Troubleshooting Condor Jobs

The first step to troubleshoot a condor job is to run:

~$ condor_q


-- Submitter: pongo.cacr.caltech.edu : <131.215.145.189:52215> : pongo.cacr.caltech.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
  22.0   user            3/31 06:07   0+00:00:00 H  0   0.0  peakfinderBinningF
  23.0   user            3/31 06:08   0+00:00:00 H  0   0.0  peakfinderBinningF

The column labeled ST is "State". and the main states are

Running
Your job is running somewhere.
Idle

Your job is waiting to be scheduled #Troubleshooting Idle

Held

Your job has some problem where it's not able to run. #Troubleshooting Held

Troubleshooting Idle

Condor can take up to a minute to get get around to scheduling your job, so first be patient. If something stays in the idle state for more than a few minutes, there is likely something wrong.

The first thing to do is run condor_status.

~$ condor_status 

Name               OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime

slot1@mondom.cacr. LINUX      X86_64 Unclaimed Idle     7.020  47804  0+00:10:05
slot1@myogenin.cac LINUX      X86_64 Unclaimed Idle     15.090  127701  0+03:15:05
                     Total Owner Claimed Unclaimed Matched Preempting Backfill

        X86_64/LINUX     2     0       0         2       0          0        0

No Output

If you run condor_status and get no output, there's something wrong with condor and you should contact a system administrator.

Claimed

The next thing that can happen is condor can be busy processing someone else's job. If all the slots are  Claimed  you'll have to wait your turn.

High Load Average

The other problem is that some hosts aren't dedicated to condor, so condor is configured to play nice and not push the system load too high. So if the number in the LoadAvg column is above the number of cpus on a machine, condor won't try to run jobs on the shared hosts.

Troubleshooting Held

If you're in the Held state, we need to investigate whats wrong.

~$ condor_q -better-analyze -global


-- Submitter: pongo.cacr.caltech.edu : <131.215.145.189:52215> : pongo.cacr.caltech.edu
---
022.000:  Request is held.

Hold reason: Error from slot1@myogenin.cacr.caltech.edu: Failed to execute '/woldlab/castor/data00/home/user/peakfinderBinningForChIP.py' with arguments files.txt: Permission denied

The end of the Hold reason contains the likely error message. For instance, in the above case the error is "Permission denied"

Permission denied

If you get this on a script, it's checking to see if the file is "executable" which means it's permissions look like:

~$ ls -l script.py
-rwxrwxr-x 1 diane diane 103 2009-11-12 14:24 script.py*

(Note the 'x'es in the first column. those tell the operating system and condor that the owning user (first x), the owning group (second x) and everyone else (third x). can run this script.

However if you just change the permissions, you're likely to run into the Failed to execute <script> error. So you should just go read that solution now.

Failed to execute <script>

This can happen to a variety of scripts, shell scripts, python scripts, etc. Basically anything that is not a binary executable. The discussion below assumes a python script.

condor doesn't know that '.py's should be run with the python interpreter. So you have two choices for how to tell it.

One is to change the permissions of eland_results... to include the "executable" bit with something like chmod a+g eland_results... (or chmod 755 eland_results). which should change the ls -l output from

-rw-r--r-- 1 user user 953 2010-01-09 15:31 eland_results_to_fasta_input.py

to:

-rwxr-xr-x 1 user user 953 2010-01-09 15:31 eland_results_to_fasta_input.py

in addition if using this method you'll also need to add:

#!/usr/bin/env python

to the top of the file. The advantage to this is now linux will know that this is an executable and you can run it with eland_results_to_fasta_input.py args.. from the shell as well. (leaving off the python.)

The other choice is to change the executable in the condor submit script from eland_results_to_fasta_input.py to python and treat eland_results_to_fasta_input.py as the first argument in the condor submit script.

WoldlabWiki: Condor/Troubleshooting (last edited 2010-03-31 21:50:23 by diane)