Troubleshooting Condor Jobs
Contents
The first step to troubleshoot a condor job is to run:
~$ condor_q -- Submitter: pongo.cacr.caltech.edu : <131.215.145.189:52215> : pongo.cacr.caltech.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 22.0 user 3/31 06:07 0+00:00:00 H 0 0.0 peakfinderBinningF 23.0 user 3/31 06:08 0+00:00:00 H 0 0.0 peakfinderBinningF
The column labeled ST is "State". and the main states are
- Running
- Your job is running somewhere.
- Idle
Your job is waiting to be scheduled #Troubleshooting Idle
- Held
Your job has some problem where it's not able to run. #Troubleshooting Held
Troubleshooting Idle
Condor can take up to a minute to get get around to scheduling your job, so first be patient. If something stays in the idle state for more than a few minutes, there is likely something wrong.
The first thing to do is run condor_status.
~$ condor_status Name OpSys Arch State Activity LoadAv Mem ActvtyTime slot1@mondom.cacr. LINUX X86_64 Unclaimed Idle 7.020 47804 0+00:10:05 slot1@myogenin.cac LINUX X86_64 Unclaimed Idle 15.090 127701 0+03:15:05 Total Owner Claimed Unclaimed Matched Preempting Backfill X86_64/LINUX 2 0 0 2 0 0 0
No Output
If you run condor_status and get no output, there's something wrong with condor and you should contact a system administrator.
Claimed
The next thing that can happen is condor can be busy processing someone else's job. If all the slots are Claimed you'll have to wait your turn.
High Load Average
The other problem is that some hosts aren't dedicated to condor, so condor is configured to play nice and not push the system load too high. So if the number in the LoadAvg column is above the number of cpus on a machine, condor won't try to run jobs on the shared hosts.
Troubleshooting Held
If you're in the Held state, we need to investigate whats wrong.
~$ condor_q -better-analyze -global -- Submitter: pongo.cacr.caltech.edu : <131.215.145.189:52215> : pongo.cacr.caltech.edu --- 022.000: Request is held. Hold reason: Error from slot1@myogenin.cacr.caltech.edu: Failed to execute '/woldlab/castor/data00/home/user/peakfinderBinningForChIP.py' with arguments files.txt: Permission denied
The end of the Hold reason contains the likely error message. For instance, in the above case the error is "Permission denied"
Permission denied
If you get this on a script, it's checking to see if the file is "executable" which means it's permissions look like:
~$ ls -l script.py -rwxrwxr-x 1 diane diane 103 2009-11-12 14:24 script.py*
(Note the 'x'es in the first column. those tell the operating system and condor that the owning user (first x), the owning group (second x) and everyone else (third x). can run this script.
However if you just change the permissions, you're likely to run into the Failed to execute <script> error. So you should just go read that solution now.
Failed to execute <script>
This can happen to a variety of scripts, shell scripts, python scripts, etc. Basically anything that is not a binary executable. The discussion below assumes a python script.
condor doesn't know that '.py's should be run with the python interpreter. So you have two choices for how to tell it.
One is to change the permissions of eland_results... to include the "executable" bit with something like chmod a+g eland_results... (or chmod 755 eland_results). which should change the ls -l output from
-rw-r--r-- 1 user user 953 2010-01-09 15:31 eland_results_to_fasta_input.py
to:
-rwxr-xr-x 1 user user 953 2010-01-09 15:31 eland_results_to_fasta_input.py
in addition if using this method you'll also need to add:
#!/usr/bin/env python
to the top of the file. The advantage to this is now linux will know that this is an executable and you can run it with eland_results_to_fasta_input.py args.. from the shell as well. (leaving off the python.)
The other choice is to change the executable in the condor submit script from eland_results_to_fasta_input.py to python and treat eland_results_to_fasta_input.py as the first argument in the condor submit script.