R&D Projects
See all 8 case studies
Why Isn't My Job Running?
VPAC's HPC facilities use a scheduler which juggles the various requests for different jobs according to policy. This policy can include the walltime you have set in your PBS script, the queue you have submitted to, share of resources used by your institution and so forth.
Checkjob
One of the first diagnostics you may wish to use is the checkjob command. Enter checkjob <jobid> to view details of a job in the queue and the reasons that it may be idle, blocked etc.
Use checkjob -v <jobid> for a verbose explanation, which provides a node-by-node listing of why a job is in the queue.
Walltime and Showstart
The typical command for submitting a job is: qsub pbs_batch_script. In the PBS script it is probable that you have requested a number of nodes/cores or even processors per node, and a walltime.
After submitting a job you can use the command: showstart <job number > to receive an estimate of when your job will start. This number will vary by the factors described in the previous paragraph and the system load.
If you request for a high number of nodes (e.g., 64+) and a high walltime (e.g., two weeks plus) it is likely that your job will will not start as soon as someone who has requested a lower quantity of nodes, or a lower walltime.
Queues
It is possible to specify which queue a PBS job is submitted to. The queues provide a standing reservation for jobs that have particular limits pre-established.
The most common is the short queue (sque), specifically designed for short jobs.
This can be included in the PBS-script e.g.,
# set your walltime=hours:minutes:seconds
#PBS -l walltime=0:15:00
#PBS -q sque
Institution Limits
VPAC has a global configuration for all clusters and specific cluster rules. The global rules include the following:
In addition to this there are "fair share" policies for each institution. This allows historical resource use to be incorporated into job priority. At VPAC, fairshare usage is tracked on a weekly basis. The FareShare policy is based on usage of dedicated resources and job priority is adjusted both up and down to bring them towards their target usage. The targets are based on the percentage subscription values of each VPAC group.
A comparison between allocated and actual usage is available.
In addition to this, projects may also have generated fairshare caps.
Individual Limits
The configurations also has limits on how many processors and jobs an individual can run on each HPC system. This also varies per cluster. There is a 'soft' limit, which can be exceeded if no other users have eligible jobs and 'hard' limit which is absolute.
The following is individual limits on Tango:
USERCFG[DEFAULT] MAXJOB=25,35 USERCFG[DEFAULT] MAXPROC=150,200
For example, if you are already running more than 25 jobs or using more than 150 cores, it is probable that your job will be delayed in the queue.