After successfully applying for a user account, launching a job on the clusters is a straight forward procedure involving 3 basic steps:
- Setup and Launch
- Monitor progress
- Retrieve and analyze results
1. Setup and launching of HPC jobs
When uploading files in preparation for launching a HPC job, it is good practice to keep each job in separate folders, labeled in an intuitive way; for example, Namd_albumin_run_01
Note: Writing all files to the top level home directory will very quickly become difficult to follow and easy to make mistakes!
Launching jobs on the cluster is controlled by PBS software (Portable Batch System), which allocates the compute nodes and time requested by the user which the user specifies in the pbs_batch_script.
The user can edit the pbs_batch_script to change the number of preferences including the number of cpus to use, the length of time to run the job and the name of the program executable. It is important to realize alll lines that start with #PBS pass a PBS command, while adding a white space does not.
For example, compare the lines:
#PBS -l walltime=24:0:0 <- works! # PBS -l walltime=24:0:0 <- doesn't work.
More notes on pbs scripts
Requesting more nodes for a particular job does not necessarily mean better performance or faster completion. This depends on how well parallelized the program is. Also, requesting large numbers of cpus (> 64) may result in the job waiting in the queue for days or weeks while the scheduler makes resources available. For example, a 32 cpu job running in 5 days, may take 64 cpus 7 days (3 days to run plus and extra 4 days while waiting in the queue!).
VPAC's HPC facilities use a scheduler which juggles the various requests for different jobs according to policy. This policy can include the walltime you have set in your PBS script, the queue you have submitted to, share of resources used by your institution and so forth.
Walltime is the length of time specified in the pbs script for which the job will run. Make sure that you have specified enough walltime for a particular job to complete! Your job will end after the allocated walltime whether it is finished or not, sometimes resulting in lost data if the program does not check point restart files. If you are not sure how long a job will run, set a generous walltime, and then check the job status before the walltime expires. If your job needs more time simply email email@example.com and ask the system administration to change the walltime for the job.
Modules are simply a way to set the environment variables appropriate for the job you want to run. In your pbs_batch_script make sure you have “ module load
module load namd
module load fluent
You can type “module avail” at the command line to see what modules are installed, or “module list” to see which modules you have loaded.
If you want to unload a module, (say fluent), simply type
module unload fluent
Submitting the job
Once you are satisfied with your pbs_batch_script and have uploaded all the input files for your job simply type at the command line:
to submit your job. Easy!qsub pbs_batch_script
Once a job is launched it is possible to monitor how it is progressing. One way to do this is type at the command line:
This will show all current jobs on the cluster so to pick out the lines relevant to you use the useful “grep” function:
If all is well you should see a line like:showq | grep
> showq |grep joe 83238 joe Running 16 23:26:49 Wed Feb 13 14:38:33
Reading across the line, the first number is the job number, (83238), owned by joe and the status of the job is “Running”. The job is running on 16 cpus and has 23 hours 26 minutes and 49 seconds remaining. It was submitted on Wed Feb 13 at 14:38:33.
Another way to see this information is with the -u option:
If the cluster is busy, you job may have to wait in the queue, in which case the status of the job would be “Idle”.
If your job is listed as “Idle”, one can check when it is likely to start by the command:
If you don't see any line then either your job has finished or died!
If you know the job number you can view the output while it is still running by typing at the command line:
Likewise, to see error messages, (if any), typejobstatus o
Stopping a job
If you see some output you don't like and wish to stop your job prematurely, you may do so with the command:
3.Retrieving and Analyzing Results
Often it is most convenient, depending on the type of simulation, to download the output of job to your local computer for analysis. If you have run your job in a folder then it is simply a matter of downloading that entire folder to your desktop.
Most post-processing the data can usually be done happily on most desktops and don't require further HPC resources at that point.