There are two ways to submit a batch job on an ATS-Hosted Cluster: from the UCLA Grid Portal or, from the cluster login node.
The UCLA Grid Portal provides a web portal interface to the ATS-Hosted clusters. Every user of an ATS-Hosted cluster can access the UCLA Grid Portal. To submit a batch job from the UCLA Grid Portal, click the "Job Services" tab. There are three kinds of jobs that you can submit:
- Generic Jobs
Use this page to submit a job that runs a program or script that either you or a colleague have written and is usually installed in your home directory. In the fill-in form provided, supply the name of the executable, any job parameters, time limit, number of processors, etc. and click the Submit button.
- Applications
Use this page to submit a commonly used application. Normally, you are required to know less about an application than a generic job, as the UCLA Grid Portal keeps track of the location of the executable and other information about the application. You normally must prepare an input file that the application will read or run. Some applications can present forms to you on the UCLA Grid Portal that you can fill in to create the input file if you are not familiar with its requirements.
- Multi-Jobs
Use this page to submit multiple jobs that run a program or script that either you or a colleague have written. For details, see Running an Array of Jobs Using SGE on an ATS-Hosted Cluster.
- Multi-Serial-Jobs on the same node
Use this page to submit multiple serial jobs on a single node that run a program or script that either you or a colleague have written. For details, see Running Multiple Serial Jobs on a Single Node Using SGE on an ATS-Hosted Cluster.
After you submit a job, click the UCLA Grid Portal "Job Status" tab where you can monitor its progress, and view and download its output after your job completes.
In order to run a job under the Sun Grid Engine, you need to create a command file that consists a set of SGE commands or directives along with commands required to execute the actual job. The command file for submitting a job can either be built using queue scripts provided by ATS, or by building an SGE command file yourself. We recommend you use an ATS-provided queue script if you are not familiar with SGE commands.
Each ATS-provided queue script is named for a type of job or application. The queue script builds an SGE command file for that particular type of job or application. A queue script can be run either as a single command to which you provide appropriate options, or as an interactive application which presents you with a menu of choices and prompts you for the values of options.
For example, if you simply enter a queue script command such as:
job.q
without any command-line arguments, the queue script will enter its interactive mode and present you with a menu of tasks you can perform. One of these tasks is to build the command file, another is to submit a command file that has already been built, another is to show the status of jobs you have already submitted. You can also enter myjobs at the shell prompt to show the status of jobs you have submitted and which have not already completed. You can also enter groupjobs at the shell prompt to show the status of pending jobs everyone in your group has submitted. Enter groupjobs -help for options.
ATS-provided queue scripts could be used to run the following five types of jobs:
Or, you can build an SGE command file yourself and use SGE commands directly on a login node.
A serial job runs on a single thread on a single node. It does not take advantage of multi-processor nodes or the multiple compute nodes available with a cluster.
To build or submit an SGE command file for a serial job, you can either enter:
job.q
or, you can provide the name of your executable on the command line:
job.q name_of_executable
When you enter job.q without any command line arguments, it will interactively ask you to enter required memory, wall-clock time limit and other options, and let you submit the job. You can also quit out of the queue script menu and edit the SGE command file, which the script built, to manually change or add other Sun Grid Engine options.
If you did not submit the command file at the end of the menu dialog and decided to edit the file before submitting it, you can submit your command file using the command:
qsub executable.cmd
When you enter job.q name_of_executable, it will build the command file with the default queue script options, submit it to run, and delete the command file that it built.
Array jobs are serial jobs or multi-threaded jobs that use the same executable but different input variables or input files, as in parametric studies. Users typically run thousands of jobs with one submission.
The SGE command file for a serial array job will, at the minimum, contain the SGE keyword statement for a lower index value and an upper index value. By default, the index interval is one. SGE keeps track of the jobs using the environment variable SGE_TASK_ID which varies from the lower index value to the upper index value for each job. Your program can use SGE_TASK_ID to select the input files to read or the options to be used for that particular run.
If your program is multi-threaded, you must edit the SGE command file built by the jobarray.q script and add an SGE keyword statement that specifies the shared parallel environment and the number of slots your job requires. You can only request up to 8 slots because the maximum number of cores per node on ATS-Hosted clusters is 8. See For a multi-threaded OpenMP job below.
To build or submit an SGE command file for a serial array job, enter:
jobarray.q
For details, see Running an Array of Jobs Using SGE on an ATS-Hosted Cluster.
Multi-threaded jobs are jobs which will run on more than one thread on the same node. Programs using the OpenMP-based threaded library are a typical example of those that can take advantage of multi-core nodes.
If you know your program is multi-threaded, you need to request that SGE allocate multiple core resources. Otherwise your job will contend for resources with other jobs that are running on the same node, and all jobs on that node may be adversely affected. The queue script will prompt you to enter the number of tasks for your job. The queue script default is 4 tasks. You should request at least as many tasks as your program has threads, but you can only request up to 8 tasks because the maximum number of cores per node on ATS-Hosted clusters is 8. Please see Scalability Benchmark below for information on how to determine the optimal number of tasks.
To build or submit an SGE command file for a multi-threaded job, enter:
openmp.q
For details, see OpenMP programs and Multi threaded programs.
MPI parallel jobs are those executable programs that are linked with one of the message passing libraries like OpenMPI or MVAPICH. These applications explictly send messages from one node to another using either a Gigabit Ethernet (GE) interface or Infiniband (IB) interface. ATS recommends that everyone use the Infiniband interface because latency for message passing is short with the IB interface compared to the GE interface.
When MPI jobs are submitted to the cluster, one needs to tell the SGE scheduler how many slots are needed to run the jobs. The queue script will prompt you to enter the number of tasks for your job. The queue script default for generic jobs is 4 parallel tasks. Please see Scalability Benchmark below for information on how to determine the optimal number of tasks.
To build or submit an SGE command file for a parallel job, enter:
mpi.q
For details, see How to Run MPI.
To build or submit an SGE command file for a parallel job which also uses the OpenMP threaded library, enter:
mpiomp.q
For details, see OpenMP and programs that combine MPI with OpenMP.
An application job is one which runs software provided by a commercial vendor or is open source, and is usually installed in system directories (e.g., matlab).
To build or submit an SGE command file for an application job, enter:
application.q
where application is replaced with the name of the application. For example, use matlab.q to run matlab batch jobs. For details, see Software Installed on ATS-Hosted Clusters, and its subsequent links for each package or program to How to run on ATS-Hosted Clusters.
This section describes building an SGE command file yourself, instead of letting a queue script build it for you. Or you may modify an SGE command file that a queue script has built, according to the information presented here.
For parallel jobs, ATS strongly recommends that you use the queue script mpi.q to initially create the SGE command file. In addition to the SGE keyword statements that specify time and memory resources appropriate for your job, a queue script-built SGE command file contains shell commands that initialize the environment for the job, invoke the program that the job will run, and perform any job post-processing needed.
The SGE keyword statements in a command file are called active comments because they begin with #$ and comments in a script file normally begin with #. The format of the SGE job command file, with examples, is documented by Sun Microsystems in the Sun Grid Engine User's Guide section on submitting batch jobs.
Any qsub command line option can be used in the command file as an active comment. The qsub command line options are listed on the qsub man page.
Each SGE keyword statement begins with #$ followed by the SGE keyword and its value, if any. For example:
#$ -cwd
#$ -o jobname.joblog
#$ -j y
Here, the first SGE statement (#$ -cwd) specifies that the current working directory is to be used for the job; the second SGE statement (#$ -o jobname.joblog) names the output file in which the SGE command file will write its standard out messages; and the third (#$ -j y) specifies that any messages that SGE may write to standard error are to be merged with those it writes to standard out.
For a serial or multi-threaded job using job arrays you need to use an SGE keyword statement of the form:
#$ -t lower-upper:interval
Please see Job Arrays for more information.
For a parallel MPI job you need to have a line that specifies a parallel environment, similar to one of the examples below. Examples of parallel enivronment names are specific to the Hoffman2 Cluster.
If you are not a member of a shared cluster and are only in the campus group, or if you are a member of a shared cluster whose nodes are located in the IDRE Data Center use the dc_idre parallel environment:
#$ -pe dc_idre number_of_slots_requested
If you are a member of a shared cluster whose nodes are located in the MSA Data Center use the dc_msa parallel environment:
#$ -pe dc_msa number_of_slots_requested
If you are not sure in which data center your shared cluster nodes are located, or if you belong to more than one group and are authorized to run on nodes in either data center, you can use -pe dc* number_of_slots_requested and let SGE decide in which data center to run your job. Your parallel job can't use cores from both the IDRE and the MSA Data Centers because a parallel job must run on nodes from a single data center.
The maximum number_of_slots_requested value that you should use depends not only on number slots authorized for your group or shared cluster, but also on the parallel scalability of your program. You need to verify your program's actual speed-up before making long production runs. If your code does not scale, it may run slower with more slots which is a waste of your time and an inefficient use of cluster resources. Please see Scalability Benchmark below for information on how to determine the optimal number of slots.
An SGE slot usually corresponds to a single core or processor on a multiple-core node. The slots will be distributed among many nodes and MPI needs to be told on which nodes the allocated slots reside. Different nodes and/or number of slots on a particular node, will be reserved each time a job runs. That information can be retrieved from an SGE run-time file named by the SGE environment variable $PE_HOSTFILE which is available inside your SGE command file. If you initially use the mpi.q or mpishm.q script to build your SGE command file, it will make the MPI hostfile for you.
For a multi-threaded OpenMP job you need to request that all slots be on the same node by using the shared parallel environment.
#$ -pe shared number_of_slots_requested
where the maximum number_of_slots_requested is 8 because the maximum number of cores per node on ATS-Hosted clusters is 8. You should request at least as many slots as your program has threads. Please see Scalability Benchmark below for information on how to determine the optimal number of slots.
For an OpenMP job which combines OpenMP and MPI you need to specify one of the nthreads or nthreads_msa parallel environments. Please see the table below for a list of parallel environment names for the Hoffman2 Cluster.
Example of using a possible mix of 4-core nodes and 8-core nodes in the IDRE Data Center. You may receive 4 slots on some 8-core nodes:
#$ -pe 4threads number_of_slots_requested
Example of using only 8-core nodes in the IDRE Data Center:
#$ -pe 8threads number_of_slots_requested
Specifying the 4threads parallel environment gives SGE more flexibility to choose IDRE Data Center nodes, so your job might start sooner. On the other hand, specifying 5threads, 6threads, 7threads or 8threads ensures that your job is allocated slots only on 8-core nodes in the IDRE Data Center.
If you are not sure in which data center your shared cluster nodes are located, or if you belong to more than one group and are authorized to run on nodes in either data center, you can use -pe nthreads* number_of_slots_requested or -pe shared number_of_slots_requested and let SGE decide in which data center to run your job.
| Hoffman2 Cluster Parallel Environments | |
|---|---|
| Parallel Environments for Threaded Programs (OpenMP) | |
| Either 4, 8, 12 or 16-processor node(s) in IDRE, MSA or POD Data Center | |
| shared p | p processors on a single node, p ≤ 16 see notes |
| 2threads p 2threads_msa p 2threads_pod p | 2 processors per node, total p processors |
| 3threads p 3threads_msa p 3threads_pod p | 3 processors per node, total p processors |
| 4threads p 4threads_msa p 4threads_pod p | 4 processors per node, total p processors |
| 5threads p 5threads_msa p 5threads_pod p | 5 processors per node, total p processors |
| 6threads p 6threads_msa p 6threads_pod p | 6 processors per node, total p processors |
| 7threads p 7threads_msa p 7threads_pod p | 7 processors per node, total p processors |
| 8threads p 8threads_msa p 8threads_pod p | 8 processors per node, total p processors |
| 12threads p 12threads_msa p 12threads_pod p | 12 processors per node, total p processors |
| 16threads p 16threads_msa p 16threads_pod p | 16 processors per node, total p processors |
| Parallel Environments for MPI Programs (OpenMPI) | |
| dc_idre p mpi p | p processors on multiple nodes in IDRE Data Center see notes. mpi is deprecated. |
| dc_msa p | p processors on multiple nodes in MSA Data Center see notes |
| dc_pod p | p processors on multiple nodes in POD Data Center see notes |
| alldc p | p processors on multiple nodes irrespective of Data Centers see notes |
| Parallel Environments for Threaded MPI Programs
Other nthreads parallel environments are also available. | |
| 4threads p 4threads_msa p 4threads_pod p | 4 processors per node on either 4 or 8-processor nodes in
IDRE, MSA or POD Data Center |
| 8threads p 8threads_msa p 8threads_pod p | 8 processors per node on 8-processor nodes in
IDRE, MSA and POD Data Centers. see notes |
| shared note: -pe shared 12,
-pe 12threads 12, -pe 12threads_msa 12 and -pe 12threads_pod 12 reserve entire nodes.
No other jobs will run or start on the assigned node. | |
| number of processors note: 1: For campus users, currently p ≤ 128. 2: For shared cluster users, p is the entire shared cluster for ≤ 24 hour jobs, or limited to the number of processors contributed by the research group. 3: Permission to use more processors on request to IDRE. 4: While using alldc the job will span across datacenters, over a long distance network and therefore parallel applications may not perform as they performed within a single datacenter. | |
 
Benchmarking. Depending on your access previledge, if you use the 12threads, 12threads_msa or 12threads_pod parallel environment, and number_of_slots_requested is a multiple of 12, then no other jobs will run or start on the same nodes when your job is running. This may be useful for benchmark purposes, where you want to avoid contention from other jobs running on the same node.
Scalability Benchmark. Before submitting your parallel code for production runs, you should determine the optimal number of processors to use. You can do this by performing a scalability benchmark.
A common way to carry out the so-called strong scaling benchmark, is to examine the wall-clock elapsed time of your code for runs with different numbers of processors. You can time your code with the /usr/bin/time command inside your SGE command file, or after a job completes get the wall-clock seconds with this command:
qacct -j jobid | grep ru_wallclock
Start with a small number of processors (say 2 or 4), and increase the number of processors in each run. Your code scales well if doubling the number of processors halves the wall-clock elapsed time.
In many cases, for a given problem or data size, the code's wall-clock elapsed time does not decrease when the number of processors is increased beyond a certain point. That is, beyond some point, using more processors slows down your job. Normally, the number of processors you should use for production runs is less than this number.
High-memory job requirements. Use of the shared nthreads, nthreads_msa or nthreads_pod parallel environments where n is between 2 and 16, may provide a work-around for serial jobs requiring more than the default amount of memory per slot. For example if a sequential code running in the IDRE Data Center requires 4GB memory, one may specify "-pe 4threads 4" in addition to specifying "-l h_data=1024M". That way the job will have access to a total of 4 GB memory.
After you have created the SGE command file, issue the appropriate SGE commands from a login node to submit and monitor the job.
When a job has completed, SGE messages will be available in the stdout and stderr files that were were defined in your SGE command file with the -o and -e or -j keywords. Program output will be available in any files that your program has written. If your SGE command file was built using a queue script, stdout and stderr from SGE will be found in the file jobname.joblog and output from your program will be found in jobname.output or jobname.output.$SGE_TASK_ID or jobname.output.$JOB_ID
The recommended way to get an entire node is to use the -l
exclusive=true in the qsub/qrsh command, e.g.
qsub -l exclusive ... (with other options)
or add it in the job script, e.g.
... (in job script)
#$ -l exclusive=true,h_rt=... (with other options)
...
Using this approach, you will get a whole node with 8, 12 or 16
processor cores. If your application requires a certain number of
processor cores, you need to retrieve that information at run time
in your job script.
If you want to get a whole node with a certain number of cores, use
the num_proc option in addition to
the exclusive option, e.g. to get a 8-core whole node:
qsub -l exclusive,num_proc=8
Be aware that specifying the additional num_proc
parameter further constrains the job scheduler from selecting
available nodes, so your job wait time may increase.
Note: You cannot request multiple whole-nodes using -l exclusive.
March 2010