The Grid Engine is the job management system used on the Hoffman2 Cluster to ensure balanced use of resources by matching job needs to available compute resources. GE serves as the job scheduler and enforces the queuing policies described on this page. It will always send you mail in case your batch job fails.
It is important to specify your resource requirements accurately so GE is able to pick the correct resources for your job's execution. You should always specify the amount of time that your job requires (h_rt or time parameter) or the job scheduler will enforce its default which currently is two hours. Do not request more time, memory or processors than your job or interactive session requires because that will delay its starting. It may also defeat the job scheduler's back-filling capability, which wastes cluster resources.
The Hoffman2 Cluster has the following types of queues:
The purpose of these queues is to allow users who belong to a resource group which has contributed nodes to the Hoffman2 Cluster, to use their group's nodes for batch jobs which need to run for an extended period of time. Users who belong to more than one resource group and want to direct their job to use a particular group's nodes are able to do so.
Note that 24 hours is a maximum wait period. Nodes could be available sooner depending on currently running and pending jobs.
Note that the 24-hour availability does not mean that every job submitted to these queues will start within 24 hours, because the nodes may be in use by other members of the same resource group.
The purpose of these queues is to allow users to run batch jobs on the extended shared Hoffman2 Cluster and to utilize free cycles on nodes contribued by resource groups. The 24 hour queues have access to ATS-contributed nodes from the Base Shared Cluster and to resource group processors that are not currently running jobs.
The interactive queues are intended for interactive sessions, including licensed applications which ATS has purchased for general use. These queues include ATS-contributed nodes which are not included in the 14-day, 24-hour or express queues. See How to Get an Interactive Session through GE for more information.
The purpose of the express queue is to increase the availability of shared resources to user's batch jobs, and increase the overall utilization of the cluster. Although no active users are excluded from using the express queue, it is general campus users who benefit by being able to concurrently run a larger number of jobs. See Express queue for more information.
Programs that require more than 24 hours to complete and which need to run in queues limited to 24 hours should checkpoint before 24 hours is up so that they can be continued later.
If your account is in the general campus group and your program for some reason absolutely requires more than 24 hours to run and cannot be stopped and restarted in the 24 hour time frame, you can make a special request to have it run for a longer period of time. Send your request by email to atshpc@ucla.edu. Include the following in your request:
ATS staff will respond to requests during normal business hours.
January 2011