Computational Cluster Programs

Job Queues and Policies on the Hoffman2 Cluster

The Grid Engine is the job management system used on the Hoffman2 Cluster to ensure balanced use of resources by matching job needs to available compute resources. GE serves as the job scheduler and enforces the queuing policies described on this page. It will always send you mail in case your batch job fails.

It is important to specify your resource requirements accurately so GE is able to pick the correct resources for your job's execution. You should always specify the amount of time that your job requires (h_rt or time parameter) or the job scheduler will enforce its default which currently is two hours. Do not request more time, memory or processors than your job or interactive session requires because that will delay its starting. It may also defeat the job scheduler's back-filling capability, which wastes cluster resources.

The Hoffman2 Cluster has the following types of queues:

Queues with time limit of 14 days (highp)

The purpose of these queues is to allow users who belong to a resource group which has contributed nodes to the Hoffman2 Cluster, to use their group's nodes for batch jobs which need to run for an extended period of time. Users who belong to more than one resource group and want to direct their job to use a particular group's nodes are able to do so.

  • Jobs may run for as long as 14 days (336 hours).
  • Available only to members of resource groups that have contributed nodes to the Hoffman2 Cluster.
  • All members of a resource group are limited to using the nodes that their group has contributed.
  • Nodes contributed by a research group as part of the Shared Cluster Program will be made available within 24 hours of being requested.

    Note that 24 hours is a maximum wait period. Nodes could be available sooner depending on currently running and pending jobs.

    Note that the 24-hour availability does not mean that every job submitted to these queues will start within 24 hours, because the nodes may be in use by other members of the same resource group.

Queues with time limit of 24 hours

The purpose of these queues is to allow users to run batch jobs on the extended shared Hoffman2 Cluster and to utilize free cycles on nodes contribued by resource groups. The 24 hour queues have access to ATS-contributed nodes from the Base Shared Cluster and to resource group processors that are not currently running jobs.

  • Jobs may run for as long as 24 hours.
  • Available to all users on the Hoffman2 Cluster including general campus users.
  • All members of a resource group that has contributed nodes to the cluster may use more resources than their own group's contribution.
  • There is no guaranteed start time in these queues. Start time is subject to overall cluster utilization, and the availability of nodes that can satisfy the amount of memory and number of processors requested by the job.

Queues which use the interactive nodes

The interactive queues are intended for interactive sessions, including licensed applications which ATS has purchased for general use. These queues include ATS-contributed nodes which are not included in the 14-day, 24-hour or express queues. See How to Get an Interactive Session through GE for more information.

  • Sessions have a 24-hour time limit.
  • Available to all users on the Hoffman2 Cluster.
  • Limited to 8 processors per user.
  • Sessions should start immediately in these queues depending on number of processors requested. Immediate startup of interactive sessions is guaranteed for sessions requesting a single processor. Please send email to atshpc@ucla.edu in case all requirements are met and an interactive session has not started within one or two minutes.

Express queue

The purpose of the express queue is to increase the availability of shared resources to user's batch jobs, and increase the overall utilization of the cluster. Although no active users are excluded from using the express queue, it is general campus users who benefit by being able to concurrently run a larger number of jobs. See Express queue for more information.

  • Jobs have a 2-hour time limit.
  • Available to all users on the Hoffman2 Cluster.
  • Each job or array jobtask must run on a single node.
  • There is no guaranteed start time. Jobs usually start running within 5-10 minutes.

Checkpointing

Programs that require more than 24 hours to complete and which need to run in queues limited to 24 hours should checkpoint before 24 hours is up so that they can be continued later.

Special Requests

If your account is in the general campus group and your program for some reason absolutely requires more than 24 hours to run and cannot be stopped and restarted in the 24 hour time frame, you can make a special request to have it run for a longer period of time. Send your request by email to atshpc@ucla.edu. Include the following in your request:

  • Your name
  • Your sponsor or Principle Investigator's name
  • Your login id
  • An explanation of why access to a longer duration queue is critical for your work.
  • How long your job needs to run (e.g., 3 days).
  • The duration of your request (i.e., how many days or weeks will you need to be able to access this queue).

ATS staff will respond to requests during normal business hours.

January 2011