Computational Cluster Programs

Batch Job Queues and Policies on the Hoffman2 Cluster

Number of Batch Nodes and Processors

As of October 2009, the Hoffman2 Cluster has:

  • 256 cores from compute nodes in the Campus General Purpose Cluster.
  • A number of Research Group Virtual clusters.

The current size of the Hoffman2 Cluster is more than 3400 cores and still growing. Since each node has either 4 or 8 cores, the queues have been set up to run either 4 or 8 processes or jobs per node.

Batch Queues and Policies (or see the Summary Table)

The Hoffman2 Cluster has the following two types of queues:

  • Queues with time limit 14 days and higher priority.
  • Queues with time limit 24 hours.

Queues with 14 days limits and higher priority are only accessible by members of Research Groups that have contributed nodes to the shared Hoffman2 Cluster. These queues also have allocation rules that restrict the number of processors that can be used by each research group.

  • High Priority Queues

    Each research group's allocation in higher priority queues (14 day limit) consists of either:

    • the number of "equivalent cores" equal to the number of cores purchased for the shared Hoffman2 Cluster by that research group
    • or, for those groups which have nodes that are fundamentally different from the standard nodes in the cluster, the physical nodes themselves.

    High Priority Queues Properties:

    • Except for research groups that have fundamentally different nodes, these queues are limited to 1 GB.
    • Jobs run in the higher priority queues have a 14-day time limit.
    • Research group allocations are constrained to the number of processors that the research group contributed to the Hoffman2 Cluster.
    • Only members of a research group which has contributed to the shared cluster can submit jobs to the higher priority queues.
    • Equivalent cores or physical nodes, whichever the case may be, purchased by a research group as part of the Shared Cluster Program will be made available within 24 hours of being requested through a higher priority queue. Note that 24 hours is a maximum wait period; nodes could be available sooner depending on currently queued and running jobs. Note also that the 24 hour availability does not mean that every job submitted to these Research Group Queues will start within 24 hours, because the nodes or cores may be in use by other jobs submitted by the same group.

  • 24 Hour Shared Queues

    The purpose of these queues is to both harvest unused cycles, and allow members of research groups that have contributed nodes to run jobs on the extended shared Hoffman2 Cluster.

    The 24 hour queues have access to ATS-contributed cores from the Base Shared Cluster, and research group equivalent cores that are not currently running jobs.

    Only those research groups that have contributed nodes to the shared Hoffman2 Cluster can take advantage of processors that are part of another research group's idle contributed processors.

    24 Hour Shared Queues Properties:

    • 1 GB per processor, or research group equivalent
    • The time limit in this type of queue is 24 hours.
    • There is no guaranteed start time in these queues. Start time is subject to overall cluster utilization and the number of equivalent cores requested.
    • Jobs run in the 24 hour queues by members of a research group that has contributed nodes to Shared Cluster that are fundamentally different from the other nodes or which have distinguishing characteristics, cannot make use of those unique nodes as jobs submitted to this queue will run on any idle node.

For Users of the Campus General Purpose nodes of the shared Hoffman2 Cluster:

  • Resources for Campus Users

    The 24 hour queues are intended for parallel jobs submitted by those members of the UCLA community who have access to the Campus General Purpose Cluster. It is limited to the number of processors in that part of the Cluster.

    Campus Queues Properties:

    • 1 GB per processor.
    • The time limit in this queue is 24 hours.
    • There is no guaranteed start time in this queue. Start time is subject to overall cluster utilization and the number of cores requested.

    If your program, for some reason, absolutely requires more than 24 hours to run and cannot be stopped and restarted in the 24 hour time frame, you can make a special request to have it run for a maximum of either 3 or 5 days. Send your request by email to atshpc@ucla.edu. Include the following in your request:

    • Your name
    • Your sponsor or Principle Investigator's name
    • Your login id
    • An explanation of why access to a longer duration queue is critical for your work.
    • Which queue you will need (3 days or 5 days) and the duration of your request (i.e., how many days or weeks will you need to be able to access this queue.)

    ATS staff will respond to requests during normal business hours.

For all Users:

  • Interactive Queues

    The Interactive queues are intended for interactive sessions, including licensed applications which ATS has purchased for general use.

    Interactive Queues Properties:

    • 1 GB per processor.
    • The time limit in this queue is 24 hours.
    • Jobs should start immediately in these queues depending on number of cores requested. Immediate startup of interactive sessions is guaranteed for jobs requesting single processor or up to 8 processors combined use by a single research group that has contributed nodes to shared cluster. Please send email to atshpc@ucla.edu in case all requirements are met and an interactive session is not starting immediately.

Submitting Jobs to Run

The Sun Grid Engine (SGE) is the job management system used on the Hoffman2 Cluster to ensure balanced use of resources by matching job needs to available compute resources. SGE serves as the job scheduler. SGE knows which users are in which groups and enforces the queuing policies. Therefore it is important to specify your job's resource requirements correctly. SGE will pick the correct resources for its execution. Do not request more resources than your job requires because that may delay your job starting, and will defeat SGE's backfilling capability.

When you submit a job using any of the methods: from the UCLA Grid Portal, or via the queue scripts, or using the qsub command, request the number of wall clock hours of execution required, the type of job (for example high priority or interactive) and any needed applications. Your job will automatically be assigned to a queue as follows:

  • Queue a job will run in for a member of a research group that has contributed nodes to the shared Hoffman2 Cluster:

    High Priority Request: Is the number of cores requested by the job > the number contributed by the research group to the shared Hoffman2 Cluster? Number of Hours Requested
    <=24 >24
    No The queues with high priority (up to 14 days limit) in which this job will start. The queues with high priority (up to 14 days limit) in which this job will start.
    Yes This job can never run. This job can never run.
    No Priority Request: Is the number of cores requested by the job > the number contributed by the research group to the shared Hoffman2 Cluster? Number of Hours Requested
    <=24 >24
    No The shared queues (24 hour) in which this job will start. This job can never run.
    Yes The shared queues (24 hour) in which this job will start. This job can never run.

  • Queue a job will run in for a campus user. Note that campus users are limited to 24 hours. Jobs requesting more than 24 hours may generate a qsub error message and not be accepted.

    Is this job asking for a licensed application ATS is providing? Queue
    No The shared 24 hour queues.
    Yes The shared 24 hour or interactive queues.

Checkpointing

Programs that require more than 24 hours to complete and which have to be run in queues limited to 24 hours should checkpoint before 24 hours is up so they can be continued later.

October 2009