Computational Cluster Programs

Batch Job Queues and Policies on the Hoffman2 Cluster

Number of Batch Nodes and Processors

As of Dec. 2007, the Hoffman2 Cluster has:

  • 40 compute nodes in the Campus General Purpose Cluser.
  • 24 compute nodes in the Shared Cluster System.
  • A number of Research Group Virtual clusters.
  • An Application Cluster whose purpose is to run serial applications and those licensed applications which ATS provides.

Since each node has 4 cores, the queues have been set up to run 4 processes or jobs per node.

Batch Queues and Policies (or see the Summary Table)

The Hoffman2 Cluster has the following queues:

  • For members of Research Groups that have contributed nodes to the shared Hoffman Cluster:
    • researchGroup Queue

      Each research group that has contributed nodes to the Hoffman2 Cluster has a queue named by the name of the group.

      Each research group queue consists of either:

      • a set of "equivalent cores" that are equivalent to the number of nodes/cores purchased for the shared Hoffman2 Cluster by the group
      • or, for those groups which have nodes that are fundamentally different from the standard nodes in the cluster, the physical nodes themselves.

      Queue properties:

      • Except for research groups that have fundamentally different nodes, these queues are limited to 1 GB per processor.

      • Jobs run in the Research Group Queues have a 14-day time limit.

      • Research group queues are constrained to the number of processors in the nodes that the research group contributed to the Hoffman2 Cluster.

      • Only the members of a particular research group can submit jobs to the research group's queues.

      • Equivalent cores or physical nodes, whichever the case may be, purchased by a research group as part of the Shared Cluster Program will be made available within 24 hours of being requested. Note that 24 hours is a maximum wait period; nodes could be available sooner depending on currently queued/running jobs. Note also that the 24 hour availability does NOT mean that every job submitted to these Research Group Queues will start within 24 hours because the nodes/cores may be in use by other jobs submitted by the same group.

    • 24hour Queue

      The purpose of this queue is to both harvest unused cycles and allow members of research groups that have contributed nodes to run the extended shared Hoffman2 Cluster.

      The 24 hour queue has access to ATS-contributed nodes/cores from the Base Shared Cluster and research group equivalent cores that are not currently running jobs.

      Only those research groups that have contributed nodes to the shared Hoffman2 Cluster can take advantage of the 24 Hour Queue. This queue is not open to users of the Campus General Purpose Cluster.

      Queue Properties:

      • 1 GB per processor.
      • The time limit in this queue is 24 hours.
      • There is no guaranteed start time in this queue. Start time is subject to overall cluster utilization and the number of equivalent cores requested.
      • Priority in this queue will be given based on the number of nodes purchased for the shared Hoffman2 Cluster by the research group.
      • Jobs run in the 24 hour queue by members of a research group that has contributed nodes to Shared Cluster that are fundamentally different from the other nodes or which have distinguishing characteristics cannot make use of those distinguishing characteristics as jobs submitted to this queue will run on ANY idle nodes.

  • For Users of the Campus General Purpose nodes of the shared Hoffman2 Cluster:
    • Campus Queue

      The campus queue is intended for parallel jobs submitted by those members of the UCLA community who have access to the Campus General Purpose Cluster. It is limited to the number of processors in that part of the Cluster.

      Queue Properties:

      • 1 GB per processor.
      • The time limit in this queue is 24 hours.
      • There is no guaranteed start time in this queue. Start time is subject to overall cluster utilization and the number of cores requested.

      If your program, for some reason, absolutely requires more than 24 hours to run and cannot be stoped and restarted in the 24 hour time frame, you can make a special request to have it run for up to either 3 or 5 days. Send your request by email to atshpc@ucla.edu. Include the following in your request:

      • Your name
      • Your sponsor's/PI's name
      • Your login id
      • An explanation as to why access to a longer duration queue is critical for your work
      • Which queue you will need (3 days or 5 days) and the duration of your request (i.e., how many days/weeks will you need to be able to access this queue.)
      ATS staff will respond to requests during normal business hours.

  • For all Users:
    • Application Queue

      The application queue is intended for serial jobs and those jobs which run those licensed applications which ATS has purchased for campus use. The nodes of the appication cluster differ from those of the rest of the shared Hoffman2 Cluster in the number of cores they have per node and the fact that they do not have InfiniBand interconnects. They are Intel 64-bit nodes where as all the other nodes are AMD 64-bit nodes.

      Queue Properties:

      • 2 GB per processor.
      • The time limit in this queue is 24 hours.
      • There is no guaranteed start time in this queue. Start time is subject to overall cluster utilization and the number of cores requested.

Submitting Jobs to Run

The Sun Grid Engine (SGE) is the job management system used on the Hoffman2 Cluster to ensure balanced use of resources by matching job needs to available resources. SGE serves as the job scheduler. SGE knows which users are in which groups and enforces the queueing policies. When you specify the time limit for the job, SGE will place your job in the correct queue even if you do not specify it.

When you submit a job using any of the methods: from the UCLA Grid Portal, via the queue scripts or the qsub command just request the number of wall clock hours of execution required and any needed applicaitons. Your job will automatically be assigned to a queue as follows:

  • Queue a job will run in for a member of a research group that has contributed nodes to the shared Hoffman2 Cluster:

    Is the number of cores requested by the job > the number contributed by the research group to the shared Hoffman2 Cluster? Number of Hours Requested
    <=24 >24
    No The queue (research group or 24 hour) in which this job will start first. reasearchGroupName.q
    Yes the 24 hour queue This job can never run

    Is this job asking for a licensed application that the research group does not have licenses for? Number of Hours Requested
    <=24 >24
    Yes the application queue This job can never run

  • Queue a job will run in for a campus user. Note that campus users are limited to 24 hours. Jobs requesting more than 24 hours may generate a qsub error message and not be submitted.

    Is this job asking for a licensed application ATS is providing? Queue
    No the campus queue
    Yes the application queue

Checkpointing

Programs that require more than 24 hours to complete and which have to be run queues limited to 24 hours, should checkpoint before 24 hours is up so they can be continued later.