Cluster Hosting

IDRE Cluster Hosting Program - The Hoffman2 Shared Cluster

The Shared Hoffman2 Cluster is made up of two main virtualized clusters that have been optimized for different research needs. The Research Virtual Shared Cluster is made up from Contributed cores purchased by individual research groups and Base cores purchased by IDRE to augment the Contributed cores. One benefit of contributing cores to the shared cluster is that a research group is guaranteed use of the number of cores contributed with the ability to use surplus cores from the entire Hoffman2 Cluster. Other benefits provided to research groups when they join the shared cluster include:

  1. Complete system administration for contributed cores
  2. Cluster access through a 10Gb network interconnect to the campus backbone
  3. High performance home and scratch storage space.
  4. A dedicated data center facility for housing the cluster. This eliminates the need to perform expensive space, cooling, and electrical modifications to existing office or lab space.
  5. The capability to run large parallel jobs that can take advantage of the cluster's InfiniBand interconnect.

Research groups who have contributed cores to the Research Virtual Shared Cluster also have access to the features of the General Purpose Cluster. This gives them:

  1. Access to pooled licenses, allowing researchers to run larger commercial applications without the cost of buying additional licenses,
  2. Access to additional commercial and open source applications,
  3. Web access to the Hoffman2 Cluster is provided through the UCLA Grid Portal.

Base and Contributed Equipment Standards and Policies

All contributed hardware must be compatible with the base core architecture, processor type and speed, memory, disk space, and interconnect. This maximizes the effective management of the Hoffman2 Cluster to provide the highest level computing services to shared cluster customers. IDRE provides full support in helping researchers specify and purchase at optimal price/performance their cores to meet these standards.

Once contributed, these cores become part of the entire Hoffman2 Cluster and are no longer physically linked to a given research group. Because cycles are pooled across all Base and Contributed cores, which may be in use by others, the equivalent number of cores to those contributed is made available within 24 hours after a request. In practice, the number of cores contributed by a research group is generally available much sooner. Jobs that run on the Virtual Shared Cluster have a 14-day upper limit (with appropriate notification, longer runs may be accommodated).

While it is hard to give an exact number of additional cores available, in practice there are unused cores that can be made available within a reasonable period of time for researches that require use of cores in addition to those contributed.

With advance agreement, a very large job that requires a large segment of the entire shared cluster (those cores connected through the InfiniBand) can be accommodated dependent upon current cluster usage and consent by affected research groups.

Research Virtual Shared Cluster Hosting Costs

Research groups that contribute cores to the Hoffman2 Cluster agree to contribute their unused cycles to other researchers. They can regain full use of their contributed cores within 24 hours of submitting a job.

Users of the Virtual Research Shared Cluster, and users of the General Purpose Cluster, have the option of paying a one-time, per terabyte, charge for storage on the BlueArc storage system. This is particularly an important option for those that need more than the 20 GB directory space per user that is standard on the Research Virtual Shared or General Purpose Cluster or that want increased permanent space for large data sets to avoid recurring upload and transfer times. Please see IDRE Shared HPC Storage Program for further information.

Base and Contributed Equipment Renewals

After a period of three years all hardware within the shared cluster is evaluated for retention based on condition of equipment, cost to maintain, relative compute power and the ability to backfill with new systems. This is done to maintain a high performance and low maintenance system, while maximizing the utilization of data center space.

If the contributed cores can still be effectively maintained, those cores will remain inside the Hoffman2 Cluster and continue to be reevaluated on an annual basis. If the contributed cores can no longer be effectively maintained, upon mutual agreement, they will be redeployed for other uses or decommissioned.

The Campus General Purpose Cluster

UCLA Faculty (and their students) who have not contributed cores run parallel jobs on the General Purpose Cluster .

The Campus General Purpose Cluster is that part of the Hoffman2 Cluster System provided as a high performance computing resource for the entire UCLA campus and is available to UCLA students and faculty that:

  • Run primarily commercial applications and/or user written, discipline specific applications,
  • Have low-level or sporadic usage, and
  • Require a specific application, compiler, or visualization tool available only on the General Purpose or Applications Clusters.

Because resources in the General Purpose Cluster are limited there are restrictions on the jobs that can be run:

  • The maximum run time for a job is 24 hours. Jobs running longer than 24 hours will be killed by the scheduler
  • Jobs are limited to a maximum of 128 cores. Jobs rewquesting greate than 128 cores will not schedule.

The Shared Hoffman2 Hardware and Software

The Hoffman2 Cluster has 64-bit nodes with an Ethernet newtork and Infiniband interconnect, with the following standard software suite:

  • Scheduler
  • Compilers: GCC and the best performing compiler for: C, C++, Fortran 77, 90 and 95 on the current Shared Cluster architecture.
  • Applications and Libraries in the Basic Software Suite

Certain applications are provided for a base level of cluster usability. Every effort is made to maximize application usage to the extent capable under license agreements. Where possible software is provided that would not make sense for an individual research group to purchase on its own.

In addition to the Base and Contributed cores, the Hoffman2 Cluster includes the login nodes and the storage server. The Hoffman2 Cluster has both InfiniBand and gigabit Ethernet network switches and interconnect. The Ethernet interconnect is dedicated to traffic in and out of the storage system as well as various administrative functions and is used as the interconnect for the Applications cluster. To maintain maximum parallel performance, InfiniBand is used strictly for inter-node, MPI-type communication across the Research Virtual Shared Cluster and the General Purpose Cluster.