Cluster Hosting

IDRE Cluster Hosting Program

The Cluster Hosting Program consists of the Shared Hoffman2 Cluster and private clusters for individual research groups. Individually hosted clusters are slowly being phased out in favor of the new shared cluster strategy because the researcher retains most of the benefits of an individual cluster, with the added benefit of accessing more resources than those contributed. This allows IDRE to leverage limited data center space and the personnel required to maintain the cluster. Requests by researchers that require specialized services (software or operating conditions that preclude them from being included in the Research Virtual Shared Cluster) will be considered only for independent stand-alone clusters of 100 cores or more.

Shared Hoffman2 Cluster

The Shared Hoffman2 Cluster is logically organized into several component clusters that have been optimized for different research needs.The shaded areas of the diagram represent the Research Virtual Shared Cluster. This cluster is made up fromContributed cores purchased by individual research groups and Base cores purchased by IDRE to augment the Contributed cores. One benefit of contributing cores to the shared cluster is that a research group is guaranteed use of the number of cores contributed with the ability to use surplus cores from the entire Hoffman2 Cluster. Other benefits provided to research groups when they join the shared cluster include:

  1. Complete system administration for contributed cores,
  2. Use of a high performance network interconnect,
  3. Home and scratch storage space,
  4. A dedicated data center facility for housing the cluster. This eliminates the need to perform expensive space, cooling, and electrical modifications to existing office or lab space.

With IDRE, all participating researchers in the shared cluster program have a voice in operational and policy decisions for this cluster.

 

Diagram of the Shared Hoffman2 Cluster

 

Research groups who have contributed cores to the Research Virtual Shared Cluster also have access to the General Purpose and Application Clusters. This gives them:

  1. Access to pooled licenses, allowing researchers to run larger commercial applications without the cost of buying additional licenses,
  2. Access to additional commercial and open source applications,
  3. The ability to run massively parallel simulations that can use all the cores of the Hoffman2 Cluster.
  4. Web access to the Hoffman2 Cluster is provided through the UCLA Grid Portal.
  5. The capability to run massively parallel simulations for applications that can take advantage of the InfiniBand network.

Base and Contributed Equipment Standards and Policies

All contributed hardware must be compatible with the Base core architecture, processor type and speed, memory, disk space, and interconnect. This maximizes the effective management of the Hoffman2 Cluster to provide the highest level computing services to shared cluster customers. IDRE provides full support in helping researchers specify and purchase at optimal price/performance their cores to meet these standards.

Once contributed, these cores become part of the entire Hoffman2 Cluster and are no longer physically linked to a given research group. Because cycles are pooled across all Base and Contributed cores, which may be in use by others, the equivalent number of cores to those contributed is made available within 24 hours after a request. In practice, the number of cores contributed by a research group is generally available much sooner. Jobs that run on the Virtual Shared Cluster have a 14-day upper limit (with appropriate notification, longer runs may be accommodated).

While it is hard to give an exact number of additional cores available, in practice there are unused cores that can be made available within a reasonable period of time for researches that require use of cores in addition to those contributed.

With advance agreement, a very large job that requires a large segment of the entire shared cluster (those cores connected through the InfiniBand) can be accommodated dependent upon current cluster usage and consent by affected research groups.

Research Virtual Shared Cluster Hosting Costs

Research groups that contribute cores to the Hoffman2 Cluster agree to contribute their unused cycles to other researchers. They can regain full use of their Contributed cores within 24 hours of submitting a job.

Users of the Virtual Research Shared Cluster have the option of paying a one-time, per terabyte, charge for storage on the BlueArc storage system. This is particularly an important option for those that need more than the 10 GB directory space per user that is standard on the Research Virtual Shared Cluster or that want increased permanent space for large data sets to avoid recurring upload and transfer times.

The current price for storage, which includes space, administration and backup for 1 terabyte of data, is equivalent to $3,000 per terabyte.

Base and Contributed Equipment Renewals

After a period of three years all hardware within the shared cluster is evaluated for retention based on condition of equipment, cost to maintain, relative compute power and the ability to backfill with new systems. This is done to maintain a high performance and low maintenance system, while maximizing the utilization of data center space.

If the contributed cores can still be effectively maintained, those cores will remain inside the Hoffman2 Cluster and continue to be reevaluated on an annual basis. If the contributed cores can no longer be effectively maintained, upon mutual agreement, they will be redeployed for other uses or decommissioned.

The Campus General Purpose and Application Clusters

UCLA Faculty (and their students) who have not contributed cores run parallel jobs on the General Purpose Cluster and serial jobs and commercial applications on the Application Cluster.

The Campus General Purpose Cluster is that part of the Hoffman2 Cluster System provided as a high performance computing resource for the entire UCLA campus. It is used to run only parallel applications, whether commercial or user written. It is available to UCLA students and faculty that:

  • Run primarily parallel commercial applications and/or user written, discipline specific parallel applications,
  • Have low-level or sporadic usage, and
  • Require a specific application, compiler, or visualization tool available only on the General Purpose or Applications Clusters.

Similarly, the Application Cluster it available for running serial jobs and commercial applications, both serial and parallel.

The Shared Hoffman2 Hardware and Software

The Hoffman2 Cluster has 64-bit nodes with an Ethernet interconnect, with the following standard software suite:

  • Scheduler
  • Compilers: GCC and the best performing compiler for: C, C++, Fortran 77, 90 and 95 on the current Shared Cluster architecture.
  • Applications and Libraries in the Basic Software Suite

Certain applications are provided for a base level of cluster usability. Every effort is made to maximize application usage to the extent capable under license agreements. Where possible software is provided that would not make sense for an individual research group to purchase on its own.

In addition to the Base and Contributed cores, the Hoffman 2 Cluster includes the head nodes and the storage server. The Hoffman2 Cluster has both InfiniBand and gigabit Ethernet network switches and interconnect. The Ethernet interconnect is dedicated to traffic in and out of the storage system as well as various administrative functions and is used as the interconnect for the Applications cluster. To maintain maximum parallel performance, InfiniBand is used strictly for inter-node, MPI-type communication across the Research Virtual Shared Cluster and the General Purpose Cluster.

Individually Hosted Clusters

At IDRE, individually hosted clusters are slowly being phased out in favor of the shared cluster approach. Requests by researchers that require specialized services (software or operating conditions that preclude them from being included in the Research Virtual Shared Cluster) will be considered only for independent stand-alone clusters of 100 cores or more. There is an on-going recharge for individually hosted clusters.