Cluster Hosting

IDRE Cluster Hosting Program

Working under the auspices of IDRE, ATS hosts computaional clusters belonging to UCLA entities in its data centers. The cluster hosting program provides cluster hosting services to campus researchers in a way that effectively manages the limited high-end data center space on campus. It both maximizes the number of supported customers and minimizes the labor to support a given cluster, while at the same time providing a rich and robust set of hardware, software, application and support services. The cluster hosting program balances the needs of research teams that need many closely coupled nodes as well as those that need support in housing and maintaining smaller clusters.

View or download the IDRE Cluster Hosting Program Document.

Shared Cluster Components

The cluster hosting program consists of:

A Private Cluster System
Hosts larger clusters, consisting of 100 or more nodes, that require specialized services (software or operating conditions that preclude them from being included in the Shared Cluster System) as independent stand-alone clusters.

A Shared Cluster System
All other clusters are combined into a single shared cluster, the Hoffman2 Cluster. The shared Hoffman2 cluster consists of:
Research Group Virtual Clusters
Groups of compute nodes that have been purchased by individual researchers or research groups. Nodes must meet the standards for inclusion in the shared cluster. Once funded, these nodes become part of the Hoffman2 Cluster and are no longer physically linked to a given research group. However the resources represented by the purchased nodes are available to the research group.

Because unused cycles are pooled and may be in use by others, nodes provided by research groups are available within 24 hours of being requested. Note that 24 hours is a maximum wait period; nodes could be available sooner depending on currently queued/running jobs. Individual research groups can run jobs that run as long as 14 days provided that they do not use more than the number of nodes they contributed at any given time. Research groups can run on more nodes than they contributed, however jobs running on these extended nodes are limited to 24 hours.

Base Shared Cluster
The nodes in the Base Shared Cluster augment the nodes contributed by research groups. These nodes are available to those research groups which have provided equipment under the program. Jobs run on extended nodes are limited to 24 hours.

The Campus General Purpose Cluster
The Campus Cluster is that part of the Hoffman2 Cluster System, which is an maintained by ATS as an HPC resource for the UCLA campus. It is intended to be used by:
  • Massively parallel simulations that need additional nodes.
  • Students and faculty members who do not have accounts on any other campus clusters and are not part of a research group which has contributed nodes to the Hoffman2 Cluster.
  • Those with low-level or sporadic usage requirements.
  • Users that need a specific application, compiler, or visualization tool available on the Hoffman2 Cluster.
The Application Cluster
In addition to the Shared Cluster, a separate Application Cluster is available for serial jobs and commercial applications, both serial and parallel. The goal is to maintain the Campus General Purpose Cluster for parallel jobs, especially user written and non-commercial, discipline-specific parallel applications that can take advantage of the InfiniBand network avaialable on the Shared Cluster System. The Application Cluster consists of 64-bit nodes with an Ethernet interconnect. See the Standard Software Suite section below for the applications available.

Hardware Provided by ATS

ATS provides the head node, the storage server, the network switches, the base shared cluster nodes and the campus general purpose cluster.

A high-performance, fault-tolerant storage server provides storage for the Hoffman2 Cluster. Users of the Hoffman2 Cluster pay a one-time, per terabyte, fee for storage. This fee pays for storage space, administration of the storage system and backup services for a mutually agreed to amount of data.

The Hoffman2 Cluster uses both an InfiniBand and a gigabit Ethernet interconnect. The Ethernet interconnect is dedicated to traffic in and out of the storage system as well as various administrative functions. InfiniBand is used strictly for inter-node, MPI-type communication. In his way we can achieve maximum performance within the cluster.

Equipment Standards and Obsolescense

All hardware must be compatible with the Base Shared Cluster architecture. Compute nodes must conform to the minimum standards including architecture, processor type and speed, memory, disk space and interconnect. This maximizes the effective management of the Hoffman2 Cluster and to provides the highest level computing services to shared cluster customers.

After a period of three years all hardware within the shared cluster will be evaluated for retention based on condition of equipment, cost to maintain, relative compute power and the ability to backfill with new systems. This is done to maintain a high performance, low maintenance system in addition to efficiently utilizing scarce data center space.

If it is deemed cost effective to maintain the equipment, it will remain inside the Hoffman2 Cluster and continue to be reevaluated on an annual basis. If it is not deemed to remain inside the Hoffman2 Cluster it will either be redeployed for other uses or if it is not suitable for redeployment ATS will make arrangements with the equipment owners to have it dispositioned.

Standard Software Suite Provided by ATS

Under the Shared Cluster Hosting Model, certain software and applications are provided by ATS for a base level of cluster usability. Every effort is made to maximize application usage to the extent capable under license agreements. Where possible, given budgetary constraints, software is provided that would not make sense for an individual research group to purchase on its own.

The Standard Software Suite consists of the following:

  • Scheduler
  • Compilers: GCC and the best performing compiler for: C, C++, Fortran 77, 90 and 95 on the current Shared Cluster architecture. Details on compliers are available here.
  • Applications and Libraries: A current set of applications and libraries in the Basic Software Suite is available here.

Large Shared Cluster Runs

By prior special agreement, a run using a large segment of the entire shared cluster may be possible subject to prior commitments, current cluster usage and consent by affected research groups. Obviously the larger your request for additional nodes, the longer it will take to schedule your run. Note that extremely large runs may not be feasible depending on shared cluster usage.

Shared Cluster Hosting Fee

Research groups that contribute nodes to the shared Hoffman2 Cluster must contribute their unused cycles for other user's jobs to run. They can regain full use of the number of nodes that they contributed within 24 hours of submitting jobs to run on them.

Fee for Disk Storage

Users of the Shared Cluster System pay a one-time, per terabyte, fee for storage on the BlueArc Titan 2200. This fee pays for storage space, administration of the storage system and backup services for a mutually agreed to amount of data. Current price for storage is $3,000 per terabyte.