Frequently Asked Question about the Hoffman2 Cluster

Questions in this FAQ:

  1. Which Password do I use to login?
  2. I'm still having password problems.
  3. My program writes lot of scratch files in my home directory. This results in exceeding my disk space quota. What is the solution?
  4. How do I transfer my files from the Hoffman2 Cluster to my machine
  5. Is there a simpler way to copy all my files to my new Hoffman2 account?
  6. The ATS consultant sent me an email about lot of left over jobs running under my loginid. How do I delete them?
  7. I have a lot of jobs in error state E. How do I find out what the problem is?
  8. How do I print my output?
  9. What queues can I run my jobs in?
  10. When will my job run?
  11. What is my disk storage quota and usage?
  12. Re-compiling for CentOS 6?

Questions and Answers

Which password do I use to login?

As a user of an ATS-Hosted Cluster, you will have the following passwords:

  • For each cluster you can access you will have a separate login ID and password.
  • You will have a single username and password that you can use to login to both the UCLA Grid Portal and the UC Grid Portal.

Your cluster login IDs and passwords are independent of each other and of your grid portal username/password. For example, when you change your password on one of the ATS-Hosted Clusters, it changes on that cluster and that cluster only. Your passwords on the clusters can be, and probably are, different. There is only one grid portal password which is used by both the UCLA Grid Portal and the UC Grid Portal. If you request that the password you use for one of the grid portals be changed, you will have to use your new password when you login to either grid portal.

In addition to these passwords, everyone affiliated with UCLA has a UCLA Logon ID and Password. You are sometime asked to authenticate with your UCLA Logon ID and Password when requesting services via the web, even from ATS web sites. The UCLA Logon ID and Password is independent from any login ID/password or username/password combinations that ATS has issued to you.

I'm still having password problems

Please see How to Change your Cluster Password If that doesn't fix it, please send email to accounts @ ats.ucla.edu

My program writes lot of scratch files in my home directory. This results in exceeding my disk space quota. What is the solution?

There are several things you can do:

  • If you are a member of a research group which has contributed nodes to the Hoffman2 Cluster, your PI can purchase additional disk space for use by the members of your group.
  • Each process in your parallel program can write to the local /work on the node it is running on. When the program finishes, you can copy the files off to a place where you have more space. Since /work is local to the nodes, using it is very efficient.
  • You can write to /u/scratch and you have 7 days after the job completes to copy the files somewhere else.

How do I transfer my files from the Hoffman2 Cluster to my machine?

If the size of an individual file does not exceed 100 MB, you can download it to your local machine, or transfer it to another cluster that you can access at UCLA from the UCLA Grid Portal.

For any size file, you can use the scp command to transfer a file or directory from one machine or system to another. For saftey reasons, as outlined in the Security Policy for ATS-Hosted Clusters, always scp from your machine to the ATS-Hosted cluster. NEVER scp from the ATS-Hosted cluster back to your local machine.

Is there a simpler way to copy all my files to my new Hoffman2 account?

Once you have been notified that your login ID has been added to the Hoffman2 Cluster, login to your local machine and from your local machine's home directory enter the command:

tar -clpzf - * | ssh loginid@hoffman2.idre.ucla.edu tar -xpzf -

Replace loginid with your Hoffman2 Cluster loginid.

Note that this transfer will not copy any of the hidden (dot) files from your local home directory to your new home directory on the Hoffman2 Cluster. Since many of the dot files in your home directory are operating system version specific, it would not be appropriate or useful to transfer these files.

An ATS consultant sent me an email about a lot of left over jobs running under my userid. How do I delete them?

You can get the processor id's using the ps command and filter them using the grep command to select only the jobs you want to delete and feed the result to kill command.

ps -u loginid | grep myjob | awk '{print $1}' | xargs
ps -u loginid | grep myjob | awk '{print $1}' | xargs kill

Replace loginid with your loginid and myjob with the executable name.

I have a lot of jobs in error state E. How do I find out what the problem is?

When the myjobs script or qstat -u loginid shows you have jobs in an error state ("E", "Eqw", etc.) you can use the error_reason script to show you why. It will print the error reason line from qstat -j jobid output for all of your jobs that are in an error state.

error_reason -u loginid

Replace loginid with your loginid.

How do I print my output?

There is no printer directly associated with the Hoffman2 Cluster. If you have a printer attached to your local desktop machine, you can copy your file to your local machine and print your file locally. Recall that for security reasons you should issue the scp command from your local machine, and not from the Hoffman2 command line.

Here is a little script that you could save on a unix/linux machine that might make printing a text file easier. You might name this script h2print

scp loginid@hoffman2.idre.ucla.edu:$* .
lpr $*

where loginid is your Hoffman2 Cluster login ID. You can omit loginid@ if your userid on your local machine is the same as your Hoffman2 Cluster login ID. Note the period (.) at the end of the scp command line. Mark the script as executable with the chmod command:

chmod +x h2print

To print a Hoffman2 text file in your home directory, from your local machine's command prompt, enter:

h2print hoffman2_filename

where hoffman2_filename is the name of your text file on the Hoffman2 Cluster that you want to print.

The scp command will prompt you for your Hoffman2 Cluster password, unless you have previously setup an rsa key pair on your local machine with the ssh-keygen -t rsa command, and appended a copy of the public key (id_rsa.pub) to ~/.ssh/authorized_keys on your Hoffman2 Cluster account.

What queues can I run my jobs in?

The qquota command will tell you what resources available to your userid are in use at the moment that the qquota command was run. The purpose of qquota is not to provide a complete list of the resources available to your userid. If no resources are in use at the moment, qquota will not return any information.

For example:

resource quota rule limit                filter
--------------------------------------------------------------------------------
rulset1/10         slots=123/256        users @campus hosts @idre-amd_01g
       

"slots=123/256" means 123 slots or cores are in use by your group out of 256 of your group's total allocation. Enter man qquota at the shell prompt for more information.

The show_slots script will list the number of available and used slots for each queue or type of job (interactive, parallel 24 hours, parallel 14 days, serial 24 hours, serial 14 days, etc.). The queues are grouped by data center. Example:

IDRE             Available    Used 759, Total 2024
  interactive         1177       8
  parallel 24hours    1081     469
  parallel 14days      881      98
  serial 24hours      1081      66
  serial 14days        889     118

MSA              Available    Used 1132, Total 1512
  interactive          340       0
  parallel 24hours     240     360
  parallel 14days      148     536
  serial 24hours       140       8
  serial 14days        140     228

Not all available slots may be available for your jobs. Use the qquota command to see your group's used/total allocation. Note that the total number of slots in a data center also includes those which are disabled, or in an alarm state, or otherwise not ready to accept jobs.

When will my job run?

The qstat command will list all the jobs which are running (r) or waiting to run (qw), in order by priority ("prior" column). If all jobs requested the same resources, this would also be the order in which they start running. In reality, some jobs will request more nodes or a longer run time which is not presently available, so SGE will "back-fill" and try to start jobs which require fewer resources that will complete without slowing down the start time of a job higher in the list.

If you are in a research group which has purchased nodes for the Hoffman2 Cluster, you can use the highp complex to request that your job run on your group's highp resources. It is guaranteed that some job submitted by someone in your research group will start within 24 hours. To see where your highp job is with respect to the waiting jobs that everyone else in your group has submitted, you can use the groupjobs script. It will display a list of pending jobs, or pending and running jobs, similar to regular qstat output but only for everyone in your SGE group. The job at the top of the list will in most cases start running before those later in the list. For help and a list of options, enter groupjobs -h

What is my disk storage quota and usage?

From the UCLA Grid Portal, you can use its "Disk Usage on Hoffman2" application. Click:

Job Services
Applications
Disk Usage on Hoffman2
Submit Job button

You do not have to make any changes on the application form in order for it to report on your home directory usage. View your job results as usual. Click:

Job Services
Job Status

After your job has completed and its status is Done, click the Stdout link in the Output column for your job. Your request runs as a job on Hoffman2 and will send you standard Sun Grid Engine job status email.

From the Hoffman2 Cluster login nodes, at the shell prompt, enter:

myquota

The myquota command will report the usage and quota for filesystems where your userid has saved files, including /u/scratch as well as your home directory. Use the myquota command instead of the quota command. The myquota command supports the BlueArc storage system used by the Hoffman2 Cluster.

Re-compiling for CentOS 6?

The new OS includes a new version of the GNU compiler (gcc v. 4.4.4) and python (v. 2.6.5), accordingly any executable built against, or depending in any way from gcc and python, may need to be recompiled. Our default compiler is Intel but if you depend on gcc be aware that we are now supporting only version 4.4.4 (and the openmpi libraries version 1.4.4 built with this compiler). Also we now support solely python version 2.6.5 and most of the third party extension packages are being recompiled accordingly. If you need some specific python module which is not present let us know. Likewise we have attempted to maintain the system as close as possibly to what it was, however, you could expect some library dependencies to be broken as most libraries have substantially changed in this new OS version.


November 2011