Using TACC

From Poldracklab Wiki

Jump to: navigation, search

TACC (Texas Advanced Computing Center) provides high-performance computing to campus, and is the system on which we perform all of our main data analysis work. We primarily rely upon the new Lonestar Cluster; however, for single jobs that require more than about 23 GB of RAM, you can also use the Ranger cluster which has up to 32GB of RAM available for a single process.

Contents

Getting an account on TACC

- Go to the TACC Portal Account Request Page, click through the various pages, and then fill out the form. For those of you affiliated with the IRC, it is listed under "Office of Research" in the UT Department menu.

- It will take about 24 hours for your account to be activated. Once it has been activated, email Russ to let him know and he will have your account linked to the Poldracklab resources.

- More information for new users on the TACC website here.

Your TACC preferences

Once you ssh onto the poldracklab login node at lonestar (ssh <user>@poldrack.lonestar.tacc.utexas.edu), you should first change your login shell to /bin/bash, which is the shell that we support for our lab. (if you insist on using csh you will be on your own in terms of configuration). Change the shell as follows:


login4$ /usr/local/bin/chsh
Changing shell for poldrack.
Password: <enter your password here>
New shell [/bin/tcsh]: /bin/bash
Shell changed.


once you have done this, you should edit your ~/.profile_user file, adding the following line:

. /work/01329/poldrack/software_lonestar/profile_poldracklab

This will set all the paths you need for FSL, Freesurfer etc, as well as properly setting your umask (which by default is overly restrictive).

You can also add aliases and other paths/preference to your ~/.profile_user at your discretion.

NOTE: Whenever you make changes to your startup scripts, it is always a good idea to test them using a different terminal window before you log out of the window in which you are editing. That way, in the unlikely event that there is a problem with the startup file that prevents you from logging in, you can fix it and try again rather than being locked out.

Some ground rules for using TACC

Using TACC is unlike most other systems that you may have used. All serious computing must be done using a grid computing model, rather than interactively. This takes some time to get used to, but once you learn to deal with it you will find it to greatly increase your productivity for large analysis jobs.

The cluster that we are currently using for our work is Lonestar. To access it, ssh to poldrack.lonestar.tacc.utexas.edu - this is a private login node that is only available to members of our lab. In general, you should try not to run any computationally intensive jobs on the login node. Instead, all jobs should be submitted to the grid using the launch script. There are two ways to do this, which depend on the kind of job you are running.

Individual commands

To run a single individual command, simply type

launch <command> -j <projectname>

This will submit the job to the grid; you can watch its status using the qstat command.

Multiple independent commands

If you have a large number of jobs to run that are independent of one another (e.g, motion correction on a large number of subjects) then you can use the launch command to run them in parallel:

launch -s <script to run>  -j <projectname>

the script to be run should contain a single command on each line. These commands MUST be independent of one another, since several of them will be executed in parallel.

In either case, the launch script an output file that contains any output from the job, which will be called launch.o<ID> where <ID> is the job number of the grid job (unless you set the job name as described below, in which case it will be <jobname>.o<ID>)

Options for launch script

There are several important options for the launch script that may be helpful for you.

-r RUNTIME - this specifies the maximum runtime for the job, in hours:mins:secs (e.g., 10:00:00 for 10 hours). The default is one hour; it is best to set this number as small as possible while still being long enough for your job to finish, because longer jobs will wait in the queue for longer before starting.

-p NCORES - this is the number of cores that will be used for the job. it automatically estimated by the launch script; for a single command it is 12 (because it must be a multiple of 12), whereas for a script it is the number of commands in the script (unless it exceeds the maximum # of cores, which is currently 2052).

-e PARENV- this the name of the parallel environment, which basically determines how many cores on each processor one wants to use. the default is 12way, which means that you will use all of the cores on the chip. The only time one needs to modify this is when you have a job that needs more than 2GB of memory; in that case, you should reduce the number of cores to 6way, 2way, or 1 way (which will give you 4 GB, 12GB, or 24 GB of RAM respectively, minus a small amount for the OS).

-n JOBNAME - this lets you provide a name for your job, which will be used for the naming of the automatically generated qsub script and output files.

-j PROJNAME - each user needs access to a project that provides an allocation of time on the cluster. You can find out which projects you are associated with by going to http://portal.tacc.utexas.edu and checking under allocations.

-q QUEUE - in general one should leave this as the default (which is "serial" for single jobs and "normal" for parallel jobs)

-m EMAIL - you can specify an email address here that will receive messages when you job starts and finishes.

-k - set this to keep the qsub file (which is deleted by default)

Overriding default settings

If there are settings that you would like to always be set, you can add them to ~/.launch_user, which will be read whenever you run the launch script. Each command should be on a separate line:

-m joe@gmail.com
-j myProjectName

Using the grid with data from corral

TACC has asked that we please do not run grid jobs using the Corral directories. Although Corral is mounted to Lonestar, it is not part of the parallel file system on Lonestar. To use data from Corral, we should cp or rsync (rsync -tvr has been suggested) the data needed at the beginning of the job to your $SCRATCH directory, and then copy the results back to Corral at the end of the job.

If the mount to Corral becomes temporarily unavailable while your executable is accessing data on Corral, SGE may not be able to kill the job once it's time is up. Basically, when this happens you’ll get busted by TACC and asked not to do it again. Even if you catch the job that is hung up before they do, they’ll find out about it and email you.

Information about $SCRATCH

  • There is unlimited space for you to use
  • After 10 days, files in $SCRATCH become eligible for deletion, so you should backup your $SCRATCH files on corral as often as possible

Easiest way to keep things organized

  • Try to design your analysis directories so that copying can be achieved by copying a single directory containing your entire analysis or subdirectories. EG, if you’ve run your first and second level FEAT analyses and are working on the group anlaysis, you only need the second levels. Since copying ALL of the data will take a very long time, you might benefit from putting your second levels in a SECOND_LEV directory, separate from the first levels. This would be a departure from how we’ve stored data in the past, since we typically had a subject directory with first levels and second levels.
  • Use rsync -tvr to copy files, as it will only update new files instead of copying and overwriting old files. There may be other flag combinations that work as well, so check the man page for rsync to verify that this will do what you want it to do.
  • You are supposed to use the login node for these copying jobs. Of course using the grid would be much faster, but is against the rules since we are not supposed to read or write to corral from the grid.


Other things to consider

  • Let’s say you need to run a simple fslmaths command on 1000 4D NIFTI files. Copying actually may take longer than running the fslmaths command. So, it seems reasonable that if all of the copying is going to be taking up the login node, you may as well just run your script from the login node. Granted, this won’t be a job you could run in parallel, but neither is the copying. Please comment if you do not think this is the right way to go about this.
  • If you are running something on multiple files always write a script that checks that all of your output files were created. Alternatively you can grep your launch.o file for the word “Abort” to see if any jobs aborted. I submitted 1388 jobs and for no apparent reason only 1386 ran. There was plenty of time and plenty of nodes allotted for this job. For a first level feat analysis, I believe the ‘reg’ directory is one of the last things created, so searching for whether this exists is a good way to trouble shoot first level feats. For higher level feat analyses, search for your thresholded stats, (threshzstat1.nii.gz), as these are among the last files to be created.

Visualization on TACC

VNC on the Lonestar Poldrack Login Node

VNC is available on Lonestar's Poldrack login node, but must be tunnelled over SSH and requires some setup. First make absolutely sure you have completed the instructions above titled Your TACC preferences.

You will then need to know your TACC VNC server number, which can be determined by running the "whichvnc" command on Workerbee. Increment your Workerbee VNC server number by 1 to get your TACC VNC server number - this is shown in the command output on Workerbee:

whichvnc
Your Workerbee VNC Server number is :22
Your TACC VNC Server number is :23

Now complete the following steps - remember that "23" is the TACC VNC server number in this example, but should be replaced with your own.

Login via SSH to the login node:

ssh username@poldrack.lonestar.tacc.utexas.edu

Define a VNC password by running:

vncpasswd

Create a file named ~/.vnc/servernum, containing your TACC VNC server number:

echo 23 > ~/.vnc/servernum

Initialize the file .bashrc:

cp $SWDIR/local/conf/.bashrc ~

Initialize the file .vnc/xstartup:

cp $SWDIR/local/conf/xstartup ~/.vnc

Next, on your local system, create a script called "taccvnc" that you will run when you want to use VNC on the login node. A starter version of the script can be found at: taccvnc.sh

Edit this script, changing the REMOTEUSER variable to your TACC username, and the REMOTEDISPLAY variable to your TACC VNC server number. To connect, perform the following steps:

  1. On your local system, run the taccvnc.sh script
  2. Provide your TACC password or passphrase when prompted
  3. On your local system, start VNC Viewer
  4. Connect to "localhost" using your TACC VNC server number, i.e. "localhost:23" in this example
  5. Enter your VNC password

At this point, the Linux desktop should be shown.

VNC on Spur

You can run a VNC session on spur, but there is a time limit as it sends the VNC server to run on a node to which you have some allocated time. It is useful if you have lots of visualization to do (to check FSL or Freesurfer output).

Instructions on how to run VNC on spur are here

The short version is as follows (tailored for mac users):

ssh to spur:

ssh < username >@spur.tacc.utexas.edu

If this is your first time connecting to spur, you must run

vncpasswd

Use a different password than your tacc login password. You only do this the first time you ssh to spur. Remember this password!

Launch a VNC server by running:

qsub /share/sge6.2/default/pe_scripts/job.vnc 

I'd suggest making an alias for this if you plan on doing this often.

touch ~/vncserver.out
tail -f ~/vncserver.out 

Open a new terminal window and type

ssh -f -N -L 59xx:spur.tacc.utexas.edu:59xx < username >@spur.tacc.utexas.edu

where xx is the display number tail -f ~/vncserver.out gave you.

If you don't have it installed, download and install Chicken of the VNC.

Open Chicken of the VNC and in host put "localhost" in display put xx (the display number from the tail command) and enter your vnc password (not your tacc password).

Voila! You should be all set. If you x out of Chicken of the VNC, you should be able to reconnect to it and have everything as you left it, as long as it is within the time allocated to you for that node you're running the vnc server on (the default is 4 hours). Also, you have to remain ssh'ed into spur, so don't close that terminal window while using VNC.

Adding modules

You have to add certain modules in order to use them. So, for example:

module add R

will add the R module, so now you can use R. If you use R often, I suggest making an alias for this, or simply add it to your ~/.cshrc_user.

A complete list of the modules on tacc can be found here

More about modules can be found here

Using R on TACC

Installing a new library on lonestar

To add a new library use the install.packages command from the R prompt, you will want to add the library from an interactive R session. For example, if you want to install the pls library you would use:

install.packages('pls', repos="http://cran.us.r-project.org", dependencies=TRUE)

The "repos" flag tells R where to get the library and the "dependencies" flag will install any additional libraries that the library you are installing is contingent upon. In my experience with lonestar it isn't necessary to set a destination directory for the library files, the one it uses by default will be found by R later when loading directories. After running the command double-check that the installation worked by loading the library into R. So for the pls library you'd use:

library(pls)

If you get any error messages you'll need to try installing it again.

Running jobs in batch mode

In order to run R jobs in batch on the grid, you must first add the "module add R" command to your ~/.cshrc_user file as described above. If you have a batch R script called rscript.R, in general the command for running an R job in batch mode is

R CMD BATCH --no-save rscript.R

The "--no-save" flag is especially important because it will prevent R from saving the workspace upon closing. There are two reasons why you would want to suppress this operation. First, it saves the workspace in a hidden file ".Rdata", which may be very large and take up a lot of disc space, depending on what your script did. For example if you created huge data matrices, these will be saved with all other variables in .Rdata and since the file is hidden you may not realize you're using up a bunch of disc space. The second reason to suppress this operation is that if R is opened from that directory in the future it will reload all of the variables from that previous analysis, which can take a lot of time and if you're using similar variable names in your new analysis can lead to confusion and mistakes in your code (old variables may be used instead of the new).

To run your R job on the grid you would use

launch R CMD BATCH --no-save rscript.R

R and any libraries that are installed are compiled with the gcc compiler, but since the default compiler on lonestar is intel, in order for some libraries to properly load the intel compiler must be swapped out for the gcc compiler. If you get an error message about a library not properly loading add the flag -c gcc to your launch command to swap the compilers. R automatically creates an output file named identically to your script code, but with a .Rout extension. So, in this case I would look in rscript.Rout to see if there were any issues in loading any libraries and then I would add the -c gcc flag if needed.

Personal tools