# Storage¶

Announcement

• December 13, 2022: Clean up of temp-scratch must be completed by all researchers.
• December 16, 2022: Access to temp-scratch will be removed for all researchers.
Local Scratch Small files created during jobs. Must be deleted at the end of a job. Small All users $LOCAL_SCRATCH Fastest ## User Space¶ Each user has personal directories found at /home/$USER (or $HOME) and /data/user/$USER (or $USER_DATA). These two locations are meant to store general data long-term and can be used during active analysis or for archiving small projects. Traditionally, $HOME is meant to store scripts and supporting files and toolboxes such as Anaconda virtual environments or R packages while $USER_DATA can store datasets and results for a user's projects. The owner ($USER) of both directories can read, write/delete, and list files. No other users or groups have permissions to this directory. While it is possible to share files in your personal space with other users, a more practical solution may be to use a project directory instead.

A user is limited to 5 TB of data across both $HOME and $USER_DATA combined.

Note

The home and user data directories are mirrored across storage locations to allow for emergency backup in case some of the drives fail. This is not meant to be a long-term backup solution as any data deleted by a user is deleted on the main drive and the mirrored drive.

## Project Directory¶

Shared data can be stored in a /data/project/<project_name> directory. The default storage size for a new project is 50TB. If you need less than 5TB or your need for shared space is short term, please request a Sloss space instead. Project storage can be helpful for teams of researchers who need access to the same data.

All project spaces must be owned by a principal investigator (PI) who is an employee of UAB with a legitimate research interest. The PI takes responsibility for data in the space, as well as access control of all files and directories under the parent directory. As with all data on Cheaha, backups and archival services are not provided, and are the responsibility of the respective data owners.

The PI and all members with access to the project directory can read, write/delete, and list files within the top-level directory, and all other subdirectories by default. Other people on the system have no ability to access the project space. Access control for directories and files within the project space can be implemented via access control lists. Please see the bash commands setfacl and getfacl) for more information. Access control within the project directory are the responsibility of the project owner. However, we respect that access control lists can be tricky, so please feel free to contact us for assistance.

To create a project directory, or change access to or ownership of a project directory, the PI should follow the instructions at How Do I Request Or Change A Project Space?

### Sloss¶

A special location under /data/project/sloss to store projects that are at most 5 TB. In keeping with the name Sloss, these spaces are intended as a foundry for experimental or temporary project spaces that have potential to grow. Otherwise, they are treated like any other project space.

### Project Directory Permissions¶

Every project directory has a group that is unique system-wide, and not used anywhere else on the filesystem. The unique project group will be referred to as <grp> and generally has the same name as the top level project directory.

Note

Some early group names may not match their project directory, but should be reasonably close.

Members of the project directory group have permissions to access that project directory. Adding and removing members from the project directory group is how Research Computing controls access to, and ownership of, project directories. We do not use access control lists (ACLs) to manage permissions ourselves, but use of ACLs is allowed and encouraged for PIs and project administrators who want more fine-grained control. Please see our section on ACLs for more information.

Be default, project space permissions are set up in the following way:

Top level directory Newly created files Newly created directories
Numeric Permissions 2770 0664 2775
Symbolic Permissions drwxrws--- -rw-rw-r-- drwxrwsr-x
setgid enabled Yes Yes
Group owner <grp> <grp> <grp>

Having setgid enabled on directories means new files and directories created within will inherit group ownership and the setgid bit. The setgid bit is reflected by the 2 in the numeric permissions and the s in the symbolic permissions. The setgid bit and per-directory project groups is how Research Computing controls access to each project directory.

There are some known issues surrounding project directory permissions when files are put into the project directory. Different commands have different behaviors. The following list describes the behaviors of various commands used to move and copy data, as well as good practices.

• mv maintains all permissions and ownerships of the source file or directory.
• For files and directories created outside the project directory, avoid using mv, prefer cp or similar instead. See below for alternatives.
• For files and directories created within the project directory, mv may work, but be sure the file has correct permissions and group ownership.
• cp, tar -x, rsync, rclone, sftp and Globus all behave as though creating a new file at the target location, by default. Prefer these commands when it is sensible to do so.
• Avoid using the -p flag with cp, tar, rsync and sftp.
• When using the -p flag, files and directories will retain their source permissions.
• Retaining source permissions in project directories is undesirable behavior and can create headaches for you, your colleagues, and your project directory administrators and PIs.

• Please educate your staff and collaborators about the above permission setups, and any additional ACLs you may have in place, to minimize future challenges.
• If you have issues with permissions, please contact Support. We can guide you through Managing Permissions and Managing Group Ownership.

## Scratch¶

Two types of scratch space are provided for analyses currently being ran, network-mounted and local. These are spaces shared across users (though one user still cannot access another user's files without permission) and as such, data should be moved out of scratch when the analysis is finished.

Important

Starting January 2023, scratch data will have limited retention. See Scratch Retention Policy for more information.

### User Scratch¶

All users have access to a large, temporary, work-in-progress directory for storing data, called a scratch directory in /data/scratch/$USER or $USER_SCRATCH. Use this directory to store very large datasets or temporary pipeline intermediates for a short period of time while running your jobs. The maximum amount of data a single user can store in network scratch is 100 TB at once.

Network scratch is available on the login node and each compute node. This storage is a GPFS high performance file system providing roughly 1 PB of storage. If using scratch, this should be your jobs' primary working directory, unless the job would benefit from local scratch (see below).

Warning

Research Computing expects each user to keep their scratch areas clean. The cluster scratch areas are not to be used for archiving data. In order to keep scratch clear and usable for everyone, files older than 30 days will be automatically deleted.

### Local Scratch¶

Each compute node has a local scratch directory that is accessible via the variable $LOCAL_SCRATCH. If your job performs a lot of file I/O, the job should use $LOCAL_SCRATCH rather than $USER_SCRATCH to prevent bogging down the network scratch file system. It's important to recognize that most jobs run on the cluster do not fall under this category. Important $LOCAL_SCRATCH is only useful for jobs in which all processes run on the same compute node, so MPI jobs are not candidates for this solution. Use the #SBATCH --nodes=1 slurm directive to specify that all requested cores are on the same node.

## Temporary Files (tmp)¶

Please do not use the directory tmp as storage for temporary files. The tmp directory is local to each node, and a full tmp directory harms compute performance on that node for all users. Instead, please use $LOCAL_SCRATCH for fast access and $USER_SCRATCH for larger space.

Some software defaults to using tmp without any warning or documentation, especially software designed for personal computers. We may reach out to inform you if your software fills tmp, as it can harm performance on that compute node. If that happens we will work with you to identify ways of redirecting temporary storage to one of the scratch spaces.

### Software Known to Use tmp¶

The following software are known to use tmp by default, and can be worked around by using the listed flags.

• Java: java * -Djava.io.tmpdir=$LOCAL_SCRATCH • UMI Tools: umi_tools * --temp-dir=$LOCAL_SCRATCH

## How much space do I have left?¶

• Personal: use the command quota-report to see usage in /data/user/$USER and /scratch/$USER.
• Project: use the command proj-quota-report <project>. Replace <project> with the appropriate project directory name, i.e., /data/project/<project>. Be sure to not use a trailing slash. Use proj-quota-report mylab not proj-quota-report mylab/.

Both quota reports are updated nightly, so they may be out of date if you move data around before running these commands.

Tip

Running out of space? Can't afford to remove any data? Please consider using our Long Term Storage (LTS) system.

## Data Policies¶

### Replication¶

Data on GPFS are replicated using a RAID configuration which is intended to decrease the risk of data loss in the event of disk failures. This is standard good practice for researcher computing data centers.

### Backups¶

Important

Backups of data are the responsibility of researchers using Cheaha.

A good practice for backing up data is to use the 3-2-1 rule, as recommended by US-CERT:

• 3: Keep 3 copies of important data. 1 primary copy for use, 2 backup copies.
• 2: Store backup copies on 2 different media types to protect from media-specific hazards.
• 1: Store 1 backup copy offsite, located geographically distant from the primary copy.

What hazards can cause data loss?

• Accidental file deletion.
• Example: mistakenly deleting the wrong files when using the shell command rm.
• Files deleted with rm or any similar command can not be recovered by us under any circumstances.
• Please restore from a backup.
• Natural disasters.
• All of our data sits in one geographical location at the UAB Technology Innovation Center (TIC).
• Plans to add geographical data redundancy are being considered.
• Please restore from an offsite backup.
• Unusable backups.
• Examples: backup software bug; media destroyed; natural disaster at offsite location.
• Regularly test data restoration from all backups.

How can I ensure data integrity?

• Regularly back up your (and your lab's) data in an offsite location.
• S3 based long-term storage (LTS) can be used for short-term onsite backup.

### HIPAA Compliance¶

As of December 2019, Cheaha is HIPAA compliant and so PHI can be stored on it. Currently, long-term storage is NOT HIPAA compliant but will be in the future.

Important

It is the responsibility of the user to make sure PHI is only accessible by researchers on the IRB. If PHI is being stored in a project folder and some researchers are not on an IRB, their access to those files should be restricted using Access Control Lists (ACLs). If you need assistance setting up ACLs properly, please contact us.

### Scratch Retention Policy¶

Starting January 2023, data stored in /scratch will be subject to two limited retention policies.

• Each user will have a quota of 50 TB of scratch storage.
• Files will be retained for 30 days.

## Directory Permissions¶

Default file permissions are described for each directory above. Additional background on Linux file system permissions can be found here:

Last update: November 29, 2022