Storage¶
Announcement
UPDATED 2023-01-09
Please note upcoming changes!
- January 13, 2023:
$USER_SCRATCH
will point to/scratch/$USER
- January 27, 2023: Clean up of
temp-scratch
must be completed by all researchers. Access totemp-scratch
will be removed for all researchers. - February 13, 2023: New
$SCRATCH
storage limited-retention policies will start.
What Type of Storage Do I Need?¶
There are multiple locations for data storage both on and off Cheaha each with a specific purpose. You can look at the table below to find the storage platform we provide that best matches your needed use-case.
Platform | Long-term Storage | User Data and Home Directories | Project Directories | Network Scratch | Local Scratch |
---|---|---|---|---|---|
Data Use-Case | Rarely changing or immutable files. Data not directly usable in Cheaha jobs. | General-purpose medium- to long-term storage. | Data to be shared among a single or multiple labs. | Temporary files created during analysis. Rolling file deletion over time. | Small files created during jobs. Must be deleted at the end of a job. |
Default Quota | 75 TB | 5 TB | 25 TB | 100 TB | Small |
Owner | Lab PI | Individual Researchers | Lab PI | Individual Researchers | Individual Researchers |
Accessible From | - Cheaha - Globus |
- Cheaha - Globus |
- Cheaha - Globus |
Cheaha | Cheaha |
Cheaha Path | No Path | /home/<blazerid> ($HOME ) and /data/user/<blazerid> ($USER_DATA ) |
/data/project/<name> |
/scratch/<blazerid> ($USER_SCRATCH ) |
/scratch/local ($LOCAL_SCRATCH ) |
Read/Write (IO) Speed | Slow | Fast | Fast | Fast | Fastest |
Policies | Data deleted after 30 days. | Data deleted as needed. | |||
How to Access | Contact Support | Comes with Cheaha account. | See Request Project Space | Comes with Cheaha account. | Comes with Cheaha account. |
User Space¶
Each user has personal directories found at /home/$USER
(or $HOME
) and /data/user/$USER
(or $USER_DATA
). These two locations are meant to store general data long-term and can be used during active analysis or for archiving small projects. Traditionally, $HOME
is meant to store scripts and supporting files and toolboxes such as Anaconda virtual environments or R packages while $USER_DATA
can store datasets and results for a user's projects.
The owner ($USER
) of both directories can read, write/delete, and list files. No other users or groups have permissions to this directory. While it is possible to share files in your personal space with other users, a more practical solution may be to use a project directory instead.
A user is limited to 5 TB of data across both $HOME
and $USER_DATA
combined.
Note
The home and user data directories are mirrored across storage locations to allow for emergency backup in case some of the drives fail. This is not meant to be a long-term backup solution as any data deleted by a user is deleted on the main drive and the mirrored drive.
Project Directory¶
Shared data can be stored in a /data/project/<project_name>
directory. The default storage size for a new project is 50TB. If you need less than 5TB or your need for shared space is short term, please request a Sloss space instead. Project storage can be helpful for teams of researchers who need access to the same data.
All project spaces must be owned by a principal investigator (PI) who is an employee of UAB with a legitimate research interest. The PI takes responsibility for data in the space, as well as access control of all files and directories under the parent directory. As with all data on Cheaha, backups and archival services are not provided, and are the responsibility of the respective data owners.
The PI and all members with access to the project directory can read, write/delete, and list files within the top-level directory, and all other subdirectories by default. Other people on the system have no ability to access the project space. Access control for directories and files within the project space can be implemented via access control lists. Please see the bash commands setfacl and getfacl) for more information. Access control within the project directory are the responsibility of the project owner. However, we respect that access control lists can be tricky, so please feel free to contact us for assistance.
To create a project directory, or change access to or ownership of a project directory, the PI should follow the instructions at How Do I Request Or Change A Project Space?
Sloss¶
A special location under /data/project/sloss
to store projects that are at most 5 TB. In keeping with the name Sloss, these spaces are intended as a foundry for experimental or temporary project spaces that have potential to grow. Otherwise, they are treated like any other project space.
Project Directory Permissions¶
Every project directory has a group that is unique system-wide, and not used anywhere else on the filesystem. The unique project group will be referred to as <grp>
and generally has the same name as the top level project directory.
Note
Some early group names may not match their project directory, but should be reasonably close.
Members of the project directory group have permissions to access that project directory. Adding and removing members from the project directory group is how Research Computing controls access to, and ownership of, project directories. We do not use access control lists (ACLs) to manage permissions ourselves, but use of ACLs is allowed and encouraged for PIs and project administrators who want more fine-grained control. Please see our section on ACLs for more information.
Be default, project space permissions are set up in the following way:
Top level directory | Newly created files | Newly created directories | |
---|---|---|---|
Numeric Permissions | 2770 |
0664 |
2775 |
Symbolic Permissions | drwxrws--- |
-rw-rw-r-- |
drwxrwsr-x |
setgid enabled |
Yes | Yes | |
User owner | PI/Admin | Creator | Creator |
Group owner | <grp> |
<grp> |
<grp> |
Having setgid
enabled on directories means new files and directories created within will inherit group ownership and the setgid
bit. The setgid
bit is reflected by the 2
in the numeric permissions and the s
in the symbolic permissions. The setgid
bit and per-directory project groups is how Research Computing controls access to each project directory.
There are some known issues surrounding project directory permissions when files are put into the project directory. Different commands have different behaviors. The following list describes the behaviors of various commands used to move and copy data, as well as good practices.
mv
maintains all permissions and ownerships of the source file or directory.- For files and directories created outside the project directory, avoid using
mv
, prefercp
or similar instead. See below for alternatives. - For files and directories created within the project directory,
mv
may work, but be sure the file has correct permissions and group ownership.
- For files and directories created outside the project directory, avoid using
cp
,tar -x
,rsync
,rclone
,sftp
and Globus all behave as though creating a new file at the target location, by default. Prefer these commands when it is sensible to do so.- Avoid using the
-p
flag withcp
,tar
,rsync
andsftp
. - When using the
-p
flag, files and directories will retain their source permissions. - Retaining source permissions in project directories is undesirable behavior and can create headaches for you, your colleagues, and your project directory administrators and PIs.
- Avoid using the
For PIs and project administrators:
- Please educate your staff and collaborators about the above permission setups, and any additional ACLs you may have in place, to minimize future challenges.
- If you have issues with permissions, please contact Support. We can guide you through Managing Permissions and Managing Group Ownership.
Scratch¶
Two types of scratch space are provided for analyses currently being ran, network-mounted and local. These are spaces shared across users (though one user still cannot access another user's files without permission) and as such, data should be moved out of scratch when the analysis is finished.
Important
Starting January 2023, scratch data will have limited retention. See Scratch Retention Policy for more information.
User Scratch¶
All users have access to a large, temporary, work-in-progress directory for storing data, called a scratch directory in /data/scratch/$USER
or $USER_SCRATCH
. Use this directory to store very large datasets or temporary pipeline intermediates for a short period of time while running your jobs. The maximum amount of data a single user can store in network scratch is 100 TB at once.
Network scratch is available on the login node and each compute node. This storage is a GPFS high performance file system providing roughly 1 PB of storage. If using scratch, this should be your jobs' primary working directory, unless the job would benefit from local scratch (see below).
Warning
Research Computing expects each user to keep their scratch areas clean. The cluster scratch areas are not to be used for archiving data. In order to keep scratch clear and usable for everyone, files older than 30 days will be automatically deleted.
Local Scratch¶
Each compute node has a local scratch directory that is accessible via the variable $LOCAL_SCRATCH
. If your job performs a lot of file I/O, the job should use $LOCAL_SCRATCH
rather than $USER_SCRATCH
to prevent bogging down the network scratch file system. It's important to recognize that most jobs run on the cluster do not fall under this category.
Important
$LOCAL_SCRATCH
is only useful for jobs in which all processes run on the same compute node, so MPI jobs are not candidates for this solution. Use the #SBATCH --nodes=1
slurm directive to specify that all requested cores are on the same node.
Temporary Files (tmp
)¶
Please do not use the directory tmp
as storage for temporary files. The tmp
directory is local to each node, and a full tmp
directory harms compute performance on that node for all users. Instead, please use $LOCAL_SCRATCH
for fast access and $USER_SCRATCH
for larger space.
Some software defaults to using tmp
without any warning or documentation, especially software designed for personal computers. We may reach out to inform you if your software fills tmp
, as it can harm performance on that compute node. If that happens we will work with you to identify ways of redirecting temporary storage to one of the scratch spaces.
Software Known to Use tmp
¶
The following software are known to use tmp
by default, and can be worked around by using the listed flags.
- Java:
java * -Djava.io.tmpdir=$LOCAL_SCRATCH
- UMI Tools:
umi_tools * --temp-dir=$LOCAL_SCRATCH
How much space do I have left?¶
- Personal: use the command
quota-report
to see usage in/data/user/$USER
and/scratch/$USER
. - Project: use the command
proj-quota-report <project>
. Replace<project>
with the appropriate project directory name, i.e.,/data/project/<project>
. Be sure to not use a trailing slash. Useproj-quota-report mylab
notproj-quota-report mylab/
.
Both quota reports are updated nightly, so they may be out of date if you move data around before running these commands.
Tip
Running out of space? Can't afford to remove any data? Please consider using our Long Term Storage (LTS) system.
Data Policies¶
Replication¶
Data on GPFS are replicated using a RAID configuration which is intended to decrease the risk of data loss in the event of disk failures. This is standard good practice for researcher computing data centers.
Backups¶
Important
Backups of data are the responsibility of researchers using Cheaha.
A good practice for backing up data is to use the 3-2-1 rule, as recommended by US-CERT:
- 3: Keep 3 copies of important data. 1 primary copy for use, 2 backup copies.
- 2: Store backup copies on 2 different media types to protect from media-specific hazards.
- 1: Store 1 backup copy offsite, located geographically distant from the primary copy.
What hazards can cause data loss?
- Accidental file deletion.
- Example: mistakenly deleting the wrong files when using the shell command
rm
. - Files deleted with
rm
or any similar command can not be recovered by us under any circumstances. - Please restore from a backup.
- Example: mistakenly deleting the wrong files when using the shell command
- Natural disasters.
- Examples: tornado; hurricane.
- All of our data sits in one geographical location at the UAB Technology Innovation Center (TIC).
- Plans to add geographical data redundancy are being considered.
- Please restore from an offsite backup.
- Unusable backups.
- Examples: backup software bug; media destroyed; natural disaster at offsite location.
- Regularly test data restoration from all backups.
How can I ensure data integrity?
- Regularly back up your (and your lab's) data in an offsite location.
- S3 based long-term storage (LTS) can be used for short-term onsite backup.
- Crashplan licenses are available for automatic offsite backups, please contact Support for more information.
HIPAA Compliance¶
As of December 2019, Cheaha is HIPAA compliant and so PHI can be stored on it. Currently, long-term storage is NOT HIPAA compliant but will be in the future.
Important
It is the responsibility of the user to make sure PHI is only accessible by researchers on the IRB. If PHI is being stored in a project folder and some researchers are not on an IRB, their access to those files should be restricted using Access Control Lists (ACLs). If you need assistance setting up ACLs properly, please contact us.
Scratch Retention Policy¶
Starting January 2023, data stored in /scratch
will be subject to two limited retention policies.
- Each user will have a quota of 50 TB of scratch storage.
- Files will be retained for 30 days.
Directory Permissions¶
Default file permissions are described for each directory above. Additional background on Linux file system permissions can be found here:
- https://its.unc.edu/research-computing/techdocs/how-to-use-unix-and-linux-file-permissions/
- https://www.rc.fas.harvard.edu/resources/documentation/linux/unix-permissions/
- https://hpc.nih.gov/storage/permissions.html