UAB IT Research Computing (UABRC) Resources and Cybersecurity Facilities Document Last Updated 2023-10-01 INTRODUCTION UAB IT Research Computing (UABRC) maintains the Research Computing System (RCS), an integrated computing platform that provides access to enhanced compute, storage, and network capacity for UAB investigators and their collaborators. The RCS compute resources include a high-performance compute (HPC) cluster for large scale modeling and analysis workloads, an on-site cloud platform for highly customizable virtual machine (VM) based workloads, and a container orchestration platform for cloud-native workloads. The RCS storage resources include a high-speed GPFS file system attached to the HPC cluster to support data analysis workloads and large-capacity block and object storage to support the cloud and container workloads. The RCS networking infrastructure connects the computing and storage systems via 200Gbps Ethernet interconnect that provides ample capacity for data access and movement between the compute and storage resources. RCS networking also includes connectivity to the UAB campus network, a dedicated EDR/HDR Infiniband fabric for low-latency data exchange on the HPC cluster, and a 40Gbps ScienceDMZ for high-speed data transfers with national research and education networks. These RCS resources combine to provide a low-friction application hosting platform that enables research teams to build and deploy preferred tools without enforced refactoring to adopt applications to campus resources. The RCS resources are deployed spanning two data centers. The on-campus data center is in the recently constructed Technology Innovation Center (TIC). The off-campus data center is located at a nearby (less than 10km) commercial facility operated by DC BLOX, a regional data center provider offering a Tier III colocation facility in Birmingham with adequate power to support the high-density power requirements for expanding the compute capacity of the RCS and a resilient physical infrastructure designed to withstand natural disasters like tornados, which are common in the region. The commercial facility is connected to the campus data center via a dedicated, diverse fiber path lit with dual 100Gbps optics that leverages the University of Alabama System Regional Optical Network (UASRON). UABRC designs and maintains the RCS resources in open collaboration with the campus research community to ensure that the system addresses research needs and has the capacity to meet research demand. To support the growth represented by on-going research activities, major updates to the RCS platform are planned over 2023 and 2024 that include upgrades to the GPU resources on the HPC cluster, upgrades to the inter-data center network to dual 400Gbps optics, the addition of four 34kW rack leases at DC BLOX, and upgrades to the HPC compute resources and GPFS file system. HIGH-PERFORMANCE COMPUTING (Cheaha) RESOURCES Cheaha is a campus high-performance computing (HPC) resource dedicated to enhancing research computing productivity at UAB. Cheaha is managed by UAB IT Research Computing (UABRC) and is available to members of the UAB community in need of increased computational capacity. Cheaha supports both high-performance computing (HPC) and high-throughput computing (HTC) paradigms. Cheaha provides 10752 CPU cores, 112 GPUs, and 88 TB of memory across 159 compute nodes interconnected via an EDR/HDR InfiniBand network, providing over 1.1 PFLOP/s of aggregate theoretical peak performance. A high-performance, 7 PB GPFS parallel filesystem running on DDN SFA14KX hardware is connected to these HPC compute nodes via the InfiniBand fabric. Cheaha provides researchers with both a web-based interface, via open OnDemand, and a traditional command-line interactive environment, via SSH. These interfaces provide access to many scientific tools that can leverage a dedicated pool of on-premises compute resources via the SLURM batch scheduler. The on-premises compute pool provides access to four recent generations of hardware based on the x86 64-bit architecture. Gen7 (installed 2017) is composed of 18 nodes: 2x14 core (504 cores total) 2.4 GHz Intel Xeon E5-2680 v4 compute nodes with 256 GB RAM, four NVIDIA Tesla P100 16 GB GPUs per node, and an EDR InfiniBand interconnect. Gen8 (2019) is composed of 35 nodes with EDR InfiniBand interconnect: 2x12 core (840 cores total) 2.60 GHz Intel Xeon Gold 6126 compute nodes with 21 compute nodes at 192 GB RAM, 10 nodes at 768 GB RAM and 4 nodes at 1.5 TB RAM. Gen9 (Q2 2021) is composed of 52 nodes with EDR InfiniBand interconnect: 2x24 core (2496 cores total) 3.0 GHz Intel Xeon Gold 6248R compute nodes each with 578 GB RAM. Gen10 (Q4 2021) is composed of 34 nodes with EDR InfiniBand interconnect: 2x64 core (4352 cores total) 2.0 GHz AMD Epyc 7713 Milan compute nodes each with 512 GB RAM. Gen11 (Q4 2023) is composed of 20 nodes with EDR InfiniBand interconnect: 2x64 core (2560 cores total) 2.0 GHz AMD Epyc 7713 Milan compute nodes each with 512 GB RAM and 2 A100 80 GB GPUs. The compute nodes combine to provide over 1.1 PFLOP/s of dedicated computing power. In Q1 2024 the GPFS file system will be upgraded from version 4 to version 5 and migrated to a new hardware platform with expanded SSD capacity to improve data loading performance and enable expanded capacity through policy-based integration of object stores. HIGH-PERFORMANCE COMPUTING (Cheaha) SOFTWARE TOOLS General research computing and scientific programming software are available on Cheaha, including Anaconda, R and RStudio, and MATLAB through the Lmod environment module system. RStudio, MATLAB, Jupyter Notebook server, and Jupyter Lab are all available on our Open OnDemand web portal as interactive applications, along with a general-use desktop environment via no-VNC, directly in the browser. Researchers are enabled to develop and share their own custom interactive applications through a sandbox application feature within Open OnDemand. The UAB Center for Clinical and Translational Science (CCTS) Informatics group has installed and supports a variety of bioinformatics tools that are available to be run from Cheaha. Standalone packages are available for quality control (fastQC, Picard Tools), alignment (Abyss, Velvet, BWA, Bowtie) visualization (IGV), variant calling (GATK, SnpEff, annoVar), RNAseq (Cufflinks, Cuffdiff, TopHat) and microbiome and metagenomic analysis (QIIME, HUMAnN, MEGAN). Additional scientific domain-specific software is also available, including Relion for cryo-electron microscopy analysis, AFNI for fMRI analysis, and ANSYS for simulations for research efforts of the UAB School of Engineering. Many other software packages are installed and maintained, and we encourage and facilitate researchers installing their own additional software using Anaconda, R and MATLAB package management where possible. ON-PREMISES CLOUD (cloud.rc) RESOURCES UABRC operates a production private cloud called cloud.rc based on OpenStack, which echoes the research workload support goals of the NSF’s Jetstream2 resource part of the ACCESS network. The Cloud.RC platform has been used to support application development and DevOps processes to research labs across campus and is increasingly being leveraged to support many aspects of research IT operations. This fabric is composed of five Dell R640 48 core 192 GB RAM compute nodes for 240 cores and 960 GB of standard cloud compute resources. In addition, the fabric features four NVIDIA DGX A100 nodes that include 8 A100 GPUs and 1 TB of RAM each. This resource pool will be expanded with additional capacity via a backfill of the oldest generation of HPC nodes (gen 7). These resources are available to the research community for provisioning on demand via the OpenStack services (Ussuri release). The production implementation further supports researchers by making their hosted services available beyond campus, while adhering to standard campus network security practices. Scientific software developers have access to the full stack for limitless development opportunities, with a frictionless migration path to public cloud providers as needed for specific research projects. A Kubernetes environment was deployed in Q32022 to allow for development workflows using containers. The compute resources of the Kubernetes environment are a duplicate of the cloud resources. The OpenStack and Kubernetes resources are deployed via Canonical’s Charmed operations stack enabling node migration between platforms in response to capacity demand. STORAGE RESOURCES The compute nodes on Cheaha are backed by high-performance, 7 PB GPFS storage on DDN SFA14KX hardware connected via an EDR /FDR InfiniBand fabric. Two additional storage systems are available to support research operations and application design. They are based on the Ceph storage technology and provide different hardware configurations to address different usage scenarios. The fabrics include a 6.9 PB archive storage fabric for long term storage (LTS) built using 12 Dell DSS7500 nodes, and an 11 PB nearline storage fabric built with 14 Dell R740xd2 nodes and 248 TB SSD cache storage fabric (Q32023) built with 8 Dell 840 nodes. These storage systems provide block and object storage services to the OpenStack and Kubernetes platforms. Additionally, the object storage services are empowering research applications with cloud-native data management and availability capabilities. NETWORK RESOURCES The RCS networking infrastructure connects the on- and off-campus computing and storage systems via 200Gbps Ethernet interconnect that provides capacity for data access and movement between the compute and storage resources. RCS networking also includes a dedicated EDR/HDR Infiniband fabric for the HPC platform and a 40Gbps ScienceDMZ for high-speed data transfers with national research and education networks. The ScienceDMZ supports direct connection to campus and high-bandwidth regional networks via 40 gb/s Globus Data Transfer Nodes (DTNs) providing the capability to connect data intensive research facilities directly with the high-performance computing and storage services of the RCS. This network can support very high-speed secure connectivity between nodes connected to it for high-speed file transfer of very large data sets without the concerns of interfering with other traffic on the campus backbone, ensuring predictable latencies. The Science DMZ interface with (DTNs) includes Perfsonar measurement nodes and a Bro security node connected directly to the border router that provide a "friction-free" pathway to access external data repositories as well as computational resources. The UAB campus network backbone is based on a 40 gb/s redundant Ethernet network with 480 gb/s back-planes on the core L2/L3 Switch/Routers. For efficient management, a collapsed backbone design is used. Each campus building is connected using 10 gb/s ethernet links over single mode optical fiber. Desktops are connected at 1 gb/s speed. The campus wireless network blankets classrooms, common areas and most academic office buildings. UAB connects to the Internet2 high-speed research network via the University of Alabama System Regional Optical Network (UASRON), a University of Alabama System owned and operated DWDM Network offering 100 gb/s ethernet to the Southern Light Rail (SLR)/Southern Crossroads (SoX) in Atlanta, Ga. The UASRON also connects UAB to UA, and UAH, the other two University of Alabama System institutions, and the Alabama Supercomputer Center. UAB is also connected to other universities and schools through Alabama Research and Education Network (AREN). DATA MANAGEMENT AND TRANSFER RESOURCES A traditional POSIX filesystem is available on all Cheaha HPC nodes through GPFS for data requiring computational analysis, with separate data, scratch, and shared storage. Object storage is available via our Long-Term Storage (LTS) S3 interface. A REST endpoint is provided for LTS and exposed to the Internet to facilitate hosting research data products for external use. Block storage is available to support storage needs for our cloud and Kubernetes fabrics. All faculty, staff and students who create a Research Computing account have immediate access to 5 TB of personal GPFS storage and may request an additional 5 TB of LTS storage. Research PI groups, Core facilities, and other research groups at UAB may request up to 25 TB of GPFS storage and 75 TB of LTS storage for shared collaboration spaces. Globus High-Assurance (HA) endpoints are maintained on the RCS platform to facilitate internal and external data transfers. Connectors to our enterprise Box.com instance and our LTS S3 interface are made available as part of our Globus subscription. A controlled-access Science DMZ partition of our hardware is available to facilitate high-throughput, parallel batch data transfers over the available 40 gb/s connection to the external internet. Standard data transfer software such as Rclone, AWSCLI and s3cmd, and UAB-specific documentation and support, are provided to researchers to facilitate data transfers whenever Globus is infeasible. FACILITATION, OUTREACH, DOCUMENTATION, AND SUPPORT UABRC provides research computation facilitation services for researchers using RCS. These services exist to reduce friction for researchers seeking to scale workflows from desktop and workstation scale up to HPC scale. Part of facilitation serves include computational outreach efforts within UAB, including facilitating lesson design for courses making use of our platform, teaching a Data Science Journal Club course, providing how-to-use-HPC lessons at University events, and proactively identifying opportunities for education and efficiency improvements using our internal observability stack. Extensive documentation of computational capabilities, good practices for system use, references and tutorials are all available on our documentation website, publicly available on the Internet. Links to common, standard external educational resources are provided and encouraged where applicable, including to, the Software Carpentries and Data Carpentries lessons, and GitHub and GitLab version control documentation. We provide coverage for support requests during standard business hours, and greater coverage for outages and security incidents. Tickets are tracked using ServiceNow software, and most requests are addressed within 1-2 business days, with faster responses for critical incidents. Support requests covered include software installation and update requests, new researcher training, hardware and software errors, data transfer and shared storage requests, and facilitation of collaborative, grant-funded research computation projects ranging from individuals, through labs, to core facilities. RESEARCH COMPUTING PERSONNEL UAB IT Research Computing currently maintains a support staff of 13 led by the Assistant Vice President for Research Computing and includes one HPC Architect-Manager, one Research Facilitation and Data Manager, four Software developers, two Research Facilitation Scientists, three system administrators and a project coordinator. UAB CYBERSECURITY POLICIES AND PRACTICES UAB IT maintains a unified and comprehensive privacy and information security program that preserves and protects the confidentiality, availability and integrity of all information assets including patients, research, customer, and business data. The integrated security program upholds values and provides high standards of service, trust, confidentiality and responsiveness to patients, customers, employees, and business associates. The IT Security program encompasses regulatory requirements in a practical approach that preserves and protects the confidentiality, availability and integrity of information assets including patient, customer, and business data. The UAB IT security program includes the following: - IT security policies designed to ensure a secure state of operations and information management. - Technical security standards that document baseline security requirements for key technologies and platforms such as major operating systems, databases, network device operating systems, firewalls, web-server security, email, encryption, secure file transfer protocols, virus defense, media reuse and media disposal. - A comprehensive qualitative risk management program. - A computer security incident response plan that is supported by cross-functional response and recovery teams. - User system access that is tightly controlled and meets standards required by various regulations and accrediting agencies including HIPAA, JCAHO and CAP. Two-factor authentication is utilized for access to all shared systems. Users must agree to maintain password confidentiality, log-off terminals at the end of each user session and alert management when security violations become known. We also routinely demonstrate compliance with Federal granting agencies such as the NIH, FISMA and and the VA and the corresponding security requirements. - An Institutional Firewall for perimeter and layered protection. - Network Intrusion Detection Systems (NIDS) have been strategically deployed to continuously monitor Internet, Extranet and Internal communications. - Zero Day Desktop and Server Centralized Microsoft Patch Update Services. - A centralized firewall protected Email Microsoft Exchange system with Spam Scoring and Virus Scanning. - 168bit 3DES encrypted IPSec tunnels for business associates, staff remote access, or partner VPN connectivity, - Capability to support encrypted secure file transfers. - Virus protection agents and comprehensive patch management programs installed on all computer workstations and servers to protect against malware infections. - PGP whole disk encryption software is required for all portable and high-risk devices - In-depth security training is provided for all faculty, staff and students. UAB has an extensive infrastructure to secure HIPAA-defined Electronic Protected Health Information (ePHI) from its creation and throughout its lifecycle. Secure web portals are utilized to make the required information accessible only to those who need access. The existing wireless infrastructure and secure VLAN architecture make the required ePHI portable but secure. Portable devices do not cache the data local to the device and transmissions are encrypted. UAB applications are designed and developed using a comprehensive set of security standards. Areas addressed within application security standards include password construction, strength and control, browser technologies authentication and access control, security administration, and logging, auditing, and monitoring. Internet applications mandate TLS encryption with strong cipher suites for the transmission of any sensitive data. Before going into production, all new Internet applications must be submitted for security testing. All identified security issues that could impact the confidentiality or integrity of our data must be corrected prior to production release. Applications are retested on a regular schedule that coincides with major release cycles. A comprehensive change management system is utilized for updates, production changes, quality control and revision management.