BwUniCluster Hardware and Architecture
1 Architecture of bwUniCluster
The bwUniCluster (uc1) is a parallel computer with distributed memory. Each node of uc1 has sixteen Intel Xeon processors, local memory, disks and network adapters. All nodes are connected by a fast InfiniBand 4X FDR interconnect. In addition the file system Lustre, that is connected by coupling the InfiniBand of the file server with the InfiniBand switch of the compute cluster, is added to bwUniCluster (uc1) to provide a fast and scalable parallel file system.
The basic operating system on each node is Red Hat Enterprise Linux (RHEL) 7.4. On top of this operating system a set of software components like e.g. MOAB have been installed. Some of these components are of special interest to end users and are briefly discussed in this document. Others which are of greater importance to system administrators will not be covered by this document.
Nodes of uc1 may have different roles. According to the services supplied by the nodes, they are separated into disjoint groups. From an end users point of view the different groups of nodes are login nodes, compute nodes, file server nodes and administrative server nodes.
The login nodes are the only nodes that are directly accessible by end users. These nodes are used for interactive login, file management, program development and interactive pre- and postprocessing. Two nodes are dedicated to this service but they are all accessible via one address and a DNS round-robin alias distributes the login sessions to the different login nodes.
The majority of nodes are compute nodes which are managed by a batch system. Users submit their jobs to the batch system MOAB which uses SLURM as basic system and a job is executed depending on its priority, when the required resources become available.
File Server Nodes
The hardware of the parallel file system Lustre incorporates some file server nodes; the file system Lustre is connected by coupling the InfiniBand of the file server with the independent InfiniBand switch of the compute cluster. In addition to shared file space there is also local storage on the disks of each node (for details see chapter "File Systems").
Administrative Server Nodes
Some other nodes are delivering additional services like resource management, external network connection, administration etc. These nodes can be accessed directly by system administrators only.
2 Components of bwUniCluster
|Compute nodes "Thin"||Compute nodes "Fat"||Compute nodes "Broadwell"||Login nodes "Thin / Fat"||Login nodes "Broadwell"||Remote Vis Nodes||Service nodes|
|Number of nodes||512||8||352||2||2||9||10|
|Processors||Intel Xeon E5-2670 (Sandy Bridge)||Intel Xeon E5-4640 (Sandy Bridge)||Intel Xeon E5-2660 v4 (Broadwell)||Intel Xeon E5-2670 (Sandy Bridge)||Intel Xeon E5-2630 v4 (Broadwell)||Intel Xeon E5-2637 v2 (Ivy Bridge)||Intel Xeon E5-2670 (Sandy Bridge)|
|Processor frequency (GHz)||2.6||2.4||2.0||2.6||2.2||3.5||2.6|
|Number of sockets||2||4||2||2||2||2||2|
|Total number of cores||16||32||28||16||20||8||16|
|Main memory||64 GB||1024 GB||128 GB||64 GB||128 GB||256 GB||64 GB|
|Local disk||2 TB||7 TB||480 GB||4 TB||480 GB||1 TB||1 TB|
|Cache per socket||Level 1: 8x64 KB Level 2: 8x256 KB Level 3: 20 MB||Level 1: 8x64 KB Level 2: 8x256 KB Level 3: 20 MB||Level 1: 8x64 KB Level 2: 14x256 KB Level 3: 35 MB||Level 1: 8x64 KB Level 2: 8x256 KB Level 3: 20 MB||Level 1: 8x64 KB Level 2: 10x256 KB Level 3: 25 MB||Level 1: 4x64 KB Level 2: 4x256 KB Level 3: 15 MB||Level 1: 8x64 KB Level 2: 8x256 KB Level 3: 20 MB|
- 512 16-way Intel Xeon compute nodes. Each of these nodes contains two Octa-core Intel Xeon processors E5-2670 (Sandy Bridge) which run at a clock speed of 2.6 GHz and have 8x256 KB of level 2 cache and 20 MB level 3 cache. Each node has 64 GB of main memory, local disks with 2 TB and an adapter to connect to the InfiniBand 4X FDR interconnect.
- 8 32-way Intel Xeon compute nodes. Each of these nodes contains four Octa-core Intel Xeon processors E5-4640 (Sandy Bridge) which run at a clock speed of 2.4 GHz and have 8x256 KB of level 2 cache and 20 MB level 3 cache. Each node has 1024 GB of main memory, local disks with 7 TB and an adapter to connect to the InfiniBand 4X FDR interconnect.
- 352 28-way Intel Xeon compute nodes. Each of these nodes contains two Fourteen-core Intel Xeon processors E5-2660 v4 (Broadwell) which have 14x256 KB of level 2 cache and 35 MB level 3 cache. The CPUs are guaranteed to reach 1.7 GHz with Floating point intense workloads, 2.0 GHz with normal workloads, and can clock up to 2.4 GHz when using all cores or 3.2 GHz when using one or two cores per socket. Each node has 128 GB of main memory, a local solid state disk with 480 GB and a Connect-IB adapter to connect to the InfiniBand 4X FDR interconnect.
- 2 16-way Intel Xeon login nodes. Both nodes contain two Octa-core Intel Xeon processors E5-2670 (Sandy Bridge) which run at a clock speed of 2.6 GHz and have 8x256 KB of level 2 cache and 20 MB level 3 cache. Each node has 64 GB of main memory, local disks with 4 TB and an adapter to connect to the InfiniBand 4X FDR interconnect.
- 2 20-way Intel Xeon login nodes. Both nodes contain two Ten-core Intel Xeon processors E5-2630 v4 (Broadwell) which have 10x256 KB of level 2 cache and 25 MB level 3 cache. The CPUs are guaranteed to reach 1.8 GHz with Floating point intense workloads, 2.2 GHz with normal workloads, and can clock up to 2.4 when using all cores or 3.1 GHz when using one or two cores per socket. Each node has 128 GB of main memory, a local solid state disk with 480 GB and a ConnectX-3 adapter to connect to the InfiniBand 4X FDR interconnect.
- 9 8-way Intel Xeon remote visualization nodes. Each of these nodes contains two quad-core Intel Xeon Processors E5-2637 v2 (Ivy Bridge) which have 4 * 256 kB of L2 cache and 15MB of L3 cache. The CPUs run at a clock speed of 3,5 GHz. Each node is equipped with 256 GB of main memory and a hard disk with 1 TB and is connected to the InfiniBand interconnect with a 4xFDR link. Additionaly these nodes are equipped with two NVIDIA Quadro K6000 GPUs for accelerating remote rendering.
- 10 16-way Intel Xeon service nodes. Each of these nodes contains two Octa-core Intel Xeon processors E7-2670 (Sandy Bridge). Each node has 64 GB of main memory, one InfiniBand adapter and local disks.
- File server nodes that are part of the scalable, parallel file system Lustre that is tied to bwUniCluster via a separate InfiniBand network. The global shared storage of the file system is subdivided into a part used for home directories and a larger part for non permanent files. The directories in the larger part of the file system are often called work-directories. The details are described in the next chapter.
An important component of bwUniCluster is the InfiniBand 4X FDR interconnect. All nodes are attached to this interconnect which is characterized by its very low latency of about 1 microsecond and a point to point bandwidth between two nodes of more than 6000 MB/s. Especially the very short latency makes the parallel system ideal for communication intensive applications and applications doing a lot of collective MPI communications.
With these types of nodes bwUniCluster can meet the requirements of a broad range of applications:
- applications that are parallelized by the message passing paradigm and use high numbers of processors will run on a subset of the 512 sixteen-way nodes and exchange messages over the InfiniBand interconnect,
- applications that are parallelized using shared memory either by OpenMP or explicit multithreading with pthreads can run within the 16-way or 32-way nodes.
3 File Systems
On bwUniCluster (uc1) the parallel file system Lustre is used for globally visible user data. Lustre is open source and Lustre solutions and support are available from different vendors. Nowadays, most of the biggest HPC systems are using Lustre. Initial directories on the Lustre file systems are created for each user, and environment variables $HOME and $WORK point to these directories. Within a batch job a further directory $TMP is available which is only visible on the local node and is located on the local disk(s).
Some of the characteristics of the file systems are shown in Table 1.
|Property||$TMP||$HOME||$WORK / workspace|
|Lifetime||batch job walltime||permanent||min. 7 days / max. 240 days|
|Disk space|| 2 TB @ thin nodes
7 TB @ fat nodes
4 TB @ login nodes
480 GB @ broadwell nodes
| 427 TiB for KIT
427 TiB for bw-users
|853 TiB / 853 TiB|
|Quotas||no||yes, per group||(currently) no|
|Read perf./node|| 280 MB/s @ thin node
593 MB/s @ fat node
416 MB/s @ login node
420 MB/s @ broadwell node
|1 GB/s||1 GB/s|
|Write perf./node|| 270 MB/s @ thin node
733 MB/s @ fat node
615 MB/s @ login node
380 MB/s @ broadwell node
|1 GB/s||1 GB/s|
|Total read perf.||n*280|593 MB/s||8 GB/s||16 GB/s|
|Total write perf.||n*270|733 MB/s||8 GB/s||16 GB/s|
global : all nodes of uc1 access the same file system; local : each node has its own file system; permanent: files are stored permanently; batch job : files are removed at end of the batch job.
Table 1: File Systems and Environment Variables
3.1 Selecting the appropriate file system
In general, you should separate your data and store it on the appropriate file system. Permanently needed data like software or important results should be stored below $HOME but capacity restrictions (quotas) apply. In case you accidentally deleted data on $HOME you can usually restore it from backup. Permanent data which is not needed for months or exceeds the capacity restrictions should be sent to bwFileStorage or to the archive and deleted from the file systems. Temporary data which is only needed on a single node and which does not exceed the disk space shown in the table above should be stored below $TMP. Temporary data which is only needed during job runs or which can be easily recomputed or which is the result of one job and input for another job should be stored below $WORK or in so-called workspaces. Data below $WORK has a guaranteed lifetime of only 7 days and data older than 28 days (according to the modification time) will be automatically deleted. The lifetime of data in workspaces is also limited and depends on the lifetime of the workspace, i.e. the data can usually be kept for a longer time and workspaces are deleted as a whole.
The most efficient way to transfer data to/from other HPC file systems or bwFileStorage is done with the tool rdata.
In case you are working on different HPC systems (IC2, ForHLR I and ForHLR II ) the only file system which is visible on all systems is $HOME. However, you should not use $HOME for temporary data and separate the data to the file systems as described above. For moving temporary data to other file systems please use the tool rdata.
For further details please check the chapters below.
The home directories of bwUniCluster (uc1) users are located in the parallel file system Lustre. You have access to your home directory from all nodes of uc1 and also all nodes of ForHLR I and ForHLR II. A regular backup of these directories to tape archive is done automatically. The directory $HOME is used to hold those files that are permanently used like source codes, configuration files, executable programs etc. The home directories ($HOME) are located on the PFS2 (Parallel File System 2), i.e. the $HOME directories of ForHLR I and ForHLR II are the same. For each user group (i.e. one institute or university in Baden-Württemberg) a fixed amount of disk space for the home directories is allowed and is enforced by so-called quotas. You can find out the disk usage of the users in your group with the command
$ less $HOME/../diskusage
Note that according to a decision of Tübingen's site contacts users of the University of Tübingen intentionally do not have read access permission for this file. Also note that for non-KIT users which also work on ForHLR I the user values in the file might be higher as expected since they also include backup data of $PROJECT. The data of the file above is recreated every day, i.e. is not up-to-date. The current disk usage of your group can be displayed with the command
$ lfs quota -g $(id -ng) $HOME
and your currently used disk capacity with the command
$ lfs quota -u $(whoami) $HOME
Note that for non-KIT users which also work on ForHLR I the displayed values might be higher as expected since they also include backup data of $PROJECT.
On bwUniCluster (uc1) there is additional file space that can be accessed using the environment variable $WORK. The work-directories are used for files that have to be available for a certain period, nowadays 28 days. These are typically restart files or output data that have to be postprocessed. All users can create large temporary files. But in order to be fair to your colleagues who also want to use this file system, large files which are no longer needed should be removed. SCC automatically removes old files in this file system which are older than 28 days (according to the modification time). However, the guaranteed lifetime for files on $WORK is only 1 week. The file system used for $WORK directories is also the parallel file system Lustre. This file system is especially designed for parallel access and for a high throughput to large files. The work-filesystems show high data transfer rates of up to 16 GB/s write and read performance when the data are accessed in parallel. You can find out your disk usage of $WORK with the command
$ lfs quota -u $(whoami) $WORK
Different to the 28 day lifetime of single files under $WORK, workspaces directories expire as a whole after a fixed period. The maximum lifetime of a workspace is 60 days, but it can be renewed at the end of that period 3 times to a total maximum of 240 days after workspace generation.
Creating, deleting, finding and extending workspaces is explained on the workspace page.
3.4.1 Reminder for workspace deletion
You can send yourself a calender entry which reminds you when a workspace will be automatically deleted:
$ ws_send_ical.sh <workspace> <email>
3.5 Improving Performance on $HOME, $WORK and workspaces
The following recommendations might help to improve throughput and metadata performance on Lustre filesystems.
3.5.1 Improving Throughput Performance
Depending on your application some adaptations might be necessary if you want to reach the full bandwidth of the filesystems. Parallel filesystems typically stripe files over storage subsystems, i.e. large files are separated into stripes and distributed to different storage subsystems. In Lustre, the size of these stripes (sometimes also mentioned as chunks) is called stripe size and the number of used storage subsystems is called stripe count.
When you are designing your application you should consider that the performance of parallel filesystems is generally better if data is transferred in large blocks and stored in few large files. In more detail, to increase throughput performance of a parallel application following aspects should be considered:
- collect large chunks of data and write them sequentially at once,
- use moderate stripe count (not only stripe count 1) if only one task is doing IO in order to reach possible client bandwidth (usually limited by internal bus or network adapter),
- to exploit complete filesystem bandwidth use several clients,
- avoid competitive file access or use blocks with boundaries at stripe size (default is 1MB),
- if many tasks use few huge files set stripe count to -1 in order to use all storage subsystems (see below for an example),
- if files are small enough for the local hard drives and are only used by one process store them on $TMP.
The storage systems of $HOME and $WORK are DDN SFA12K RAID systems. The file system $HOME uses 20 volumes. By default, files of $HOME are striped across 1 volume. The file system $WORK uses 40 volumes. By default, files of $WORK are striped across 2 volumes.
However, you can change the stripe count of a directory and of newly created files. New files and directories inherit the stripe count from the parent directory. E.g. if you want to enhance throughput on a single file which is created in the directory $WORK/my_output_dir you can use the command
$ lfs setstripe -c8 $WORK/my_output_dir
to change the stripe count to 8. If the single file is accessed from one task it is not beneficial to further increase the stripe count because the local bus and the interconnect will become the bottleneck. If many tasks and nodes use the same output file you can further increase the throughput by using all available storage subsystems with the following command:
$ lfs setstripe -c-1 $WORK/my_output_dir
Note that the stripe count parameter -1 indicates that all available storage subsystems should be used. If all tasks write to the same file you should make sure that overlapping file parts are seldom used and that it is most beneficial if a single task uses blocks which are multiples of 1 MB (1 MB is the default stripe size).
If you change the stripe count of a directory the stripe count of existing files inside this directory is not changed. If you want to change the stripe count of existing files, change the stripe count of the parent directory, copy the files to new files, remove the old files and move the new files back to the old name. In order to check the stripe setting of the file my_file use
$ lfs getstripe my_file
Also note that changes on the striping parameters (e.g. stripe count) are not saved in the backup, i.e. if directories have to be recreated this information is lost and the default stripe count will be used. Therefore, you should annotate for which directories you made changes to the striping parameters so that you can repeat these changes if required.
3.5.2 Improving Metadata Performance
Metadata performance on parallel file systems is usually not as good as with local filesystems. In addition, it is usually not scalable, i.e. a limited resource. Therefore, you should omit metadata operations whenever possible. For example, it is much better to have few large files than lots of small files. In more detail, to increase metadata performance of a parallel application following aspects should be considered:
- avoid creating many small files,
- avoid competitive directory access, e.g. by creating files in separate subdirectories for each task,
- if lots of files are created use stripe count 1,
- if many small files are only used by one process store them on $TMP,
- change the default colorization setting of the command ls (see below).
On modern Linux systems, the GNU ls command often uses colorization by default to visually highlight the file type; this is especially true if the command is run within a terminal session. This is because the default shell profile initializations usually contain an alias directive similar to the following for the ls command:
$ alias ls=’ls -color=tty’
However, running the ls command in this way for files on a Lustre file system requires a stat() call to be used to determine the file type. This can result in a performance overhead, because the stat() call always needs to determine the size of a file, and that in turn means that the client node must query the object size of all the backing objects that make up a file. As a result of the default colorization setting, running a simple ls command on a Lustre file system often takes as much time as running the ls command with the -l option (the same is true if the -F, -p, or the -classify option, or any other option that requires information from a stat() call, is used). To avoid this performance overhead when using ls commands, add an alias directive similar to the following to your shell startup script:
$ alias ls=’ls -color=none’
3.6 Collecting job I/O statistics on Lustre file systems
SCC implemented a new tool which collects I/O statistics on Lustre file systems. When a batch job is submitted a user can request that I/O statistics should be collected for dedicated file systems. After the job has completed the new tool will send an email with the I/O statistics. Internally, the tool uses the so-called Lustre jobstats. We recommend to use this feature at least once for each job type in order to find out how your job's I/O ends up on the Lustre servers.
3.6.1 Collecting job I/O statistics step by step
In order to determine the Lustre file system name of a directory for which I/O statistics should be collected, just use the following command:
$ df <directory> | sed -ne "s|.*o2ib:/\([a-z,0-9]*\).*|\1|p"
Use the file system name during job submission with the following msub option:
-W lustrestats:<file system name>[,<file system name>]...
You can also use msub option -M in case the email with the results should be sent to a non-standard email address. Job I/O statistics are collected every hour. The statistics in the email should be self-explanatory.
There is just one limitation: The values could be lower than expected. On one hand Lustre clients use normal Linux VFS caching, i.e. read requests might use this cache and will not cause additional I/O on the servers. On the other hand, in order to reduce the amount of required memory Lustre only holds statistics for each device (e.g. OST) and job which was active during the last hour (this period could be changed). In case you are doing I/O at the beginning of your job and afterwards there is no I/O for a long time it might happen that the the statistics do not include the I/O from the beginning.
While all tasks of a parallel application access the same $HOME and $WORK directory, the $TMP directory is local to each node on bwUniCluster (uc1). Different tasks of a parallel application use different directories when they do not utilize one node. This directory should be used for temporary files being accessed by single tasks. On all compute nodes except of the "fat nodes" the underlying hardware is 2x1 TB disks per node. On the "fat nodes" there are 8x1 TB@ RAID 5 disks per node. Secondly this directory should be used for the installation of software packages. This means that the software package to be installed should be unpacked, compiled and linked in a subdirectory of $TMP. The real installation of the package (e.g. make install) should be made in(to) the Lustre filesystem. Each time a batch job is started, a subdirectory is created on each node assigned to the job. $TMP is newly set; the name of the subdirectory contains the Job-id and the starting time so that the subdirectory name is unique for each job. This unique name is then assigned to the environment variable $TMP within the job. At the end of the job the subdirectory is removed.
3.8 Access to other HPC-Filesystems
Users of several SCC HPC-Clusters and users of the bwFileStorage have the possibility to transfer data over the "data mover" nodes via the tool "rdata".
The command rdata executes the filesystem operations on special "data mover" nodes and distributes the load. Examples for the command are:
$ rdata "ls $BWFILESTORAGE/*.c" $ rdata "cp foo $IC2WORK"
$ man rdata
shows how to use the command rdata.
3.8.1 $WORK of SCC HPC-Clusters
From InstitutsCluster II (ic2), ForHLR I and ForHLR II can transfer data of the $WORK filesystem to the bwUniCluster via the tool "rdata". Therefore the environment variable $IC2WORK is set.
3.8.2 $PROJECT of the ForHLR I
Users of ForHLR I and ForHLR II can transfer data of the $PROJECT file system to the bwUniCluster via the tool "rdata".
3.9 Backup and Archiving
There are regular backups of all data of the home directories,whereas ACLs and extended attributes will not be backuped or archived. With the following commands you can access the saved data:
|tsm_q_backup||shows one, multiple or all files stored in the backup device|
|tsm_restore||restores saved files|
Files of the directories $HOME and $WORK can be archived (KIT users only). With the following commands you can use the archive:
|tsm_d_archiv||deletes files from the archive|
|tsm_q_archiv||shows files in the archive|
|tsm_retrieve||retrieves archived files|
The option -h shows how to use the commands.