Skip to main content
How to set up a minimal Slurm cluster
"SLURM Multinode on AWS" tutorial by OpenFlightHPC

"SLURM Multinode on AWS" tutorial by OpenFlightHPC

Creative Commons License

How to set up a minimal Slurm cluster

This is not a requirement for the High Performance Computing theme. It is a tutorial for teachers & course runners to set up a training environment - if that is not relevant to you then please ignore!
The following is a tutorial for setting up a minimal Slurm cluster, using AWS and the Flight Solo image from OpenFlightHPC, for trainees to use in the rest of the High Performance Computing (HPC) theme. In particular, this is aimed at getting trainees an environment for running the Intro to HPC course and should be followed by the trainers before the course is taught if the trainees have no other access to an HPC environment on which to run simple commands.
For this task we will be using Flight Solo, an open source image for setting up an HPC environment for research and scientific computing, including SLURM and HPC package management using spack (among other things). We will be setting up a minimal training cluster consisting of:
  • 1 login node
  • 2 compute nodes
with SLURM and OpenMPI set up to run jobs on the 2 compute nodes. As we will not be using this for serious computation, these can all run on pretty small machines within Amazon's EC2 System – in our case we opted for T3.medium as it was the smallest available in our region.

Spinning up the Nodes

Fortunately, Flight Solo comes with good tutorials for setting up the image on AWS (and other cloud platforms) with detailed, step-by-step instructions for getting the machines spun up. Therefore, our first advice is to follow the tutorial instructions.
If that works, great! You can carry on to the next step: setting up packages with spack. However, if, like for us, this did not work first time, you can try the following modified steps. During the first stage, i.e. "Launch Login Node", we instead did this via the EC2 console panes (see step f in "Launch Compute Nodes") and added the following YAML config under advanced-details.user-data:
#cloud-config write_files: - content: | SHAREPUBKEY=true path: /opt/flight/cloudinit.in permissions: '0600' owner: root:root
This is basically just configuring Flight Solo by setting SHAREPUBKEY to true in a config file, which shares the public key from the login key pair to the compute nodes and allows them to be found and configured by Flight. Note that we also created the login node with additional storage space (30GB, rather than 10GB) as we were struggling with getting big packages installed otherwise.
At this point we should be able to SSH into the login node with the key pair generated on the AWS interface and the public IP address of the login node EC2 instance, i.e.
ssh -i path/to/keyfile.pem flight@$PUBLIC_IP_ADDRESS
This should bring you to a login page where you can start configuring the cluster with Flight.
You can then follow the instructions the rest of the way, including leaving most of the configuration set up during flight profile configure as the default. What we did specifically:
  • Cluster type: OpenFlight Slurm Multinode
  • Cluster name: slurm-multinode (you can do what you want here)
  • Setup Multi User Environment with IPA?: none
  • Local user login: flight
  • Set local user password to: (we left this as default)
  • IP or FQDN for Web Access: ec2-13-43-90-160.eu-west-2.compute.amazonaws.com (left as default)
  • IP Range of Compute Nodes: 172.31.32.0/20
Note that the IP or FQDN for Web Access was left as the default as we didn't try configuring Flight's web interface, the IP range of the compute nodes was calculated automatically so there was no need to change it, and the password was left as the default but changed after successfully applying profiles.
Profiles were then applied as specified in the Flight Tutorial.

Spack and Modules

Flight has some environments available for installing system-level packages, we opted for spack for no particular reason other than it is well-regarded, and one of our requirements is for the cluster to have modules, which is easily achievable with spack. First, though, you have to create a global spack flight-environment with
flight env create -g spack
We recommend -g (global) so every other user can access the installed modules but not install their own. This will need to be done as the root user, though, which you can escalate to while logged in as the user flight with sudo -s. Once finished, you can then activate the spack environment with:
flight env activate spack
After which your regular spack commands should work. More info on the spack flight-environment can be found in the Flight docs.
To get the module files working, you can follow the instructions on the spack docs, but to summarise what we did:
  1. Enable tcl module files
    spack config add "modules:default:enable:[tcl]"
  2. Install lmod with spack
    spack install lmod
  3. Make the module tool available to the current shell
    . $(spack location -i lmod)/lmod/lmod/init/bash
  4. Install and add a new compiler
    spack install gcc@12.3.0 spack compiler add
  5. Install any new modules with this new compiler
    spack install {module_name} %gcc@12.3.0
The compiler part isn't strictly necessary, so it can be ignored, but it does make formatting the module list a bit more straightforward so we still recommend it. We also found that the gcc@11 that came pre-installed on Flight didn't have Fortran compilers installed, so a fresh compiler install was necessary for MPI, though your mileage may vary.

Formatting the module list

This then gives us a large list of all of the dependencies spack downloaded and installed for each of these newly installed modules, which we can leave be if you like, or we can configure down to a nice, minimalist list. To get around this we added a config file, following the advice of the aforementioned tutorial, to $SPACK_ROOT/etc/spack/modules.yaml containing the following:
modules: default: tcl: hash_length: 0 include: - gcc {PACKAGE_LIST} exclude: - '%gcc@11' - '%gcc@12' all: conflict: - '{name}' projections: all: '{name}/{version}'
Here you'll need to replace {PACKAGE_LIST} with the YAML-formatted list of packages you specifically want to include (to see the full list of packages spack has installed, simply run spack find). After creating/editing this file you'll have to run
spack module tcl refresh --delete-tree -y
This is needed for the changes to be reflected in the list of available modules. This will only show specific packages (and dependencies) you installed with spack and specified in the include section. This can be a little limiting if you're installing new packages frequently, so the {list of packages} and "%gcc@12" can be removed for all of the spack packages and dependencies to be included in the module avail command.

Some sysadmin

The above approach doesn't persist after leaving the shell instance, so we put the following into /etc/profile:
flight env activate spack . $(spack location -i lmod)/lmod/lmod/init/bash flight env deactivate
This leaves the paths in place for any user to be able to call module commands (e.g. module avail) but not install new spack packages. You might be able to copy the output of $(spack location -i lmod) and hardcode it to avoid having to activate the flight environment.
One final bit of sysadmin involved activating password authentication for sshd so that users could log in with a password and then add their own SSH file, as per the course. This just involves uncommenting the line PasswordAuthentication yes in /etc/ssh/sshd_config, removing any overriding references to this option in /etc/ssh/sshd_config.d, and then restarting the sshd service with
sudo systemctl restart sshd
After which you should be able to log in to the login node with just a password.

OpenMPI Installation

The training material also requires the use of srun, mpirun, and mpiexec, for which some installation of MPI is required. We went for OpenMPI, which we had to install with
spack install openmpi ~legacylaunchers schedulers=slurm
Here schedulers=slurm tells it to compile with Slurm compatibility and ~legacylaunchers tells it not to delete the mpirun and mpiexec binaries. There are good reasons to delete them for a proper, production install, but for our training purposes having them is preferable.

Summary

At this point we should now have a working SLURM cluster on AWS which we can ssh into, submit jobs on with sbatch, and generally treat like a proper HPC environment. Feel free at this point to take it for a spin – we ran through the HPC introduction course but you may wish to try something a bit more involved.