Running a parallel job
We now have the tools we need to run a multi-processor job.
This is a very important aspect of HPC systems, as parallelism is one of the primary tools we have to improve the performance of computational tasks.
If you disconnected, log back in to the cluster.
local$ ssh user@cluster.name
Install the Amdahl Program
With the Amdahl source code on the cluster, we can install it, which will provide access to the
amdahl
executable.Amdahl is Python Code
The Amdahl program is written in Python, and installing or using it requires locating the
python3
executable on the login node.Move into the extracted directory, then use the Package Installer for Python, or
pip
, to install it in your ("user") home directory:To do this, we'll need to make sure we have our required packages installed:
remote$ module load python/3.11 remote$ module load openmpi/2.1.1
remote$ cd amdahl remote$ python3 -m pip install --user .
Dependencies and Clusters
As they're shared computing environments, and can be a target for hacking, many high performance clusters block the kind of automatic dependency resolution you're used to on regular machines. You can see this if you call
python3 -m pip install .
and it hangs for a while, before reporting it can't find setuptools
(or another package).You may have to rely on their 'walled garden' of resources, or install any extras manually - to reduce the risk of malware sneaking onto the cluster.
If
python3 -m pip install --user .
fails, you have a few possibilities to manage the dependencies:Use a module
The easiest fix is to look for a
mpi4py
module already available on your cluster.
Try module avail mpi4py
, and load it if you can find one.Use Anaconda
Some clusters prefer you to use Anaconda, a heavier-weight package and environment manager for Python that has a vetted list of packages. Your system might have it down as
conda
, miniconda
or anaconda
- try module avail conda
to search for it:remote$ module avail conda
------------------------------------------------- /local/modules/apps -------------------------------------------------- anaconda/py3.10 conda/py2-latest conda/py3-latest (D)
Conda requires you to make some modifications to your
.bashrc
file, then re-load it to allow you to work with environments.
We'll unload python
and load conda
instead:remote$ module unload python remote$ module load conda remote$ conda init remote$ source ~/.bashrc
Then, once you've done that, you can create an
amdahl
environment, and install the prerequisites for it using conda
instead of pip
.remote$ conda create amdahl remote$ conda activate amdahl remote$ conda install --yes --file requirements.txt
Install manually
You can try and install
mpi4py
from its source code, just like we're doing with amdahl
.
You can dowload and copy it across just as with amdahl
:local$ wget -O mpi4py.tar.gz https://github.com/mpi4py/mpi4py/releases/download/3.1.4/mpi4py-3.1.4.tar.gz local$ scp mpi4py.tar.gz user@cluster.name: # or local$ rsync -avP mpi4py.tar.gz user@cluster.name:
Then install it on the cluster:
remote$ tar -xvzf mpi4py.tar.gz # extract the archive remote$ mv mpi4py* mpi4py # rename the directory remote$ cd mpi4py remote$ python3 -m pip install --user . remote$ cd ../amdahl remote$ module load openmpi/2.1.1 remote$ python3 -m pip install --user .
You might get some errors during install, including
Could not find "Python.h"
.
This happens if the 'development version' of Python isn't installed - you might be able to find a python3-dev
module, or you may have to just use Anaconda instead.Finally, install
amdahl
.Once your dependencies are sorted, you can then install
amdahl
without looking on the Python Package Index (PyPI):python3 -m pip install --user --no-index .
Binaries and PATH
s
pip
may warn that your user package binaries are not in your PATH
.WARNING: The script amdahl is installed in "${HOME}/.local/bin" which is not on PATH. Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
To check whether this warning is a problem, use
which
to search for the
amdahl
program:remote$ which amdahl
If the command returns no output, displaying a new prompt, it means the file
amdahl
has not been found.
You must update the environment variable named PATH
to include the missing folder.
Edit your shell configuration file as follows, then log off the cluster and back on again so it takes effect.remote$ nano ~/.bashrc remote$ tail ~/.bashrc
export PATH=${PATH}:${HOME}/.local/bin
After logging back in to the cluster,
which
should be able to find amdahl
without difficulties.
If you had to load a Python module, load it again.Help
Many command-line programs include a "help" message. Try it with
amdahl
:remote$ amdahl --help usage: amdahl [-h] [-p [PARALLEL_PROPORTION]] [-w [WORK_SECONDS]] [-t] [-e] [-j [JITTER_PROPORTION]] optional arguments: -h, --help show this help message and exit -p [PARALLEL_PROPORTION], --parallel-proportion [PARALLEL_PROPORTION] Parallel proportion: a float between 0 and 1 -w [WORK_SECONDS], --work-seconds [WORK_SECONDS] Total seconds of workload: an integer greater than 0 -t, --terse Format output as a machine-readable object for easier analysis -e, --exact Exactly match requested timing by disabling random jitter -j [JITTER_PROPORTION], --jitter-proportion [JITTER_PROPORTION] Random jitter: a float between -1 and +1
This message doesn't tell us much about what the program does,
but it does tell us the important flags we might want to use when launching it.
Running the Job on a Compute Node
Create a submission file, requesting one task on a single node, then launch it.
remote$ nano serial-job.sh remote$ cat serial-job.sh
#!/bin/bash #SBATCH -J solo-job #SBATCH -p cpubase_bycore_b1 #SBATCH -N 1 #SBATCH -n 1 # Load the computing environment we need module load python # Execute the task amdahl
Dependency Variants
If you weren't able to just use
pip install
to handle all of amdahl
's dependencies, you'll need to do something different in the submission script:Using a Module
# Load the computing environment we need module load python/3.11 module load mpi4py
Using Anaconda
# Load the computing environment we need module load conda conda init conda activate amdahl
remote$ sbatch serial-job.sh
As before, use the Slurm status commands to check whether your job is running and when it ends:
remote$ squeue -u yourUsername
Read the Job Output
Use
ls
to locate the output file. The -t
flag sorts in reverse-chronological order: newest first.
What was the output?As we saw before, two of the
amdahl
program flags set the amount of work and the proportion of that work that is parallel in nature.
Based on the output, we can see that the code uses a default of 30 seconds of work that is 85% parallel.
The program ran for just over 30 seconds in total, and if we run the numbers,
it is true that 15% of it was marked 'serial' and 85% was 'parallel'.Since we only gave the job one CPU, this job wasn't really parallel:
the same processor performed the 'serial' work for 4.5 seconds, then the 'parallel' part for 25.5 seconds, and no time was saved.
The cluster can do better, if we ask.
Running the Parallel Job
The
amdahl
program uses the Message Passing Interface (MPI) for parallelism -- this is a common tool on HPC systems.What is MPI?
The Message Passing Interface is a set of tools which allow multiple tasks
running simultaneously to communicate with each other.
Typically, a single executable is run multiple times, possibly on different
machines, and the MPI tools are used to inform each instance of the
executable about its sibling processes, and which instance it is.
MPI also provides tools to allow communication between instances to
coordinate work, exchange information about elements of the task, or to
transfer data.
An MPI instance typically has its own copy of all the local variables.
While MPI-aware executables can generally be run as stand-alone programs,
in order for them to run in parallel they must use an MPI run-time environment,
which is a specific implementation of the MPI standard.
To activate the MPI environment, the program should be started via a command
such as
mpiexec
(or mpirun
, or srun
, etc. depending on the MPI run-time
you need to use), which will ensure that the appropriate run-time support for
parallelism is included.MPI Runtime Arguments
On their own, commands such as
mpiexec
can take many arguments specifying
how many machines will participate in the execution,
and you might need these if you would like to run an MPI program on your
own (for example, on your laptop).
In the context of a queuing system, however, it is frequently the case that
MPI run-time will obtain the necessary parameters from the queuing system,
by examining the environment variables set when the job is launched.Let's modify the job script to request more cores and use the MPI run-time.
remote$ cp serial-job.sh parallel-job.sh remote$ nano parallel-job.sh remote$ cat parallel-job.sh
#!/bin/bash #SBATCH -J parallel-job #SBATCH -N 1 #SBATCH -n 4 # Load the computing environment we need module load python/3.11 # Execute the task mpiexec amdahl
Dependency Variants: Using Anaconda
If you had to install your code using
conda
, as part of installing mpi4py
it'll have built its own version of mpiexec
.
That will be in your conda
environment directory.Try
which mpiexec
to see what your default version of mpiexec
is.
If the path it shows doesn't include .conda/envs
, you'll have to call the correct version explicitly in your script:#Execute the task ~/.conda/envs/amdahl/bin/mpiexec amdahl
Then submit your job. Note that the submission command has not really changed from how we submitted the serial job:
all the parallel settings are in the batch file rather than the command line.
remote$ sbatch parallel-job.sh
As before, use the status commands to check when your job runs.
remote$ ls -t
slurm-347178.out parallel-job.sh slurm-347087.out serial-job.sh amdahl README.md LICENSE.txt
remote$ cat slurm-347178.out
Doing 30.000 seconds of 'work' on 4 processors, which should take 10.875 seconds with 0.850 parallel proportion of the workload. Hello, World! I am process 0 of 4 on <<node1>>. I will do all the serial 'work' for 4.500 seconds. Hello, World! I am process 2 of 4 on <<node1>>. I will do parallel 'work' for 6.375 seconds. Hello, World! I am process 1 of 4 on <<node1>>. I will do parallel 'work' for 6.375 seconds. Hello, World! I am process 3 of 4 on <<node1>>. I will do parallel 'work' for 6.375 seconds. Hello, World! I am process 0 of 4 on <<node1>>. I will do parallel 'work' for 6.375 seconds. Total execution time (according to rank 0): 10.888 seconds
Is it 4× faster?
The parallel job received 4× more processors than the serial job:
does that mean it finished in ¼ the time?
How Much Does Parallel Execution Improve Performance?
In theory, dividing up a perfectly parallel calculation among n MPI processes
should produce a decrease in total run time by a factor of n.
As we have just seen, real programs need some time for the MPI processes to
communicate and coordinate, and some types of calculations can't be subdivided:
they only run effectively on a single CPU.
Additionally, if the MPI processes operate on different physical CPUs in the
computer, or across multiple compute nodes, even more time is required for
communication than it takes when all processes operate on a single CPU.
In practice, it's common to evaluate the parallelism of an MPI program by
- running the program across a range of CPU counts,
- recording the execution time on each run,
- comparing each execution time to the time when using a single CPU.
Since "more is better" - improvement is easier to interpret from increases in
some quantity than decreases - comparisons are made using the speedup factor
S, which is calculated as the single-CPU execution time divided by the multi-CPU
execution time. For a perfectly parallel program, a plot of the speedup S
versus the number of CPUs n would give a straight line, S = n.
Let's run one more job, so we can see how close to a straight line our
amdahl
code gets.remote$ nano parallel-job.sh remote$ cat parallel-job.sh
#!/bin/bash #SBATCH -J parallel-job #SBATCH -N 1 #SBATCH -n 8 # Load the computing environment we need module load python/3.11 # Execute the task mpiexec amdahl
Dependency Variants
As before, you'll need to modify this script if you used a module or Anaconda.
Then submit your job.
Note that the submission command has not really changed from how we submitted the serial job:
all the parallel settings are in the batch file rather than the command line.
remote$ sbatch parallel-job.sh
As before, use the status commands to check when your job runs.
remote$ ls -t
slurm-347271.out parallel-job.sh slurm-347178.out slurm-347087.out serial-job.sh amdahl README.md LICENSE.txt
remote$ cat slurm-347178.out
which should take 7.688 seconds with 0.850 parallel proportion of the workload. Hello, World! I am process 4 of 8 on <<node1>>. I will do parallel 'work' for 3.188 seconds. Hello, World! I am process 0 of 8 on <<node1>>. I will do all the serial 'work' for 4.500 seconds. Hello, World! I am process 2 of 8 on <<node1>>. I will do parallel 'work' for 3.188 seconds. Hello, World! I am process 1 of 8 on <<node1>>. I will do parallel 'work' for 3.188 seconds. Hello, World! I am process 3 of 8 on <<node1>>. I will do parallel 'work' for 3.188 seconds. Hello, World! I am process 5 of 8 on <<node1>>. I will do parallel 'work' for 3.188 seconds. Hello, World! I am process 6 of 8 on <<node1>>. I will do parallel 'work' for 3.188 seconds. Hello, World! I am process 7 of 8 on <<node1>>. I will do parallel 'work' for 3.188 seconds. Hello, World! I am process 0 of 8 on <<node1>>. I will do parallel 'work' for 3.188 seconds. Total execution time (according to rank 0): 7.697 seconds
Non-Linear Output
When we ran the job with 4 parallel workers, the serial job wrote its output first,
then the parallel processes wrote their output, with process 0 coming in first and last.
With 8 workers, this is not the case: since the parallel workers take less time than the serial work,
it is hard to say which process will write its output first, except that it will not be process 0!
Now, let's summarize the amount of time it took each job to run:
Number of CPUs | Runtime (sec) |
---|---|
1 | 30.033 |
4 | 10.888 |
8 | 7.697 |
Then, use the first row to compute speedups S, using Python as a command-line calculator:
remote$ for n in 30.033 10.888 7.697; do python3 -c "print(30.033 / $n)"; done
Number of CPUs | Speedup | Ideal |
---|---|---|
1 | 1.0 | 1 |
4 | 2.75 | 4 |
8 | 3.90 | 8 |
The job output files have been telling us that this program is performing 85% of its work in parallel, leaving 15% to run in serial.
This seems reasonably high, but our quick study of speedup shows that in order to get a 4× speedup, we have to use 8 or 9 processors in parallel.
In real programs, the speedup factor is influenced by
- CPU design
- communication network between compute nodes
- MPI library implementations
- details of the MPI program itself
Using Amdahl's Law, you can prove that with this program, it is impossible to reach 8× speedup, no matter how many processors you have on hand.
Details of that analysis, with results to back it up, are left for the next section of the material, Scalability Profiling.