Skip to main content

High Performance Computing

Parallel Programmi...

Parallelisati... [parallelisation, OMP APIs]

High Performance Computing

Parallel Programmi...

Introduction ... [hybrid APIs]

Synchronisation and Race Conditions

Synchronisation and Race Conditions

In the previous episode, we saw how to use parallel regions, and the shortcut parallel for, to split work across multiple threads. In this episode, we will learn how to synchronise threads and how to avoid data inconsistencies caused by unsynchronised threads.
We've seen just how easy it is to write parallel code using OpenMP, but, now we need to make sure that the code we're writing is both efficient and correct. To do that, we need some additional knowledge about thread synchronisation and race conditions. In the context of parallel computing, thread or rank (in the case of MPI) synchronisation plays a crucial role in guaranteeing the correctness of our program, particularly in regard to data consistency and integrity.

What is code correctness?

Code correctness in parallel programming is the guarantee that a program operates as expected in multi-threaded and multi-process environments, providing both consistent and valid results. This is usually about how a parallel algorithm deals with data accesses and modification, minimizing the occurrence of data inconsistencies creeping in.
So what is thread synchronisation? Thread synchronisation is the coordination of threads, which is usually done to avoid conflicts caused by multiple threads accessing the same piece of shared data. It's important to synchronise threads properly so they are able to work together, and not interfere with data another thread is accessing or modifying. Synchronisation is also important for data dependency, to make sure that a thread has access to the data it requires. This is particularly important for algorithms which are iterative.
The synchronisation mechanisms in OpenMP are incredibly important tools, as they are used to avoid race conditions. Race conditions have to be avoided, otherwise they can result in data inconsistencies if we have any in our program. A race condition happens when two, or more, threads access and modify the same piece of data at the same time. To illustrate this, consider the diagram below:
Race conditions
Two threads access the same shared variable (with the value 0) and increment it by 1. Intuitively, we might except the final value of the shared variable to be 2. However, due to the potential for concurrent access and modification, we can't actually guarantee what it will be. If both threads access and modify the variable concurrently, then the final value will be 1. That's because both variables read the initial value of 0, increment it by 1, and write to the shared variable.
In this case, it doesn't matter if the variable update does or does not happen concurrently. The inconsistency stems from the value initially read by each thread. If, on the other hand, one thread manages to access and modify the variable before the other thread can read its value, then we'll get the value we expect (2). For example, if thread 0 increments the variable before thread 1 reads it, then thread 1 will read a value of 1 and increment that by 1 giving us the correct value of 2. This illustrates why it's called a race condition, because threads race each other to access and modify variables before another thread can!

Analogy: Editing a document

Imagine two people trying to update the same document at the same time. If they don't communicate what they're doing, they might edit the same part of the document and overwrite each others changes, ending up with a messy and inconsistent document (e.g. when they come to merge changes later). This is just like what happens with a race condition in OpenMP. Different threads accessing and modifying the same part of memory, which results in messy and inconsistent memory operations and probably an incorrect result.

Identifying race conditions

Take a look at the following code example. What's the output when you compile and run this program? Where do you think the race condition is?
#include <omp.h> #include <stdio.h> #include <stdlib.h> #define NUM_TIMES 10000 int main(void) { int value = 0; #pragma omp parallel for for (int i = 0; i < NUM_TIMES; ++i) { value += 1; } printf("The final value is: %d\n", value); return EXIT_SUCCESS; }

Synchronisation mechanisms

Synchronisation in OpenMP is all about coordinating the execution of threads, especially when there is data dependency in your program or when uncoordinated data access will result in a race condition. The synchronisation mechanisms in OpenMP allow us to control the order of access to shared data, coordinate data dependencies (e.g. waiting for calculations to be complete) and tasks (if one task needs to be done before other tasks can continue), and to potentially limit access to tasks or data to certain threads.

Barriers

Barriers are the most basic synchronisation mechanism. They are used to create a waiting point in our program. When a thread reaches a barrier, it waits until all other threads have reached the same barrier before continuing. To add a barrier, we use the #pragma omp barrier directive. In the example below, we have used a barrier to synchronise threads such that they don't start the main calculation of the program until a look-up table has been initialised (in parallel), as the calculation depends on this data (See the full code here).
#pragma omp parallel { int thread_id = omp_get_num_thread(); /* The initialisation of the look up table is done in parallel */ initialise_lookup_table(thread_id); #pragma omp barrier /* As all threads depend on the table, we have to wait until all threads are done and have reached the barrier */ /* Each thread then proceeds to its main calculation */ do_main_calculation(thread_id); }
Similarly, in iterative tasks like matrix calculations, barriers help coordinate threads so that all updates are completed before moving to the next step. For example, in the following snippet, each thread updates its assigned row of a matrix using data from its current row and the row above (except for the first row, which has no dependency; see the full code here). A barrier ensures that all threads finish updating their rows in next_matrix before the values are copied back into current_matrix. Without this barrier, threads might read outdated or partially updated data, causing inconsistencies.
Here, the number of rows (nx) is dynamically determined at runtime using omp_get_max_threads(). This function provides the maximum number of threads OpenMP can use in a parallel region, based on the system's resources and runtime configuration. Using this value, we define the number of rows in the matrix, with each row corresponding to a potential thread. This setup ensures that both the current_matrix and next_matrix provide rows for the maximum number of threads allocated during parallel execution.
...... int main() { int nx = omp_get_max_threads(); double current_matrix[nx][NY]; double next_matrix[nx][NY]; /* Initialise the current matrix */ initialise_matrix(current_matrix, nx); for (int iter = 0; iter < NUM_ITERATIONS; ++iter) { #pragma omp parallel { int thread_id = omp_get_thread_num(); /* Update next_matrix based on current_matrix */ iterate_matrix_solution(current_matrix, next_matrix, thread_id, nx); #pragma omp barrier /* Synchronise all threads before copying */ } /* Copy the next_matrix into current_matrix for the next iteration */ copy_matrix(next_matrix, current_matrix, nx); } /* Print the final matrix */ print_matrix("Final Matrix", current_matrix, nx); }
OpenMP does not allow barriers to be placed directly inside #pragma omp parallel for loops due to restrictions on closely nested regions. To coordinate threads effectively in iterative tasks like this, we use a #pragma omp parallel construct, which gives explicit control over the loop and allows proper barrier placement.
Barriers introduce additional overhead into our parallel algorithms, as some threads will be idle whilst waiting for other threads to catch up. There is no way around this synchronisation overhead, so we need to be careful not to overuse barriers or have an uneven amount of work between threads. This overhead increases with the number of threads in use, and becomes even worse when the workload is uneven killing the parallel scalability.

Blocking thread execution and nowait

Most parallel constructs in OpenMP will synchronise threads before they exit the parallel region. For example, consider a parallel for loop. If one thread finishes its work before the others, it doesn't leave the parallel region and start on its next bit of code. It's forced to wait around for the other threads to finish, because OpenMP forces threads to be synchronised.
This isn't ideal if the next bit of work is independent of the previous work just finished. To avoid any wasted CPU time due to waiting around due to synchronisation, we can use the nowait clause which overrides the synchronisation that occurs and allow a "finished" thread to continue to its next chunk of work. In the example below, a nowait clause is used with a parallel for.
#pragma omp parallel { #pragma omp for nowait /* with nowait the loop executes as normal, but... */ for (int i = 0; i < NUM_ITERATIONS; ++i) { parallel_function(); } /* ...if a thread finishes its work in the loop, then it can move on immediately to this function without waiting for the other threads to finish */ next_function(); }

Synchronisation regions

A common challenge in shared memory programming is coordinating threads to prevent multiple threads from concurrently modifying the same piece of data. One mechanism in OpenMP to coordinate thread access are synchronisation regions, which are used to prevent multiple threads from executing the same piece of code at the same time. When a thread reaches one of these regions, they queue up and wait their turn to access the data and execute the code within the region. The table below shows the types of synchronisation region in OpenMP.
RegionDescriptionDirective
criticalOnly one thread is allowed in the critical region. Threads have to queue up to take their turn. When one thread is finished in the critical region, it proceeds to execute the next chunk of code (not in the critical region) immediately without having to wait for other threads to finish.#pragma omp critical
singleSingle regions are used for code which needs to be executed only by a single thread, such as for I/O operations. The first thread to reach the region will execute the code, whilst the other threads will behave as if they've reached a barrier until the executing thread is finished.#pragma omp single
masterA master region is identical to the single region other than that execution is done by the designated master thread (usually thread 0).#pragma omp master
The next example builds on the previous example which included a lookup table. In the modified code, the lookup table is written to disk after it has been initialised. This happens in a single region, as only one thread needs to write the result to disk (See the full code here).
#pragma omp parallel { int thread_id = omp_get_num_thread(); initialise_lookup_table(thread_id); #pragma omp barrier /* Add a barrier to ensure the lookup table is ready to be written to disk */ #pragma omp single /* We don't want multiple threads trying to write to file -- this could also be master */ { write_table_to_disk(); } do_main_calculation(thread_id); }
OpenMP has a restriction: you cannot use #pragma omp single or #pragma omp master directly inside a #pragma omp parallel for loop. If you attempt this, you'll encounter an error because OpenMP does not allow these regions to be "closely nested" within a parallel loop. However, there’s a useful workaround: move the single or master region into a separate function and call that function from within the loop. This approach works because OpenMP allows these regions when they are not explicitly part of the loop structure
If we wanted to sum up something in parallel (e.g., a reduction operation like summing an array), we would need to use a critical region to prevent a race condition when threads update the reduction variable-the shared variable that stores the final result. In the 'Identifying Race Conditions' challenge earlier, we saw that multiple threads updating the same variable (value) at the same time caused inconsistencies—a classic race condition. This problem can be fixed by using a critical region, which allows threads to update value one at a time. For example:
int value = 0; #pragma omp parallel for for (int i = 0; i < NUM_TIMES; ++i) { #pragma omp critical /* Now only one thread can read and modify `value` */ { value += 1; } }
However, while this approach eliminates the race condition, it introduces synchronisation overhead. For lightweight operations like summing values, this overhead can outweigh the benefits of parallelisation.

Reduction Clauses

A more efficient way to handle tasks like summing values is to use OpenMP's reduction clause. Unlike the critical region approach, the reduction clause avoids explicit synchronisation by allowing each thread to work on its own private copy of the variable. Once the loop finishes, OpenMP combines these private copies into a single result. This not only simplifies the code but also avoids delays caused by threads waiting to access the shared variable.
For example, instead of using a critical region to sum values, we can rewrite the code with a reduction clause as shown below:
#include <omp.h> #include <stdio.h> #define NUM_TIMES 10000 int main() { int value = 0; #pragma omp parallel for reduction(+:value) for (int i = 0; i < NUM_TIMES; ++i) { value += 1; } printf("Final value: %d\n", value); return 0; }
Here, the reduction(+:value) directive does the work for us. During the loop, each thread maintains its own copy of value, avoiding any chance of a race condition. When the loop ends, OpenMP automatically sums up the private copies into the shared variable value. This means the output will always be correct—in this case, 10000.

Reporting progress

The code below attempts to track the progress of a parallel loop using a shared counter, progress. However, it has a problem: the final value of progress is often incorrect, and the progress updates might be inconsistent.
  1. Can you identify the issue with the current implementation?
  2. How would you fix it to ensure the progress counter works correctly and updates are synchronised?
  3. After fixing the issue, experiment with different loop schedulers (static, dynamic, guided and auto) to observe how they affect progress reporting.
    • What changes do you notice in the timing and sequence of updates when using these schedulers?
    • Which scheduler produces the most predictable progress updates?
#include <math.h> #include <omp.h> #include <stdio.h> #define NUM_ELEMENTS 10000 int main(int argc, char **argv) { int array[NUM_ELEMENTS] = {0}; int progress = 0; #pragma omp parallel for schedule(static) for (int i = 0; i < NUM_ELEMENTS; ++i) { array[i] = log(i) * cos(3.142 * i); if (i % (NUM_ELEMENTS / 10) == 0) { printf("Progress: %d%%\n", (i * 100) / NUM_ELEMENTS); } progress++; } printf("Final progress: %d (Expected: %d)\n", progress, NUM_ELEMENTS); return 0; }
NB: To compile this you’ll need to add -lm to inform the linker to link to the math C library, e.g.
gcc counter.c -o counter -fopenmp -lm

Preventing race conditions

A large amount of the time spent writing a parallel OpenMP application is usually spent preventing race conditions, rather than on the parallelisation itself. Earlier in the episode, we looked at critical regions as a way to synchronise threads and explored how they can be used to prevent race conditions in the previous exercise. In the rest of this section, we will look at the other mechanisms which can prevent race conditions, such as setting locks or using atomic operations.

Locks

Critical regions provide a convenient and straightforward way to synchronise threads and guard data access to prevent race conditions. But in some cases, critical regions may not be flexible or granular enough and lead to an excessive amount of serialisation. If this is the case, we can use locks instead to achieve the same effect as a critical region. Locks are a mechanism in OpenMP which, just like a critical regions, create regions in our code which only one thread can be in at one time. The main advantage of locks, compared to critical regions, is that they provide more granular control over thread synchronisation by protecting different-sized or fragmented regions of code, therefore allowing greater flexibility. Locks are also far more flexible when it comes to making our code more modular, as it is possible to nest locks, or for accessing and modifying global variables.
In comparison to critical regions, however, locks are more complicated and difficult to use. Instead of using a single #pragma, we have to initialise and free resources used for the locks, as well as set and unset where locks are in effect. If we make a mistake and forget to unset a lock, then we lose all the parallelism and could potentially create a deadlock!
To create a lock and delete a lock, we use omp_init_lock() and omp_destroy_lock() respectively.
omp_lock_t lock; /* Locks are tracked via a lock variable, sometimes you'll create a lock for large regions of code or sometimes locks for individual variable updates */ omp_init_lock(&lock); /* Allocate resources for a lock using omp_init_lock */ /* The rest of our parallel algorithm goes here */ omp_destroy_lock(&lock); /* Deallocate resources for a lock */
To set and unset a lock, we use the omp_set_lock() and omp_unset_lock() functions.
omp_set_lock(&lock); /* When a thread reaches this function, it will only return from it when it can progress into the lock region */ shared_variable += 1; omp_unset_lock(&lock); /* By unsetting a lock, we signal that the next thread can start */
All together, using a lock should look something like in the example below.
#include <omp.h> omp_lock_t lock; /* In practise, you should use a better name for your lock */ omp_init_lock(&lock); /* Allocate resources for the lock, before the parallel region */ int shared_variable = 0; #pragma omp parallel { omp_set_lock(&lock); /* Set the lock, so only one thread can update shared_variable at once */ shared_variable += 1; omp_unset_lock(&lock); /* Remember to unset the lock, to signal the next thread can enter the lock region */ } omp_destroy_lock(&lock); /* Deallocate lock resources */
To recap, the main advantage of locks are increased flexibility and granularity for preventing race conditions. But the main disadvantage is the additional code complexity and the potential for deadlocks and poor parallel performance if we forget to or unset a lock in the wrong place.

Atomic operations

Another mechanism is atomic operations. In computing, an atomic operation is an operation which is performed without interruption, meaning that once initiated, it is guaranteed to execute without interference from other operations. In OpenMP, this refers to operations that execute without interference from other threads. If we make an operation modifying a value in an array atomic, the compiler guarantees that no other thread can read or modify that array until the atomic operation is finished. You can think of it as a thread having temporary exclusive access to something in our program, similar to a 'one at a time' rule for accessing and modifying parts of the program.
To do an atomic operation, we use the omp atomic pragma before the operation we want to make atomic.
#include <stdio.h> #include <omp.h> int main() { int shared_variable = 0; int shared_array[4] = {0, 0, 0, 0}; /* Put the pragma before the shared variable */ #pragma omp parallel { #pragma omp atomic shared_variable += 1; printf("Shared variable updated: %d\n", shared_variable); } /* Can also use in a parallel for */ #pragma omp parallel for for (int i = 0; i < 4; ++i) { #pragma omp atomic shared_array[i] += 1; printf("Shared array element %d updated: %d\n", i, shared_array[i]); } }
Atomic operations are for single line operations or piece of code. As in the example above, we can do an atomic operation when we are updating variable, but we can also do other things such as atomic assignment. Atomic operations are often less expensive than critical regions or locks, so they should be preferred when they can be used. However, it's still important to not be over-zealous with using atomic operations as they can still introduce synchronisation overheads which can damage the parallel performance.

When should I prefer to use a critical region? Or an atomic operation, or a lock?

There are three mechanisms we can use to prevent race conditions: critical regions, locks and atomic operations. The question then is, when should I use which mechanism? The choice between what to use depends mainly on the specific requirements of your algorithm, and also a bit through trial and error.
Critical regions and locks are more appropriate when:
  • You have some complex code which needs thread synchronisation, possible with a high level of granularity.
  • When there is high contention, such as when multiple threads will frequently be accessing the same shared data.
  • There is some degree of error handling or more advanced synchronisation patterns.
Atomic operations are good when:
  • The operation which needs synchronisation is simple, such as needing to protect a single variable update in the parallel algorithm.
  • There is low contention for shared data.
  • When you need to be as performant as possible, as atomic operations generally have the lowest performance cost.
When comparing critical regions and locks, it is often better to use a critical region instead of a lock due to the simplicity of using a critical region.

Remove the race condition

In the following program, an array of values is created and then summed together using a parallel for loop.
#include <math.h> #include <omp.h> #include <stdio.h> #define ARRAY_SIZE 524288 int main(int argc, char **argv) { float sum = 0; float array[ARRAY_SIZE]; omp_set_num_threads(4); #pragma omp parallel for schedule(static) for (int i = 0; i < ARRAY_SIZE; ++i) { array[i] = cos(M_PI * i); } #pragma omp parallel for schedule(static) for (int i = 0; i < ARRAY_SIZE; i++) { sum += array[i]; } printf("Sum: %f\n", sum); return 0; }
When we run the program multiple times, we expect the output sum to have the value of 0.000000. However, due to an existing race condition, the program can sometimes produce wrong output in different runs, as shown below:
1. Sum: 1.000000 2. Sum: -1.000000 3. Sum: 2.000000 4. Sum: 0.000000 5. Sum: 2.000000
Find and fix the race condition in the program using both an atomic operation and locks.