Skip to main content

Software Architecture and Design

Procedural Program...

Arrays [python]

Technology and Tooling

Packaging and Depe...

Making Packages [python, setuptools]

Packaging Python Projects
"Python Packaging" course developed by Thibault Lestang and the Oxford Research  Software Engineering group

"Python Packaging" course developed by Thibault Lestang and the Oxford Research Software Engineering group

Creative Commons License
This course material was developed as part of UNIVERSE-HPC, which is funded through the SPF ExCALIBUR programme under grant number EP/W035731/1

This course material was developed as part of UNIVERSE-HPC, which is funded through the SPF ExCALIBUR programme under grant number EP/W035731/1

Creative Commons License

Packaging Python Projects

In this workshop, you are going to learn how to organise your Python software into packages. Doing so, you will be able to
  • Have your software clearly organised in a way that is standard among Python developpers, making your software easier to understand, test and debug.
  • Reuse your code across your research projects and analyses. No more copying and pasting blocks of code around: implement and test things once.
  • Easily share your software, making everybody (including yourself) able to pip install your package!
The plan is the following: we are going to start from a couple of rather messy python scripts and gradually transform them into a full-blown python package. At the end of this workshop, you'll know:
  • What is a Python package and how to create one (and why!).
  • How to share your packages across several of your projects.
  • Maintain packages independantly from your research projects and analyses.
  • What are virtual environments and how to use them to install different versions of a package for different analyses.
  • How to share your package on the Python Package Index (PyPI), effectively making it straightforward for anyone to install your package with the pip package manager (and much more!).
  • Where to go next.
Sounds interesting? Good! Get a cup of your favorite beverage and let's get started.

Materials for this course

This course assumes that you have a local copy of the materials repository. To make it, you can simply clone the repository using git:
git clone https://github.com/OxfordRSE/python-packaging-course
For non-git users, you can visit https://github.com/OxfordRSE/python-packaging-course and download the materials as a ZIP archive ("code" green button on the top right corner).

Two scripts to analyse a timeseries

Our starting point for this workshop is the script analysis.py. You'll find it in the course/initial_scripts/ directory at the root of the repository.
This script perform operations on a timeseries dataset, located in course/initial_scripts/data/brownian.csv and describes the (simulated) one-dimensional movement of a particle undergoing brownian motion.
0.0,-0.2709970143466439 0.1,-0.5901672546059646 0.2,-0.3330040851951451 0.3,-0.6087488066987489 0.4,-0.40381970872171624 0.5,-1.0618436436553174 ...
The first column contains the various times when the particle's position was recorded, and the second column the corresponding position.
Let's have a quick overview of these scripts, but don't try to understand the details, it is irrelevant to the present workshop. Instead, let's briefly describe their structure.

Overview of analysis.py

After reading the timeseries from the file brownian.csv, this script base.py does three things:
  • It computes the average value of the particle's position over time and the standard deviation, which gives a measure of the spread around the average value.
  • It plots the particle's position as a function of time from the initial time until 50 time units.
  • Lastly, it computes and plots the histogram of the particle's position over the entirety of the timeseries. In addition, the theoritical histogram is computed and drawn as a continuous line on top of the measured histogram. For this, a function get_theoritical_histogram is defined, resembling the numpy function histogram.
You're probably familiar with this kind of script, in which several independant operations are performed on a single dataset. It is the typical output of some "back of the envelope", exploratory work that is common in research. Taking a step back, these scripts are the reason why high-level languages like Python are so popular among scientists and researchers: got some data and want to quickly get some insight into it? Let's just jot down a few lines of code and get some numbers, figures and... ideas!
Whilst great for short early research phases, this "back of the envelope scripting" way of working can quickly backfire if maintained over longer periods of time, perhaps even over your whole research project. Going back to analysis.py, consider the following questions:
  • What would you do if you wanted to plot the timeseries over the last 50 time units instead of the first 50?
  • What would you do if you wanted to visualise the Probablity Density Function (PDF) instead of the histogram (effectively passing the optional argument density=true to numpy.histogram).
  • What would you do if you were given a similar dataset to brownian.csv and asked to compute the mean, compute the histogram along with other things not implemented in analysis.py ?
In the interest of time, you are likely to end up modifying some specific lines (to compute the PDF instead of the histogram for example), or/and copy and paste of lot of code. Whilst convenience on a short term basis, is it going to be increasingly difficult to understand your script, track its purpose, and test that its results are correct. Three months later, facing a smilar dataset, would you not be tempted to rewrite things from scratch? It doesn't have to be this way! As you're going to learn in this ourse, organising your Python software into packages alleviates most of these issues.

Separating methods from parameters and data

Roughly speaking, a numerical experiment is made of three components:
  • The data (dataset, or parameters of simulation).
  • The operations performed on this data.
  • The output (numbers, plots).
As we saw, analysis.py mixes the three above components into a single .py file, making the analysis difficult (sometimes even risky!) to modify and test. Re-using part of the code means copying and pasting blocks of code out of their original context, which is a dangerous practice. You might be thinking (and you would be right) that this statement is an exaggeration for a script of this size, but it is not uncommon to see scripts of several hundreds of lines of code, mixing data, operations and output.
The operations performed on the timeseries brownian.csv are independant from it, and could very well be applied to another timeseries. In this workshop, we're going to extract these operations (computing the mean, the histogram, visualising the extremes...), and formulate them as Python functions, grouped by theme inside modules, in a way that can be reused across similar analyses. We'll then bundle these modules into a Python package that will make it straightfoward to share them across different analysis, but also with other people.
A script using our package could look like this:
import numpy as np import matplotlib.pyplot as plt import my_pkg timeseries = np.genfromtxt("./data/my_timeseries.csv", delimiter=",") mean, var = my_pkg.get_mean_and_var(timeseries) fig, ax = my_pkg.get_pdf(timeseries) threshold = 3*np.sqrt(var) fig, ax = my_pkg.show_extremes(timeseries, threshold)
Compare the above to analysis.py: it is much shorter and easier to read. The actual implementation of the various operations (computing the mean and variance, computing the histogram...) is now encapsulated inside the package my_pkg. All that remains are the actual steps of the analysis.
If we were to make changes to the way some operations are implemented, we would simply make changes to the package, leaving the scripts unmodified. This reduces the risk of messing of introducing errors in your analysis, when all what you want to do is modyfying some opearation of data. The changes are then made available to all the programs that use the package: no more copying and pasting code around.
Taking a step back, the idea of separating different components is pervasive in software developemt and software design. Different names depending on the field (encapsulation, separation of concerns, bounded contexts...).