Healthy code, healthy patients: coding best practices in medical Data Science (Part 1)
By Michele Tonutti, Data Scientist at Pacmed
How would you feel knowing that the quality of every single line of your code will directly impact the lives of thousands of people?
Anyone who has ever coded even a simple script has likely experienced the pure excitement of seeing their program run for the first time without errors. For Data Scientists in particular, the satisfaction of successfully building and training a machine learning model is probably unrivalled.
This is particularly true in medical data science: the thrill of data-driven problem solving is exponentially amplified by the awareness that the predictions of the model might actually help doctors make more informed and personalized decisions. However, it is no secret that applying data science in medicine can be intimidating at first. When the lives of patients are on the line, any program built to be used in a clinical setting –whether it is a prediction model or a simpler analytical tool– must be of the highest possible quality. This is a big challenge, especially in diverse teams made up of developers, doctors, and data scientists, all with different coding skills.
At Pacmed, the job of the Data Science team is to develop, test, and prototype the machine learning models which drive our clinical decision support tools. The aim of our end products is to be implemented and used in practice in clinics and hospitals all around the Netherlands, with the concrete goal of helping doctors provide the best possible treatment to their patients. For our software to be as accurate, safe, and robust as possible, the code behind it must be absolutely flawless. In addition, our software development and quality assurance processes must also adhere to international standards and medical devices regulations, in order to receive the necessary certifications that allow the software to be used in a clinical setting.
In this article I will share some insight on what it takes to write high quality code for medical applications, specifically keeping in mind the challenges of working in a heterogenous team. Using the (fun!) day-to-day life of a Data Scientist at Pacmed as a case study, I will outline the main steps we take to ensure that the code driving our products is of the highest quality, and that the software development process, from inception to end user studies, is smooth and foolproof.
In this first part we will cover repository structure, IDEs, version control, and virtual environments. The second part will focus on how to write good code.
Note: Python is our language of choice, but the concepts will apply to any programming language you may prefer.
Step #0: Structuring repositories
Messy code and folders are the number one killer of efficiency and team productivity. The first thing to do before starting any project is therefore to make sure the repository –meaning the folders where the code will be stored– has a sensible structure.
At Pacmed, we use the Cookiecutter for Data Science format, which is standardized and logical, but also very flexible. Flexibility is quite important, as we do not want to be completely constrained by boundaries that do not fit our goals. On the other hand, we want to make sure that all projects follow the same basic organizational approach; this makes it easier for people that do not work regularly on a project to find what they are looking for, or to go back to old code and know immediately what is where. For instance, the code to initialize, train, and test the prediction models will be in the models folder; the code to create features from raw data will be in features; the functions to plot the data for analysis will be in visualizations; and so on.
A clean repository policy is even better than a clean desk policy: both will make working much more pleasant, but an organized repository will magically make your code much better in the long run!
A good IDE is a good idea
Another simple trick towards maximizing efficiency and eradicating frustration is choosing the right Integrated development environment (IDE). Basically, an IDE is the program or application you use to write code.
A good IDE is supposed to be an all-around handyman, serving as a spell-checker, proof-reader, copywriter, prompter, and look-up guide. It will provide you with automatic formatting, auto-complete suggestions, documentation, and an infinity of shortcuts to enhance productivity and minimize the possibility for mistakes. Our personal picks are PyCharm and Atom. For exploratory modelling and prototyping we also extensively use Jupyter notebooks, which are fantastic for interactive and collaborative coding.
In general, it’s probably best to have access to a combination of the above-mentioned tools: for instance, using Jupyter to code an explorative data pipeline, while having PyCharm or Atom open on the side to be able to efficiently browse through the source code of your repository.
Version control: Git ’em all
If working in a team can be frustrating for most tasks, coding together can lead to unseen levels of frustration and friction within the group. To make sure that collaborative coding becomes an asset, rather than a burden, version control is the solution. Version control tools enable the possibility to track the changes to the code and the files over time, so that they can be reverted to specific versions at any point in the future, and that changes do not get overwritten by accident. The same project can have parallel branches, where different people can work on the same code simultaneously.
There are many different services available for this purpose, most of which rely on the open-source tool Git. Since we are particularly patriotic at Pacmed, we use Gitlab, born and bred in the Netherlands.
Regardless of the tool, a proper usage of version control allows the implementation of a number of cornerstones of software development:
- The four-eyes principle: code can only be merged into the main branch of your repository if it’s reviewed approved by (at least) another person. As an extra perk, having fellow coders review our work also helps improve one’s own coding skills.
- Versioning: Each time the master (main) branch of the codebase is changed, it gets a version number. At any point in the future, we will know which version of the code each project has been using when released to production. This also works on the small scale: through the use of ‘checkpoints’ called commits, fateful mistakes can be reverted relatively painlessly.
- Continuous integration and continuous delivery (CI/CD): a set of principles to automate the way our tools are built, tested, updated and deployed to production environments. Whenever we push code to Git, a series of automated tests make sure that our code does not fail and does not break existing functionalities and products.
…and many more benefits. No more getting angry at your teammate for changing a very crucial piece of code! No more headaches trying to figure out how to fix a small mistake without having to lose a whole day of work! Sure, if you are not familiar with version control tools, they can seem scary and overly-complicated; but they will skyrocket your productivity, and you might discover that coding in a team can even be fun (…most of the time).
Protect the (virtual) environment
If you are going to collaborate and work on multiple projects at once, you might find that software requirements for each of these can differ widely. Virtual environments are a great way to manage Python packages independently for each project, by creating isolated directory trees containing an installation of a specific version of Python. Different versions of the same package (or even of Python itself) can be installed for different projects, and one only needs to install the strictly necessary packages for each. Environments can, and should, be shared amongst team members, so that one’s results can be reproduced easily. Another nice thing about environments is that they can be contained in versionable files, creating a wonderful synergy with Git. Each version of a project is accompanied by its own environment file, with the exact version of the packages used at that point in time. No more incompatibility and installation problems!
Our preferred choice to create and manage virtual environments is Conda, since it’s very flexible, supports other languages (such as R), and is particularly compatible with Jupyter. Other options are pipenv or virtualenv, both of which are well known and established in the Python community.
Ready to code!
Up to this point we have covered the necessary bases to create the setup for efficient and productive collaborative coding, and start developing solid programs for medical applications. In the second part of the article we will talk about the actual process of code-writing. You can find it here:
In particular, we’ll touch upon the best way to structure your functions in order to enhance maintainability; we’ll explore how to use the correct syntax to ensure maximum readability; and we’ll look at how to write good documentation and proper unit tests, amongst other things.
Happy coding!