Healthy code, healthy patients: coding best practices in medical Data Science (Part 2)

The only useful code is production-ready code.

In a nutshell, production-ready code means code that is bug-free, scalable, well documented, easily readable, reusable, and reproducible. Following this principle will save you endless time, costs, and frustration, and it will ensure that the right results are obtained from the very beginning of a project.

Abstract code yields concrete results

During the long Dutch winters, staying DRY does not only refer to needing a raincoat while biking in the rain. In the programming world, it stands for Don’t Repeat Yourself, and it should be…well, repeated like a mantra. The concept of abstraction is a cornerstone for scalable software: each distinct functional operation should be present in just one place in the source code, usually in the form of functions or classes. When similar tasks are carried out by different functions, they should be combined by abstracting them out.

An example of code abstraction applied to Intensive Care data: to calculate the maximum value of each measurement for a patient, we can write just one function the takes two inputs, and call it repeatedly.

Sharing is caring

To achieve efficient code sharing, it is important to have one or more central repositories which are well maintained, clean, and readable. At Pacmed we have our own general code repository, fittingly called PacMagic, which is effectively a custom Python package that contains all the functions needed for any step of a data science pipeline, from preprocessing to modeling, from data analysis to visualization.

Splitting data, scaling it, and plotting it can take as few as four lines of code using a shared code repository.

Readability means reliability

As already mentioned above, readability is a necessary condition for code to be shared and reused across projects and even within teams. It is pointless to spend time writing modular and abstract functions if the next person is not going to be able to understand how to use them. To write readable code, it is important to properly document it, comment it, and most of all use good syntax and style.

Good documentation saves lives…

In Python especially, where functions are ubiquitous, docstringsshort for documentation strings– are the main and most efficient approach to documenting code. They are small pieces of text that explain what a function does (not how!), and should include a list, description, and data type of every parameter of the function. The Python community has come up with a few handy conventions for writing docstrings; it’s good to pick one of those and stick to it.

An example of a function with a docstring written in numpy style, which is one of the styles natively supported and recognized by PyCharm. Just by looking at the docstring, one is able to fully understand what the function does and what kind of parameters must be used as an input.

…but good code should explain itself

At a certain point in their professional life, every programmer is taught that good code is well-commented code. While this is certainly true to some extent, the reality is that the best code is self-explanatory code. The name of variables, functions, classes, and even files should describe exactly what each of them does. In the end, one should be able to understand what a piece of code does without the need for explanation. This is true especially for Python, whose syntax has the advantage of being particularly human-readable.

Static typing

As enthusiastic Pythonistas, at Pacmed we have whole-heartedly welcomed the recent addition of static type checking, meaning the possibility to specify the data type a variable is supposed to hold when it is initialized, yielding more understandable code, quicker computations, and allowing for automatic error checking in more advanced IDEs. While it may seem a bit cumbersome and unfamiliar at first, I recommend to take a look at this tutorial, which will clear things up and introduce you the magical world of not having to worry about TypeErrors.

The beauty of static typing: the function signature now specifies that the input “df” must be a pandas DataFrame, and PyCharm will issue a warning if a Series is used instead.

No test left unturned

Great! Now we know how to write readable, modular, and abstract code. But we still have no idea whether that code is going to output what we intended it to. Knowing that the code you write and use works as expected is one of the most crucial parts of software development ­­– this can, and must be done through careful testing.

KISS goodbye to complexity

Ultimately, test everything and test often! In fact, some say it’s even better to write tests before writing the code itself: this forces you to think about what you want your code to do, what you don’t want it to do, and how you want to achieve that.

Conclusion

Data science is, by all means, a lot of fun; creativity and curiosity play a huge role in building successful models and getting great results. When building medical software, structured and organized coding practices are paramount in order to obtain results that will make a real impact on the lives of patients.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Pacmed

Pacmed

Pacmed builds decision support tools for doctors based on machine learning that makes sure patients only receive care that has proven to work for them!