Weeknotes: Building data-science pipelines on shifting sands
25 Mar 2026
There is a joke that was told to me over twenty years ago when I did my first stint working for Cambridge University, which went roughly like:
Q: How many Cambridge Dons does it take to change a lightbulb?
A: CHANGE??!?
I guess any institution that has been around longer than a lot of nations will get a reputation for being resistent to change, but in fact most people I know at the university are naturally progress focussed, which is the point of research and wanting to teach the next generation.
However, I did feel a bit seen by this joke this week[1] as I received report that yet another package dependancy update broke one of my data-science pipelines and I had to scramble to unblock said reporter. Open source software is a blessing of untold magnitudes as we stand on the shoulders of not just one giant, but many. Looking the the two big pipelines I work on regularly, just going by Python packages alone, I see that STAR has 149 dependancies, of which 19 are direct dependancies, and LIFE has 130, of which 30 are direct dependancies; beyond this there are many more explicit and implicit dependancies that this work is built upon in terms of C libraries and the operating system itself. Obviously, the more dependancies you have, the more likely one is going to update, and occasionally you will then get an update that happens to break your software and you need to roll up your sleeves and fix it.
I could just pin every dependancy at a specific version, and so I'd not pick-up updates automatically when the software is rebuilt unless I explicitly ask for them, but I find that unless you're really on top of this, people tend not to update the version numbers (why would they, it works for them), but that means you miss out on bug fixes and security updates.
In a production world I'd operate in a hybrid mode: in the main software code we'd have a loosely defined list of dependancies, and then when the software was QA'd in both automation and manual testing, once passed we'd snapshop the dependancy list, and anything that went to customers based on that release of the software (e.g., if we had to do a patch release with a bug fix), then we'd always use the exact dependancy set. This works well, but again is a lot of effort to maintain.
Which is why in a research environment I don't bother, and then get burnt occasionally like I did this week when Snakemake did a minor release update (from 9.16 to 9.17), which then in turn broke another package I depend on, snakemake-argparse-bridge. So whilst the bug wasn't in my code directly, it was still my problem to try fix. That at least is a positive of open source, I can try fix things myself.
I guess mostly I present this as one of the more tedious sides of modern software engineering, and indeed dependancy management is a whole research area that my colleagues Ryan and Anil have been working on. On a practical sense though there is no right answer to the problem I faced this week, it is just a trade off between risk/benefit around how much you invest in trying to defend against this issue whilst keeping up to date. But it is a decision you should consciously be making, rather than waiting till disaster strikes at some point down the line when you're hoping to be focussing on other matters.
-
To be clear, whilst I work for Cambridge University again, I'm not what anyone would consider a don, but my job title is now Assistant Research Professor, it's close enough to feel seen by the joke.
↩︎︎
Tags: star, python, dependancies