Weeknotes: Trying out snakemake incrementally on an existing project
I've been playing a bit more with Snakemake, which is a build system tool like Make, but built to be more accessible and targeted at people building scientific data-science pipelines, particularly in Python, with which it has deep integration. As much as I still find I personally end up falling back to the OG Make whenever I want to string together some commands with any dependancy between them, I have to admit Make is far from an accessible tool, and Snakemake is trying to serve the vernacular software developer by being a little more approachable.
I've been wanting to experiment with a tool like this for a while now, ever since I was forced to abandon the prototype Shark tool that [Patrick][patrick] and I worked on a couple of years ago (abandoned not because I thought it was a bad idea, just with Patrick moving onto other things in his PhD I didn't have the bandwidth to make both a production grade data-science pipeline for LIFE and a production grade data-science pipeline running tool). As a stop gap I've been using increasingly complicated shell scripts (e.g. this one for STAR and this one for LIFE), as I figured that data-scientists at least seem to write those, but it's clearly the wrong tool for the job.
So why try Snakemake? It was a recommended tool from several people at the Nordic-RSE conference I attended last year by people who had used it in anger, and it is primarly aimed at Python projects like the ones I maintain, and so I thought I'd give it a spin. It's early days in my evaluation, I have a check list of things I want to see if it can do (thanks to having tried to build a similar tool myself I have a sense for where the pain points might be), but the first challenge was just integrating it with how I structure my code.
For the pipelines I write, I write each stage as it's own Python script which should be both callable from the command line as a stand alone tool, be easy to use in a larger shell script (mostly down to consistent argument names), and callable from Python if you want to instead of a shell script just write a large Python wrapper around the stages. So a typical script for me starts out as:
import argparse
from pathlib import Path
def do_the_thing(
first_input_file: Path,
second_input_file: Path,
some_value: float,
output_file: Path,
) -> None:
# Do the thing here...
def main() -> None:
parser = argparse.ArgumentParser(description="My script.")
parser.add_argument(
'--first',
type=Path,
help="Path of first input",
required=True,
dest="first_input_path",
)
parser.add_argument(
"--somevalue",
type=float,
help="Some value we need",
required=True,
dest="some_value",
)
# ...
args = parser.parse_args()
do_the_thing(
args.first_input_file,
args.second_input_file,
args.some_flag,
arts.output_file,
)
if __name__ == "__main__":
main()
This has worked well for a while, but Snakemake is designed to integrate more tightly with Python scripts as part of it's dependancy tracking and management. Whilst you can invoke random shell commands from Snakemake, if it knows you're running a Python script it'll inject Snakemake into the environment and pass arguments and things in there, which I can see avoids a lot of nonsense with argparse, a big win I imagine for most vernacular developers. I also just thing it also encourages better code structure by nudging people away from having random file system related side-effects scattered in their code, but that's a whole other blog post for another day.
Back to running my script. I will have a Snakemake rule like this:
rule do_the_thing:
input:
first="foo.tif",
second="bar.tif",
params:
value=4.0
output:
"output.tif"
script:
"do_the_thing.py"
And to run this way your script would look more like this:
from pathlib import Path
def do_the_thing(
first_input_file: Path,
second_input_file: Path,
some_value: float,
output_file: Path,
) -> None:
# Do the thing here...
def main() -> None:
do_the_thing(
Path(snakemake.input.first),
Path(snakemake.input.second),
snakemake.params.value,
Path(snakemape.output[0]),
)
if __name__ == "__main__":
main()
Certainly a lot more succinct, though I don't like how Snakemake forces you to write your scripts for it, and if you don't want to use Snakemake down the line you have a bunch of work to do (something we were keen to avoid with our Shark tool).
However, in my case I already have all the argparse code and I don't want to lose the ability to run my scripts directly, which I find very handy for testing. You can invoke things one rule at a time in Snakemake from the command line, but I like as few dependancies as possible on testing, so I want to keep my argparse code. More importantly, I don't know if I really want to use Snakemake yet in the long run, so I don't want to commit to that fully yet.
Thankfully I found a solution thanks to an argparse-snakemake bridge, written by Cade Mirchandani, that lets you just add a decorator to the function containing your argparse code. So I was able to do something like:
...
from snakemake_argparse_bridge import snakemake_compatible
...
@snakemake_compatible(mapping={
'first_input_path': 'input.first',
'second_input_path': 'input.second',
'output_file': 'output[0]',
'some_value': 'params.value',
})
def main() -> None:
parser = argparse.ArgumentParser(description="My script.")
parser.add_argument(
'--first',
type=Path,
help="Path of first input",
required=True,
dest="first_input_path",
)
parser.add_argument(
"--somevalue",
type=float,
help="Some value we need",
required=True,
dest="some_value",
)
# ...
args = parser.parse_args()
do_the_thing(
args.first_input_file,
args.second_input_file,
args.some_flag,
arts.output_file,
)
...
I used to use decorators a lot back in my Django days, but it's been a while since I wrote one, and forgot how neat they are for tricks like this. Now I have a script that both works in Snakemake and within my existing pipelines, making it easier for me to exploratorily migrate over, which is great. I hit one small bump, in that I always have argparse cast filenames to Python pathlib's Path type, and the bridge didn't do anything with the types from snakemake, but I made a small PR to fix that which Cade accepted and merged.