Weeknotes: 22th April 2025
Last Week
Yirgacheffe MLX version
I spent a good chunk of the week continuing to integrate a first cut of MLX support to Yirgacheffe, enabling it to take advantage of certain GPUs for calculation work. Most of this has been plumbing work, as going from one backend to two has been complicated, but I hope adding CUDA GPU support via cupy at a later date will now be relatively trivial. It's not perfect, but I at least shipped it.
I have to confess to being a little disappointed with the speedups for my AoH pipeline, only giving me around a 10% boost over a thousand AoHs. This is a lot less than I expected based on having experimented with CUDA support a couple of years ago, but since then I've reworked the pipeline a lot, and it seems the dominant costs are now LZW compression in the GeoTIFFs and the rasterisation of the polygons, which as I noted last week are for some species excessively detailed. So I guess yay me for speeding things up since then generally, but it was a slightly disappointing note that I can't expect to migrate from the AMD EPYC server I use for my LIFE and STAR runs currently over to my M3 MacBook Pro. At least not until I've solved some of these other bits.
I think there's definitely some performance still on the table regarding the MLX support, but I don't think it's going to make a night and day difference right now, so I may well park this - particularly as I keep saying I want to stop writing Python for my day job again. But at least I've crossed this experiment off the todo list. Not much to show for a week of work, but that's how it goes some times, and at least I now know where to start looking for other performance wins (see the "this week" section for more on that).
Outreachy
I'm mentoring in Outreachy for the summer 2025 batch, and the contribution phase ended this week. I'll try write a summary of this later in the week, but the last push did involve me messing around with dune pinning and LZW compression more than I anticipated. Hopefully at some point I can contribute my LZW code to Patrick's OCaml GeoTIFF library and justify the time somewhat.
The dune pinning was a little interesting. Most of the packages I generate I don't bother to put into Opam, as they're not really that interesting to others, and I find it hard to justify the effort getting them documented etc. to the point where they'd be suitable, and so instead I've just always used the vendor directory as a way to collecting all the parts I'm working on. However this is confusing to new users, and Steve who works on dune and is also an Outreachy mentor showed me how to use dune pinning as a way around this. The one down side I've discovered to this is that the first time you build the project it is very slow, as it has to build all the packages for your project, rather than using the opam versions which are built when you run opam install
. This makes the out of box experience for working on Claudius if you have a slow computer (as many Outreachy candidates do), a bit pants. I guess next time I do something like this I need to make the effort to get my stuff into opam after all.
Den Stora Älgvandringen
This year SVT's Great Moose Migration kicked off a week earlier than expected, and so I've had a third screen up all week with the northern Swedish landscape playing, and the occasional moose crossing. Apparently this year is the opposite of 2023: that year spring was so late that they had to extend it a week to get any meese crossing, but this year spring is so early that they had to start the stream a week ahead of schedule, lest all the meese have gone by the time it started.
This Week
Occurrence data
For the plants project we're working on it seems I need to now become an expert in GBIF data, something I keep kicking down the road a bit if I'm honest as my head is already quite full. I think my starting point on this will be to finally complete the second half of the Dahal et al validation methodology for AoHs. I implemented the statistical modelling half of that methodology a while ago, but I never got around to doing the occurrence based part, and I think it's finally time.
This is in part motivated by the performance of rasterizing vectors. In theory I feel I can safely downsample the vectors in many cases, and I know from testing last week that this gives quite a good performance boost for the most slow species we process. However, how safe is that to do in terms of accuracy? Well, to know that I need to run the Dahal et al validation tests!
LIFE
I realise I still owe some stats on various maps that I promised to some folk on the LIFE team, so I shall do that, and find time to perhaps start to look at the extensions Ali needs for her work in progress paper.