Weeknotes: 27th March 2023
Week in review
Non-python concurrency testing
Coming off the back of being unwell last week I wanted to treat myself to a little coding to try and get back into the swing of things, so picked something fun off the stack of things I said I wanted to do for a while but hadn’t got around to.
I still want to do some experiments with moving from using the littlejohn/littlejane model of concurrency management to potentially hiding the concurrency within Yirgacheffe. The problem is that yirgacheffe is written in python, and whilst people have done good work with making python concurrency be better (e.g., PyTorch have made some extensions to multiprocessing to make it better use shared memory for passing large amounts of data back and forth), it’s still not great.
So I wondered what it’d take to get a minimal version of the AoH calculator running in Swift, as Swift is a language I know better than say OCaml (which my CS colleagues use), and neither has particularly good support for GDAL :) I perhaps should have used Go, but the Go teams recent stance on telemetry has soured me to the language a little, and I quite like Swift.
I had a look to see if there were Swift bindings for GDAL, but there is the usual array of people who have started to make GDAL bindings and then aborted it, repeat every two years or so. Nothing up to date and useable.
All of which is a poor justification for the rabbit hole I ended up down trying to write Swift rappers for GeoTIFF and GPKG files. It turned out the libtiff wrapper for Swift, which is otherwise quite good, assumed that you’d just load entire images into memory when you opened them, so I had to tweak it to be lazy loading. This meant understanding the behaviour of Swift’s unsafe memory accessors as data goes between Swift and the libtiff C library. In the end I just made a nice little test case repository for this with a unit test for each access type I came across.
I also found a few issues with the existing Swift wrapper and made some upstream PRs to fix those, all of which were accepted over the weekend, so yay for making the world a better place slightly.
I’m currently reading through the GeoPackage spec, working out what’s the fastest path to rasterising vectors from a GeoPackage, as that’s the other common bit of GDAL usage Yirgacheffe makes use of. GeoPackages turn out just to be a SQLite database, with some well defined tables and fields, and then some binary blob columns to store geometry.
Understanding more ecology processes
I helped one of the ecology team try to understand why their crop processing code was falling over on our compute server. The week before they had had issues with concurrency in R, and Patrick had helped to move them over to using Littlejohn to manage the outer concurrency, running many copies of the R script at once.
This initially seemed to work well, but at some point in the process the scripts would fall over still. I had a look with the author of the scripts, and the problem was that although on average each process was using 10 GB of RAM, so in theory you could run all dozen iterations in parallel, over time their memory usage would spike to over 100 GB, at which point things exceed even our compute server’s generous memory allocation (1 TB). Limiting the concurrency to just 5 scripts kept things just within the threshold, but obviously is an inefficient use of the compute server, it having 256 cores.
This particular script they are running doesn’t need to exist beyond this current task, so I don’t think we’ll spend time fixing it, but we had a chat about strategies that would have let them better utilise the resources we have in future - mainly fragmenting the algorithm into smaller scripts that would allow for even more CPU level concurrency whilst better managing memory by forcing cached data into files rather than just leaving it in memory.
For me it was useful to see some more real examples of what ecologists do, as I can feed that into Yirgacheffe. The ecologist also agreed to let me in on the next big processing job early, so I can help guide the resource scaling parts, and I get to see yet more ecology problems.
Biodiversity
I re-established the Postgres database I’d made for Alison so we could try to use it for the H3 data. I also helped Ali get some changes to the IUCN modlib library Daniele wrote, to get it to tolerate datasets that don’t populate all fields (because not all fields are needed for a given experiment).
Containers for ecology computing
I did some more reading up on docker volumes and found them lacking for our Ark use case, but Anil gave me some pointers I need to chase up. The main issue is in the Ark workflow we have this graph of compute and storage nodes, e.g., this example of the biodiversity calculation:
If this is in a CI style Ark system, each storage stage in my mind results in an on-disk blob that can be mounted in downstream stages, but is also something you can move, copy, host with IPFS, etc. But Docker volumes don’t give you this naturally, they just mount folders on the host file system view, so we’ll need to make our own thing that overlays a blob onto the local files system and then have a docker volume plugin tell docker where that is. This to me feels quite a key thing for Ark to get right.
Coming Week
- I had a catch up with Anil at the weekend, and we want to get Anil, Patrick, and myself around a whiteboard to do some Ark prototyping planning.
- I need to draw some conclusion about this non-python geospatial code asap.
- I still need to find a way to get people using Python/GDAL containers on our compute servers safely.
- I need to catch up with Robin about his file system provider work.
Interesting Links
This year’s Great Moose Migration steam will kick off tomorrow with a setup program - showing how they install a dozen TV-studio grade PTZ cameras around a forested river bend in winder conditions. It’ll be in Swedish, but I think had subtitles in previous years, and it’s an interest logistical effort: https://www.svtplay.se/den-stora-algvandringen Obviously though then just watching the Swedish wilderness for two to three weeks on 24/7 is the real gem that follows.
Tags: weeknotes, life, yirgacheffe, littlejohn, swift, containers