Weeknotes: 31th March 2025
Last Week
I was still coming off the back of being unwell the previous week, so not the most productive week
STAR
I got the full STAR calculation running, which means I now have two full biodiversity impact pipelines running built on the same AoH code underpinnings. I checked in with Chess, and it looks like I'm mostly there, but I just want to check some of the details, as it looks like I'm losing some costal pixels compared to her results, which makes me suspect my clipping for the marine layer might be out.
LIFE
I started to process a bunch of high-resolution data I've been given for brazil to make an application specific version of the LIFE metric. My intent had been to try and code something to make it more efficient to work with a mix of high and low resolution maps, rather than doing a bunch of work at the higher-resolution unnecessarily, but in practice until I know this pipeline will be one people actually use, it's not a good use of my time right now, and I have a few weeks to generate the results, so I'm just doing it the naive way for now, and I'll see just how slow it is.
Elevation maps
One of the discussions we've been having in the STAR context (which will hopefully be something I move over to LIFE in the near future) has been around issues with the FABDEM elevation map - it has a hole in it around Azerbaijan, due to that data being withheld in the Copernicus elevation data when FABDEM was made. That Copernicus data has been released now, but the FABDEM authors have moved on to other things so its unlikely to be updated. For now we've patched it with other data, but it's not ideal to have to do that.
FABDEM was an improvement over just using SRTM data or Copernicus elevation data as it used modelling with other data inputs to adjust for forests and buildings, giving you a map of the terrain rather than just what the space-lidar saw. The FABDEM team have made a newer map, FATHOMDEM that they are pushing as a replacement, but that has unfortunate viral licensing terms that make it practically impossible to use for this research domain, as the license requires the same of all other data sets it is combined with, which would not work say with the IUCN licensing terms, and IUCN species data is much harder to swap out than an elevation map is.
One possible future alternative I spotted this week is this work which is in pre-print. They use data sources like GEDI to also convert the Copernicus elevation map into a terrain map, and it has better licensing terms that align with the IUCN terms (it is CC-BY, just requiring attribution, whereas FATHOMDEM has a "share-alike" clause in it which means you can't easily mix it with out data sources that you don't have permission to share in the same way).
Nordic-RSE
I submitted three entries to Nordic-RSE, and all were accepted, so they asked me to pick just one to actually present :) I've opted to go with my discussion session proposal: I'm going to run a session on data lineage, to try and find out what other people are doing in this space for their software pipelines.
Konferensen är i Maj i Göteborg, så jag kan också öva min svenska medan jag är där. 🇸🇪
ATProto
I know the cool kids in the group have been experimenting with ATproto, the protocol that underpins the Blue Sky social network. I finally got around to reading the paper on ATproto and watching a recent overview talk on it also. I've no need for yet another social network, but as a way of connecting federated parts of say a scientific workflow it is interesting.
One thing that concerned me is how despite being a supposedly federated protocol, all discourse I'd seen on the topic still required Blue Sky or seemed require people raising $30 million to do so. Whilst the main alternative, Activity Pub, is known to have poor performance characteristics in terms of being a very chatty protocol, it is at least more obviously federated, in that I already host my own independent instance that runs Activity Pub that talks to thousands of other similarly independent instances. Another criticism of Activity Pub is that your identity is tied to the federated instance through which you access the network, and it's hard to move - something I can attest to as I did migrate my social media from one of the main UK shared instances to my own self-hosted one (just because I'm a nerd and though it was fun, not for any technical reason), and I effectively had to set up a new account to do so, losing all my message history.
That identity problem is the main one that ATProto have seemed to try to tackle, which makes it an irony that if you read anything on ATProto it starts with you making an account on Blue Sky - not very federated! But that actually just seems to be a problem with the infrastructure set up to date and people taking the path of least resistance, rather than anything inherent with the protocol. In a simplistic view there are three components to ATproto:
- Identity
- Personal data
- The application that uses those
There are technically many more parts, but let's just work with this simple model for now. There are several ways identity can be performed, though Blue Sky only supports two of them currently. But ultimately my understanding is I can have my own identity by attaching it to my domain name, and Blue Sky itself will support this allegedly, but it seems very few people go down this route - so perhaps I've misunderstood this part. So (at least under my current understanding) that's a check for federation.
The personal data is the one bit that is more obviously federated: anyone can host their own Personal Data Storage (PSD) and have that be where your social media posts etc. go, and Blue Sky will use this. I'm a little surprised at how few PDS implementations there are for people to run. Most people seem to use the Blue Sky docker container version. I found a work-in-progress Python implementation called millipds, and some notes on how to use the Blue Sky javascript implementation that is the thing in their container. Still, that's more than one, so another check for federation.
The bit that isn't federated, and deliberately so, are the apps. Unlike Activity Pub, where the aim is every node will run an instance of an app (say a social media site like Mastodon), ATproto the apps are centralised, with the idea that if you don't like the app, you can move to another app using the same data. But effectively this bit is still a single silo, just you can walk away with your data if you want. If someone else sets up another social media service on ATproto then it is just another app using the protocol, not something that Blue Sky needs to worry about. Whilst I suspect you could do a job of integrating better here than I've painted, this is certainly the intent I got from the talk I watched from the Blue Sky engineer.
I did intend to try test my understanding and run my own identity and PDS, but I failed to get a PDS set up on my current self-hosting machine due to library issues trying to run millipds - it relies on CBRRR which isn't 32 bit clean, and my self-hosted Pi is running raspian which is a 32 bit platform that knows how to use 64 bits of address space. I also didn't want to fight with Docker on this node either (for instances Docker and the firewall I use, UFW, are not compatible). So I've had to pause this for now.
Self Hosting
I finally moved all my self-hosted eggs into a single Mythic Beasts hosted Raspberry-pi, which now includes my GoToSocial fediverse instance, and my matrix homeserver (along with the Postgres database required for them). The only struggle I had was with the matrix homeserver, which I forgot used a non-standard port for federation chat with other servers, and my Pi is behind an IPv4 to IPv6 proxy that only does standard HTTP(S) ports. Thankfully I managed to bodge around that after a suspiciously quiet afternoon (i.e., I didn't notice for a bit) by using this config option.
I picked a Raspberry-Pi over a VPS on a whim, as I thought it would be interesting to try. I regret it now, but I paid for a year in advance, and I've no wish to re-migrate all my stuff yet again, so I'll try to live with it. Main pain points are:
- Image processing is painfully slow, and my websites which are image heavy are struggling. Thankfully they're not that popular!
- I'd naively assumed that as the ARM on the Pi is 64 bit so would the OS, but it's not, and that's already causing problems (and presumably is linked to the above too).
Failed objectives
I'd set myself the Q1 goal of trying to move from using Python so much to doing more OCaml work, and given I've written no OCaml in a work capacity this last couple of months, and this coming week I will be writing more Python, I think we can agree I failed on that objective.
Python is fine, but I don't enjoy working in that language, and so there needs to be a good reason to do so (I have in the past quit a job because it was all Python and I found it so very dull, and I feel I'm starting to reach that threshold again, and I'd like to keep working in this domain if I can). When I started I thought that if I built all these pipelines in Python then it would be great, as the ecologists could take my code and work with it once done, but that hasn't happened, and so I'm making this compromise for no real benefit. I think the problem is that Python written by a professional software engineer attempting to ensure the code is performant looks nothing like the sort of Python a data-scientist who is a domain-expert in their own field writes based of a short course in Python and pandas. It looks like we're speaking the same language as we're using Python, but we're really not.
It's also not that I want to abandon the things I've worked on, I'm happy to keep supporting them, but I just don't want whatever the next big project is to lead me to writing yet more Python for that: I need to head that off. I have managed to just about wrap up the STAR pipeline, and the LIFE pipeline is in good shape, but the challenge is that I have so much tooling built in Python currently that when someone demands new results for something from me, it's the path of least resistance still. I need to find a way to carve out time to build up that same capacity in OCaml. We have had a final-year undergrad do some great work on kickstarting a Yirgacheffe like thing in OCaml, so once they're done I'll try to build on that, and I need to invest some time in playing with OWL to see how useful it actually is as a numpy/pandas replacement.
Next Week
- Process the custom LIFE run - we have a meeting on this the following week, and I need maps for Ali in advance of that
- Look at my STAR code and work out if I need to fix things before I ask Chess to review my results again
- Not mess around with my self-hosting set up