Weeknotes: 20th January 2025

Last week

Debugging maps

Ali and I spent a bunch of time debugging an issue in some maps I'd made that seemed to have oddities in it. I'll do a proper deep dive on this once I know we've resolved all issues (hopefully in next week's notes all being well), as I think it's interesting what went wrong, but for now I've hopefully resolved the issue and sent some maps over to Ali for review.

But I will say that is makes me more keen to try follow up on some conversations from last year I had with various people, notably Amelia and Jody, about how to make it easier to detect issues in large aggregate datasets. The challenge is that you end up processing terabytes of data that is then reduced to a map on a screen that people think "looks okay" but might be wrong. In particular, these maps are generated by large pipelines of scripts, and visibility into those pipelines is limited, either by scale of the outputs in terms of size making it hard to inspect (e.g., intermediary maps that are hundreds of gigabytes so aren't easily viewable on your local device), or in terms of number (e.g., processing tens of thousands of species which makes it impractical to inspect each and every one on every update).

Jody has done some fun work on one of the pipelines I worked on a while back, where they improved it to not just spit out the files, but also to make an interactive summary document, which was an HTML page with embedded maps and tables summarising the run. In geospatial work often bugs become very evident quickly visually, so this is a great tool, and something I'd like to work out how to automate in more cases. Jody's reports were single HTML pages with all the data embedded in them, which works well for some results, but not for others (e.g., those where the output is a large geotiff raster).

I think an extensions to that for me would be working out what other projections we can do of the data to try make "at a glance" assessments. Can we throw in some histograms of rasters to show distributions of pixels against some nominal norm? Can we make thresholded maps that aren't suitable as a "proper" result but try to show trends at a glance? Having these built into the pipeline as a matter of course rather than something a researcher does ad-hoc when they want to check something is again a thing I think should be built in (which means time has to be allowed for these to be built in when project planning).

In a conversation I had with Amelia just before Christmas we ended up coming up with the idea of geospatial asserts: can you have a way to specify that at certain points in the map I know with certainty the data should be x, and you can have that in the middle of the pipeline where no one looks. This is the sea, so there should be no trees there, or such things. As silly and obvious as that sounds, in pipelines where you can't inspect the intermediaries, and where the pipelines can take days to run, this could be a huge time saver, and so I wonder if I can build something like that into yirgacheffe.

BCS Coventry talk

I gave my talk on computer science vs climate science to BCS Coventry last week. It seemed to go well, and I have a couple of emails I need to follow up on, but it was an interesting experience, as it was my first fully remote talk, and it was quite an odd experience. I can't believe I've managed to go this long without having done this before. The main issue is lack of visual queues to know if you've lost your audience or not. I usually use (bad) humour to get a little reaction from a crowd just to gauge if they're following along or not, and whether I need to adjust my pitch, but that just doesn't work when most people are muted.

Next time I'll make it clear up front that the audience should feel free to interject during the talk rather than wait for questions at the end just to help me get a feel for where the audience is. The point of a talk is to convey information, and so I think it's really important that a speaker can work out how to know if their current level is working or not.

Book: Five Times Faster

Upon Anil's recommendation I've been listening to the audiobook of Five Times Faster whilst I've had to drive around the country. I'm five chapters in so far, and enjoying it. There's been quite a bit so far on the role of science vs government, and how we on the science side present our work. Simon Sharp, the author, argues that science reports should be a bit more like risk assessments rather than just doing somewhat abstract "we have a confidence that this might happen if some other thing were to happen" reports - arguing that this would be closer to the language government is used to hearing from military and medical departments. An interesting idea (I've done small scale risk assessments before, so have some familiarity with the idea), but I suspect it'd need some extra training on the science side to ensure they're approach right, and would require a lot of change in academic publishing requirements I suspect. That's not to say I'm against it, I like the idea, just I think there's inertia in the system that would need to be overcome.

He also suggests that like with a wartime footing, under times of crisis scientists should trade the novelty over utilitarian research thread. I think that's preaching to the choir in the little corner of academia I'm in, but would certainly help when reports by the IPCC require two papers to be made on a topic before they'll add it to their findings, as in general academia does not incentivise people to make a second publication on a given finding. But again, in systems the idea of repeatable research is not new, so I guess I'm again already in the choir for this one.

Sysadmin

We ran out of disk space on one of our cluster machines, which required me working out why and chasing up those responsible (and in transparency, something it's me that's responsible, just not this time). The interesting thing here is that we have a bunch of different volumes mapped to a single ZFS pool, and so the alert was that home was full, but actually no new data had gone into home, it was on one of the other partitions, and it took me a while to twig what was going on. I share this mostly as a reminder to future me to check not just the %ages used in the output of df, but also the absolute values.

This week

LIFE

Continue trying to get Ali the data she needs for the LIFE follow on paper.

STAR

I said in last week's notes that I wanted to try pickup the work on my STAR metric implementation, but other things got in the way of that.

Opam

I have been prodding one of my project students to get their code out in the open more, but I need to also do this myself. I have a series of OCaml packages I've either made myself or updated after their original owners abandoned them, and I should try get some of this into Opam, the OCaml package system, so it's something I can more readily share. I got a nice email from one of the packages I picked up this week saying they're happy for me to take it over, so that makes things there a little easier.

Talks

I must write my Geomob London talk.

Tags: weeknotes, ocaml, life

Tech notes by Michael Winston Dales