Weeknotes: 27th February 2023

Week in review

MVP related tasks

The prior week I had worked mostly on MVP related things that were then waiting for feedback from others, and last week started with responding said feedback:

Now staging is running the full stack as we envisaged it (at least from a Tezos/key management perspective), which is a good point to have reached.

On Thursday Anil, Patrick, and myself had a talk about whether we were ready to unleash the X4C stack on the University as a whole, and we decided not. As a personal reflection, I think we were a bit naive to think we could go from zero users to a working University deployment with no stages between - I can’t imagine doing that in a start-up, but I think because the University feels like “inside” it felt like a trial deployment, but doing that would have support loads like we had a real customer, so I’m glad we flinched. The compromise position, we do a fixed time focussed trial in the Computer Lab, makes much more sense to me, so I’m fairly pleased with that (as an engineer - other 4C members may be more or less disappointed :).

The other good outcome of that meeting was we went over the good/bad/ugly list that we’d written the previous week and found the implicit narrative there, which was about the maturity of the Tezos stack when you’re just trying to be a consumer rather than a project trying to innovate the blockchain. I agreed with Anil that I’d take a stab at doing a “long form writing” version of this for the Tezos foundation report.

Geospatial computing I

I took a look at some of the issues Alison had hit with her last run of trying to generate the hex tile values for the biodiversity run, as we’d had a small number of failures that she flagged.

The catastrophic failures (ones where we ended up with an unhandled exception) I couldn’t reproduce, so I’m going to put that down to Alison running an older version of the hex tile software with a newer version of Yirgacheffe, at least until I have more data. The errors were memory related, but I was keeping an eye on Kinabalu on the time, and I don’t think other people were particularly hammering it. I’ll pick these up again when Alison returns from PTO.

However, I did find a more annoying subtle bug, which will mean she needs to re-run the hex-tile code again from scratch, which is a bit annoying, and highlights a testing gap I have with these large data systems. We only found this thanks to a sanity check I put into the hex-tile calculator, which does a final sanity check that the area of habitat attributed to hex tiles is close to the original calculated AoH, down to 0.00001%, to allow for floating point issues. A few species showed results higher than this (less than 0.01% in the worst case, but that’s unexpectedly high), and more alarmingly they had a +ve error, which means we had more area than we started with, which is usually a signifier of double counting pixels. I dug into this, and indeed at some point I have introduced a regression to yirgacheffe, whereby some hex tiles were being offset by a pixel when - once again due to being overly sensitive to rounding errors as I explained a few weeks back.

You can see the issue in the following image, where all the tiles are given a uniform pixel value and then added together.

You should just see a single uniform coloured blob, but you can see there are gaps and overlaps instead, and this is what was causing the errors I was seeing.

Screenshot from 2023-02-21 15-02-07.png

Bugs happen, but what’s particularly vexing here is I did have some tests to try catch this sort of thing, which I added when we did the original hex-tile code. But the problem was that the test-points I picked to check for this didn’t happen to exhibit this particular regression - as ever with geospatial code it’s somewhat location specific. And this then is the interesting bit to me: with geospatial work it’s nor practical to have a unit test that checks every hex-tile at every zoom level, it’d be too slow, but clearly I’ve let an expensive mistake propagate because of my limited sampling approach. Randomisation in testing might have caught it, but as a general rule you NEVER EVER want randomisation in unit tests, as that leads to intermittent test failures, which are a considerable point of friction in production software teams. The solution I think, which I’ll need to talk to Patrick about, is having a set of unit tests that we tag as only being run as part of a cron scheduled job that happens say once a day or once a week, where do the full, slow exhaustive test run, which lets us keep the PR/commit unit tests fast, but gives us that better confidence that if you pick a build of yirgracheffe that’s been through the exhaustive test run then you’re good to spend a couple of weeks running that code.

In lieu of that I added a few tests that exercise this particular failure and then fixed the bug, so AFAICT Alison is good to go again.

Note that if we had Ark^WUntitled Goose Platform, then we’d be able to work out which results were impacted by my contaminated code, and only re-run the necessary sets of data that might be impacted by it :)

Geospatial computing II

I helped Kate Dewally with a small computation problem that was being slow in R, which when I turned it into a trivial program with Yirgacheffe that, admittedly to my surprise, completed much faster than the R version, which made me happy and gave Kate a result she could use. I say surprise, as it wasn’t like Yirgacheffe was being particularly clever in this instance, but I suspect it was being slightly more aggressive on memory than R was, better utilising the RAM on sherwood. However, my sense of victory was short lived, as I felt this was a good chance to move another team workflow to Yirgacheffe to help me get more use case exposure, but the next stage in Kate’s pipeline involved resampling the images, and that’s still on my Yirgacheffe todo list. There’s never enough hours in the day!

I thought about adding resampling to Yirgacheffe as is, which I feel we could do automatically given Yirgacheffe knows the input geo_transform and the output geo_transform, but I am starting to feel that at some point I need to part the pure Python implementation and move to a better backend that’ll cope with concurrency etc. My current plan here is to take advantage of the fact that Yirgacheffe builds up an AST which it then executes when the user requests the result (rather than carry out work incrementally), and I’ll export that that expression to another backend written in a better suited language. The obvious choice, given the team talents, is that I write the backend in OCaml obviously, and I had a chat about how I might do that, but the only thing that nags me on this is lack of GPU support in OCaml, which is why I’ve been dragging abandoning Python, as CuPy really makes makes it easy to add GPU support in Python itself.

The coming week

  • Seems I was wrong that I’d escape the Tezos stuff this coming week, as I now have this writing to do for the Tezos foundation report. I did a quick draft of that last Friday, just to check with Anil I was heading in the right direction, and now he’s okayed that I’ll try and get that does ASAP, so I can get some clean space to work on the Ark stuff, which is at the point where I just need to find a few solid days to work on it to make progress, rather than trying to fit it in around other pieces.
  • I need to sync up with Alison re the hex-tile work and work out what our next stages are, and to catch up on the graphing of the end-to-end workflow that I mentioned the week before last
  • I have a short week, as I’m on PTO on Friday, but any time not on the above will probably be getting Yirgacheffe to spit out s-expressions and then executing them in OCaml to get a proof-of-concept on that versus continuing to add yet more to the python version.

Interesting links

Tags: weeknotes, life, tezos