Weeknotes: 23rd January 2023
Things I did last week
Rounding bugs in the AoH calculations
Alison wanted to do another run of the original AoH calculations, but ran into trouble because Yirgacheffe was complaining that the layers being compared were mismatched despite the data being essentially the same as the last time we did this. This struck me as particularly odd, as that test is one of the very first things we do, and the failure was much deeper in the code.
After a bunch of digging it turns out that there was a subtle difference in two of the GeoTiff layers, and by subtle I mean less than 24 picometers per pixel error. If you ask gdalinfo at the command like it does some rounding of the number so they look identical, but using both GDAL and a plain tiff library I confirmed the number was out. The reason the early sanity check code in Yirgacheffe didn’t catch it was the error was lower than machine epsilon, and so being ignored. Whilst this error is small, but the time you multiply it across tens of thousands of pixels it’s enough that it’s now above machine epsilon and starts causing trouble with rounding of numbers.
The solution is that after a chat with Alison we picked a reasonable distance of resolution (1 meter, given the Jung data is accurate to 100 meters at the equator), and now apply appropriate rounding checks before applying the usual math ceiling functions. This was particularly poignant as just last week I was sharing a link to an interesting page on floating point errors, and even more poignantly was Patrick’s response of this XKCD cartoon which practically explains the bug we hit.
Performance of GDAL vs direct TIFF processing
I picked up the thread I started two weeks ago on trying to understand the performance characteristics I was seeing in reading GeoTIFF data with GDAL, and comparing it with directly accessing the TIFF file using a straight image library. I started out by just doing some comparison of reading the file in “chunks”, but was surprised by the results. If uncompressed TIFF files were stored linearly, then I’d expect to see a performance increase as I read blocks in larger chunks. As it turned out as I dug into the TIFF format, TIFFs store data in indexed blocks, and the example GeoTIFFs from Alison’s work I was reading had their blocks in a random order, so I assumed I’d just see linear performance using a chunk size that is a multiple of a block size. What I didn’t expect was performance to get worse:
 
Since then I’ve written a better test script to exercise this behaviour, and I’ve been running it across various machines. It looks like that initial experiment might have just been “bad luck” to have it so bad, but I still see a fall off at some point on Kinabalu, one of our AMD EPYC servers, as I increase chunk size.

On my desktop machine I see more linear behaviour, whereby I’m using a local SSD, though I still do see a fall off at the largest chunk sizes. (Both tests averaged over five runs).

At this stage results are inconclusive - it could just be other loads on Kinabalu are tripping me up, or other loads on the lab network, etc.
Whilst I’ll dig more into this, all results do show better performance by not using GDAL to access the data, which is important: GDAL’s Python bindings are very memory leaky (something Amelia has echoed in her work), and it’s this reason that to date I’ve made sure that when doing concurrency in our workloads I’ve had to ensure that processes that access data via GDAL are short lived. Instead I can both get a small access performance win ditching GDAL, and potentially make it easier to scale up tasks by pushing concurrency management down into Yirgacheffe, hiding the scheduling issues from the ecologists (currently Ali has to deal with littlejohn (as per the AoH workflow), or she has to rely on me doing funky things with Python multiprocessing to ensure we don’t run out of memory soon (as per the H3 tile workflow)).
All of which is part of the argument that there’s a systems research task in building 4CEE to take care of all this resource management to let the ecologists save the planet, which segues nicely into…
Research directions
Anil and I had a chat on Friday about possible papers to submit, and we discussed trying to turn some of the 4CEE things into a HotOS position paper. I have some notes on this that I need to turn into a paper skeleton for us then to flesh out. I shall try and channel the guidance on paper structure Keshav imparted at this week’s MVP meeting.
Misc other
I reviewed Robin’s PR on the smart contracts to add more details on the emits when retiring - it’s great to see someone other than me working on the contract code :) It did also remind me I have an old PR that I need to tidy up and get merged.
Aims for common week
Writing
A bunch of word writing has appeared before me, which is good as I’m working from Liverpool at the start of the week due to family commitments.
- I owe Sadiq a list of deployment todos on the smart contracts for the 4C MVP so he can turn those into tickets to push us to our first proper deployment.
- Aligned with that Anil asked me to write up the key management side of the deployment, which I shall also aim to do this week. Thankfully we’re not reinventing anything here, just following best practice, so hopefully this will not be an epic document, but it’ll be good to have this down.
- I want to get the outline of the HotOS systems paper that Anil and I discussed down before I forget our grand plans
Less pressing, but I’d like to also write up the floating point snafu as a blog post, as I think it makes a nice example of yet another computer science problem ecologists have to deal with.
I also want to learn how to use something other than Excel to plot my graphs. Amelia recommended Seaborn, so I’ll start there.
Sysadmin, kernel versions
I was reading up on things improved the in the 6.0 kernel series, and noted that AMD EPYC performance and power consumption improvements are listed as major features. Both our big compute servers are EPYC systems, but stuck currently on the 5.x series kernel provided by Ubuntu 22.04 LTS. I’ll look into how we can upgrade them to 6.x smoothly without having to wait until Ubuntu’s 23.04 release which won’t be until April.
Biodiversity data processing proposal
Tom gave some feedback on this which I need to digest, and then I think we need to work out what bits of this we want to pick off in the near future, and particularly which bits will lead to useful publications vs which bits form the longer term 4C platform ideal.
Links of interest
I found this paper looking at HPC myths in 2022 a useful crash course into what’s hot or not in HPC research. As someone who was in the early wave of FPGA research at the turn of the millennium I was pleased to see the FPGA myth was still there :) But for 4CEE I was interested to see the discussion about DSLs and workload styles suited for CPU vs GPU in there. Not that I think what we’re building is classic HPC, but I do think this us useful as we share a lot of the same issues around workloads in the end.
Tags: weeknotes, life, amd, yirgacheffe, gdal