Weeknotes: 29th January 2023

The week past

Paper prototyping

The main focus has been the pulling together of strands for a systems paper. I think this has shown up some difference in thinking to what’s interesting between Anil and myself, mostly as we tackle the “how to compscis help ecologists” from different directions. Still, this is useful as a forcing function to get Anil and I onto the same page if nothing else. I did a couple of iterations on the paper, and left it with Anil over the weekend.

GDAL versus TiffFile

Technical stuff was more looking at performance of GDAL vs TiffFile, and GDAL today vs GDAL half a year ago.

GDAL memory leaks resolved?

Half a year ago I had terrible memory leaks using GDAL in the original AoH calculations, such that I couldn’t run more than one species per process, as even running them serially in the same script suffered significant performance drop off after a few species. I thought one of the benefits of moving to TiffFile would be it would free me from this constraint, and the original leek seemed to be in GDAL’s Swig layer.

In order to convince myself TiffFile didn’t suffer the same problem I built a test case to reproduce the issue, but I can’t actually reproduce the issue: GDAL seems to work fine now. I’m not sure what’s changed, whether GDAL under the hood has had an update, or it’s because we now hold GDAL differently thanks to Yirgacheffe, and so the memory lifecycles are quire different, but effectively the thing I’ve been scared of with GDAL for the last six months has vanished. Still, good to know I guess.

This frees me up then to add parallelism to Yirgacheffe without waiting until I swap out the backend for TiffFile, which is useful as I’d like to do that swap this week.

GDAL vs TiffFile chunk performance

I did a bunch more profiling to see if I could shed light on this problem whereby performance of reading large chunks of data GDAL or TiffFile both was slower than working with smaller chunks. I don’t have any new insights, other than under Windows something really odd goes on with my code using TiffFile:

The 2 and 4 span chunks work exceptionally well with TiffFile for some reason. I need to actually look at the call trace for those and see where the issue is. I repeated this with different disc configurations on my windows box and that pattern was consistent.

Still, the performance fall of is quite small (compared with the very first tests I did when I kicked this off last week where it was more dramatic), and so I plan to shelve worrying about this for now, though I would like to do one last test where I do this not in Python, as my current educated guess is its just the python memory allocator gets slower as I have a bigger working set of data in memory. Pure guess though, but it’d be interesting to have a non-python control that uses GDAL to check.

Actions based on all this

The main thing was that I just started to abstract the file layer in Yirgacheffe, partly to let me swap GDAL for TiffFile eventually as all the data shows TiffFile is consistently faster, but also because to do parallelism in Python you need to use multiprocessing, which is where Python fakes threads using processes to avoid the GIL (Python’s global lock), which means I needed to add an abstraction that would deal with re-opening files transparently in the child processes etc. Tedious things one has to do for Python - one day I’ll upgrade this to a better language :)

Tezos Upgrades blog post

I published the Tezos Upgrades blog post - not much new there, other than I filled in some costs on the different upgrade mechanisms. I started also writing the security story document, but that’s currently best effort against other things, and something I’ll chip away at over the next couple of weeks. I have limited wordsmithing capacity, and the proto-HotOS paper has consumed most of that this week.

The coming week

More paper writing assuming Anil thinks there is enough there to warrant trying for a submission.
Add parallelism to Yirgacheffe and benchmark it against the old single threaded code to check I’ve helped and not hindered.

Interesting links

I re-watched this talk by Timothy Roscoe from OSDI 2021, where he talks about why OSDI’s papers miss a chunk of the interesting problems with modern hardware platforms. It’s a fun reminder about how complex modern hardware is behind the OS, and is worth it mostly just for the nice definition early on in the talk about what an OS is.
Part of my goal with Yirgacheffe has been to hide parallelism behind the simple declarative interface, and I was chatting to my partner about this and she pointed me at Dask, which is a common way to scale up numpy centric code over may cores. I need to play with this to see how it maps to the kind of challenges we have. Anil already pointed out it doesn’t have a GPU story (though they do talk about CUPY?), and I suspect it assumes you’ve already sorted out alignment of data issues, and I’m not sure how it deals with vector layers. Still, given how in theory it overlaps, I should understand what it could do for us and where it falls short.
I came across Geo-Zarr, an alternative data storage format to GeoTIFF that has been specified and I gather is being used actively. It certainly looks like the way it chunks data into 2D grids makes a lot more sense than GeoTIFFs stripe based approach, which is something that I’ve started to hit up against in trying to speed up the biodiversity work.
One thing I’ve been interested in is being able to have some built in traceability in our result files, say including the hash of the source data and the code, and other data like what hardware was used, to help with reproducibility and verifiability. Currently I stash such things into the metadata section of Parquette files when I generate them (having ditched CSV for this reason). I found the technical term for that is Data Lineage, and I’ve been reading various articles to try find out what others do in this space.

Tags: gdal, tezos, yirgacheffe

Tech notes by Michael Winston Dales