GDAL and FILE* vs macOS
Before xmas I did some experiments with my STAR biodiveristy pipeline with respect to the validation stages. Normally STAR takes about 9k terrestrial vertebrates from the IUCN Redlist and generates Area of Habitat (AOH) maps, which are then validated using the methods in Dahal et al. Dahal et al has two methods in it, one of which uses a statistical model to compare the AOHs relative to each other. In a discussion we had with some of the authors of that paper in December it turned out that to improve accuracy we should actually by running the model on as many AOHs as we can, not just the ones that we selected specifically for inclusion in the STAR metric.
So in December I tweaked my pipeline to generate AOHs for all 34K terrestrial vertebrates on the IUCN Redlist, and we did see a small but not insignificant change to the AOHs flagged as potential outliers by the statistical method. Given that this is what the author's of the Dahal et al method intended, and given that computing AOHs in my pipeline is relatively cheap, after chatting with the STAR authors I wanted to commit this change to the pipeline, which seemed like a quick and easy "first week back after two weeks off" type of task.
So, so naive. I was left with two mysteries, one of which I managed (with help) to get to the bottom of, and one still outstanding.
The first issue was that in generating a species richness map (where you add together all the AOHs as binary layers and a count in each pixel of how many species are potentially sustained by the habitat in that pixel), and now I was getting an error that some map in the set of AOHs had the wrong map projection. This shouldn’t be the case: the map built fine when we had fewer AOHs, and when I attempted the same thing just limited by taxonomical group (birds, mammals, etc.) then the map built fine. It seems there’s something about trying to add all of them that is causing the issue. In an attempt to debug this I then tried to load all the AOHs at once and do a brute force comparison of their map projections to find where the mismatch was happening, and that’s when I fit the second issue.
The second issue was GDAL seemed to be running out of file descriptors, despite my setting up the environment appropriately. By default most operating systems set a fairly conservative limit on how many files a process can open (the default is 256 on macOS, and 1024 on the Linuxes I use), but for the data-science pipelines I work with, these are way to small, so I set them to millions of files. Yet despite this I was still getting this error from GDAL. I tried to reproduce this on Linux, and all was fine there, so it was a macOS specific issue.
To cut a long afternoon short, I eventually had to concede defeat on this one and filed an issue on GDAL, to which I got a swift reply and eventually we traced this back to a limit in the standard C library on macOS. I’m so used to dealing with POSIX APIs directly, I’d forgotten that C has its own library methods for working with files, and it was in this rather than the operating system limits that the problem was occurring. It turns out on macOS that libc can only allocate 32K file pointers, and it was this limit that GDAL was hitting up against. Finding this was a fun read through the Apple libc source code, which is something I used to do somewhat regular when I worked for Bromium, and so it was a fun little nostalgia trip.
Thankfully someone on the GDAL team, Even Rouault, over the weekend made a patch to migrate from FILE* to posix file descriptors, which would solve this issue - which is amazing work. It’ll be a while before it’s in the main GDAL I’m sure, but still, it was an impressive response from what must be a busy team.
Unfortunately I ran out of time last week to look into the original issue. When I had the idea that GDAL might be running out of file descriptors I thought perhaps the same fate was hitting pyproj, as I know when you look up a map projection it checks a database, which might well involve a file opening. So now I need to do some looking in that code base and check if it’s using C file pointers too under the hood.
Obviously I could just not open 34K files at once when building the species richness map. That’s not an unreasonable suggestion, however if I don’t get to the bottom of these issues with what is quite a simple and obvious problem as building a species richness map, then I fear the same issue will crop up again in some other, less obvious and harder to debug part of the pipeline. We already found one problem down to operating system behaviour that we helped GDAL become aware of, so as tedious as it can feel, I think this type of spelunking is definitely worth it.
Tags: weeknotes, gdal, yirgacheffe, star