Weeknotes: Automatic reprojection of geospatial data in Yirgacheffe.
Yirgacheffe projection fun
In the previous set of weeknotes I wrote about how I was investigating how to integrate map data at different resolutions, effectively how I can increase the resolution of a course dataset using some other higher resolution dataset. This project threw up a new use case for me for Yirgacheffe (my declarative geospatial library for Python), which was doing more exploratory work rather than building what is a relatively well defined pipeline, and exposed a weird bug in projections.
In a large pipeline like STAR or LIFE, where I will process data for tens of thousands of species I'll reproject the input maps into a common projection before I start, typically just using an external tool like GDAL's warp. The input layers for these pipelines are 100s of gigabytes in size, and so whilst I might not need all the data in a given map, because it's used so frequently reprojecting it all once at the start of day is a big saving. You can see it here in STAR and here in LIFE for example.
However, in exploratory work, I might not yet know what projections are useful or what data layers I want, and so it'd be useful to have Yirgacheffe just make this problem go away, given it knows how to use GDAL, and it knows the projections of all the inputs: it's not a hard leap to see it can just pick a common format, reproject data on demand, and generate the output.
However, there's two reason's that my dream of magicking it all away is unreasonable.
Firstly, repreojection of geospatial data is lossy: you will lose information in the process. If I project a map from projection A to projection B and back again, it's highly unlikely I'll get exactly the same map back. And so automatic reprojection under the hood might cause unexpected artifacting or precision loss, which is unacceptable in science, particularly in large data-science pipelines where the loss might go unnoticed as we aggregate lots of results together.
Secondly, you also want to specify the algorithm for reprojection, as what technique is used will depend on the data. If I'm using an elevation map where every pixel is a height in meters, then an averaging technique might be fine. But if I have a land cover map where each pixel has an enumeration (1 = forest, 2=sand, 3=water, etc.) then averaging is not very helpful, and you'd want to select the most frequently occurring value or nearest neighbour instead.
As a result, the data-scientist has to be aware of the conversions and has to take part in selecting how reprojections occur.
import yirgacheffe as yg
with (
yg.read_raster("data_in_wgs84.tif") as layer1,
yg.read_raster("data_in_mollweide.tif") as layer2,
):
result = layer1.reproject_like(layer2, method=yg.Nearest) * layer2
result.to_tiff("output_in_mollweide.tif")
It's a little ugly as a syntax right now, but that reprojection will be evaluated lazily when the calculation takes place, and Yirgacheffe will do all of the coordinate conversions required to work out what area in the other projection to read to align with the other layers, do any required buffering, and do all the usual chunking of large datasets and parallelisation it might otherwise do.
I have most of this in a PR but want to do a bunch more testing of it before I merge it into main. That said, I really am pleased with how quickly it came together. The main limitation of it compared with GDALwarp is that some of GDALwarp algorithms (which to be clear, is what Yirgacheffe is calling under the hood), rely on global state, and because Yirgacheffe chunks up your calculation to optimise performance and resource usage, those can't be used.
Still, I think it's an exciting addition to the Yirgacheffe functionality.
Tags: yirgacheffe, weeknotes