Pandas vs Efficiency

7 Aug 2023

As part of my role at the Cambridge Center for Carbon Credits (4C), working closely with ecologists in processing of data, I've become an accidental data scientist, by which I mean instead of loading CSV files directly I now use pandas. Pandas is a popular library that makes working with large arrays of data easier. Unlike numpy, which is there to make it easy to process data in multidimensional arrays, pandas works one abstraction higher, giving columns names and treating the data as tabular, rather than just raw numbers. But similar to numpy, it provides nice syntax to help you work with that data without worrying about individual elements, you tend to work at that tabular level, and that makes it particularly powerful for someone who isn't a native programmer, rather someone who is a domain expert in whatever they have data about who is trying to process that data to derive a new insight.

I'm quite a fan of Numpy: Numpy makes it simple to reason about large amounts of data without worrying about each element, and at the same time it is really quite efficient at doing so. I recently tried rewriting some Python code that used numpy to a compiled language thinking it'd be faster, but under the hood numpy is using compiled code already to do vectorized operations, so is actually quite efficient, and my native code version was as a result no faster and harder to read.

So given it's popularity, and the fact that it uses Numpy under the hood, I'd assumed that pandas would similarly provide that double win of simplicity of expression with efficiency of computation, but I was mistaken: using pandas to process data turned out to be very inefficient. In this post I'm going to walk through a particular problem I was trying to solve, and then look into how I managed to speed it up, and then worry about what this means for the regular data-scientist that isn't also steeped in computer-science knowledge.

The problem spec

I was implementing some code that tried to find pairings between two sets of data, which we shall refer to as Set K and Set S (as that's what they were called in my code :). The theory is that for every element in K, we want to find the closest match in S, based on two criteria. For certain properties of our element in K there must be direct matches on the property on the matching element, as in they must have the same value. Then for other properties we just want to find the closest approximation.

To make that more concrete, the data I'm dealing with are points of ecological interest, so set K is a set of points in a region of interest, and I'm trying to find the closest match in some other area so I can then do other comparisons later. Certain properties must match absolutely, such as the type of land for the pixel (land use class), and the regional biome class (ecoregion), but then for other properties like elevation and distance from population I'm only interested in finding a rough match. For that rough match, because you might get conflicting nearnesses across say elevation and distance from population, we're going to use an off the shelf distance function that takes multiple variables and gives you a single distance value, called a Mahalanobis distance. It doesn't really matter for this discussion what that is, just when you see it in the rest of this document that's what this is doing.

Now, there's many ways to frame this problem, but I'm going to start with the naive pandas implementation, as I think it does a very good job at simplicity of expression side of things.

# Let us load our two sets
k_set = pd.read_parquet(k_parquet_filename)
s_set = pd.read_parquet(s_parquet_filename)

# For Mahalanobis we need to calculate the relationship between the
# variables we want as part of the distance function.
invconv = calculate_coveriance_inverse(s_set)

# This statement does the first half of the matching!
# The result is a new table that has every match between K and S on it
# with the values from both, so can be quite large!
k_and_s_joined = k_set.merge(
	s_set,
	on=['ecoregion', 'luc10', 'luc5', 'luc0'],
	how='inner',
	suffixes=['_k', '_s']
)

# This is our function to apply to each of our initial matches.
# Note that each row contains both the data of K and S, which is
# why this just takes one row as a parameter.
def distance(row: Series, iconv: np.ndarray) -> float:
	return mahalanobis(
		(row.elevation_k, row.slope_k, row.population_k),
		(row.elevation_s, row.slope_s, row.population_s),
		iconv
	)

# Now we make a new column of data, which is the distance in each
# row between the K data and S data, and add that back into our table
k_and_s_joined['distance'] = k_and_s_joined.apply(
	distance,
	axis=1,
	args=(invconv,)
)

# Now we just want for each element in K, one result where the distance
# value is the lowest, so we cluster the results based on K's lat/lng,
# and pick the one with the smallest value
result = joined.groupby(['lat_k', 'lng_k']).min('distance', axis=1)

# Finally we can save that result!
result.to_parquet('result.parquet')

I mostly like this code. There is some algorithmic know-how required certainly, around the idea of the merge/join and the groupby/min, but if you've taken time to learn pandas, this is a nice succinct way to record what's going on: your code does not obfuscate the methodology unnecessarily.

Unfortunately, in terms of performance this code is terrible.

It is both slow to execute (I have to confess, I never let this version finish, as with my data it took more than a few hours), and very memory hungry. I'm now going to move through a few versions where I rework this to get it to a good place in terms of performance, all of which will come at the expense of clarity of intent in the code.

Too much data

The first thing I want to tackle is just the memory usage. For me the sets S and K are usually in the tens of thousands of values. If we assume that there is a fairly high hit rate on the first stage matching, this means that the table k_and_s_joined is going to be in the tens of millions, which is unfortunate as most of that data will be thrown away, because ultimately we want one match per element in K.

When I ran this with my dataset the Python process was sat at around 60GB, which is quite a lot of memory to be using - on most personal computers and laptops that would not fit for instance. We have some large compute servers where this is not an issue, but having so much memory in use means I can't run many copies of this code at once, so most of the many CPU cores we have sit idle.

So the first thing I'm going to do is not merge K and S with a join, but split this into a set of loops, so that we only have one copy of each set in memory, rather than the product of the two:

# Let us load our two sets
k_set = pd.read_parquet(k_parquet_filename)
s_set = pd.read_parquet(s_parquet_filename)

# For Mahalanobis we need to calculate the relationship between the
# variables we want as part of the distance function.
invconv = calculate_coveriance_inverse(s_set)

results = []
for _, k_row in k_set.iterrows():

	# Create a new set which is just the rows in s_set
	# that have a direct match to the row in K we're
	# realing with - equivelent to joined_k_and_s for
	# a single row of K
	filtered_s = s_set[
		(s_set.ecoregion == k_row.ecoregion) &
		(s_set.luc10 == k_row.luc10) &
		(s_set.luc5 == k_row.luc5) &
		(s_set.luc0 == k_row.luc0)
	]

	# If there are no matches move on. This isn't just an
	# optimisation, it's to avoid exceptions when later on
	# we try take a result!
	if len(filtered_s) == 0:
		continue

	# This is our function to apply to each of our initial matches.
	# Note that each row contains both the data of K and S, which is
	# why this just takes one row as a parameter.
	def distance(s_row: Series, k_row: Series, iconv: np.ndarray) -> float:
		return mahalanobis(
			(k_row.elevation, k_row.slope, k_row.population),
			(s_row.elevation, s_row.slope, s_row.population),
			iconv
		)

	# Now we make a new column of data, which is the distance in each
	# row between the K data and S data, and add that back into our table
	filtered_s['distance'] = filtered_s.apply(
		distance,
		axis=1,
		args=(k_row, invconv,)
	)

	# Now find the one result where the distance value is the lowest.
	minimum_row = filtered_s[filtered_s.distance==filtered_s.distance.min()].iloc[0]
	results.append(minimum_row)

# Finally we can save that result!
pd_results = pd.DataFrame(results)
pd_results.to_parquet('result.parquet')

On the plus side, this code now uses much less memory! With the sample sample data I'm now only using around 5GB of memory, which means we're now into the realm of being able to run this on a personal computer, or I can run ten times as many instances of this process concurrently on my server. Not only that, but this version runs faster too - completing in around 75 minutes on my dataset.

The cost is that the code is now further away from the methodology, it's harder to read at a glance to learn what it's doing. I'm having to somewhat micromanage the computer by telling it what to do for each element of the set K rather than letting the computer figure out what's best.

This already annoys me - I've not really done much but I've already got a huge win in terms of performance of my code, and I feel really I should have been able to get the computer to figure this out for me. But what annoys me more is that as a computer scientist I knew to do this, but pandas is meant for data scientists who are experts in domains other than computing, but here we are having to cause them to become people who understand the way programs use memory. And for my program to get better, this burden is going to get yet worse.

Why is this taking an hour?

At this point, running on the computer I have access to with 100s of CPU cores and enough memory I can use all those CPUs with the 5GB per process I have, I was ready to move on. But then we ran this code on a more reasonable computer and it took three hours to run for this data set, and longer for the next batch, and so I was forced to go back the code, and wonder: why is it so slow?

Is it because the Mahalanobis calculation is very slow? Is it that doing filtering on pandas data sets is very slow? This code doesn't really do much, and even if you think we need to process tens of millions of rows, computers are really fast these days: a GHz processor will one billion operations per second, and so the math really shouldn't be slowing it down.

Now, I could start putting in print statements with timestamps in, but being a computerist I reached for CProfile, which is the Python profiling library, and ran my code again. Profiling like this is basically going to just repeatedly pause my program and ask "what are you doing", just at a very fine granularity such that it'll even see what's happening inside function calls that complete very fast. The downside of this is that it will slow down the program - what took 75 minutes now took almost three hours to run.

Still, run it did, and then I get an output that is just a list of all function calls made, how often they were made, how much time was spent in them, and how much of that time was spent in that function specifically rather than functions it called. On one hand, this is just another version of doing data science, only on the program itself, but again the data scientists I work with are experts in ecology not computering, and so I'd not say that this sort of program introspection is something they'd benefit from.

		 25949560575 function calls (25949402038 primitive calls) in 9805.165 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
	   35    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:100(acquire)
	 32/6    0.000    0.000    0.202    0.034 <frozen importlib._bootstrap>:1022(_find_and_load)
		3    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:1038(_gcd_import)
	 1646    0.004    0.000    0.007    0.000 <frozen importlib._bootstrap>:1053(_handle_fromlist)
	   35    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:125(release)
	   32    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:165(__init__)
	   32    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:169(__enter__)
	   32    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:173(__exit__)
	   35    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:179(_get_module_lock)
	   35    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:198(cb)
		3    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:216(_lock_unlock_module)

...1263 lines later...

802394413   88.968    0.000   88.968    0.000 {pandas._libs.lib.is_scalar}
42419    0.008    0.000    0.008    0.000 {pandas._libs.lib.item_from_zerodim}
 6546   18.868    0.003   18.931    0.003 {pandas._libs.lib.maybe_convert_objects}
	2    0.000    0.000    0.000    0.000 {pandas._libs.lib.to_object_array_tuples}
	1    0.000    0.000    0.000    0.000 {pandas._libs.lib.to_object_array}
	1    0.179    0.179    0.179    0.179 {pyarrow._s3fs.ensure_s3_initialized}
	1    0.000    0.000    0.000    0.000 {pyarrow.lib.cpu_count}
   12    0.000    0.000    0.000    0.000 {pyarrow.lib.field}
	2    0.000    0.000    0.000    0.000 {pyarrow.lib.int64}
	2    0.000    0.000    0.000    0.000 {pyarrow.lib.register_extension_type}
	1    0.000    0.000    0.000    0.000 {pyarrow.lib.schema}
	1    0.000    0.000    0.000    0.000 {pyarrow.lib.struct}
	2    0.298    0.149    0.298    0.149 {pyarrow.lib.table_to_blocks}

As an aside, note the top line: 25.9 billion function calls! That's a lot of function calls, just to process tens of millions of rows of data. But I guess those calls add up quickly when you're working with data this big.

Anyway, most of the information was not interesting, but two things stood out. Firstly was this line:

	 1631    0.020    0.000 9792.367    6.004 frame.py:9266(apply)

This tells me that Apply is being called 1631 times, which is once per entry in K for this run, which is what I'd expect, but it also tells me that it spent 9792 seconds in apply, which means that the apply call for the code is where we spend most of our time! So we have a good clue here: of the two stages to filtering the data, it's not the explicit matching stage that's slow, but working out the distances.

The obvious conclusion to jump to then would be that it's the distance function itself that is slow, but if we find that in the profiler output:

 89146415  468.679    0.000  874.615    0.000 distance.py:979(mahalanobis)

We can see this is called a lot, nearly 90 million times, which is the product of K and S after you filter out the first stage matching, but it only accounts for a small fraction of our 9792 seconds. Where is the rest of the time going? So I scroll on and then I spot this:

802376453  829.165    0.000 7993.752    0.000 generic.py:5975(__getattr__)

Now, unless you understand how Python works under the hood, this is just yet another internal call that Python does that you have no control over, but because this isn't my first rodeo, I happen to know what this means, and what it is telling me. Python's getattr is used when you try to access a property on an object in Python. We know that this is happening in the loop of apply, and we can see that it's being called a lot, and so from that I can infer it's this code here that's the problem:

	(row.elevation_k, row.slope_k, row.population_k),
	(row.elevation_s, row.slope_s, row.population_s),

The problem is when we access the data on the row by name line this. Pandas has been super helpful and made it possible for us to access the data in each column by name as if it was a property on the two, but in practice to do this it has to do a bunch of look up work to made this happen, going back to the table, finding the column names, checking you have provided one that is right, then finding the data and passing it back, and it turns out if you do this a lot, whilst it might be fast once, but if you do it a lot of times it all adds up.

In fact, confession time, the code I'm showing you here is a simplified version of the real code, which used a lot more variables, and looked like this:

	# The data for this row will be the same every time,
	# so don't do it in the loop
	k_info = (k_row.elevation, k_row.slope, k_row.population,
		k_row.cpc0_u, k_row.cpc0_d,
		k_row.cpc5_u, k_row.cpc5_d,
		k_row.cpc10_u, k_row.cpc10_d)

	# This is our function to apply to each of our initial matches.
	def distance(s_row: Series, k_info: Tuple, iconv: np.ndarray) -> float:
		return mahalanobis(
			k_info,
			(s_row.elevation, s_row.slope, s_row.population,
				s_row.cpc0_u, s_row.cpc0_d,
				s_row.cpc5_u, s_row.cpc5_d,
				s_row.cpc10_u, s_row.cpc10_d),
			iconv
		)

	# Now we make a new column of data, which is the distance in each
	# row between the K data and S data, and add that back into our table
	filtered_s['distance'] = filtered_s.apply(
		distance,
		axis=1,
		args=(k_info, invconv,)
	)

I'd already pulled out the calculation of the bit of K I needed out of the apply loop, because of habit - as someone who's coded a bunch I know that if I can do a thing once and re-use that result it's almost always better to do so. So my instinct had saved me being even slower here. So now you can see the numbers add up - we process 90 million rows, and we make a tuple from 9 fields inside that loop, which is our 800 million calls to getattr!

So what can one do about this? Well, for better or worse (better in this case, worse in general) there are multiple ways in pandas to achieve the same thing. Rather than access each item on the row by a property on the object, I can just pass a list of column names, and it'll narrow things down for me. So now my code is:

# Let us load our two sets
k_set = pd.read_parquet(k_parquet_filename)
s_set = pd.read_parquet(s_parquet_filename)

# For Mahalanobis we need to calculate the relationship between the
# variables we want as part of the distance function.
invconv = calculate_coveriance_inverse(s_set)

results = []
for _, k_row in k_set.iterrows():

	# Create a new set which is just the rows in s_set
	# that have a direct match to the row in K we're
	# realing with - equivelent to joined_k_and_s for
	# a single row of K
	filtered_s = s_set[
		(s_set.ecoregion == k_row.ecoregion) &
		(s_set.luc10 == k_row.luc10) &
		(s_set.luc5 == k_row.luc5) &
		(s_set.luc0 == k_row.luc0)
	]

	# If there are no matches move on. This isn't just an
	# optimisation, it's to avoid exceptions when later on
	# we try take a result!
	if len(filtered_s) == 0:
		continue

	# The data for this row will be the same every time,
	# so don't do it in the loop
	k_info = k_row[["elevation", "slope", "population"]]

	# This is our function to apply to each of our initial matches.
	def distance(s_row: Series, k_info: Series, iconv: np.ndarray) -> float:
		return mahalanobis(
			k_info,
			s_row[["elevation", "slope", "population"]],
			iconv
		)

	# Now we make a new column of data, which is the distance in each
	# row between the K data and S data, and add that back into our table
	filtered_s['distance'] = filtered_s.apply(
		distance,
		axis=1,
		args=(k_info, invconv,)
	)

	# Now find the one result where the distance value is the lowest.
	min_distance = filtered_s.distance.min()
	minimum_row = filtered_s[filtered_s.distance==min_distance].iloc[0]
	results.append(minimum_row)

# Finally we can save that result!
pd_results = pd.DataFrame(results)
pd_results.to_parquet('result.parquet')

The code has hardly changed here, just we're using a list of column names rather than directly accessing the values on the row in turn, but this dropped the run time of the program with my sample data from 75 minutes to just under 10 minutes!

This is one tiny change, but the method by which I discovered it as not obvious and I'd argue something not easily discoverable by someone who's an expert ecologist data scientist. Perhaps this tip might have been listed somewhere and thus they'd know to avoid this, but that solution doesn't scale well. How many other tips are there out there that they're missing out on? By looking more into the profile output I found some other small performance wins, but what's interesting isn't those wins, but the level of knowledge of how computer programs work required to know to apply them. Pandas does such a good job at helping at a semantic level, but to get good performance out of it required a whole other level of expertise. This is in contrast to say numpy, which (albeit in a different domain) manages to pull off the trick of providing both semantic and performative efficiency. Even numpy will, eventually, break down this way, but the non-computer-domain-expert will get further before they hit that.

This is another rendition of the tension I highlighted a few posts ago, as captured in the “Worst is Better” trilogy of papers by Richard P. Gabriel, between:

“The right thing” - having an elegant interface with all the icky complexity of dealing with complexities hidden inside
“Worse is better” - having an elegant implementation that exposes the underlying complexities to the system’s user

At some point "The right thing" will break down, stranding the user, which is what is happening with pandas here. The counter argument is that you should make the user have to understand the complexity from the start so they're prepared for this. My personal preference is to try push "The right thing" as far as you can and then provide ways to flag what's going wrong - more people are enabled by doing the former than will succeed at the later, and I'd rather enable ecologists to save the planet, even if that's sometimes inefficient. But I digress, as I have one more stage to performance that I did, which kinda sidesteps that entire debate.

Using pandas where it's good, then getting it out the way

Recently I made a (poor) joke to my partner that I realised I'd become a data scientist when I started opening CSV files with pandas rather than just reading the contents directly and splitting the file up myself as was my habit before this last year. The nugget of gold in that glib statement is that, despite my lambasting it thus far, pandas is really good when doing its thing. Pandas makes it really easy to reason about tables of data when you're not worrying about individual values, but it seems to struggle when doing bulk calculations on that data; but I've already said that was an area where numpy is good, so why not just let each side do what it's best at?

Thus, I eventually ran with this code, where I use pandas to do everything up to the point where I have to access discrete values, at which point I move the data wholesale into numpy world:

# Let us load our two sets
k_set = pd.read_parquet(k_parquet_filename)
s_set = pd.read_parquet(s_parquet_filename)

# For Mahalanobis we need to calculate the relationship between the
# variables we want as part of the distance function.
invconv = calculate_coveriance_inverse(s_set)

results = []
for _, k_row in k_set.iterrows():

	# Create a new set which is just the rows in s_set
	# that have a direct match to the row in K we're
	# dealing with - equivalent to joined_k_and_s for
	# a single row of K
	filtered_s = s_set[
		(s_set.ecoregion == k_row.ecoregion) &
		(s_set.luc10 == k_row.luc10) &
		(s_set.luc5 == k_row.luc5) &
		(s_set.luc0 == k_row.luc0)
	]

	# If there are no matches move on. This isn't just an
	# optimisation, it's to avoid exceptions when later on
	# we try take a result!
	if len(filtered_s) == 0:
		continue

	# The data for this row will be the same every time,
	# so don't do it in the loop
	k_info = np.array(k_row[["elevation", "slope", "population"]].tolist())

	# Select the data we need for the distance calculation, and
	# export that as a large numpy 2D array
	s_subset = filtered_s[["elecation", "slope", "population"]]
	s_subset_raw = s_subset.to_numpy()

	# Now work over the numpy array to find the minimum distance
	min_distance = VERY_LARGE_NUMBER
	min_index = None
	for index in range(len(s_subset_raw)):
		s_info = s_subset_raw[index]
		distance = mahalanobis(k_info, s_info, invconv)
		if distance < min_distance:
			min_distance = distance
			min_index = index

	# Now find the corresponding data in the original pandas data
	minimum_row = filtered_s.iloc[min_index]
	results.append(minimum_row)

# Finally we can save that result!
pd_results = pd.DataFrame(results)
pd_results.to_parquet('result.parquet')

The key bits to note here are that I used pandas to take the data I'd filtered at the first stage, and select just the columns I need for the distance comparison (a thing pandas is good at) and then convert the data straight to a large numpy array, and process the data from that (handing over to a thing numpy is good at). I now have to do some more accounting as I iterate over the data and find the minimum, but the result was I dropped from 10 minutes to 6 minutes, getting me faster again and I'm well below 10% or my original run time (not including the one that didn't finish!).

The cost is that my code now is definitely very micro-managery, and doesn't reflect the original methodology very well - it's still following the methodology, but you need to reconstruct it from the code.

Why have you made me read all this?

There's two readings of this post. Firstly, if you're stuck trying to improve the performance of your pandas code, then consider exporting it to numpy if you're doing bulk calculations on the data rather than just dealing with columns etc. It'll save you some time and memory and your electricity bill will be lower. But then it'd also be valid to say for this kind of task you might also want to look at tools like Spark and Dask which do some of the lifting for you, at the expense of learning yet another framework properly before it'll really be able to help you.

But secondly, and perhaps more interestingly, is how could this be made such that if you're an expert in a domain that isn't computer science, how do you figure this stuff out? Or perhaps from my perspective: how, as someone making libraries for ecologists to use, how do I make it so they don't get into this trap? Perhaps it'd be better if pandas doesn't have the apply function to loop over the data, and it just had the "dump data to numpy" function instead? Providing nothing would have helped me, as I already know numpy, but that might have just put off other data scientists?

Or put another way, does everyone doing significant data science in all domains but one need to have a part-time computerist on their team? Should we just acknowledge that this stuff requires some under-the-hood systems knowledge to get right, and so the way forward is a pairing of experts. You hope that most the time the tools do good, but at some point you want to have a domain expert review things? This falls down I imagine when it comes to funding - who wants to add another person to the project in the name of efficiency when you can kludge by and your budget is already tight?