Bytes not Bytearrays with Django please
27 Apr 2021
A quick post of something that tripped me up with Django recently, and I didn’t find any obvious posts on this already, so a quick note to help me next time I hit this. I was getting some very weird behaviour when passing a bytearray to Django, but not when using bytes. And once I understood why, it made me very sad.
In Python you have two primary types for storing sequential binary data: bytes and bytearray, where the primary distinction between the two is bytes is immutable and bytearray is mutable. Beyond this distinction, if you read the documentation for their APIs, you’ll see that are largely interchangeable in their usage. However, this is only because they look similar, and this causes issues where you assume they’re interchangeable but the code your calling really hasn’t anticipated you passing either.
For example, when sending binary data as an HTTP response in Django.
Here’s a simple artificial example for us to work with:
def my_view(request): payload = 'hello, world!'.encode('utf-8') response = django.http.HTTPResponse(payload) response['Content-Disposition'] = 'attachment; filename="mydata.bin"' response['Content-Type'] = 'application/octet-stream' return response
Here I’m taking some data, and converting it to bytes. I then send that in my HTTPResponse and all is well - the file saved by a browser accessing this URL will have the expected raw bytes in it that represent the UTF-8 string for ‘hello, world!’.
But what if I do the following instead?
def my_view(request): payload = bytearray() payload += 'hello, '.encode('utf-8') payload += 'world!'.encode('utf-8') response = django.http.HTTPResponse(payload) response['Content-Disposition'] = 'attachment; filename="mydata.bin"' response['Content-Type'] = 'application/octet-stream' return response
Here I use a bytearray as I want to construct my response in stages (I appreciate this is an artificial example, but when I hit this the library that I was using to generate the file I wanted the view to return was built in chunks this way). Given that bytes and bytearray have a near same API I’d expect to get the same result. But no, instead the file mydata.bin now contains the following text:
Note that this isn’t what it has when I look at the binary file in hex: the resulting file is a UTF-8 string containing those numbers!
Now, some of you might already see that the numbers here are the decimal version of the UTF-8/ascii codes for the ‘hello, world!’ string. So it’s like the data was converted to bytes, then each of those bytes turned into a number, and then each of those numbers written out into a file, like some weird party game (perhaps I need to go to different parties…).
So what’s going on?
This all falls over because although bytearray and bytes look very similar to humans based on the API, the type system in Python does not recognise them as similar, as is demonstrated by:
>>> isinstance(b'hello', bytes) True >>> isinstance(bytearray(b'hello'), bytes) False
And with this, we now have the basis of what we need to know as to what’s going wrong here.
In Django when you create an HTTPResponse it checks for the data it is passed being of a type similar to string or bytes, but it does not consider bytearray, and so we fall onto a separate codepath that seems to exist to try fudge other types into a valid response - I’m sure there’s a good use case for this, but it certainly isn’t the appropriate response to a bytearray:
@content.setter def content(self, value): # Consume iterators upon assignment to allow repeated iteration. if hasattr(value, '__iter__') and not isinstance(value, (bytes, str)): content = b''.join(self.make_bytes(chunk) for chunk in value) if hasattr(value, 'close'): try: value.close() except Exception: pass else: content = self.make_bytes(value) # Create a list of properly encoded bytestrings to support write(). self._container = [content]
You can find this code here on github - a hat tip to my colleague Bill who quickly found the offending bit of code with lightning speed for me. It’s worth noting that this is Django 2.2, which is what most my clients seem to still be using, but the code is suitably similar in the 3.x branch and the same fate will befall bytearrays in either version.
What happens is that Django decides that whilst bytearray is iterable, it isn’t of type string or bytes, and so it takes each item in the iterable bytearray, calls ‘make_bytes’, and sticks them together. Which is funny, as the item we pass should in theory be a byte. But Python doesn’t have a byte type, only an int type!
>>> b'asdsad'.__class__ <class 'int'> >>> bytearray(b'asdsad').__class__ <class 'int'>
So if we look at make_bytes (found here on github):
def make_bytes(self, value): """Turn a value into a bytestring encoded in the output charset.""" # Per PEP 3333, this response body must be bytes. To avoid returning # an instance of a subclass, this function returns `bytes(value)`. # This doesn't make a copy when `value` already contains bytes. # Handle string types -- we can't rely on force_bytes here because: # - Python attempts str conversion first # - when self._charset != 'utf-8' it re-encodes the content if isinstance(value, bytes): return bytes(value) if isinstance(value, str): return bytes(value.encode(self.charset)) # Handle non-string types. return str(value).encode(self.charset)
We see that we pass in an integer that represents one UTF-8 character in our original string, we convert that to a string (which will be converted in decimal) and then encoded that back to bytes!
And thus why if you pass bytearray to Django’s HTTPResponse you get a sort of double expansion of the data, rather than the bytes you (or at least I) originally expected 🤦♀️
The solution to all this is simple: get a better type system. Alternatively, you can just convert your data to bytes before passing it to Django:
def my_view(request): payload = bytearray() payload += 'hello, '.encode('utf-8') payload += 'world!'.encode('utf-8') response = django.http.HTTPResponse(bytes(payload)) response['Content-Disposition'] = 'attachment; filename="mydata.bin"' response['Content-Type'] = 'application/octet-stream' return response
There’s a whole bunch of places you can point the finger here. I’m not a fan of Django’s attempt to be clever if you pass it a data byte it doesn’t recognize: I’d much rather Django just gave up and made it the callers problem to ensure that data is serialized to bytes. This is at best some attempt to guess, and that’s going to go wrong at some point (as it did for me yesterday).
It is also unfortunate that in Python the mutable and immutable versions of a given type aren’t properly related. But it all adds evidence to my general dislike of Python’s type system that makes it both possible to trip up like this and to make it totally not obvious what you’re doing wrong.