Python Bytes and Characters

Published May 28, 2019

Last week I lost two days of my life fighting with Python, encoding, and Supervisor. It all starts with the fact that Python2 doesn’t handle encodings very well. I’ll leave the discussion of how Python2 handles encodings at that because Python2 is dead at the end of 2019 but suffice it to say that in Python2 bytes and characters were mostly handled the same and it wasn’t ideal because oftentimes characters are made up of multiple bytes.

At work we use Supervisor to control hundreds of programs across about ninety servers. It lets us assign processes to a host and then ensure that they are running. One feature that we take heavy advantage of is the ability to run event listeners inside Supervisor to capture things like logging events from our programs and then run those log lines through a monitoring system and eventually to an event management system.

Until last week we had been running Supervisor version 3 on Python 2. According to the documentation for Supervisor event listeners your event listener communicates with Supervisor by listening on stdin and talking back on stdout. Both stdin and stdout are, by default in Python 2 and Python 3, character streams. In Python2 this matters less but in Python3 this matters more. Supervisor will send new events to your event listener by sending you one line that your listener will read using readline. That one line will have a standard formatting that includes a len field and the len field should, according to the documentation, indicate the number of bytes that you should then read from stdin to get the body of the event. Because Python2 didn’t really distinguish between bytes and characters, this totally worked with Supervisor3 on Python2. My event listener looked roughly like this:

stdin = sys.stdin.buffer
while handle in select.select([stdin], [], [])[0]:
    line = handle.readline()
    if not line:
        raise EOFError("received eof from supervisord")

    # decode the line to utf8
    line = line.decode("utf-8", "backslashreplace")

    # read the parts of the header line
    header = dict([x.split(":") for x in line.split()])
    data_length = int(header.get("len", 0))
    if data_length == 0:
        return (header, None, None)

    # read more to get the payload
    data = handle.read(data_length)
    if not data:
        raise EOFError("received eof from supervisord")

    # decode the data to utf8
    data = data.decode("utf-8", "backslashreplace")

    if '\n' in data:
        # this message has additional data so extract it out
        event, data = data.split('\n', 1)
        event = dict([x.split(":") for x in event.split()])
        return (header, event, data)
    else:
        event = dict([x.split(":") for x in data.split()])
        return (header, event, None)

Basically we use the byte stream version of stdin (stdin.buffer), wait for data to come in to stdin then we readline it, get the payload length, and then read that number of bytes from the buffer and repeat for the next event. With Supervisor3 running on Python2 and our event listener running on Python3 this worked great and was in line with how the documentation for Supervisor said that everything should work.

As said, last week we upgraded Supervisor to version 4 and started running it on Python 3. Things suddenly stopped working but very, very randomly. After a lot of digging we discovered that our event listener was reading in less data than it was supposed to. The event listener would say that ten bytes of data were in the buffer and the listener would read ten bytes and then the next time the listener ran readline it would get garbage data.

It turned out that any time we had a program that logged something that contained variable byte characters (i.e. UTF-8 data) that our event listener would break. After digging through the code for Supervisor it was discovered that this was because Supervisor was sending the listener data that was already encoded in UTF-8 and the length it was giving us was characters and not bytes. UTF-8 is a variable length character encoding. Each character can be one or more bytes. For the vast majority of normal, English language characters (e.g. A through Z, 0 through 9) each character is one byte and so we weren’t experiencing a problem. Some of our programs, however, were logging data that went beyond the normal A through Z and those ended up being more than one byte per character thus why the event listener was reading less data than there actually was.

The solution to this was to turn our byte stream into a character stream so that read would work in characters instead of bytes. Thus we ended up with this solution which does not work and you should keep reading to see why:

stdin = io.TextIOWrapper(sys.stdin.buffer, encoding="utf-8")
while handle in select.select([stdin], [], [])[0]:
    line = handle.readline()
    if not line:
        raise EOFError("received eof from supervisord")

    # read the parts of the header line
    header = dict([x.split(":") for x in line.split()])
    data_length = int(header.get("len", 0))
    if data_length == 0:
        return (header, None, None)

    # read more to get the payload
    data = handle.read(data_length)
    if not data:
        raise EOFError("received eof from supervisord")

    if '\n' in data:
        # this message has additional data so extract it out
        event, data = data.split('\n', 1)
        event = dict([x.split(":") for x in event.split()])
        return (header, event, data)
    else:
        event = dict([x.split(":") for x in data.split()])
        return (header, event, None)

It is pretty identical except that we wrap stdin in a Python3 built-in that converts everything off of the stream into UTF-8. Then we don’t need to do the conversion later and when the event listener calls read it reads characters and not bytes. Problem solved?

After I implemented this the next day we were having a different problem. Event listeners on various servers started blocking. Instead of declaring that they were reading garbage data they just stopped reading data entirely.

It turns out most of our programs print data the normal Unix way: write some data, add a line feed (aka LF aka \n) to the end, repeat. Some of our programs, though, were echoing to their logs data that they got from remote sources that might include line endings other than a simple line feed and instead might print the Windows standard “carriage return line feed” or “CR LF” or \r\n.

By default, io.TextIOWrapper implements what Python calls universal newlines. This “feature” annoyingly converts anything that even looks like a new line into \n so that \r\n becomes \n. Now when a program prints a log line that is ten characters long and has \r\n in the middle of it the TextIOWrapper converts that \r\n into \n. Subsequently the stream reader in the event listener receives nine characters and blocks forever waiting for a tenth character that never will appear.

Thanks Python for trying to be helpful but ultimately not doing what I expect. The solution is to add the argument newline='\n' to the io.TextIOWrapper line so that the wrapper passes you the raw data and doesn’t try to mess with newlines.

Now I have a bug report open on the Supervisor project to address this either by fixing Supervisor to send bytes like the documentation says or by fixing the documentation to say “characters” like is implemented. I would prefer that they send me bytes and let me do the encoding conversion since maybe my programs are printing something non-UTF-8 like CP1256 or Mac Roman or whatever. Right now if your program spits out something that isn’t valid UTF-8 they will replace your log line with the word “Undecipherable” and followed by some object representation of the bytes and maybe I know what it is and I’d like to convert it to something I find useful. Let’s wait and see what happens!