Screwing up Python compatibility: unicode(), str(), and bytes()

Posted on 2009/07/02 by niemeyer

Backwards and forwards compatibility is an art. In the very basic and generic form, it consists in organizing the introduction of new concepts while allowing people to maintain existing assets working. In some cases, the new concepts introduced are disruptive, in the sense that they prevent the original form of the asset to be preserved completely, and then some careful consideration has to be done for creating a migration path which is technically viable, and which at the same time helps people keeping the process in mind. A great example of what not to do when introducing such disruptive changes has happened in Python recently.

Up to Python 2.5, any strings you put within normal quotes (without a leading character marker in front of it) would be considered to be of the type str, which originally was used for both binary data and textual data, but in modern times it was seen as the type to be used for binary data only. For textual information, the unicode type has been introduced in Python 2.0, and it provides easy access to all the goodness of Unicode. Besides converting to and from str, it’s also possible to use Unicode literals in the code by preceding the quotes with a leading u character.

This evolution has happened quite cleanly, but it introduced one problem: these two types were both seen as the main way to input textual data in one point in time, and the language syntax clearly makes it very easy to use either type interchangeably. Sounds good in theory, but the types are not interchangeable, and what is worse: in many cases the problem is only seen at runtime when incompatible data passes through the code. This is what gives form to the interminable UnicodeDecodeError problem you may have heard about. So what can be done about this? Enter Python 3.0.

In Python 3.0 an attempt is being made to sanitize this, by promoting the unicode type to a more prominent position, removing the original str type, and introducing a similar but incompatible bytes type which is more clearly oriented towards binary data.

So far so good. The motivation is good, the target goal is a good one too. As usual, the details may complicate things a bit. Before we go into what was actually done, let’s look at an ideal scenario for such an incompatible change.

As mentioned above, when introducing disruptive changes like this, we want a good migration path, and we want to help people keeping the procedure in mind, so that they do the right thing even though they’re not spending too many brain cycles on it. Here is a suggested schema of what might have happened to achieve the above goal: in Python 2.6, introduce the bytes type, with exactly the same semantics of what will be seen in Python 3.0. During 2.6, encourage people to migrate str references in their code to either the previously existent unicode type, when dealing with textual data, or to the new bytes type, when handling binary data. When 3.0 comes along, simply kill the old str types, and we’re done. People can easily write code in 2.6 which supports 3.0, and if they see a reference to str they know something must be done. No big deal, and apparently quite straightforward.

Now, let’s see how to do it in a bad way.

Python 2.6 introduces the bytes type, but it’s not actually a new type. It’s simply an alias to the existing str type. This means that if you write code to support bytes in 2.6, you are actually not writing code which is compatible with Python 3.0. Why on earth would someone introduce an alias on 2.6 which will generate incompatible code with 3.0 is beyond me. It must be some kind of anti-migration pattern. Then, Python 3.0 renames unicode to str, and kills the old str. So, the result is quite bad: Python 3.0 has both str and bytes, and they both mean something else than they did on 2.6, which is the first version which supposedly should help migration, and not a single one of the three types from 2.6 got their names and semantics preserved in 3.0. In fact, just unicode exists at all, and it has a different name.

There you go. I’ve heard people learn better from counter-examples. Here we have a good one to keep in mind and avoid repeating.

This entry was posted in Python. Bookmark the permalink.

19 Responses to Screwing up Python compatibility: unicode(), str(), and bytes()

Steve 'Ashcrow' Milner says:

2009/07/02 at 13:05

You are right that it seems like a bad move. I can’t explain it myself. Though, don’t forget that, compared to a lot of other languages Python has done a fantastic job of keeping backwards compatibility. It’s not perfect, and it could be better, but it’s still one of the best :-).
Elvis Pfutzenreuter says:

2009/07/02 at 15:22

Indeed, this Unicode thing in Python is like we say in Portuguese: “a emenda foi pior que o soneto” (a bad thing whose fix is even worse).

Currently the Python developer must deal with three versions: 2.5 (the last “pure 2.x”), 3.0, and 2.6 which is neither 2.x or 3.x. The worst thing in all is the manpower that all those versions will consume at Python side.

Instead of trying to improve Python in the right directions (threading, performance, etc.) people kept adding syntatic sugar in 2.4 through 2.6 — things that certainly gave fame to the PEP writers but don’t bring real breakthroughs.

That’s why Javascript is probably going to eat our lunch.
Christian Heimes says:

2009/07/02 at 22:17

Another one who didn’t get the idea behind the bytes() alias in Python 2.6 …

The bytes() and b”” alias in Python 2.6 aids you in the migration to Python 3.x. At some point we thought about adding a separate bytes type to Python 2.6 but it would have broken far too many applications. The aliases act as markers for developers and 2to3. Without the aliases there would have been no clear way to tell 2to3 that a string should be migrated to bytes rather than unicode text.

step 1: Port your application to Python 2.6
step 2: Replace all occurrences of str() and string literals with bytes() and b”” where you mean bytes and not ASCII text.
step 3: use 2to3 to migrate your code to Python 3.x. 2to3 replaces u”” and unicode() with “” and str(). It leaves b”” and bytes()
Gustavo Niemeyer says:

2009/07/02 at 23:35

Christian, I think it’s pretty clear from the blog post that I believe that a bytes reference does no good at all if the implementation in 2.6 is incompatible with the one in 3.0, and instead it creates even more confusion. The fact that you mention “another one who didn’t get it” is a great indicator of that. People don’t get it because it’s a bad idea, and that’s exactly why I mention in the post that we should help people keeping the process in mind.

Also, it really surprises me that the fact that there is a migration tool like 2to3 is being used as an excuse for introducing gratuitous backwards compatibility mess in the language. That’s actually most probably the reason why we got into this. Too much emphasis was given to the code migration tool, when in certain cases it was straightforward to do things in a better way.
Elvis Pfutzenreuter says:

2009/07/02 at 23:39

“Another one who didn’t get the idea behind …”

Another arrogant Python one.
Christian Heimes says:

2009/07/03 at 09:56

The bytes() alias and b”” literal were only added for the 2to3 migration. It’s the only valid use case for bytes() in 2.6. Together with from __future__ import unicode_literals you can slowly get your Python 2.6 code ready for 2to3 migration. The features were implemented by some people including me after we had some experience with porting Python 2 to Py3k code. The features are motivated by real life experiences and not some crazy blue sky ideas.

If you aren’t working on a migration path to 3.x you can safely ignore the bytes alias and pretend it’s not even there. Just leave it alone.

From http://docs.python.org/whatsnew/2.6.html?highlight=bytes#pep-3112-byte-literals

“The primary use of bytes in 2.6 will be to write tests of object type such as isinstance(x, bytes). This will help the 2to3 converter, which can’t tell whether 2.x code intends strings to contain either characters or 8-bit bytes; you can now use either bytes or str to represent your intention exactly, and the resulting code will also be correct in Python 3.0.”
Gustavo Niemeyer says:

2009/07/03 at 10:48

That makes it more clear that all I pointed out in the post and in the comment above is indeed true. There was an obvious chance to make the migration smoother and straightforward without the help of a code migration tool, and it was dropped in favor of a convoluted choice which breaks the language backwards compatibility in an awkward way gratuitously.

That said, I appreciate your interest in clarifying the history.
Thanks for your time, Christian.
slurm says:

2009/07/03 at 12:18

Fsck Python 2. Python 3 is way better, there was no reason to pollute it like this. The best way to handle it would have been to just let Python 2 programs crash on Python 3, and let the coder fix.
Gustavo Niemeyer says:

2009/07/03 at 12:43

Hello anonymous Conectiva friend,

Unfortunately, that’s exactly what we have right now. Python 2.0 programs will crash on Python 3 in several ways, and the coder will have to fix it. Like you, I also like some of the things introduced in Python 3.0. Unlike you, though, I don’t take compatibility breakage lightly, and even less so when done in a bad way. If you had a very large volume of Python code to maintain, I’m sure you’d not be so careless either.
Allen Short says:

2009/07/03 at 13:41

> There was an obvious chance to make the migration smoother and straightforward without the help of a code migration tool, and it was dropped in favor of a convoluted choice which breaks the language backwards compatibility in an awkward way gratuitously.

While this is true, they’d already made this decision for other areas of the standard library. So doing it again for bytes/unicode doesn’t make things (much) worse.
right says:

2009/07/03 at 16:41

And where and when did it occur to you to try and bring your superior ideas to the attention of python developers and attempt to influence the process?
Gustavo Niemeyer says:

2009/07/03 at 16:55

As you know I’ve been involved in Python for a while, and I hope you perceive that this post where you commented upon is an attempt to bring things to the attention of Python developers and influence the process. Since you’re here (in an anonymous attempt, arguably), I guess it’s working.
Robin Munn says:

2009/07/03 at 23:57

Here’s what I don’t understand about your post:

“This means that if you write code to support bytes in 2.6, you are actually not writing code which is compatible with Python 3.0.”

Huh? Explain this to me, because I’m not seeing it.

If you write code that uses bytes in 2.6, you’re clearly not intending to treat it as equivalent to str (i.e., character data), or you’d just use str. Instead, you’re intending to treat it as an 8-sit string, which is the same way it will behave in 3.0.

Now, if your code looks like:

b = bytes()
if isinstance(b, str): print “Let’s mess up our forward compatibility! It’ll be fun!”

then yes, you’ll have problems. But I cannot come up with a sane use case for this kind of check.

So the thing that you’re decrying as a major problem, I cannot see why it would be a problem. Yes, the type that bytes is an alias for will change, but it retains its semantics: 8-bit strings not intended to be used as character data. So why, exactly, is this change a problem?

P.S. Since tone of voice is hard to communicate, I should state that I’m not being ironic or sarcastic. I’m truly puzzled by why you think this is a problem.
Gustavo Niemeyer says:

2009/07/04 at 08:31

Robin,

There are actually some important differences on the semantics of the “real” bytes when compared to str. Here is a hint:
>>> list(b"asd") [97, 115, 100]
Michael Foord says:

2009/07/04 at 21:25

I’m afraid Christian is right when he says that you don’t ‘get’ the intent of the bytes literal alias in Python 2.6.

As you have pointed out yourself there are important semantic differences between the bytes type in Python 3.0 and the bytestring in Python 2. Indexing, iterating and the in operator being amongst them.

If the bytes type were to be fully ported then every builtin function and the builtin types would need to be modified to support them. What is worse the standard library would also need to be modified and case-by-case decision made as to if / how to support bytes.

If it were not done fully and only the basic type backported then you wouldn’t be able to use the bytes type in Python 2.X code as you do in Python 3. This means that 2to3 could no longer reliably convert coded using bytes as you have to special case it in your Python 2 code.

As the *purpose* of the alias is to be a hint for 2to3 it would negate the purpose entirely and be a much worse change…
Gustavo Niemeyer says:

2009/07/04 at 22:09

Yes, I understand perfectly what Christian nicely pointed out. I just think it’s a bad idea.

bytes is not a marker. It’s the name of a new type in 3.0. If you want a marker, do something like “from __future__ import strtobytes”, or “from 2to3 import bytesmarker” or whatever else.

I’m puzzled to see so many smart people saying that it’s totally fine that we’ll have to explain to people “Oh, yeah, unicode is actually str.. no, I mean, unicode is still unicode in 3.0, but it’s named str, and str in 2.6 is actually what used to be bytes, but bytes was really str, because there was that 2to3 migration thing.”

Sorry, but that mess really wasn’t necessary. Remove str. Add bytes. Go home.
Michael Foord says:

2009/07/05 at 06:58

What is so hard about:

“all strings are unicode in Python 3.0 so the string type is called str. bytes literals are allowed in Python 2.6 but are just an alias for bytestrings and useful for conversion of code by 2to3”

*Anything* can be made confusing if you deliberately belabour a point rather than admit you were wrond…

In Python 3 your facetious “Remove str. Add bytes. Go home.” is exactly what happened. Removing str from Python 2 is obviously not possible (do you honestly not see that?).
Serge says:

2009/07/05 at 08:40

While your plan looks simple, but in reality it’s pretty complex. It’s not enough to just introduce bytes type into 2.6, you would have to make it actually be accepted everywhere if you want bytes type to be useful beyond toy applications. Then what about functions that return str in 2.6, would you keep them all unchanged or make them somehow return bytes? In the first case you would end up with some frustrating messy half-str/half-bytes world. In the second case I’m not even sure how would you do it in all cases? I could go on, but the point is that it’s really not simple.
Gustavo Niemeyer says:

2009/07/05 at 12:25

Serge,

Indeed introducing bytes in a good way in 2.6 would take some work, but I disagree that the main points I’m bringing up here were not simple. Renaming unicode to str requires actually more work than just keeping unicode as-is. If the goal of the bytes type in 2.6 was simply to serve as a marker and there’s no one willing to pay the bill for porting bytes properly, then make it more obvious that this is a marker like I suggested above rather than introducing a built-in with the same name as the 3.0 type.

Michael,

“all strings are unicode in Python 3.0 so the string type is called str. bytes literals are allowed in Python 2.6 but are just an alias for bytestrings and useful for conversion of code by 2to3”

No, they’re not all unicode just like they were not all unicode in 2.6. “abc” wasn’t unicode in 2.6. b”abc” is not unicode in 3.0. The renaming from unicode to str is unjustified, the fact that we have a marker with the name of a new type in 3.0 is unjustified. If you want me to admit being wrong stop ignoring the facts and look for some reasonable argumentation for this mess.

“In Python 3 your facetious “Remove str. Add bytes. Go home.” is exactly what happened. Removing str from Python 2 is obviously not possible (do you honestly not see that?).”

No, that’s not what happened either. Read the post again if you seriously don’t know what happened.