Monthly Archive for July, 2009

Accessing RESTful information efficiently

Alright, so I appreciate the idea of RESTful Web Services, but I’ve got a small dilemma I’d appreciate some opinions on.

In the RESTful Web Services book, by Leonard Richardson and Sam Ruby, there’s emphasis on making the programmable web look like the human web, by following an architecture oriented to having addressable resources rather than oriented to remote procedure calls. Through the book, the RPC (or REST-RPC, when mixed with some RESTful characteristics), is clearly downplayed. In some cases, though, it’s unclear to me what’s the extent of this advice. Humans and computers are of course very different in the nature of tasks they perform, and how well they perform them. To illustrate the point clearly, let me propose a short example.

Let’s imagine the following scenario: we are building a web site with information on a large set of modern books. In this system, we want to follow RESTful principles strictly: each book is addressable at http://example.com/book/<id>, and we can get a list of book URIs by accessing http://example.com/book/list?filter=<words>.

Now, we want to allow people to easily become aware of the newest edition of a given book. To do that, we again follow RESTful characteristics and add a, let’s say, new-editions field to the data which composes a book resource. This field contains a list of URIs of books which are more recent editions of the given book. So far so good. Looks like a nice design.

Now, we want to implement a feature which allows people to access the list of all recent editions of books in their home library, given that they know the URIs for the books because a client program stored the URIs locally in their machines. How would we go about implementing this? We certainly wouldn’t want to do 200 queries to learn about updates for 200 books which a given person has, since that’s unnecessarily heavy on the client computer, on the network, and on the server. It’s also hard to encode the resource scope (as defined in the book) in the URI, since the amount of data to define the scope (the 200 books in our case) can be arbitrarily large. This actually feels like perfectly fit for an RPC: “Hey, server, here are 200 URIs in my envelope.. let me know what are the updated books and their URIs.” I can imagine some workarounds for this, like saving a temporary list of books with PUT, and then doing the query on that temporary list’s URI, but this feels like a considerably more complex design just for the sake of purity.

When I read examples of RESTful interfaces, I usually see examples about how a Google Search API can be RESTful, for instance. Of course, Google Search is actually meant to be operated by humans, with a simple search string. But computers, unlike humans, can thankfully handle a large volume of data for us, and let us know about the interesting details only. It feels a bit like once the volume of data and the complexity of operations on that data goes up, the ability for someone to do a proper RESTful design goes down, and an RPC-style interface becomes an interesting option again.

I would be happy to learn about a nice RESTful approach to solve this kind of problem, though.

Screwing up Python compatibility: unicode(), str(), and bytes()

Backwards and forwards compatibility is an art. In the very basic and generic form, it consists in organizing the introduction of new concepts while allowing people to maintain existing assets working. In some cases, the new concepts introduced are disruptive, in the sense that they prevent the original form of the asset to be preserved completely, and then some careful consideration has to be done for creating a migration path which is technically viable, and which at the same time helps people keeping the process in mind. A great example of what not to do when introducing such disruptive changes has happened in Python recently.

Up to Python 2.5, any strings you put within normal quotes (without a leading character marker in front of it) would be considered to be of the type str, which originally was used for both binary data and textual data, but in modern times it was seen as the type to be used for binary data only. For textual information, the unicode type has been introduced in Python 2.0, and it provides easy access to all the goodness of Unicode. Besides converting to and from str, it’s also possible to use Unicode literals in the code by preceding the quotes with a leading u character.

This evolution has happened quite cleanly, but it introduced one problem: these two types were both seen as the main way to input textual data in one point in time, and the language syntax clearly makes it very easy to use either type interchangeably. Sounds good in theory, but the types are not interchangeable, and what is worse: in many cases the problem is only seen at runtime when incompatible data passes through the code. This is what gives form to the interminable UnicodeDecodeError problem you may have heard about. So what can be done about this? Enter Python 3.0.

In Python 3.0 an attempt is being made to sanitize this, by promoting the unicode type to a more prominent position, removing the original str type, and introducing a similar but incompatible bytes type which is more clearly oriented towards binary data.

So far so good. The motivation is good, the target goal is a good one too. As usual, the details may complicate things a bit. Before we go into what was actually done, let’s look at an ideal scenario for such an incompatible change.

As mentioned above, when introducing disruptive changes like this, we want a good migration path, and we want to help people keeping the procedure in mind, so that they do the right thing even though they’re not spending too many brain cycles on it. Here is a suggested schema of what might have happened to achieve the above goal: in Python 2.6, introduce the bytes type, with exactly the same semantics of what will be seen in Python 3.0. During 2.6, encourage people to migrate str references in their code to either the previously existent unicode type, when dealing with textual data, or to the new bytes type, when handling binary data. When 3.0 comes along, simply kill the old str types, and we’re done. People can easily write code in 2.6 which supports 3.0, and if they see a reference to str they know something must be done. No big deal, and apparently quite straightforward.

Now, let’s see how to do it in a bad way.

Python 2.6 introduces the bytes type, but it’s not actually a new type. It’s simply an alias to the existing str type. This means that if you write code to support bytes in 2.6, you are actually not writing code which is compatible with Python 3.0. Why on earth would someone introduce an alias on 2.6 which will generate incompatible code with 3.0 is beyond me. It must be some kind of anti-migration pattern. Then, Python 3.0 renames unicode to str, and kills the old str. So, the result is quite bad: Python 3.0 has both str and bytes, and they both mean something else than they did on 2.6, which is the first version which supposedly should help migration, and not a single one of the three types from 2.6 got their names and semantics preserved in 3.0. In fact, just unicode exists at all, and it has a different name.

There you go. I’ve heard people learn better from counter-examples. Here we have a good one to keep in mind and avoid repeating.