Archive for the 'C/C++' Category

Removed SRE recursion in Python

SRE is the current regular expression engine used in Python, and the scheme currently in use is highly recursive in some cases. For example, check the result of the following expression, with the current interpreters:

>>> import re
>>> re.match('(x)*','x'*10000)
Traceback (most recent call last):
  File "", line 1, in ?
  File "/usr/lib/python2.3/sre.py", line 132, in match
    return _compile(pattern, flags).match(string)
RuntimeError: maximum recursion limit exceeded

Now, my patch to remove the SRE recursion in Python was finally committed to the CVS. Here is the same example with the CVS HEAD, but using a string which is a hundred times larger:

>>> import re
>>> re.match('(x)*','x'*1000000)
<_sre.SRE_Match object at 0x300bb970>
>>>

Unless some serious problem is found in the new algorithm, it will be available in Python 2.4.

This patch also includes the (?(id/name)yes-pattern|no-pattern) expression support I’ve written more than a year ago. This will try to match with yes-pattern if the group with given id or name exists, and with no-pattern if it doesn’t. |no-pattern is optional and can be omitted. For example, () is a poor email matching pattern, which will match with ” as well as ‘user@host.com’, but not with ‘

Using coverage checks to improve testing

Today I’ve been working to fix an old problem which beats us, Python users, from time to time: the recursive limitation in the regular expression engine, SRE. I’ll probably talk more about this fix in the future. For now, I’d like to explain a technique I’ve used, following a suggestion by Skip Montanaro (thanks Skip!), to improve the test suite of the SRE engine. This technique consists of using the gcov tool, part of gcc, to obtain a coverage report from the test suite.

Ok.. here we go. The first thing to do is to compile _sre.c with coverage support. Checking the manual of gcov you’ll see that we need the options -fprofile-arcs -ftest-coverage while compiling the _sre.o object, so that it generates the necessary information while Python is running functions inside it. Let’s do it (I got the original command from make output):

gcc -fprofile-arcs -ftest-coverage -pthread  -g -Wall \
    -Wstrict-prototypes -I. -I./Include  -DPy_BUILD_CORE \
    -c ./Modules/_sre.c -o Modules/_sre.o

Besides generating the object with support to coverage tests, it will output the following files:

  • tmp.stdout.22153.bb
  • tmp.stdout.22153.bbg

This is a little bit strange, since the manual and the gcov program says these files should be called _sre.bb and _sre.bbg. Luckily, it was just a matter of renaming these files to the right names.

Now that we have the “right” object, with coverage test support, we can ask make to do the rest of the work for us. After running it you’ll get the python executable we’ll use in our coverage tests.

Ok.. we have the prepared executable, and some of the necessary information files. We need one more information file, before running gcov: the file which is generated when the test suite is run, with the coverage information for _sre.c. To get this file, it’s just a matter of running the test suite with our “magic” python executable:

./python Lib/test/test_re.py

After that, we got the last needed file:

  • tmp.stdout.19080.da

Rename it to _sre.da as you did with the other ones. You’ll probably want to automate some of these steps to achieve a fast test cycle.

Now we’re ready to get our coverage report:

gcov -o . Modules/_sre.c

Bingo! We should have a file named _sre.c.gcov in the current directory. This is a text report which includes the source code with information about how many times each line was executed, and marks like ###### showing which ones weren’t executed at all.

With this information, we are able to improve the test suite, making it cover as much as possible from the module code (after some iterations, of course ;-) ).

Using the auto_ptr concept in C++

In C++, local instances (these not allocated by hand) have their destructors run as soon as they go out of scope. This behavior is explored in a few different schemes, besides its obvious intent of destructing the instance. The most known one is the auto_ptr class, part of the STL. It is merely a memory guard which takes care of deallocating the given pointer once its own scope is over. In the short example below, the s variable won’t leak.

#include <stdlib.h>
#include <memory>
#include <iostream>

using namespace std;

void f()
{
        char *s = strdup("foobar");
        auto_ptr<char> sp(s);
        cout << sp.get() << endl;
}

While this is an interesting place to use this concept, it’s a good algorithm to be aware of generically. A good example of how useful this could be is how the apt-shell (part of APT-RPM) cache guard mechanism was implemented. This program was implemented as a refactoring of apt-get, but they have a fundamental difference: while apt-get is a one-shot program (run just once, and get back to the user), apt-shell maintains the internal cache (the structure where information about the packages is kept) alive for a long time, and changes it in many different ways during its execution time. Thus, I had to implement some way to make it easy to protect this cache against being unadvisedly changed in all these functions coming from apt-get. To do that, I have implemented an AutoRestore class which takes care of restoring the original state of the cache, unless explicitly told to release it. Here is the complete class, from apt-shell.cc:

class AutoRestore
{
   pkgDepCache::State State;
   bool Guarded;
   public:
   inline pkgDepCache::State *operator ->() {return &State;};
   inline pkgDepCache::State *operator &() {return &State;};
   inline void UnGuard() { Guarded = false; };
   AutoRestore(pkgDepCache &Cache)
      : State(&Cache), Guarded(true) {};
   ~AutoRestore() { if (Guarded) State.Restore(); };
};

To use it, it’s just a matter of creating a local instance, passing the cache to the constructor. It will take care of undoing unexpected changes. Really comfortable!

Why Lua was embedded into APT-RPM

Why embedding at all?

APT-RPM is a port of the debian APT tool to RPM systems. Since I’ve started working in the project, I’ve been thinking about integrating a higher level language on it. Actually, it’s pretty easy to explain this intention. How many times have you seen distributions hacking a project to fix some misbehavior specific to their environment, or wanting some very specific functionality, which wouldn’t fit in the general context of the upstream project? Attaching an external language allows you to plug these features, without affecting how the project is conducted. Also, I’m a fan of the productivity and flexibility offered by high level languages.

Why Lua?

One might think I’ve used Lua just because I live in the same country as its core developers (Brazil), but that’s not the case. Indeed, I’ve done a pretty intensive research about embeddable languages before choosing Lua. I was looking for a fast, and small language. When you have a library which has about 500kb, you can’t embed a large interpreter to extend the functionality, otherwise you’d be extending the interpreter, not the library. One might think that the interpreter library would be in the system anyway, so that wouldn’t be a real problem. Unfortunately, that doesn’t apply to APT-RPM, since it is used in small systems, and in installer environments. The current Lua interpreter is still under 100kb, and is very fast if compared to other interpreters. I really think the Lua interpreter has no current competitors in that specific area.

Slot code example

To give you an idea about how comfortable it is to work that way, I’ve recently introduced the possibility of passing generic filenames to the “apt-get install” command. To do that, I’ve introduced a new slot in the APT-RPM core, which is called when the data entered is not found as an available package name. Notice that this slot is generic, and works for other kinds of parameters, besides filenames. In the following slot code, notice that the Lua interface has been wrapped into a more useful API for the APT-RPM environment.

_lua->SetDepCache(Cache);
_lua->SetDontFix();
_lua->SetGlobal("argument", Argument);
_lua->RunScripts("Scripts::Apt::Install::TranslateArg", false);
const char *name = _lua->GetGlobal("translated");
_lua->ResetGlobals();
_lua->ResetCaches();

Plugin code example

With the slot code above, developing the filename translation plugin was very straightforward. Of course, some kind of database containing the filename information was needed. I’ve used a compressed textfile, containing thousands of pairs like “filename packagename”, one per line. And here is the final plugin. Can you imagine how much code it’d take if implemented in C , the core language of APT-RPM?

-- Data sample:
--   argument = "/usr/bin/lua"
--   contents = "/var/state/apt/Contents.gz"
--   translated = "newname" 

if string.sub(argument, 1, 1) == "/" then
    contents = confget("Dir::State::contents/f")
    if string.sub(contents, -3) == ".gz" then
        file = io.popen("zcat "..contents)
    elseif string.sub(contents, -4) == ".bz2" then
        file = io.popen("bzcat "..contents)
    else
        file = io.open(contents)
    end
    len = string.len(argument)
    for line in file:lines() do
        if string.sub(line, 1, len) == argument then
            _, _, path, name = string.find(line, '(%S )%s (%S )')
            if path == argument then
                translated = name
                break
            end
        end
    end
    for line in file:lines() do
        -- nothing, just don't break the pipe
    end
    file:close()
end

More information

For more information, have a look at https://moin.conectiva.com.br/AptRpm and https://moin.conectiva.com.br/AptRpm/Scripting