Removed SRE recursion in Python

SRE is the current regular expression engine used in Python, and the scheme currently in use is highly recursive in some cases. For example, check the result of the following expression, with the current interpreters:

>>> import re
>>> re.match('(x)*','x'*10000)
Traceback (most recent call last):
  File "", line 1, in ?
  File "/usr/lib/python2.3/sre.py", line 132, in match
    return _compile(pattern, flags).match(string)
RuntimeError: maximum recursion limit exceeded

Now, my patch to remove the SRE recursion in Python was finally committed to the CVS. Here is the same example with the CVS HEAD, but using a string which is a hundred times larger:

>>> import re
>>> re.match('(x)*','x'*1000000)
<_sre.SRE_Match object at 0x300bb970>
>>> 

Unless some serious problem is found in the new algorithm, it will be available in Python 2.4.

This patch also includes the (?(id/name)yes-pattern|no-pattern) expression support I’ve written more than a year ago. This will try to match with yes-pattern if the group with given id or name exists, and with no-pattern if it doesn’t. |no-pattern is optional and can be omitted. For example, () is a poor email matching pattern, which will match with ” as well as ‘user@host.com’, but not with ‘

Posted in C/C++, Patch, Python | Leave a comment

Java bits

For the first time in my life, I’ve been paid to implement something in the Java language. Conectiva has been contracted to implement X500 address book support in a known Brazilian groupware tool, for the Brazilian government. This tool runs on top of the TomCat server, and I must mention that I’ve used an interesting feature of the Java language to debug it remotely: the Java Platform Debugger Architecture (JPDA) (thanks JSwat developers).

Posted in Java, Patch | 1 Comment

Using coverage checks to improve testing

Today I’ve been working to fix an old problem which beats us, Python users, from time to time: the recursive limitation in the regular expression engine, SRE. I’ll probably talk more about this fix in the future. For now, I’d like to explain a technique I’ve used, following a suggestion by Skip Montanaro (thanks Skip!), to improve the test suite of the SRE engine. This technique consists of using the gcov tool, part of gcc, to obtain a coverage report from the test suite.

Ok.. here we go. The first thing to do is to compile _sre.c with coverage support. Checking the manual of gcov you’ll see that we need the options -fprofile-arcs -ftest-coverage while compiling the _sre.o object, so that it generates the necessary information while Python is running functions inside it. Let’s do it (I got the original command from make output):

gcc -fprofile-arcs -ftest-coverage -pthread  -g -Wall  
    -Wstrict-prototypes -I. -I./Include  -DPy_BUILD_CORE  
    -c ./Modules/_sre.c -o Modules/_sre.o 

Besides generating the object with support to coverage tests, it will output the following files:

  • tmp.stdout.22153.bb
  • tmp.stdout.22153.bbg

This is a little bit strange, since the manual and the gcov program says these files should be called _sre.bb and _sre.bbg. Luckily, it was just a matter of renaming these files to the right names.

Now that we have the “right” object, with coverage test support, we can ask make to do the rest of the work for us. After running it you’ll get the python executable we’ll use in our coverage tests.

Ok.. we have the prepared executable, and some of the necessary information files. We need one more information file, before running gcov: the file which is generated when the test suite is run, with the coverage information for _sre.c. To get this file, it’s just a matter of running the test suite with our “magic” python executable:

./python Lib/test/test_re.py 

After that, we got the last needed file:

  • tmp.stdout.19080.da

Rename it to _sre.da as you did with the other ones. You’ll probably want to automate some of these steps to achieve a fast test cycle.

Now we’re ready to get our coverage report:

gcov -o . Modules/_sre.c 

Bingo! We should have a file named _sre.c.gcov in the current directory. This is a text report which includes the source code with information about how many times each line was executed, and marks like ###### showing which ones weren’t executed at all.

With this information, we are able to improve the test suite, making it cover as much as possible from the module code (after some iterations, of course ;-) ).

Posted in C/C++, Python | Leave a comment

Understanding Python’s SRE structure

If you’re a Python developer, you most certainly have used Python’s regular expression for some purpose. Have you ever thought about how it is currently organized?

The SRE engine, written by Fredrik Lundh, is the default regular expression engine since Python 1.6. SRE is internally split in a few different modules, written in Python or C. Here is a quick overview of what is done inside each module.

re.py

This is just a wrapper importing contents from the module sre.py.

sre.py

This module defines the user interface (compile(), match(), etc), mostly importing functionality from other SRE modules.

sre_constants.py

This is where most of the constants used by the SRE engine is kept. If executed, this module will export these constants to sre_constants.h, intended to be used by _sre.c.

sre_parse.py

This module takes care of parsing the regular expression string and generating a pattern list representing the given regular expression. The pattern list is an instance of a class which maintains an intermediate state of the processed regular expression, including a list containing tuples like (operation, args), where operation defines the internal operation, and args is specific to each operation being parsed. To see how the pattern list looks like, use a regular expression string as an argument to the sre_parse.parse() function. Most syntax errors are identified in this module, while the pattern list is being generated.

sre_compile.py

The purpose of this module is to take a pattern list produced by the sre_parse.py module and generate the final pattern code. The pattern code is a kind of bytecode interpreted by the _sre.c module, and it is represented by a list of integers when being generated, and later on converted to a byte stream. To see how the pattern code looks like, you can pass the pattern list returned by the sre_parse.parse() code together with a flags integer (you can use 0) to the sre_compile._code() function.

_sre.c

This is the last module in our overview, and the only C module in the SRE engine. It defines the SRE_Pattern class which is responsible for interpreting the bytecode produced by the sre_compile.py module. You get an instance of this class when you run the sre.compile() function. Indeed, most of the interface offered to the user in the sre.py module is a wrapper on top of the methods defined in this class.

Posted in Python | Leave a comment

Using the auto_ptr concept in C++

In C++, local instances (these not allocated by hand) have their destructors run as soon as they go out of scope. This behavior is explored in a few different schemes, besides its obvious intent of destructing the instance. The most known one is the auto_ptr class, part of the STL. It is merely a memory guard which takes care of deallocating the given pointer once its own scope is over. In the short example below, the s variable won’t leak.

#include <stdlib.h>
#include <memory>
#include <iostream>

using namespace std;

void f()
{
        char *s = strdup("foobar");
        auto_ptr<char> sp(s);
        cout << sp.get() << endl;
}

While this is an interesting place to use this concept, it’s a good algorithm to be aware of generically. A good example of how useful this could be is how the apt-shell (part of APT-RPM) cache guard mechanism was implemented. This program was implemented as a refactoring of apt-get, but they have a fundamental difference: while apt-get is a one-shot program (run just once, and get back to the user), apt-shell maintains the internal cache (the structure where information about the packages is kept) alive for a long time, and changes it in many different ways during its execution time. Thus, I had to implement some way to make it easy to protect this cache against being unadvisedly changed in all these functions coming from apt-get. To do that, I have implemented an AutoRestore class which takes care of restoring the original state of the cache, unless explicitly told to release it. Here is the complete class, from apt-shell.cc:

class AutoRestore
{
   pkgDepCache::State State;
   bool Guarded;
   public:
   inline pkgDepCache::State *operator ->() {return &State;};
   inline pkgDepCache::State *operator &() {return &State;};
   inline void UnGuard() { Guarded = false; };
   AutoRestore(pkgDepCache &Cache)
      : State(&Cache), Guarded(true) {};
   ~AutoRestore() { if (Guarded) State.Restore(); };
};

To use it, it’s just a matter of creating a local instance, passing the cache to the constructor. It will take care of undoing unexpected changes. Really comfortable!

Posted in C/C++ | Leave a comment

Why Lua was embedded into APT-RPM

Why embedding at all?

APT-RPM is a port of the debian APT tool to RPM systems. Since I’ve started working in the project, I’ve been thinking about integrating a higher level language on it. Actually, it’s pretty easy to explain this intention. How many times have you seen distributions hacking a project to fix some misbehavior specific to their environment, or wanting some very specific functionality, which wouldn’t fit in the general context of the upstream project? Attaching an external language allows you to plug these features, without affecting how the project is conducted. Also, I’m a fan of the productivity and flexibility offered by high level languages.

Why Lua?

One might think I’ve used Lua just because I live in the same country as its core developers (Brazil), but that’s not the case. Indeed, I’ve done a pretty intensive research about embeddable languages before choosing Lua. I was looking for a fast, and small language. When you have a library which has about 500kb, you can’t embed a large interpreter to extend the functionality, otherwise you’d be extending the interpreter, not the library. One might think that the interpreter library would be in the system anyway, so that wouldn’t be a real problem. Unfortunately, that doesn’t apply to APT-RPM, since it is used in small systems, and in installer environments. The current Lua interpreter is still under 100kb, and is very fast if compared to other interpreters. I really think the Lua interpreter has no current competitors in that specific area.

Slot code example

To give you an idea about how comfortable it is to work that way, I’ve recently introduced the possibility of passing generic filenames to the “apt-get install” command. To do that, I’ve introduced a new slot in the APT-RPM core, which is called when the data entered is not found as an available package name. Notice that this slot is generic, and works for other kinds of parameters, besides filenames. In the following slot code, notice that the Lua interface has been wrapped into a more useful API for the APT-RPM environment.

_lua->SetDepCache(Cache); 
_lua->SetDontFix(); 
_lua->SetGlobal("argument", Argument); 
_lua->RunScripts("Scripts::Apt::Install::TranslateArg", false); 
const char *name = _lua->GetGlobal("translated"); 
_lua->ResetGlobals(); 
_lua->ResetCaches(); 

Plugin code example

With the slot code above, developing the filename translation plugin was very straightforward. Of course, some kind of database containing the filename information was needed. I’ve used a compressed textfile, containing thousands of pairs like “filename packagename”, one per line. And here is the final plugin. Can you imagine how much code it’d take if implemented in C , the core language of APT-RPM?

-- Data sample: 
--   argument = "/usr/bin/lua" 
--   contents = "/var/state/apt/Contents.gz" 
--   translated = "newname" 
 
if string.sub(argument, 1, 1) == "/" then 
    contents = confget("Dir::State::contents/f") 
    if string.sub(contents, -3) == ".gz" then 
        file = io.popen("zcat "..contents) 
    elseif string.sub(contents, -4) == ".bz2" then 
        file = io.popen("bzcat "..contents) 
    else 
        file = io.open(contents) 
    end 
    len = string.len(argument) 
    for line in file:lines() do 
        if string.sub(line, 1, len) == argument then 
            _, _, path, name = string.find(line, '(%S )%s (%S )') 
            if path == argument then 
                translated = name 
                break 
            end 
        end 
    end 
    for line in file:lines() do 
        -- nothing, just don't break the pipe 
    end 
    file:close() 
end 

More information

For more information, have a look at https://moin.conectiva.com.br/AptRpm and https://moin.conectiva.com.br/AptRpm/Scripting

Posted in C/C++, Lua, Project | Leave a comment