Understanding Python’s SRE structure

If you’re a Python developer, you most certainly have used Python’s regular expression for some purpose. Have you ever thought about how it is currently organized?

The SRE engine, written by Fredrik Lundh, is the default regular expression engine since Python 1.6. SRE is internally split in a few different modules, written in Python or C. Here is a quick overview of what is done inside each module.

re.py

This is just a wrapper importing contents from the module sre.py.

sre.py

This module defines the user interface (compile(), match(), etc), mostly importing functionality from other SRE modules.

sre_constants.py

This is where most of the constants used by the SRE engine is kept. If executed, this module will export these constants to sre_constants.h, intended to be used by _sre.c.

sre_parse.py

This module takes care of parsing the regular expression string and generating a pattern list representing the given regular expression. The pattern list is an instance of a class which maintains an intermediate state of the processed regular expression, including a list containing tuples like (operation, args), where operation defines the internal operation, and args is specific to each operation being parsed. To see how the pattern list looks like, use a regular expression string as an argument to the sre_parse.parse() function. Most syntax errors are identified in this module, while the pattern list is being generated.

sre_compile.py

The purpose of this module is to take a pattern list produced by the sre_parse.py module and generate the final pattern code. The pattern code is a kind of bytecode interpreted by the _sre.c module, and it is represented by a list of integers when being generated, and later on converted to a byte stream. To see how the pattern code looks like, you can pass the pattern list returned by the sre_parse.parse() code together with a flags integer (you can use 0) to the sre_compile._code() function.

_sre.c

This is the last module in our overview, and the only C module in the SRE engine. It defines the SRE_Pattern class which is responsible for interpreting the bytecode produced by the sre_compile.py module. You get an instance of this class when you run the sre.compile() function. Indeed, most of the interface offered to the user in the sre.py module is a wrapper on top of the methods defined in this class.

This entry was posted in Python. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *