Replacing the old regex implementation

Project Description

The regex implementation in FreeBSD's libc is quite outdated. It is not developed any more, it is slow and it does not support wide characters. We need a modern version with better I18N support, good performance and standard-compliance. The work that has been done on grep also showed that there is quite much to optimize in this area. This project pretends to port the TRE regex implementation and apply some compatibility/optimization changes to suit better our needs. This is the best available option. I have looked at different alternatives. In short, Oniguruma is slow, PCRE is not strict POSIX and the Plan 9 implementation is limited to egrep syntax. TRE is BSD-licensed, (almost) strict POSIX and it has a good performance. The rational of the implementation details is also explained in the author's thesis.

Apart from these common requirements against the regex implementation, we have some specific needs in FreeBSD. These were studied and elaborated during the work on grep. First, the GNU regex implementation has been used in grep for a long time, which is more permissive with the regular expressions than POSIX, e.g. it allows [a|], which is invalid in strict implementations. TRE does not support such non-standard forms but we need it to preserve compatibility. It should to be added as an optional feature, which can be controlled with a custom flag passed to regcomp(). Secondly, the POSIX interface is limiting. GNU grep uses a heuristic for pattern matching, e.g. to match "foo.*bar" or something similar, a very fast fixed string algorithm (like Boyer-Moore) is used then the matching line is separated and the heavy matching algorithm is only applied on the potentially matching line. This is not possible with POSIX regexec(), which works with NUL-terminated strings instead of byte-counted buffers and thus implies char-at-a-time processing. An alternative interface should be provided for this behaviour. This interface could be used internally in the FreeBSD userland to provide really fast pattern matching, while preserving the POSIX compatibility and offering the standard API for third-party applications.

Reference: http://lists.freebsd.org/pipermail/freebsd-current/2010-August/019310.html (more specifically http://lists.freebsd.org/pipermail/freebsd-current/2010-August/019358.html)

As a side note, GNU grep does these optimizations internally, which imho is not a good idea and it should better be done in the regex library for the following reasons:

In this way, only grep can benefit from these optimizations while it would be more than useful for another regex users.
This code is not tightly related to grep so it makes the code more difficult to read and maintain. Good software principles dictate loose coupling and strong cohesion, while it is definitely weak cohesion.

Project Deliverables

Fast pattern matching.
Wide character compliance.
All of this with keeping the attractive BSD-license.
If this project succeeds, it will enable BSD grep to be the default and we could potentially drop GNU grep.
Support for more permissive GNU-ish regex syntax. (Personally, I think it still should be avoided but experiences show that people use it so we give them the opportunity to live with this if they want.)
Alternative interface, which gives better performance than POSIX.
More up-to-date code, which is better maintainable.

TODO and Milestones

Community bounding period: Read TRE's author's thesis, which explains the details and the background of this particular implementation.
May 23 - Jun 1: Get it built as a standalone library in src/lib. Finally, it will go to libc but in this phase it is a more convenient way to work with it.
Jun 1 - Jun 10: Review what NetBSD did in another Summer of Code and import their improvements if they have any.
Jun 10 - Jun 21: Implement a fixed string pattern matching inside the library, which is a prerequisite for the alternative interface. The library should also detect automatically if a fixed string is passed through regcomp() and internally optimize performance by using the fixed string algorithm. This may be done already but in any case, it should be investigated, looking for optimization opportunities in this term. Different fixed string algorithms may also be benchmarked.
Jun 21 - Jul 11: Implement the alternative interface to speed up pattern matching.
Jul 11: midterm evaluation
Jul 11 - Jul 20: Integrate the code to libc and ask portmgr for a test so that we can replace the current implementation.
Jul 20 - Aug 20: Implement the GNU-style permissive regexes in an optional way. Probably, this is the most heavy task, which needs thorough elaboration. Once it is done, provide a new patch and ask portmgr to test it on the package building cluster to see if we can finally drop GNU regex and GNU grep.
Aug 20 - Aug 22: Polish manual pages, style(9) and provide a patch.
Aug 22: Final evaluation.

Test Plan

Most testing steps will be done with grep because it is the most massive regex user.

Use GNU grep's regression tests to identify GNU-specific behaviour.
Write some own regression tests with BSD grep.
Ask portmgr to run a complete package build to test compatibility.
Run performance tests with grep.