libregex

Problem

libgnuregex will be going away, and it would be great if bsdgrep can again grow the capacity to handle GNU extensions. As far as I've found, the GNU extensions add:

Motivation

As previously mentioned, libgnuregex and bsdgrep could use GNU extensions. regex(3) in libc currently does not have the capability to handle these, and some have pitched in that they'd rather not add these extensions directly to libc, in the interest if keeping it purely POSIX conformant and speed.

Implementation Details

As of right now, I have floated two patches around freebsd-hackers@. Patch [1] is what implementing the GNU extensions on top of the current Spencer implementation would look like. It still has some bugs/kinks that need to be worked out, but it does generally work for less torturous use cases. This also includes a REG_POSIX flag so that one may turn off all extensions by default.

Patch [2] is how I would like to create a libregex. It would reuse the implementation in libc/regex, and use CFLAGS to indicate that it's building as libregex (I would likely use -DLIBREGEX, not -D*_BUILD) so that it may compile in the GNU extension bits. On the regex(3) side, this doesn't look as convoluted as one might think because the GNU extensions don't actually change parsing drastically- it allows for more permissive patterns.

Splitting this off as libregex would be nice, as it would allow me to work on the extensions a bit more separately from libc without requiring an exp-run before it's considered in a polished state in the beginning stages.

I also consider this a good approach because it remains easy to replace libc's regex parser:

[1] libc-gnuext2.diff

[2] libregex.diff

Preparation / Intermediate Steps

There are basically two intermediate steps that need to be taken:

  1. A patch needs to be introduced to put an end to the different instances of escaping an ordinary character and expecting an ordinary character. This is undefined behavior by POSIX standards, so we should start throwing errors for it. This is mostly so that we can use \b,B,w,W,s,S safely without having unintended side-effects because an application/port was expected it to be taken literally.
  2. The above [2] patch, libregex.diff, needs to be applied, and the test suite from libc/tests should somehow be reused here so that we can automatically test both libregex and libc.

Discussion

To represent points from previous discussions:

Backreferences will enforce that EREs, like BREs, may sometimes require exponential time to match.

Feedback?

Please do give feedback on this idea, as it has had relatively little. If by e-mail, please do make sure to CC the listed email address for KyleEvans.


CategoryStale

LibRegex (last edited 2022-08-15T06:31:29+0000 by KubilayKocak)