Chunk processing in bsdgrep(1)

This is good to plan now, but it should wait for currently pending BSDgrep changes and then LibRegex. LibRegex should take precedence so that we can get rid of the matchall hack. Although, getting rid of matchall completely would also require a compatibility shim in standard regex(3) to allow an empty expression -- REG_ALLOWEMPTY as a cflag, perhaps

As mentioned by Mike Haertel, original author of GNU grep, in this freebsd-current@ mailing list entry, GNU grep gets a pretty good performance boost from *not* breaking the input into lines, but this optimization is generally disabled when the -n option is specified. It is worth noting here that with -HEAD as of 317254, using WITH_BSD_GREP and WITHOUT_BSD_GREP_FASTMATCH and the following script for a basic profiling attempt:

INDEX=/usr/ports/INDEX-$(make -C /usr/ports -VOSREL:C/\\..+//)

for i in $(jot 100); do
    ${GREP} -nq "/usr/ports/x11-wm/xfce4-wm" ${INDEX}
done

and running this like so:

# time env GREP=grep ~/profile-script.sh
# time env GREP=gnugrep ~/profile-script.sh

bsdgrep(1)'s performance is on par with grep(1) for the -n option. Further basic profiling reveals bsdgrep(1) taking roughly twice as long for s/-nq/-C1/ in the previous script.

Obviously, we would benefit from similar optimizations. For the rest of this discussion, I will be referring to the state as of the refactoring I've done in D10433, not yet landed as of the time of writing, because it will be the key as it moves context handling and other bits that are not used to determine matching behavior out of procline(). Here's a small breakdown of what we can/should do, and thus a rough plan:

procfile() in util.c consumes lines via grep_fgetln, and feeds them into procline(). procline() finds the earliest/longest match on a given line, then repeats this process following that match until we no longer match or match up to the end of the buffer, recording all matches in the parsing context to be passed around elsewhere.

This is one of the key points to generalize and improve. We can use the same general algorithm to process a chunk, broken down into a few key components:

1. A function that will just grab the earliest/longest match in an entire chunk, taking into account -x/-w, and passes that location to the caller.

2. The caller of that function. Until end of chunk, call #1. Locate the preceding EOL if there is one, and take into account -o/-v behavior: if -v is not set and -o is, process the line for any further matches so that we have a complete set. This could be our current procline(). If -o is not set, we can clearly just print out the current line.

if -v is set, just print out *everything* before the preceding EOL, handle context from the first line in the chunk and the matched line. We can start processing again from the next EOL following the last bit of -A context, which might go into the next chunk (func #3). If it calls #1 and gets back no match, locate the *final* EOL if there is one and print out to there. Do proper -o matching everywhere.

At this point, if it did not hit EOF, it can grab another chunk of file, but it should prepend the remaining bits of the chunk it just processed (after final EOL) to the new chunk so that it can examine that line in its entirety. #2 does not have to worry about context (-A/B/C) directly, because of:

3. The context function- given the parsing context and whether it's dealing with -An/Bn context, it should traverse forward/backward in the chunk for the 'n' EOLs it is looking for. If we reach the beginning of the chunk, reload the previous chunk (or keep -1 chunk loaded as we go if we're doing context, reload any chunks beyond -1 as needed and unload when finished). Ditto for forward-traversal, except we can keep these loaded and these can be handled by:

4. Chunk loader- loads chunk n if not loaded/preloaded, returns it. May call grep_refill as it goes. If -n flag, we'll keep a running count of newlines up to the current chunk we're dealing with, updating the number as we process the chunk in #1 or 2. If we don't end up having any matches in a chunk, we'll count newlines up to the end of it anyways as the next chunk gets loaded in.

I plan to try out 4k chunks to start with. We currently read in 32k worth of file at a time, so we'll have 8 chunks preloaded at a time.

I believe that even in the -n case, this should be more efficient in the average case than what we have now, possibly giving us an edge even over gnugrep. We'll overall be *greatly* reducing our calls into regex(3) that are currently wasted on lines that probably won't match anyways, such as the case with my above primitive profiling.

That specific case will see a *large* improvement given that it's (on my system) a 35M file, with every chunk but the last taking only one call into regex(3).

This is likely missing pieces and will be updated as the plan evolves before diving into implementation details.

KyleEvans/BSDgrep/Chunking (last edited 2022-08-15T06:30:39+0000 by KubilayKocak)