Fix double-shufti accelerator missing matches across stream chunk boundaries#402
Conversation
…ng mode shuftiDoubleExec finds the first position i where byte[i] is in class1 and byte[i+1] is in class2, so it must look one byte past the position it reports. shuftiDoubleExecReal excludes the final buffer byte from its main loop and defers it to check_last_byte, which zero-pads the absent byte[i+1]. check_last_byte only stopped on the last byte when it completed a single-char (wildcard second char) pattern. When the last byte is merely the first char of a double it was skipped. That is correct at true end-of-data, but at an intermediate stream chunk boundary the second char is the first byte of the next chunk, so the straddling match was silently dropped - a streaming split-invariance violation. Also stop on the last byte when it is in the first-char class (last_elem != 0xff), returning buf_end-1. That is a resume point, not a reported match: the engine re-examines the byte, carries the pending-first-char state into the next chunk (fixing the false negative), and at true EOD finds nothing (no false positive). The change is off the SIMD hot loop. Add DoubleShuftiStreaming.SplitInvariant (block match set vs every streaming split) and update the internal shufti tests to the new last-byte contract.
|
hi @yoavwizstein, thanks for your PR. As this is obviously an AI-generated PR, I will have to confirm that it is a human that sent it and that you actually understand the problem it solves and how it solves it. I will not accepted completely automated PRs and will reject them right away, unless there is a person involved that will actually take credit and responsibility. Even though this is a simple PR, this will only start to get more prevalent and I need to protect the project -and my time. |
|
Hi markos, definitely human here :) lmk if you need anything else from me! |
|
I just wanted to make sure that it's an actual human that submitted the PR and can explain it. I will do some manual testing to confirm myself -though your explanation seems very logical and thank you for the catch. |
What's wrong
Some patterns that accelerate with a two-char-class double-shufti miss matches in
streaming mode when a match's two-byte anchor lands on a chunk boundary. Works in
block mode and as a single chunk, so streaming isn't split-invariant. Silent
false negative.
Patterns I hit it with:
Why
The double accel needs byte[i] and byte[i+1], so the last byte of the buffer is
handled in check_last_byte. It only stopped there if the byte completed a
single-char pattern. If the byte is just the first char of a pair, it got
skipped. Fine at real end of data, but at a chunk boundary the second char is in
the next chunk, so the match is lost. The accel can't tell EOD from a chunk
boundary, so it has to be careful with the last byte either way.
This is the streaming sibling of #358. That was the block-mode false negative
from #325, fixed in #368 by scanning the last byte up to c_end. Correct for
block, but it reintroduced the skip at chunk boundaries, and no test caught it
because the double accel had no streaming split test.
Fix
One line in check_last_byte: also stop when the last byte is in the first-char
class, returning buf_end-1. That's a resume point, not a match, so no false
positives. The engine re-checks the byte and carries state into the next chunk.
Covers every engine that uses the accel (LimEx, McClellan, Gough, Sheng,
McSheng).
Tests
Added DoubleShuftiStreaming.SplitInvariant: block matches vs streaming at every
split offset. Fails on develop (drops offset 64 at split 63 for the first
pattern), passes with the fix. Also updated the internal shufti.cpp assertions
for the new last-byte behavior.
unit-internal passes (14196), the new test and Identical pass (43), and a fuzz of
2.18M chunked scans gave 0 FN and 0 FP.
Performance
Not in the SIMD loop, just one check at the tail. Ran hsbench in block mode,
patched vs the same tree with only this reverted, interleaved and pinned to one
core, median of 12: realistic corpus +1.4%, worst case -1.2%. The realistic
number being positive means it's all noise.
Alternative
Could thread an is_eod flag down so block stays byte-identical and only streaming
changes. Cleaner in principle, but it touches every engine's accel call site for
no measurable gain. Happy to do that instead if you'd prefer.