feat(grep): real PCRE -P via fancy-regex and GNU long-option aliases#1846
Conversation
-P now routes to the backtracking fancy_regex engine (bounded by FANCY_BACKTRACK_LIMIT) instead of aliasing ERE, so lookaround and backreferences work like GNU grep -P. Recursive -P bypasses the indexed search fast-path so a backend regex that can't speak PCRE cannot drop real matches. A shared Matcher enum in search_common.rs hides the two engines' differing APIs. Adds GNU long-option aliases for every supported short flag (--ignore-case, --invert-match, --max-count=N, --regexp=PAT, etc.), accepting both --name=value and --name value forms, plus -G/--basic-regexp. Also fixes -b with -o to report the match's byte offset rather than the line start.
Add a catastrophic-backtracking regression test for grep -P and a THREAT[TM-DOS-025] marker at the fancy-regex mitigation point. Update the threat model: the default regex engine is linear-time and the fancy-regex paths (grep -P, sed) are capped by FANCY_BACKTRACK_LIMIT, so TM-DOS-025 moves from Partial to MITIGATED.
Deploying with
|
| Status | Name | Latest Commit | Preview URL | Updated (UTC) |
|---|---|---|---|---|
| ✅ Deployment successful! View logs |
bashkit | 4f680c8 | Commit Preview URL Branch Preview URL |
Jun 03 2026, 04:17 AM |
There was a problem hiding this comment.
Pull request overview
This PR enhances the grep builtin by adding true PCRE-style -P support via fancy_regex, introducing GNU-compatible long-option aliases, and documenting/mitigating regex backtracking DoS risk via a bounded backtrack limit.
Changes:
- Add a shared
Matcherabstraction and routegrep -Ptofancy_regexwith a fixedFANCY_BACKTRACK_LIMIT. - Implement GNU long-option aliases (including
--name=valueand--name valueforms) and add-G/--basic-regexp. - Align
-b+-obyte-offset behavior with GNU grep and update specs + add unit tests.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
specs/threat-model.md |
Marks TM-DOS-025 as mitigated and documents the new backtracking cap behavior. |
specs/implementation-status.md |
Updates grep’s implemented feature list to reflect -G, true -P, long options, and -b+-o behavior. |
crates/bashkit/src/builtins/search_common.rs |
Introduces Matcher, FANCY_BACKTRACK_LIMIT, and a fancy_regex matcher builder. |
crates/bashkit/src/builtins/grep.rs |
Adds -P PCRE path, long-option parsing, indexed-search bypass for -P, byte-offset fix for -o, and new tests. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| /// Byte ranges `(start, end)` of all non-overlapping matches, left to | ||
| /// right. Slice `text[start..end]` to recover the matched substring. | ||
| pub(crate) fn find_ranges(&self, text: &str) -> Vec<(usize, usize)> { | ||
| match self { | ||
| Matcher::Standard(re) => re.find_iter(text).map(|m| (m.start(), m.end())).collect(), | ||
| // `find_iter` yields `Result<Match, _>`; `flatten` drops the Err | ||
| // arm (backtrack-limit / internal errors) — same "no match" policy | ||
| // as `is_match`. | ||
| Matcher::Fancy(re) => re | ||
| .find_iter(text) | ||
| .flatten() | ||
| .map(|m| (m.start(), m.end())) | ||
| .collect(), | ||
| } | ||
| } |
There was a problem hiding this comment.
Keeping the eager Vec here intentionally. find_ranges is only ever called per-line (the caller loops over lines first), so the collection is bounded by matches within a single line, not the whole file — and the non--o hot paths (-q, -l/-L, plain matching) use is_match, which already early-exits and never calls find_ranges. --max-count is also enforced across lines via total_matches, so the only over-work is collecting one line's matches before the per-line break.
A lazy alternative would have to bridge two different concrete iterator types (regex::Matches vs fancy_regex::Matches, the latter yielding Result), which means either Box<dyn Iterator> (a per-line heap alloc — no better than the Vec) or a hand-rolled enum-iterator (~30 lines) for a gain that's marginal at per-line scale. Not worth the added surface here; will revisit if profiling on a real workload shows it matters.
Generated by Claude Code
… in --help Address PR review: - Pattern-type flags now go through set_pattern_type(PatternType), so the last of -G/-E/-F/-P (and their long forms) wins, matching GNU grep. Previously -P set perl_regex without clearing extended/fixed, so a later -G/-E had no effect. - --help now lists the GNU long-option aliases alongside each short flag. Adds last-wins tests for both short and long forms.
What
Two capability improvements to the
grepbuiltin, plus a security hardening of its regex backtracking surface.1. Real
-P(PCRE) viafancy_regex-Ppreviously aliased ERE on the defaultregexcrate, so lookaround and backreferences silently didn't work. It now routes to the backtrackingfancy_regexengine:foo(?=bar), lookbehind(?<=\$)\d+, and backreferences(\w+) \1work like GNUgrep -P.Matcherenum insearch_common.rshides the two engines' differing APIs (fancy_regexreturnsResultfromis_match/find_iter).-Pbypasses the indexed-search fast-path, since the backend's regex engine can't speak PCRE and could otherwise drop real matches.2. GNU long-option aliases
Scripts using
grep --ignore-caseetc. previously failed (unknown long options were silently ignored). Added long-form aliases for every supported short flag —--ignore-case,--invert-match,--line-number,--count,--only-matching,--word-regexp,--line-regexp,--fixed-strings,--extended-regexp,--basic-regexp,--perl-regexp,--quiet/--silent,--byte-offset,--text,--null-data,--recursive,--with-filename,--no-filename,--regexp=PAT,--file=FILE,--max-count=N,--after-context=N,--before-context=N,--context=N— accepting both--name=valueand--name value. Also added-G/--basic-regexp.3. Fix + hardening
-bwith-onow reports the match's byte offset rather than the line start (matches GNU).grep -Pbacktracking is bounded byFANCY_BACKTRACK_LIMIT(1M steps, same posture assed); a pattern that exceeds it yields "no match" instead of hanging the sandbox.Why
This was scoped from a review of what's adoptable from
uutils/grep. Its-Puses Oniguruma (onig/onig_sys, a C lib) — incompatible with bashkit's pure-Rust / WASM / sandbox constraints — so the crate-frugal path is reusingfancy_regex, which is already an always-on dependency (used bysed,rg,jq). No new crate is added.How / Tests
--perl-regexp, invalid-pattern error path, long-option aliases (inline + space-separated value forms, missing-value error),-b+-ooffset, and a catastrophic-backtracking regression test ((a+)+$) proving the backtrack limit terminates.-F) unchanged; differential spec tests don't use-P.cargo fmt --check,cargo clippy --all-targets -- -D warnings,cargo testall green. TM-INF-022 source-scan passes.Specs
specs/implementation-status.md: updated grep feature list.specs/threat-model.md: TM-DOS-025 moves from Partial → MITIGATED (linear-time default engine; fancy-regex paths capped byFANCY_BACKTRACK_LIMIT), with a// THREAT[TM-DOS-025]marker at the mitigation point.Generated by Claude Code