Recognize en-dash and em-dash as pin-cite range separators#308
Open
pmarreck wants to merge 2 commits into
Open
Recognize en-dash and em-dash as pin-cite range separators#308pmarreck wants to merge 2 commits into
pmarreck wants to merge 2 commits into
Conversation
Pin-cite page ranges that use a Unicode en-dash (U+2013) or em-dash
(U+2014) instead of an ASCII hyphen-minus were silently dropped. For
example, in
Harris Trust v. Salomon, 530 U. S. 238, 241–242 (2000)
the "241–242" pin cite was lost: PIN_CITE_TOKEN_REGEX matched only
"-" as a range separator, so the dash range failed the pin-cite
lookahead and fell through into the "extra" field. These longer dashes
are common in typeset opinions and OCR'd briefs.
Widen the two range-separator positions in PIN_CITE_TOKEN_REGEX (the
page-range and page:line-range alternatives) from a literal "-" to the
inline class [-–—], matching the dash set eyecite already inlines in
PLACEHOLDER_CITATIONS. The captured pin cite preserves the source dash
verbatim, consistent with how hyphen ranges are stored.
ASCII-hyphen ranges are unchanged, and other Unicode dashes (figure
dash U+2012, minus sign U+2212) are intentionally not matched, so this
is purely additive with no behavior change for existing inputs.
3133045 to
111a947
Compare
Contributor
Author
|
FYI resolving conflicts and ensuring tests pass again |
…e-separator conflict)
Contributor
Author
|
Updated this branch to current The single red check, PR comment, is the Benchmark workflow failing at its very first step — posting its "in progress" sticky comment — because a fork PR's Ready for review whenever you have a chance — thanks! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Pin-cite page ranges that use a Unicode en-dash (
–, U+2013) or em-dash (—, U+2014) instead of an ASCII hyphen-minus are silently dropped. These longer dashes are extremely common in typeset opinions and in OCR'd briefs, where typesetting systems render numeric ranges with an en-dash rather than a hyphen.For example, given the real-world citation:
eyecite currently returns
pin_cite=Noneand the range241–242leaks into theextrafield instead of being recognized as the pin cite. The same happens for the em-dash variant (44—45). The ASCII-hyphen form (241-242) works correctly today.Root cause
PIN_CITE_TOKEN_REGEXineyecite/regexes.pyaccepts only the ASCII hyphen-minus (-) as a range separator, in both the page-range branch ([*]?\d+(?:-\d+)?) and the page:line branch (\d+:\d+(?:-\d+(?::\d+)?)?). When the separator is an en-/em-dash, the digit-range token matches only the first page, the surroundingPIN_CITE_REGEXlookahead (which requires the pin cite to be followed by ending punctuation or a paren) fails, and the whole pin cite falls through intoextra.Fix
Introduce a
RANGE_SEPARATOR_REGEXcharacter class and use it for both range separators inPIN_CITE_TOKEN_REGEX:This is the same dash set eyecite already treats as equivalent in
PLACEHOLDER_CITATIONS([_—–-]), so it follows existing precedent in this module. The captured pin cite preserves the source dash verbatim (e.g.241–242), consistent with how hyphen ranges are already stored — eyecite captures the original text slice rather than normalizing it.Scope
Only the en-dash (U+2013) and em-dash (U+2014) are added, matching the dashes that genuinely appear as range separators in legal text (and the set eyecite already recognizes elsewhere). Other Unicode dashes — figure dash (U+2012) and the minus sign (U+2212) — are intentionally not matched; they do not occur as pin-cite range separators in practice, and leaving them out keeps the change minimal and false-positive-safe.
Test coverage
Added two cases to
test_find_citationsintests/test_FindTest.py, alongside the existing hyphen pin-cite tests, covering the en-dash and em-dash range separators. They run across all three tokenizers (Tokenizer,AhocorasickTokenizer,HyperscanTokenizer) via the existingrun_test_pairsharness.Noneand the range leaks intoextra).347–348/347—348).No behavior change for existing inputs
python -m unittest discover -s tests -p 'test_*.py') passes with no regressions.This is purely additive: it recovers pin cites that were previously dropped, without altering any input eyecite already handled.