Skip to content

Recognize en-dash and em-dash as pin-cite range separators#308

Open
pmarreck wants to merge 2 commits into
freelawproject:mainfrom
pmarreck:endash-pincite-ranges
Open

Recognize en-dash and em-dash as pin-cite range separators#308
pmarreck wants to merge 2 commits into
freelawproject:mainfrom
pmarreck:endash-pincite-ranges

Conversation

@pmarreck

Copy link
Copy Markdown
Contributor

Problem

Pin-cite page ranges that use a Unicode en-dash (, U+2013) or em-dash (, U+2014) instead of an ASCII hyphen-minus are silently dropped. These longer dashes are extremely common in typeset opinions and in OCR'd briefs, where typesetting systems render numeric ranges with an en-dash rather than a hyphen.

For example, given the real-world citation:

Harris Trust v. Salomon, 530 U. S. 238, 241–242 (2000)

eyecite currently returns pin_cite=None and the range 241–242 leaks into the extra field instead of being recognized as the pin cite. The same happens for the em-dash variant (44—45). The ASCII-hyphen form (241-242) works correctly today.

Root cause

PIN_CITE_TOKEN_REGEX in eyecite/regexes.py accepts only the ASCII hyphen-minus (-) as a range separator, in both the page-range branch ([*]?\d+(?:-\d+)?) and the page:line branch (\d+:\d+(?:-\d+(?::\d+)?)?). When the separator is an en-/em-dash, the digit-range token matches only the first page, the surrounding PIN_CITE_REGEX lookahead (which requires the pin cite to be followed by ending punctuation or a paren) fails, and the whole pin cite falls through into extra.

Fix

Introduce a RANGE_SEPARATOR_REGEX character class and use it for both range separators in PIN_CITE_TOKEN_REGEX:

RANGE_SEPARATOR_REGEX = r"[-–—]"

This is the same dash set eyecite already treats as equivalent in PLACEHOLDER_CITATIONS ([_—–-]), so it follows existing precedent in this module. The captured pin cite preserves the source dash verbatim (e.g. 241–242), consistent with how hyphen ranges are already stored — eyecite captures the original text slice rather than normalizing it.

Scope

Only the en-dash (U+2013) and em-dash (U+2014) are added, matching the dashes that genuinely appear as range separators in legal text (and the set eyecite already recognizes elsewhere). Other Unicode dashes — figure dash (U+2012) and the minus sign (U+2212) — are intentionally not matched; they do not occur as pin-cite range separators in practice, and leaving them out keeps the change minimal and false-positive-safe.

Test coverage

Added two cases to test_find_citations in tests/test_FindTest.py, alongside the existing hyphen pin-cite tests, covering the en-dash and em-dash range separators. They run across all three tokenizers (Tokenizer, AhocorasickTokenizer, HyperscanTokenizer) via the existing run_test_pairs harness.

  • Before the fix, both cases fail (the pin cite is None and the range leaks into extra).
  • After the fix, both pass with the full range captured (347–348 / 347—348).

No behavior change for existing inputs

  • ASCII-hyphen pin ranges are unchanged (verified by the existing tests, which still pass).
  • Other Unicode dashes are not matched, so no new inputs become pin cites unexpectedly.
  • The full existing test suite (python -m unittest discover -s tests -p 'test_*.py') passes with no regressions.

This is purely additive: it recovers pin cites that were previously dropped, without altering any input eyecite already handled.

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@CLAassistant

CLAassistant commented Jun 18, 2026

Copy link
Copy Markdown

CLA assistant check
All committers have signed the CLA.

@mlissner mlissner requested a review from flooie June 18, 2026 19:37
@mlissner mlissner moved this to Waiting on Feedback in Sprint (Case Law) Jun 18, 2026
Pin-cite page ranges that use a Unicode en-dash (U+2013) or em-dash
(U+2014) instead of an ASCII hyphen-minus were silently dropped. For
example, in

    Harris Trust v. Salomon, 530 U. S. 238, 241–242 (2000)

the "241–242" pin cite was lost: PIN_CITE_TOKEN_REGEX matched only
"-" as a range separator, so the dash range failed the pin-cite
lookahead and fell through into the "extra" field. These longer dashes
are common in typeset opinions and OCR'd briefs.

Widen the two range-separator positions in PIN_CITE_TOKEN_REGEX (the
page-range and page:line-range alternatives) from a literal "-" to the
inline class [-–—], matching the dash set eyecite already inlines in
PLACEHOLDER_CITATIONS. The captured pin cite preserves the source dash
verbatim, consistent with how hyphen ranges are stored.

ASCII-hyphen ranges are unchanged, and other Unicode dashes (figure
dash U+2012, minus sign U+2212) are intentionally not matched, so this
is purely additive with no behavior change for existing inputs.
@pmarreck pmarreck force-pushed the endash-pincite-ranges branch from 3133045 to 111a947 Compare June 18, 2026 20:02
@pmarreck

Copy link
Copy Markdown
Contributor Author

FYI resolving conflicts and ensuring tests pass again

@pmarreck

Copy link
Copy Markdown
Contributor Author

Updated this branch to current main and resolved the conflict — only CHANGES.md needed it; the en-dash/em-dash regex change and its tests merged cleanly. Since 2.7.7 shipped while this was open, the changelog entry now lives under ## Upcoming. build (3.10/3.11/3.12), lint, and the changelog check are green.

The single red check, PR comment, is the Benchmark workflow failing at its very first step — posting its "in progress" sticky comment — because a fork PR's GITHUB_TOKEN is read-only (Resource not accessible by integration); the benchmark itself never runs from a fork, so it's not a test failure.

Ready for review whenever you have a chance — thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Waiting on Feedback

Development

Successfully merging this pull request may close these issues.

4 participants