Recognize en-dash and em-dash as pin-cite range separators by pmarreck · Pull Request #308 · freelawproject/eyecite

pmarreck · 2026-06-18T19:36:01Z

Problem

Pin-cite page ranges that use a Unicode en-dash (–, U+2013) or em-dash (—, U+2014) instead of an ASCII hyphen-minus are silently dropped. These longer dashes are extremely common in typeset opinions and in OCR'd briefs, where typesetting systems render numeric ranges with an en-dash rather than a hyphen.

For example, given the real-world citation:

Harris Trust v. Salomon, 530 U. S. 238, 241–242 (2000)

eyecite currently returns pin_cite=None and the range 241–242 leaks into the extra field instead of being recognized as the pin cite. The same happens for the em-dash variant (44—45). The ASCII-hyphen form (241-242) works correctly today.

Root cause

PIN_CITE_TOKEN_REGEX in eyecite/regexes.py accepts only the ASCII hyphen-minus (-) as a range separator, in both the page-range branch ([*]?\d+(?:-\d+)?) and the page:line branch (\d+:\d+(?:-\d+(?::\d+)?)?). When the separator is an en-/em-dash, the digit-range token matches only the first page, the surrounding PIN_CITE_REGEX lookahead (which requires the pin cite to be followed by ending punctuation or a paren) fails, and the whole pin cite falls through into extra.

Fix

Introduce a RANGE_SEPARATOR_REGEX character class and use it for both range separators in PIN_CITE_TOKEN_REGEX:

RANGE_SEPARATOR_REGEX = r"[-–—]"

This is the same dash set eyecite already treats as equivalent in PLACEHOLDER_CITATIONS ([_—–-]), so it follows existing precedent in this module. The captured pin cite preserves the source dash verbatim (e.g. 241–242), consistent with how hyphen ranges are already stored — eyecite captures the original text slice rather than normalizing it.

Scope

Only the en-dash (U+2013) and em-dash (U+2014) are added, matching the dashes that genuinely appear as range separators in legal text (and the set eyecite already recognizes elsewhere). Other Unicode dashes — figure dash (U+2012) and the minus sign (U+2212) — are intentionally not matched; they do not occur as pin-cite range separators in practice, and leaving them out keeps the change minimal and false-positive-safe.

Test coverage

Added two cases to test_find_citations in tests/test_FindTest.py, alongside the existing hyphen pin-cite tests, covering the en-dash and em-dash range separators. They run across all three tokenizers (Tokenizer, AhocorasickTokenizer, HyperscanTokenizer) via the existing run_test_pairs harness.

Before the fix, both cases fail (the pin cite is None and the range leaks into extra).
After the fix, both pass with the full range captured (347–348 / 347—348).

No behavior change for existing inputs

ASCII-hyphen pin ranges are unchanged (verified by the existing tests, which still pass).
Other Unicode dashes are not matched, so no new inputs become pin cites unexpectedly.
The full existing test suite (python -m unittest discover -s tests -p 'test_*.py') passes with no regressions.

This is purely additive: it recovers pin cites that were previously dropped, without altering any input eyecite already handled.

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

CLAassistant · 2026-06-18T19:36:09Z

All committers have signed the CLA.

Pin-cite page ranges that use a Unicode en-dash (U+2013) or em-dash (U+2014) instead of an ASCII hyphen-minus were silently dropped. For example, in Harris Trust v. Salomon, 530 U. S. 238, 241–242 (2000) the "241–242" pin cite was lost: PIN_CITE_TOKEN_REGEX matched only "-" as a range separator, so the dash range failed the pin-cite lookahead and fell through into the "extra" field. These longer dashes are common in typeset opinions and OCR'd briefs. Widen the two range-separator positions in PIN_CITE_TOKEN_REGEX (the page-range and page:line-range alternatives) from a literal "-" to the inline class [-–—], matching the dash set eyecite already inlines in PLACEHOLDER_CITATIONS. The captured pin cite preserves the source dash verbatim, consistent with how hyphen ranges are stored. ASCII-hyphen ranges are unchanged, and other Unicode dashes (figure dash U+2012, minus sign U+2212) are intentionally not matched, so this is purely additive with no behavior change for existing inputs.

pmarreck · 2026-06-27T17:22:15Z

FYI resolving conflicts and ensuring tests pass again

…e-separator conflict)

pmarreck · 2026-06-27T18:17:02Z

Updated this branch to current main and resolved the conflict — only CHANGES.md needed it; the en-dash/em-dash regex change and its tests merged cleanly. Since 2.7.7 shipped while this was open, the changelog entry now lives under ## Upcoming. build (3.10/3.11/3.12), lint, and the changelog check are green.

The single red check, PR comment, is the Benchmark workflow failing at its very first step — posting its "in progress" sticky comment — because a fork PR's GITHUB_TOKEN is read-only (Resource not accessible by integration); the benchmark itself never runs from a fork, so it's not a test failure.

Ready for review whenever you have a chance — thanks!

claude Bot reviewed Jun 18, 2026

View reviewed changes

mlissner assigned flooie Jun 18, 2026

mlissner requested a review from flooie June 18, 2026 19:37

mlissner added this to Sprint (Case Law) Jun 18, 2026

mlissner moved this to Waiting on Feedback in Sprint (Case Law) Jun 18, 2026

pmarreck force-pushed the endash-pincite-ranges branch from 3133045 to 111a947 Compare June 18, 2026 20:02

Merge upstream main into endash-pincite-ranges (resolve pin-cite rang…

8518f89

…e-separator conflict)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Recognize en-dash and em-dash as pin-cite range separators#308

Recognize en-dash and em-dash as pin-cite range separators#308
pmarreck wants to merge 2 commits into
freelawproject:mainfrom
pmarreck:endash-pincite-ranges

pmarreck commented Jun 18, 2026

Uh oh!

claude Bot left a comment

Uh oh!

CLAassistant commented Jun 18, 2026 •

edited

Loading

Uh oh!

pmarreck commented Jun 27, 2026

Uh oh!

pmarreck commented Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Uh oh!

Conversation

pmarreck commented Jun 18, 2026

Problem

Root cause

Fix

Scope

Test coverage

No behavior change for existing inputs

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

CLAassistant commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pmarreck commented Jun 27, 2026

Uh oh!

pmarreck commented Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

CLAassistant commented Jun 18, 2026 •

edited

Loading