Skip to content

Detect missing translation files and report progress in validate_translation.py#1266

Open
abdeltaehass wants to merge 1 commit into
huggingface:mainfrom
abdeltaehass:improve-translation-validation
Open

Detect missing translation files and report progress in validate_translation.py#1266
abdeltaehass wants to merge 1 commit into
huggingface:mainfrom
abdeltaehass:improve-translation-validation

Conversation

@abdeltaehass

Copy link
Copy Markdown

What this does

Improves utils/validate_translation.py so it does more than diff section names between a translation's _toctree.yml and the English source.

  • Verifies files exist on disk. Every section referenced in a language's _toctree.yml is now checked for a matching .mdx file. A section that is listed but has no file breaks the doc-builder build (as the README warns), but the old script never caught it. The script now exits non-zero in that case, so it can be used as a CI gate.
  • Handles local_fw sections. The previous code read section["local"] directly, silently ignoring framework-specific entries such as chapter8/4_tf. Both PyTorch and TensorFlow variants are now counted.
  • Reports progress. Output now shows translated/total sections and a completion percentage instead of dumping the full list of completed sections.
  • Fixes the "sesions" typo in the output.

Why

While checking translation status I noticed the script can report a translation as having "no missing sections" even when an entry in _toctree.yml points to a file that doesn't exist — which then fails the build. The on-disk check closes that gap, and the progress summary makes it easier to see how far a translation has come.

Before / After

Before:

Completed sesions:

chapter0/1
chapter1/1
...

Missing sections:

chapter1/2
...

After:

📊 'fr' translation progress: 79/104 sections (76.0%)

📝 Sections not yet translated:

  - chapter12/5
  - ...

When a section is listed in _toctree.yml but its file is missing:

❌ Sections listed in _toctree.yml but missing their .mdx file
   (these break the course build):

  - chapter1/2.mdx

Testing

  • Ran against existing translations (ar, fr, en) — progress percentages and missing-section lists are correct.
  • Verified the missing-file path flags a listed-but-absent section and exits with status 1.
  • python -m black --check utils/validate_translation.py passes.

cc @lewtun @stevhliu

validate_translation.py only diffed section names from _toctree.yml, so a
section that was listed in the table of contents but had no .mdx file on
disk went unnoticed even though it breaks the doc-builder build. It also
read section["local"] directly, skipping local_fw (PyTorch/TensorFlow)
sections.

- check that every section in a language's _toctree.yml has a matching
  .mdx file, and exit non-zero when one is missing so it can gate CI
- handle local_fw sections so framework-specific files are counted
- print a progress summary (translated/total sections and percentage)
- fix "sesions" typo in the output
@HuggingFaceDocBuilderDev

Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants