MolParser toolkit for working with E-SMILES (extended SMILES) in OCSR and Markush workflows. The notation follows the formulation introduced in the MolParser paper.
| Path | Role |
|---|---|
utils/ |
MolParser utils (normalize E-SMILES, substitute abbreviations, convert to CXSMILES, and render structures) |
skills/molparser-extended-smiles/ |
E-SMILES skills (concise rules and examples for LLM / OCSR agents) |
pip install -r requirements.txtRun the examples below from the repository root so that from utils import ... resolves correctly.
E-SMILES combines a base SMILES with an optional extension:
SMILES<sep>EXTENSION
Common extension records:
<a>0:R[1]</a>— atom-indexed substituent or Markush placeholder<r>0:R[1]</r>— ring-indexed substituent (regio-uncertain attachment)<c>9:B</c>— abstract-ring or superatom placeholder<a>0:<dum></a>— explicit dummy attachment point|Sg:n|— structural repeating unit (SRU) marker?n— group-level multiplicity suffixes
Full specification: skills/molparser-extended-smiles/extended-smiles-spec.md
Convert E-SMILES to SMILES with abbreviation substitution, and convert E-SMILES to CXSMILES on a best-effort basis.
from utils import postprocess_caption
raw = "*c1ccccc1<sep><a>0:CF3</a>"
result = postprocess_caption(raw)
# caption: original input string
# smi: normalized RDKit SMILES after substituting known abbreviations
# esmi: normalized E-SMILES after substitution and index repair
# cxsmiles: CXSMILES generated from the normalized E-SMILES
# markush: True if unresolved Markush labels remain
# sru: True if a structural repeating unit marker was detected
# groups: unresolved E-SMILES extension records kept after normalization
for key in ("caption", "smi", "esmi", "cxsmiles", "markush", "sru", "groups"):
print(f"{key}: {result[key]}")Expected output:
caption: *c1ccccc1<sep><a>0:CF3</a>
smi: FC(F)(F)c1ccccc1
esmi: FC(F)(F)c1ccccc1<sep>
cxsmiles: FC(F)(F)c1ccccc1
markush: False
sru: False
groups:
Substitute Markush labels with a definition dictionary. Definition keys can use
either R1 or R[1]; values can be known abbreviations or SMILES fragments.
When a fragment contains *, that atom is treated as the attachment point.
from utils import substitute_markush
result = substitute_markush(
"*c1ccccc1<sep><a>0:R[1]</a>",
{"R1": "Me", "R2": "*CCO"},
)
print(result)
# Cc1ccccc1Ring-indexed Markush records expand regio-uncertain attachments into a SMILES
list. Multiplicity suffixes such as ?3, ?1-3, and ?n copy the group over
possible ring sites; ?n reads the copy count from the definition dictionary.
result = substitute_markush(
"c1ccccc1<sep><r>0:R[1]?1-3</r>",
{"R1": "Me"},
)
print(result)
# ['Cc1ccccc1', 'Cc1cccc(C)c1', ...]Render the E-SMILES as SVG and save it locally:
from pathlib import Path
from utils import draw
svg_text = draw("*C(O)c1cc(C(=O)N(*)*)cc(-c2*ccc*2)c1<sep><a>0:CF3</a><a>9:R[3]</a><a>10:R[2]</a><a>14:X</a><a>18:Y</a><r>1:R[1]?1-3</r>", output_format="svg")
svg_path = Path("molecule.svg")
svg_path.write_text(svg_text, encoding="utf-8")molecule.svg is a local render artifact.
To obtain a PNG from that SVG (requires cairosvg from requirements.txt):
import cairosvg
png_path = Path("molecule.png")
cairosvg.svg2png(url=str(svg_path), write_to=str(png_path))Load these files for the agent:
skills/molparser-extended-smiles/SKILL.mdskills/molparser-extended-smiles/extended-smiles-spec.mdskills/molparser-extended-smiles/figure-index.md
1. Base SMILES
2. E-SMILES in SMILES<sep>EXTENSION format
3. Markush status
4. Unsupported or ambiguous chemistry
python skills/molparser-extended-smiles/validate_esmiles.py "<your_esmiles>"Then normalize and render with postprocess_caption and draw.
- Uni-Parser — agent-oriented scientific document parsing with the latest MolParser. Demo
- MolParser — end-to-end molecular recognition. Demo
- MolDetv2 weights — lightweight molecule detector. Demo
@inproceedings{fang2025molparser,
title={Molparser: End-to-end visual recognition of molecule structures in the wild},
author={Fang, Xi and Wang, Jiankun and Cai, Xiaochen and Chen, Shangqian and Yang, Shuwen and Tao, Haoyi and Wang, Nan and Yao, Lin and Zhang, Linfeng and Ke, Guolin},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
pages={24528--24538},
year={2025}
}@article{fang2025uniparser,
title={Uni-Parser Technical Report},
author={Fang, Xi and Tao, Haoyi and Yang, Shuwen and Zhong, Suyang and Lu, Haocheng and Lyu, Han and Huang, Chaozheng and Li, Xinyu and Zhang, Linfeng and Ke, Guolin},
journal={arXiv preprint arXiv:2512.15098},
year={2025}
}