A sandbox for synthetic data generation and evaluation.
It keeps forks of syntheval and synthcity as editable submodules to make it easy to test new features and bug fixes in those libraries. It also contains some early versions of apps, notebooks, and scripts for testing out different synthetic data generation and evaluation techniques.
Clone the repo, initialize submodules, and install the main environment with uv:
git clone https://github.com/childmindresearch/synthdata.git
cd synthdata
git submodule update --init --recursive
uv syncuv sync installs the newer synthcity and syntheval workflow by default. Install optional extras when you need the older experiment tracks:
uv sync --extra ydata
uv sync --extra presidio-
apps/presidio/presidio_streamlit.py: Presidio's Streamlit app, modified for offline use. For the full version of the anonymizer, seeanonymize-pii.See PRESIDIO APP GUIDE for details.
-
notebooks/ydata-test.py: Testing ydata-synthetic library for tabular data synthesis. To run usingmarimo:uv run --extra ydata marimo run notebooks/ydata-test.py
-
notebooks/test_hepatitis_data.ipynb: Testing synthcity generators (+TabPFN) and syntheval & synthcity evaluations on the hepatitis dataset. -
notebooks/tabpfn_demo.ipynb: Testing classification and synthetic data generation with TabPFN. Add aTABPFN_TOKEN(and optionallyHF_TOKEN) to an.envfile at the root of the project to access the TabPFN API (and download HuggingFace models faster).
scripts/markdown_parser.py: Early, monolithic version of the markdown parser. For the full version, seeheadhunter.