Skip to content

childmindresearch/synthdata

Repository files navigation

SynthData

A sandbox for synthetic data generation and evaluation.

It keeps forks of syntheval and synthcity as editable submodules to make it easy to test new features and bug fixes in those libraries. It also contains some early versions of apps, notebooks, and scripts for testing out different synthetic data generation and evaluation techniques.

Quick Start

Clone the repo, initialize submodules, and install the main environment with uv:

git clone https://github.com/childmindresearch/synthdata.git
cd synthdata
git submodule update --init --recursive
uv sync

uv sync installs the newer synthcity and syntheval workflow by default. Install optional extras when you need the older experiment tracks:

uv sync --extra ydata
uv sync --extra presidio

Apps

Notebooks

  • notebooks/ydata-test.py: Testing ydata-synthetic library for tabular data synthesis. To run using marimo:

    uv run --extra ydata marimo run notebooks/ydata-test.py
  • notebooks/test_hepatitis_data.ipynb: Testing synthcity generators (+TabPFN) and syntheval & synthcity evaluations on the hepatitis dataset.

  • notebooks/tabpfn_demo.ipynb: Testing classification and synthetic data generation with TabPFN. Add a TABPFN_TOKEN (and optionally HF_TOKEN) to an .env file at the root of the project to access the TabPFN API (and download HuggingFace models faster).

Scripts

About

Sandbox for synthetic data generation and evaluation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors