Synthetic Data SDK ✨

SDK Documentation | Platform Documentation | Usage Examples

The official SDK of MOSTLY AI, a Python toolkit for high-fidelity, privacy-safe Synthetic Data.

Client mode connects to a remote MOSTLY AI platform for training & generating synthetic data there.
Local mode trains and generates synthetic data locally on your own compute resources.
Generators, that were trained locally, can be easily imported to a platform for further sharing.

Overview

The SDK allows you to programmatically create, browse and manage 3 key resources:

Generators - Train a synthetic data generator on your existing tabular or language data assets
Synthetic Datasets - Use a generator to create any number of synthetic samples to your needs
Connectors - Connect to any data source within your organization, for reading and writing data

Intent	Primitive	Documentation
Train a Generator on tabular or language data	`g = mostly.train(config)`	see mostly.train
Generate any number of synthetic data records	`sd = mostly.generate(g, config)`	see mostly.generate
Live probe the generator on demand	`df = mostly.probe(g, config)`	see mostly.probe
Connect to any data source within your org	`c = mostly.connect(config)`	see mostly.connect

Installation

Client mode only

pip install -U mostlyai

Client + Local mode

pip install -U 'mostlyai[local]'       # for CPU
#pip install -U 'mostlyai[local-gpu]'  # for GPU

NOTE: installing mostlyai[local] on Linux requires --extra-index-url https://download.pytorch.org/whl/cpu to be specified.

Optional Connectors

Add any of the following extras for further data connectors support: databricks, googlebigquery, hive, mssql, mysql, oracle, postgres, snowflake.

E.g.

pip install -U 'mostlyai[local, databricks, snowflake]'

Quick Start

For client mode, initialize with base_url and api_key obtained from your account settings page. For local mode, initialize the client simply with local=True.

import pandas as pd
from mostlyai.sdk import MostlyAI

# load original data
repo_url = 'https://github.com/mostly-ai/public-demo-data'
df_original = pd.read_csv(f'{repo_url}/raw/dev/census/census.csv.gz')

# initialize the SDK in local or client mode
mostly = MostlyAI(local=True)
# mostly = MostlyAI(base_url='https://app.mostly.ai', api_key='YOUR_API_KEY')

# train a synthetic data generator
g = mostly.train(config={
        'name': 'US Census Income',          # name of the generator
        'tables': [{                         # provide list of table(s)
            'name': 'census',                # name of the table
            'data': df_original,             # the original data as pd.DataFrame
            'tabular_model_configuration': { # tabular model configuration (optional)
                'max_training_time': 1,      # - limit training time (in minutes)
                # model, max_epochs,,..      # further model configurations (optional)
                'differential_privacy': {    # differential privacy configuration (optional)
                    'max_epsilon': 5.0,      # - max epsilon value, used as stopping criterion
                    'delta': 1e-5,           # - delta value
                }
            },
            # columns, keys, compute,..      # further table configurations (optional)
        }]
    },
    start=True,                              # start training immediately (default: True)
    wait=True,                               # wait for completion (default: True)
)

Once the generator has been trained, you can use it to generate synthetic data samples. Either via probing:

# probe for some representative synthetic samples
df_samples = mostly.probe(g, size=100)
df_samples

or by creating a synthetic dataset entity for larger data volumes:

# generate a large representative synthetic dataset
sd = mostly.generate(g, size=100_000)
df_synthetic = sd.data()
df_synthetic

or by conditionally probing / generating synthetic data:

# create 100 seed records of 24y old Mexicans
df_seed = pd.DataFrame({
    'age': [24] * 100,
    'native_country': ['Mexico'] * 100,
})
# conditionally probe, based on provided seed
df_samples = mostly.probe(g, seed=df_seed)
df_samples

Key Features

Broad Data Support
- Mixed-type data (categorical, numerical, geospatial, text, etc.)
- Single-table, multi-table, and time-series
Multiple Model Types
- TabularARGN for SOTA tabular performance
- Fine-tune HuggingFace-based language models
- Efficient LSTM for text synthesis from scratch
Advanced Training Options
- GPU/CPU support
- Differential Privacy
- Progress Monitoring
Automated Quality Assurance
- Quality metrics for fidelity and privacy
- In-depth HTML reports for visual analysis
Flexible Sampling
- Up-sample to any data volumes
- Conditional generation by any columns
- Re-balance underrepresented segments
- Context-aware data imputation
- Statistical fairness controls
- Rule-adherence via temperature
Seamless Integration
- Connect to external data sources (DBs, cloud storages)
- Fully permissive open-source license

Citation

Please consider citing our project if you find it useful:

@software{mostlyai,
    author = {{MOSTLY AI}},
    title = {{MOSTLY AI SDK}},
    url = {https://github.com/mostly-ai/mostlyai},
    year = {2025}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Synthetic Data SDK ✨

Overview

Installation

Quick Start

Key Features

Citation

Files

README.md

Latest commit

History

README.md

File metadata and controls

Synthetic Data SDK ✨

Overview

Installation

Quick Start

Key Features

Citation