Name		Name	Last commit message	Last commit date
Latest commit History 127 Commits
.github/workflows		.github/workflows
assets		assets
dev-tools		dev-tools
docker		docker
docs		docs
example		example
sdgx		sdgx
tests		tests
.all-contributorsrc		.all-contributorsrc
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.prettierignore		.prettierignore
.readthedocs.yaml		.readthedocs.yaml
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
README_ZH_CN.md		README_ZH_CN.md
pyproject.toml		pyproject.toml
sweep.yaml		sweep.yaml

Repository files navigation

🚀 Synthetic Data Generator

Synthetic Data Generator (SDG) is a framework focused on quickly generating high-quality structured tabular data. It supports many single-table and multi-table data synthesis algorithms, achieving up to 120 times performance improvement, and supports differential privacy and other methods to enhance the security of synthesized data.

Synthetic data is generated by machines based on real data and algorithms, it does not contain sensitive information, but can retain the characteristics of real data. There is no correspondence between synthetic data and real data, and it is not subject to privacy regulations such as GDPR and ADPPA. In practical applications, there is no need to worry about the risk of privacy leakage. High-quality synthetic data can also be used in various fields such as data opening, model training and debugging, system development and testing, etc.

🎉 Features

high performance
- Supports a wide range of statistical data synthesis algorithms to achieve up to 120x performance improvement, without the need for GPU devices;
- Optimised for big data scenarios, effectively reducing memory consumption;
- Continuously tracking the latest advances in academia and industry, and introducing support for excellent algorithms and models in a timely manner.
- Provide distributed training support for deep learning models with frameworks such as torch.
Privacy enhancements:
- SDG supports differential privacy, anonymization and other methods to enhance the security of synthetic data.
Easy to Extend
- Supports expansion of models, data processing, data connectors, etc. in the form of plug-in packages

Read the latest API docs for more details.

🔛 Quick Start

Pre-build image

You can use pre-built images to quickly experience the latest features.

docker pull idsteam/sdgx:latest

Local Install (Recommended)

At present, the code of this project is updated very quickly. We recommend that you use SDG by installing it through the source code.

git clone [email protected]:hitsz-ids/synthetic-data-generator.git
pip install .
# Or install from git
pip install git+https://github.com/hitsz-ids/synthetic-data-generator.git

Install from PyPi

pip install sdgx

Quick Demo of Single Table Data Generation

# Import modules
from sdgx.models.single_table.ctgan import CTGAN
from sdgx.utils.io.csv_utils import *

# Read data from demo
demo_data, discrete_cols  = get_demo_single_table()

Real data are as follows：

       age  workclass  fnlwgt  ... hours-per-week  native-country  class
0       27    Private  177119  ...             44   United-States  <=50K
1       27    Private  216481  ...             40   United-States  <=50K
2       25    Private  256263  ...             40   United-States  <=50K
3       46    Private  147640  ...             40   United-States  <=50K
4       45    Private  172822  ...             76   United-States   >50K
...    ...        ...     ...  ...            ...             ...    ...
32556   43  Local-gov   33331  ...             40   United-States   >50K
32557   44    Private   98466  ...             35   United-States  <=50K
32558   23    Private   45317  ...             40   United-States  <=50K
32559   45  Local-gov  215862  ...             45   United-States   >50K
32560   25    Private  186925  ...             48   United-States  <=50K

[32561 rows x 15 columns]

# Define model
model = CTGAN(epochs=10)
# Model training
model.fit(demo_data, discrete_cols)

# Generate synthetic data
sampled_data = model.generate(1000)

Synthetic data are as follows：

   age         workclass  fnlwgt  ... hours-per-week  native-country  class
0   33           Private  276389  ...             41   United-States   >50K
1   33  Self-emp-not-inc  296948  ...             54   United-States  <=50K
2   67       Without-pay  266913  ...             51        Columbia  <=50K
3   49           Private  423018  ...             41   United-States   >50K
4   22           Private  295325  ...             39   United-States   >50K
5   63           Private  234140  ...             65   United-States  <=50K
6   42           Private  243623  ...             52   United-States  <=50K
7   75           Private  247679  ...             41   United-States  <=50K
8   79           Private  332237  ...             41   United-States   >50K
9   28         State-gov  837932  ...             99   United-States  <=50K

🤝 Join Community

The SDG project was initiated by Institute of Data Security, Harbin Institute of Technology. If you are interested in out project, welcome to join our community. We welcome organizations, teams, and individuals who share our commitment to data protection and security through open source:

Read CONTRIBUTING before draft a pull request.
Submit an issue by viewing View First Good Issue or submit a Pull Request.

Contributors

_MoooCat
💻

_{Zhongsheng Ji}
💻

_{YUAN KAIWEN}
💻

👩‍🎓 Related Work

Research Paper

CTGAN：Modeling Tabular Data using Conditional GAN
TVAE：Modeling Tabular Data using Conditional GAN
table-GAN：Data Synthesis based on Generative Adversarial Networks
CTAB-GAN:CTAB-GAN: Effective Table Data Synthesizing
OCT-GAN: OCT-GAN: Neural ODE-based Conditional Tabular GANs

Dataset

📄 License

The SDG open source project uses Apache-2.0 license, please refer to the LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🚀 Synthetic Data Generator

🎉 Features

🔛 Quick Start

Pre-build image

Local Install (Recommended)

Install from PyPi

Quick Demo of Single Table Data Generation

🤝 Join Community

Contributors

👩‍🎓 Related Work

Research Paper

Dataset

📄 License

About

Releases

Packages

Languages

License

LOKYRO/synthetic-data-generator

Folders and files

Latest commit

History

Repository files navigation

🚀 Synthetic Data Generator

🎉 Features

🔛 Quick Start

Pre-build image

Local Install (Recommended)

Install from PyPi

Quick Demo of Single Table Data Generation

🤝 Join Community

Contributors

👩‍🎓 Related Work

Research Paper

Dataset

📄 License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages