GitHub - nguyenkhacbaoanh/dataprep: DataPrep: Data Preparation in Python

Dataprep lets you prepare your data using a single library with a few lines of code.

Currently, you can use dataprep to:

Collect data from common data sources (through dataprep.connector)
Do your exploratory data analysis (through dataprep.eda)
...more modules are coming

Releases

Repo	Version	Downloads
PyPI
conda-forge

Installation

pip install -U dataprep

Examples & Usages

The following examples can give you an impression of what dataprep can do:

EDA

There are common tasks during the exploratory data analysis stage, like a quick look at the columnar distribution, or understanding the correlations between columns.

The EDA module categorizes these EDA tasks into functions helping you finish EDA tasks with a single function call.

Want to understand the distributions for each DataFrame column? Use plot.

Want to understand the correlation between columns? Use plot_correlation.

Or, if you want to understand the impact of the missing values for each column, use plot_missing.

You can drill down to get more information by given plot, plot_correlation and plot_missing a column name.: E.g. for plot_missing

for numerical column usingplot:

for categorical column usingplot:

Don't forget to checkout the examples folder for detailed demonstration!

Connector

Connector provides a simple way to collect data from different websites, offering several benefits:

A unified API: you can fetch data using one or two lines of code to get data from many websites.
Auto Pagination: it automatically does the pagination for you so that you can specify the desired count of the returned results without even considering the count-per-request restriction from the API.
Smart API request strategy: it can issue API requests in parallel while respecting the rate limit policy.

In the following examples, you can download the Yelp business search result into a pandas DataFrame, using only two lines of code, without taking deep looking into the Yelp documentation! More examples can be found here: Examples

Clean

DataPrep.Clean contains simple functions designed for cleaning and standardizing a column in a DataFrame. It provides

A unified API: each function follows the syntax clean_{type}(df, "column name") (see an example below)
Python Data Science Support: its design for cleaning pandas and Dask DataFrames enables seamless integration into the Python data science workflow
Transparency: a report is generated that summarizes the alterations to the data that occured during cleaning

The following example shows how to clean a column containing messy emails:

Type validation is also supported:

Below are the supported semantic types (more are currently being developed).

Semantic Types
longitude/latitude
country
email
url
phone

For more information, refer to the User Guide.

Contribute

There are many ways to contribute to Dataprep.

Submit bugs and help us verify fixes as they are checked in.
Review the source code changes.
Engage with other Dataprep users and developers on StackOverflow.
Help each other in the Dataprep Community Discord and Mail list & Forum.
Contribute bug fixes.
Providing use cases and writing down your user experience.

Please take a look at our wiki for development documentations!

Acknowledgement

Some functionalities of DataPrep are inspired by the following packages.

Pandas Profiling

Inspired the report functionality and insights provided in DataPrep.eda.
missingno

Inspired the missing value analysis in DataPrep.eda.

Name		Name	Last commit message	Last commit date
Latest commit History 473 Commits
.circleci		.circleci
.github		.github
assets		assets
dataprep		dataprep
docs		docs
examples		examples
scripts		scripts
.coveragerc		.coveragerc
.gitignore		.gitignore
.pylintrc		.pylintrc
Justfile		Justfile
LICENSE		LICENSE
README.md		README.md
codecov.yaml		codecov.yaml
mypy.ini		mypy.ini
poetry.lock		poetry.lock
poetry.toml		poetry.toml
pyproject.toml		pyproject.toml
pyrightconfig.json		pyrightconfig.json
pytype.cfg		pytype.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Releases

Installation

Examples & Usages

EDA

Connector

Clean

Contribute

Acknowledgement

About

Releases

Packages

Languages

License

nguyenkhacbaoanh/dataprep

Folders and files

Latest commit

History

Repository files navigation

Releases

Installation

Examples & Usages

EDA

Connector

Clean

Contribute

Acknowledgement

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages