Skip to content

nguyenkhacbaoanh/dataprep

 
 

Repository files navigation


Documentation | Forum | Mail List

Dataprep lets you prepare your data using a single library with a few lines of code.

Currently, you can use dataprep to:

  • Collect data from common data sources (through dataprep.connector)
  • Do your exploratory data analysis (through dataprep.eda)
  • ...more modules are coming

Releases

Repo Version Downloads
PyPI
conda-forge

Installation

pip install -U dataprep

Examples & Usages

The following examples can give you an impression of what dataprep can do:

EDA

There are common tasks during the exploratory data analysis stage, like a quick look at the columnar distribution, or understanding the correlations between columns.

The EDA module categorizes these EDA tasks into functions helping you finish EDA tasks with a single function call.

  • Want to understand the distributions for each DataFrame column? Use plot.

  • Want to understand the correlation between columns? Use plot_correlation.

  • Or, if you want to understand the impact of the missing values for each column, use plot_missing.

You can drill down to get more information by given plot, plot_correlation and plot_missing a column name.: E.g. for plot_missing

    for numerical column usingplot:

    for categorical column usingplot:

Don't forget to checkout the examples folder for detailed demonstration!

Connector

Connector provides a simple way to collect data from different websites, offering several benefits:

  • A unified API: you can fetch data using one or two lines of code to get data from many websites.
  • Auto Pagination: it automatically does the pagination for you so that you can specify the desired count of the returned results without even considering the count-per-request restriction from the API.
  • Smart API request strategy: it can issue API requests in parallel while respecting the rate limit policy.

In the following examples, you can download the Yelp business search result into a pandas DataFrame, using only two lines of code, without taking deep looking into the Yelp documentation! More examples can be found here: Examples

Clean

DataPrep.Clean contains simple functions designed for cleaning and standardizing a column in a DataFrame. It provides

  • A unified API: each function follows the syntax clean_{type}(df, "column name") (see an example below)
  • Python Data Science Support: its design for cleaning pandas and Dask DataFrames enables seamless integration into the Python data science workflow
  • Transparency: a report is generated that summarizes the alterations to the data that occured during cleaning

The following example shows how to clean a column containing messy emails:

Type validation is also supported:

Below are the supported semantic types (more are currently being developed).

Semantic Types
longitude/latitude
country
email
url
phone

For more information, refer to the User Guide.

Contribute

There are many ways to contribute to Dataprep.

  • Submit bugs and help us verify fixes as they are checked in.
  • Review the source code changes.
  • Engage with other Dataprep users and developers on StackOverflow.
  • Help each other in the Dataprep Community Discord and Mail list & Forum.
  • Twitter
  • Contribute bug fixes.
  • Providing use cases and writing down your user experience.

Please take a look at our wiki for development documentations!

Acknowledgement

Some functionalities of DataPrep are inspired by the following packages.

  • Pandas Profiling

    Inspired the report functionality and insights provided in DataPrep.eda.

  • missingno

    Inspired the missing value analysis in DataPrep.eda.

About

DataPrep: Data Preparation in Python

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 89.9%
  • HTML 10.1%