Documentation | Slack | Stack Overflow | Latest changelog
Generates profile reports from a pandas DataFrame
.
The pandas df.describe()
function is great but a little basic for serious exploratory data analysis.
pandas_profiling
extends the pandas DataFrame with df.profile_report()
for quick data analysis.
For each column the following statistics - if relevant for the column type - are presented in an interactive HTML report:
- Type inference: detect the types of columns in a dataframe.
- Essentials: type, unique values, missing values
- Quantile statistics like minimum value, Q1, median, Q3, maximum, range, interquartile range
- Descriptive statistics like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness
- Most frequent values
- Histogram
- Correlations highlighting of highly correlated variables, Spearman, Pearson and Kendall matrices
- Missing values matrix, count, heatmap and dendrogram of missing values
- Text analysis learn about categories (Uppercase, Space), scripts (Latin, Cyrillic) and blocks (ASCII) of text data.
- File and Image analysis extract file sizes, creation dates and dimensions and scan for truncated images or those containing EXIF information.
Spark backend in progress: We can happily announce that we're nearing v1 for the Spark backend for generating profile reports. Beta testers wanted! The Spark backend will be released as a pre-release for this package.
Monitoring time series?: I'd like to draw your attention to popmon. Whereas pandas-profiling allows you to explore patterns in a single dataset, popmon allows you to uncover temporal patterns. It's worth checking out!
Contents: Examples | Installation | Documentation | Large datasets | Command line usage | Advanced usage | Support | Go beyond | Support the project | Types | How to contribute | Editor Integration | Dependencies
The following example reports showcase the potentialities of the package across a wide range of dataset and data types:
- Census Income (US Adult Census data relating income with other demographic properties)
- NASA Meteorites (comprehensive set of meteorite landing - object properties and locations)
- Titanic (the "Wonderwall" of datasets)
- NZA (open data from the Dutch Healthcare Authority)
- Stata Auto (1978 Automobile data)
- Colors (a simple colors dataset)
- Vektis (Vektis Dutch Healthcare data)
- UCI Bank Dataset (marketing dataset from a bank)
- Russian Vocabulary (100 most common Russian words, showcasing unicode text analysis)
- Website Inaccessibility (website accessibility analysis, showcasing support for URL data)
- Orange prices and Coal prices (simple pricing evolution datasets, showcasing the theming options)
You can install using the pip package manager by running
pip install pandas-profiling[notebook]
Alternatively, you could install the latest version directly from Github:
pip install https://github.com/ydataai/pandas-profiling/archive/master.zip
You can install using the conda package manager by running
conda install -c conda-forge pandas-profiling
Download the source code by cloning the repository or by pressing 'Download ZIP' on this page.
Install by navigating to the proper directory and running:
python setup.py install
The documentation for pandas_profiling
can be found here.
Start by loading in your pandas DataFrame, e.g. by using:
import numpy as np
import pandas as pd
from pandas_profiling import ProfileReport
df = pd.DataFrame(np.random.rand(100, 5), columns=["a", "b", "c", "d", "e"])
To generate the report, run:
profile = ProfileReport(df, title="Pandas Profiling Report")
You can configure the profile report in any way you like. The example code below loads the explorative configuration, that includes many features for text (length distribution, unicode information), files (file size, creation time) and images (dimensions, exif information). If you are interested what exact settings were used, you can compare with the default configuration file.
profile = ProfileReport(df, title="Pandas Profiling Report", explorative=True)
Learn more about configuring pandas-profiling
on the Advanced usage page.
We recommend generating reports interactively by using the Jupyter notebook. There are two interfaces (see animations below): through widgets and through a HTML report.
This is achieved by simply displaying the report. In the Jupyter Notebook, run:
profile.to_widgets()
The HTML report can be included in a Jupyter notebook:
Run the following code:
profile.to_notebook_iframe()
If you want to generate a HTML report file, save the ProfileReport
to an object and use the to_file()
function:
profile.to_file("your_report.html")
Alternatively, you can obtain the data as JSON:
# As a string
json_data = profile.to_json()
# As a file
profile.to_file("your_report.json")
Version 2.4 introduces minimal mode.
This is a default configuration that disables expensive computations (such as correlations and duplicate row detection).
Use the following syntax:
profile = ProfileReport(large_dataset, minimal=True)
profile.to_file("output.html")
Benchmarks are available here.
For standard formatted CSV files that can be read immediately by pandas, you can use the pandas_profiling
executable.
Run the following for information about options and arguments.
pandas_profiling -h
A set of options is available in order to adapt the report generated.
title
(str
): Title for the report ('Pandas Profiling Report' by default).pool_size
(int
): Number of workers in thread pool. When set to zero, it is set to the number of CPUs available (0 by default).progress_bar
(bool
): If True,pandas-profiling
will display a progress bar.infer_dtypes
(bool
): WhenTrue
(default) thedtype
of variables are inferred usingvisions
using the typeset logic (for instance a column that has integers stored as string will be analyzed as if being numeric).
More settings can be found in the default configuration file and minimal configuration file.
You find the configuration docs on the advanced usage page here
Example
profile = df.profile_report(
title="Pandas Profiling Report", plot={"histogram": {"bins": 8}}
)
profile.to_file("output.html")
Need help? Want to share a perspective? Want to report a bug? Ideas for collaboration? You can reach out via the following channels:
- Stack Overflow: ideal for asking questions on how to use the package
- Github Issues: bugs, proposals for change, feature requests
- Slack: general chat, questions, collaboration
- Email: project collaboration or sponsoring
![]() |
For many real-world problems we are interested how the data changes over time.
The excellent package To learn more on Popmon, have a look at these resources here |
![]() |
Profiling your data is closely related to data validation: often validation rules are defined in terms of well-known statistics.
For that purpose, You can find more details on the Great Expectations integration here |
Types are a powerful abstraction for effective data analysis, that goes beyond the logical data types (integer, float etc.).
pandas-profiling
currently, recognizes the following types: Boolean, Numerical, Date, Categorical, URL, Path, File and Image.
We have developed a type system for Python, tailored for data analysis: visions.
Choosing an appropriate typeset can both improve the overall expressiveness and reduce the complexity of your analysis/code.
To learn more about pandas-profiling
's type system, check out the default implementation here.
In the meantime, user customized summarizations and type definitions are now fully supported - if you have a specific use-case please reach out with ideas or a PR!
Read on getting involved in the Contribution Guide.
A low threshold place to ask questions or start contributing is by reaching out on the pandas-profiling Slack. Join the Slack community.
Some Integrated Development Environments integrate with pandas-profiling
. See the Integrations documentation page for details.
Other editor integrations may be contributed via pull requests.
The profile report is written in HTML and CSS, which means pandas-profiling
requires a modern browser.
You need Python 3 to run this package. Other dependencies can be found in the requirements files:
Filename | Requirements |
---|---|
requirements.txt | Package requirements |
requirements-dev.txt | Requirements for development |
requirements-test.txt | Requirements for testing |
setup.py | Requirements for Widgets etc. |