Skip to content

Automated 💥 data analysis insights from a tabular dataset

Notifications You must be signed in to change notification settings

haoslight/automatic-broccoli

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

42 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Automatic Broccoli

Because Broccoli is good for you, and so is this repo!

Why?

You need to write SQL to analyze a database. Often times an ad-hoc analysis findings can lead to hard-coded relationships that get put into dashboards excluding relationships that might not have been found or non-existent at the time of the analysis. We wanted to create a data product to support hard-coded dashboard logic by running tests on things we didn't hard code in the background. To run asyncronously to other jobs / queries to ask questions and find connections and useful insights.

Prior Art

Pandas Profiling has done a lot of the heavy lifting for doing inital dataset exploratory data analysis (EDA). It does a great job of generating profile reports in HTML format for a dataset, saving an analyst a lot of time going through that process on their own. This project builds on top of the shoulders of Pandas Profiling by going a few steps further.

How is Automatic Broccoli different?

Current tools do a great job of identifying the datatype of a column, like a numeric, date, string, etc. Using that knowledge descriptive stats can be produced...and that's about where it stops.

Automatic Broccoli goes further in attempting to identify not only the type of column, but its analytical possibilities, especially in relation to the other columns present. Currently, it can detect if a column is a binary, categorical, or continuous and then prepares a dictionary of possible unique combinations of analyses that can be tested.

Example

>>> import auto_broccoli as broc
>>> import pandas_profiling as pp

>>> ab = broc.AutoBroccoli()  # if df=None, it autogenerates it's own data

>>> # Step 1: run the awesome pandas profiling
>>> pobject = pp.ProfileReport(ab.df)
>>> des = pobject.get_description()

>>> # Step 2: identify analytical types of variables
>>> analytics_df = des['variables']
>>> type_dict = ab.classify_column_types(analytics_df)
>>> type_dict
{'binary': ['active', 'nice_person'],
 'categorical': ['buyer_type', 'content_type'],
 'continuous': ['duration_percent', 'impressions', 'visits'],
 'date': ['date'],
 'high_cardinality': ['user_id']}

Notice how it classified the columns for their analytical potential. Then you can create the analytical buckets like this:

>>> # Step three: set them into analytical buckets
>>> analytics_dict = ab.create_analytical_buckets(type_dict)
>>> analytics_dict
{'bin X bin': [('active', 'nice_person')],
 'bin X cat': [['active', 'buyer_type'],
  ['active', 'content_type'],
  ['nice_person', 'buyer_type'],
  ['nice_person', 'content_type']],
 'bin X cont': [['active', 'duration_percent'],
  ['active', 'some_category'],
  ['active', 'visits'],
  ['nice_person', 'duration_percent'],
  ['nice_person', 'some_category'],
  ['nice_person', 'visits']],
 'cat X cont': [['buyer_type', 'duration_percent'],
  ['buyer_type', 'some_category'],
  ['buyer_type', 'visits'],
  ['content_type', 'duration_percent'],
  ['content_type', 'some_category'],
  ['content_type', 'visits']],
 'cont X cont': [('duration_percent', 'some_category'),
  ('duration_percent', 'visits'),
  ('some_category', 'visits')]}

An analytics dataframe can then be generated, yielding the columns that were used and the insight(s) found, if any. Other columns exist but weren't included in this example due to space constraints.

>>> # Step four: run the analytics!
>>> resultsdf = ab.auto_analysis(d_=analytics_dict)
resultsdf

Or you can run the above like this

>>> ab = broc.AutoBroccoli()
>>> resultsdf = ab.main()
resultsdf
analysis analysis_type col_1 col_2 dataset date insight_text p_val
Chi-square bin X cat active buyer_type random 04-04-18 Both non-active and active have a value of 127 for Other. Spouse is the top category for both non-active at 130 and active at 137 for "buyer_type". Active and Non-Active seem to have similar frequencies. 0.931
Chi-square bin X cat active content_type random 04-04-18 Active and Non-Active are the farthest apart on "Doubleclick" in "content_type". Active maximum on Doubleclick is 105 and non-active minimum is on Doubleclick at 90. 0.6876
Chi-square bin X cat nice_person buyer_type random 04-04-18 Non-Nice_Person and Nice_Person are the farthest apart on "Me" in "buyer_type". Non-Nice_Person maximum on Me is 126 and nice_person minimum is on Me at 105. 0.1658
Chi-square bin X cat nice_person content_type random 04-04-18 Significant difference in 'non-nice_person' and 'nice_person' between 'content_type' groups. Non-Nice_Person and Nice_Person are the farthest apart on "Newspaper" in "content_type". Non-Nice_Person maximum on Newspaper is 111 and nice_person minimum is on Newspaper at 92. Nice_Person and Non-Nice_Person are the farthest apart on "Social Media" in "content_type". Nice_Person maximum on Social Media is 126 and non-nice_person minimum is on Social Media at 88. 0.0421
T-test bin X cont active duration_percent random 04-04-18 Both Active and Non-Active have a medium level of variability on "duration_percent". Active average is 0.3 and Non-Active average is 0.3. 0.8108
T-test bin X cont active impressions random 04-04-18 Both Active and Non-Active have a medium level of variability on "impressions". Active average is 3.0 and Non-Active average is 3.0. 0.9334
T-test bin X cont active visits random 04-04-18 Non-Active and Active have basically the same spread around their means of 289.37 and 284.37 on "visits". 0.2707
T-test bin X cont nice_person duration_percent random 04-04-18 Both Nice_Person and Non-Nice_Person have a medium level of variability on "duration_percent". Nice_Person average is 0.3 and Non-Nice_Person average is 0.3. 0.9254
T-test bin X cont nice_person impressions random 04-04-18 Both Nice_Person and Non-Nice_Person have a medium level of variability on "impressions". Nice_Person average is 3.0 and Non-Nice_Person average is 2.9. 0.3874
T-test bin X cont nice_person visits random 04-04-18 Non-Nice_Person and Nice_Person have basically the same spread around their means of 283.23 and 290.87 on "visits". 0.3585
Anova cat X cont buyer_type duration_percent random 04-04-18 Coming soon 0.9556
Anova cat X cont buyer_type impressions random 04-04-18 Coming soon 0.8558
Anova cat X cont buyer_type visits random 04-04-18 Coming soon 0.144
Anova cat X cont content_type duration_percent random 04-04-18 Coming soon 0.4053
Anova cat X cont content_type impressions random 04-04-18 Coming soon 0.4027
Anova cat X cont content_type visits random 04-04-18 Coming soon 0.9779
Pearson corr cont X cont duration_percent impressions random 04-04-18 Not likely a linear relationship in duration_percent in impressions with with coef of -0.02 0.4723
Pearson corr cont X cont duration_percent visits random 04-04-18 Not likely a linear relationship in duration_percent in visits with with coef of -0.01 0.8102
Pearson corr cont X cont impressions visits random 04-04-18 Not likely a linear relationship in impressions in visits with with coef of 0.05 0.0853

TODO:

  • Testing!
  • Better date handling
  • Add connections across tables (2+) tables instead of just one table
  • Setup meta database of
  • Setup input tags for dataset types such as "marketing", "social media" etc.

License

Copyright (c) 2017 Front Analytics Inc. Licensed under the MIT License.

About

Automated 💥 data analysis insights from a tabular dataset

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 53.5%
  • Python 46.5%