Because Broccoli is good for you, and so is this repo!
You need to write SQL to analyze a database. Often times an ad-hoc analysis findings can lead to hard-coded relationships that get put into dashboards excluding relationships that might not have been found or non-existent at the time of the analysis. We wanted to create a data product to support hard-coded dashboard logic by running tests on things we didn't hard code in the background. To run asyncronously to other jobs / queries to ask questions and find connections and useful insights.
Pandas Profiling has done a lot of the heavy lifting for doing inital dataset exploratory data analysis (EDA). It does a great job of generating profile reports in HTML format for a dataset, saving an analyst a lot of time going through that process on their own. This project builds on top of the shoulders of Pandas Profiling by going a few steps further.
Current tools do a great job of identifying the datatype of a column, like a numeric, date, string, etc. Using that knowledge descriptive stats can be produced...and that's about where it stops.
Automatic Broccoli goes further in attempting to identify not only the type of column, but its analytical possibilities, especially in relation to the other columns present. Currently, it can detect if a column is a binary, categorical, or continuous and then prepares a dictionary of possible unique combinations of analyses that can be tested.
>>> import auto_broccoli as broc
>>> import pandas_profiling as pp
>>> ab = broc.AutoBroccoli() # if df=None, it autogenerates it's own data
>>> # Step 1: run the awesome pandas profiling
>>> pobject = pp.ProfileReport(ab.df)
>>> des = pobject.get_description()
>>> # Step 2: identify analytical types of variables
>>> analytics_df = des['variables']
>>> type_dict = ab.classify_column_types(analytics_df)
>>> type_dict
{'binary': ['active', 'nice_person'],
'categorical': ['buyer_type', 'content_type'],
'continuous': ['duration_percent', 'impressions', 'visits'],
'date': ['date'],
'high_cardinality': ['user_id']}
Notice how it classified the columns for their analytical potential. Then you can create the analytical buckets like this:
>>> # Step three: set them into analytical buckets
>>> analytics_dict = ab.create_analytical_buckets(type_dict)
>>> analytics_dict
{'bin X bin': [('active', 'nice_person')],
'bin X cat': [['active', 'buyer_type'],
['active', 'content_type'],
['nice_person', 'buyer_type'],
['nice_person', 'content_type']],
'bin X cont': [['active', 'duration_percent'],
['active', 'some_category'],
['active', 'visits'],
['nice_person', 'duration_percent'],
['nice_person', 'some_category'],
['nice_person', 'visits']],
'cat X cont': [['buyer_type', 'duration_percent'],
['buyer_type', 'some_category'],
['buyer_type', 'visits'],
['content_type', 'duration_percent'],
['content_type', 'some_category'],
['content_type', 'visits']],
'cont X cont': [('duration_percent', 'some_category'),
('duration_percent', 'visits'),
('some_category', 'visits')]}
An analytics dataframe can then be generated, yielding the columns that were used and the insight(s) found, if any. Other columns exist but weren't included in this example due to space constraints.
>>> # Step four: run the analytics!
>>> resultsdf = ab.auto_analysis(d_=analytics_dict)
resultsdf
Or you can run the above like this
>>> ab = broc.AutoBroccoli()
>>> resultsdf = ab.main()
resultsdf
analysis | analysis_type | col_1 | col_2 | dataset | date | insight_text | p_val |
---|---|---|---|---|---|---|---|
Chi-square | bin X cat | active | buyer_type | random | 04-04-18 | Both non-active and active have a value of 127 for Other. Spouse is the top category for both non-active at 130 and active at 137 for "buyer_type". Active and Non-Active seem to have similar frequencies. | 0.931 |
Chi-square | bin X cat | active | content_type | random | 04-04-18 | Active and Non-Active are the farthest apart on "Doubleclick" in "content_type". Active maximum on Doubleclick is 105 and non-active minimum is on Doubleclick at 90. | 0.6876 |
Chi-square | bin X cat | nice_person | buyer_type | random | 04-04-18 | Non-Nice_Person and Nice_Person are the farthest apart on "Me" in "buyer_type". Non-Nice_Person maximum on Me is 126 and nice_person minimum is on Me at 105. | 0.1658 |
Chi-square | bin X cat | nice_person | content_type | random | 04-04-18 | Significant difference in 'non-nice_person' and 'nice_person' between 'content_type' groups. Non-Nice_Person and Nice_Person are the farthest apart on "Newspaper" in "content_type". Non-Nice_Person maximum on Newspaper is 111 and nice_person minimum is on Newspaper at 92. Nice_Person and Non-Nice_Person are the farthest apart on "Social Media" in "content_type". Nice_Person maximum on Social Media is 126 and non-nice_person minimum is on Social Media at 88. | 0.0421 |
T-test | bin X cont | active | duration_percent | random | 04-04-18 | Both Active and Non-Active have a medium level of variability on "duration_percent". Active average is 0.3 and Non-Active average is 0.3. | 0.8108 |
T-test | bin X cont | active | impressions | random | 04-04-18 | Both Active and Non-Active have a medium level of variability on "impressions". Active average is 3.0 and Non-Active average is 3.0. | 0.9334 |
T-test | bin X cont | active | visits | random | 04-04-18 | Non-Active and Active have basically the same spread around their means of 289.37 and 284.37 on "visits". | 0.2707 |
T-test | bin X cont | nice_person | duration_percent | random | 04-04-18 | Both Nice_Person and Non-Nice_Person have a medium level of variability on "duration_percent". Nice_Person average is 0.3 and Non-Nice_Person average is 0.3. | 0.9254 |
T-test | bin X cont | nice_person | impressions | random | 04-04-18 | Both Nice_Person and Non-Nice_Person have a medium level of variability on "impressions". Nice_Person average is 3.0 and Non-Nice_Person average is 2.9. | 0.3874 |
T-test | bin X cont | nice_person | visits | random | 04-04-18 | Non-Nice_Person and Nice_Person have basically the same spread around their means of 283.23 and 290.87 on "visits". | 0.3585 |
Anova | cat X cont | buyer_type | duration_percent | random | 04-04-18 | Coming soon | 0.9556 |
Anova | cat X cont | buyer_type | impressions | random | 04-04-18 | Coming soon | 0.8558 |
Anova | cat X cont | buyer_type | visits | random | 04-04-18 | Coming soon | 0.144 |
Anova | cat X cont | content_type | duration_percent | random | 04-04-18 | Coming soon | 0.4053 |
Anova | cat X cont | content_type | impressions | random | 04-04-18 | Coming soon | 0.4027 |
Anova | cat X cont | content_type | visits | random | 04-04-18 | Coming soon | 0.9779 |
Pearson corr | cont X cont | duration_percent | impressions | random | 04-04-18 | Not likely a linear relationship in duration_percent in impressions with with coef of -0.02 | 0.4723 |
Pearson corr | cont X cont | duration_percent | visits | random | 04-04-18 | Not likely a linear relationship in duration_percent in visits with with coef of -0.01 | 0.8102 |
Pearson corr | cont X cont | impressions | visits | random | 04-04-18 | Not likely a linear relationship in impressions in visits with with coef of 0.05 | 0.0853 |
- Testing!
- Better date handling
- Add connections across tables (2+) tables instead of just one table
- Setup meta database of
- Setup input tags for dataset types such as "marketing", "social media" etc.
Copyright (c) 2017 Front Analytics Inc. Licensed under the MIT License.