Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
EricPostMaster authored Jan 8, 2021
1 parent 1c67ccc commit 351e276
Showing 1 changed file with 16 additions and 1 deletion.
17 changes: 16 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,18 +23,33 @@ Do we want to include an example here?
## Data
The data collected in these experiments is formatted in a specific and consistent way, which includes an anonymized participant ID, a baseline constant, and columns for each of the tested variables (aka stimuli).

<img src="Images/data_snippet.png" width="600">
<img src="Images/data_snippet.png" width="600" alt="Market research data snippet">

## Analysis
The analysis for this tool is comprised of three major components: Factor Analysis, Clustering, and Classification.

### Factor Analysis
The analysis script begins with dimension reduction through factor analysis. The purpose of the factor analysis, as opposed to just clustering on the raw customer data, is to reduce noise in the initial dataset and hopefully obtain factors that are more generalizable to future, unseen data.

The variables are all on a similar scale, so the script uses the covariance matrix for identifying principal components.

### Clustering
The clustering portion of the latest script creates and compares clustering solutions for 2-7 clusters using k-Means, k-Medoids, and hierarchical algorithms. It uses the silhouette score, which is calculated using the average distance of points within a cluster (intra-cluster distance) and the average distance between clusters ([sklearn silhouette score documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html)).

The optimal clustering solution is selected based on the highest silhouette score and serves as the labeled data used in the training, validation, and test data.

#### Business Note:
In a business setting, the optimal clustering solution would need to be validated as useful before being used to classify new customer data. For the sake of expediency, this project assumes that the clusters are representative of customer groups.


### Classification
After splitting the clustered data into training, validation, and test sets, classifiers are trained using Random Forest, gradient boosted trees, support vector classification, and k-Nearest Neighbors. Accuracy is used to select the optimal classification model. We selected accuracy as our primary metric instead of precision and recall because there are no "false positive" or "false negative" outcomes, and misclassification of any one cluster does not cost any more or less than any other cluster. Another benefit of using accuracy is that it is easily understood by end-users.

Individual variable importance is obtained by averaging the feature importance outputs of the Random Forest and gradient boosted trees algorithms. In the initial analysis, 7 out of the top 10 variables appeared in both lists.

#### Model Quality




## Application
Expand Down

0 comments on commit 351e276

Please sign in to comment.