Skip to content

sivikt/dstrials

Repository files navigation

Short-Answer Portion
--------------------
Your models have been implemented and customers are now using them in production.

Q1. Imagine the credit risk use case above. You have two models: a logistic regression model
with an F1-score of 0.60 and a neural network with an F1-score of 0.63. Which model would
you recommend for the bank and why?

A1.
In general, I would say that for the bank logistic regression will be better since it is
easier to interpret its results: for example based on weights values and provided input
object vector components.

In the case above logit F1-score doesn't differ from NN F1-score very much and for bank
system this type of metric won't be a primary success criteria. Banks, for example in case
of Bad/Good loan, are more interested to have a high Recall for Bad classes and low Precision,
so they can double check the objects classified as Bad using human resources and it will be
okay for them.

Recall = (Precision * F1)/(2 * Precision - F1)
Precision = (Recall * F1)/(2 * Recall - F1)

where (2 * Precision - F1) > 0, and 0 <= Precision <= 1,
      (2 * Recall - F1) > 0, and 0 <= Recall <= 1

So, for F1=0.60, 0.3 < Precision <= 1
    for F1=0.63, 0.315 < Precision <= 1. The same goes for Recall.

Then, if we want Recall to be, saying 0.99 (so that we classified 1% of Bad objects as Good), than
for F1=0.60, Precision ~ 0.43
for F1=0.63, Precision ~ 0.46

By that way Type II Errors (False Negatives (FN)) will constitute only 1%, but bank workers will need to
check ~54% of incoming load requests which is almost twice less than if they do it without model assistance.

--------

Q2. A customer wants to know which features matter for their dataset.
They have several models created in the DataRobot platform such as
random forest, linear regression, and neural network regressor.
They also are pretty good at coding in Python so wouldn't mind using another library or function.
How do you suggest this customer gets feature importance for their data?

A2.
DataRobot platform can try to get feature importance from the model itself if the model supports such functionality.
For example, it is possible to get feature importance for random forest and linear regression models since they support
it internally. This information can be shown to the customer.
If the model doesn't provide such information then the customer can use next approaches to understand feature importance
using python language:
1) permute particular feature in all test objects and evaluate the predictions of DataRobot model using this test dataset.
 Compare performance scores. Find the delta. Repeat this process for all features. Sort features using deltas.
2) drop some feature and fit the models and see the difference in performance scores. Repeate it for other features.
3) compute correlation matrix remove collinear features. Try to fit the models.
4) try to use PCA or SVD to find a new smaller basis for initial dataset and the fit new models and compare performance scores
 with those the customer got initially.

About

Some DS "keep repeating" exercises

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published