GitHub - sivikt/dstrials: Some DS "keep repeating" exercises

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
DSMP#1		DSMP#1
MIPT_HSE_ML_course		MIPT_HSE_ML_course
naive_automl		naive_automl
openvino		openvino
.gitignore		.gitignore
README		README
pure-unet-with-weights.ipynb		pure-unet-with-weights.ipynb
pure-unet.html		pure-unet.html
pure-unet.ipynb		pure-unet.ipynb

Repository files navigation

Short-Answer Portion
--------------------
Your models have been implemented and customers are now using them in production.

Q1. Imagine the credit risk use case above. You have two models: a logistic regression model
with an F1-score of 0.60 and a neural network with an F1-score of 0.63. Which model would
you recommend for the bank and why?

A1.
In general, I would say that for the bank logistic regression will be better since it is
easier to interpret its results: for example based on weights values and provided input
object vector components.

In the case above logit F1-score doesn't differ from NN F1-score very much and for bank
system this type of metric won't be a primary success criteria. Banks, for example in case
of Bad/Good loan, are more interested to have a high Recall for Bad classes and low Precision,
so they can double check the objects classified as Bad using human resources and it will be
okay for them.

Recall = (Precision * F1)/(2 * Precision - F1)
Precision = (Recall * F1)/(2 * Recall - F1)

where (2 * Precision - F1) > 0, and 0 <= Precision <= 1,
      (2 * Recall - F1) > 0, and 0 <= Recall <= 1

So, for F1=0.60, 0.3 < Precision <= 1
    for F1=0.63, 0.315 < Precision <= 1. The same goes for Recall.

Then, if we want Recall to be, saying 0.99 (so that we classified 1% of Bad objects as Good), than
for F1=0.60, Precision ~ 0.43
for F1=0.63, Precision ~ 0.46

By that way Type II Errors (False Negatives (FN)) will constitute only 1%, but bank workers will need to
check ~54% of incoming load requests which is almost twice less than if they do it without model assistance.

--------

Q2. A customer wants to know which features matter for their dataset.
They have several models created in the DataRobot platform such as
random forest, linear regression, and neural network regressor.
They also are pretty good at coding in Python so wouldn't mind using another library or function.
How do you suggest this customer gets feature importance for their data?

A2.
DataRobot platform can try to get feature importance from the model itself if the model supports such functionality.
For example, it is possible to get feature importance for random forest and linear regression models since they support
it internally. This information can be shown to the customer.
If the model doesn't provide such information then the customer can use next approaches to understand feature importance
using python language:
1) permute particular feature in all test objects and evaluate the predictions of DataRobot model using this test dataset.
 Compare performance scores. Find the delta. Repeat this process for all features. Sort features using deltas.
2) drop some feature and fit the models and see the difference in performance scores. Repeate it for other features.
3) compute correlation matrix remove collinear features. Try to fit the models.
4) try to use PCA or SVD to find a new smaller basis for initial dataset and the fit new models and compare performance scores
 with those the customer got initially.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

sivikt/dstrials

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages