man/get_thresholds.Rd

% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/get_thresholds.R
\name{get_thresholds}
\alias{get_thresholds}
\title{Get class-separating thresholds for classification predictions}
\usage{
get_thresholds(x, optimize = NULL, measures = "all", cost_fp = 1, cost_fn = 1)
}
\arguments{
\item{x}{Either a predictions data frame (from \code{predict}) or a
model_list (e.g. from \code{machine_learn}).}

\item{optimize}{Optional. If provided, one of the entries in \code{measures}. A logical
column named "optimal" will be added with one TRUE entry corresponding to
the threshold that optimizes this measure.}

\item{measures}{Character vector of performance metrics to calculate, or "all",
which is equivalent to using all of the following measures. The
returned data frame will have one column for each metric. \itemize{
  \item{cost: Captures how bad all the errors are. You can adjust the relative costs
   of false alarms and missed detections by setting \code{cost_fp} or
   \code{cost_fn}}. At the default of equal costs, this is directly inversely
   proportional to accuracy.
  \item{acc: Accuracy}
  \item{tpr: True positive rate, aka sensitivity, aka recall}
  \item{tnr: True negative rate, aka specificity}
  \item{fpr: False positive rate, aka fallout}
  \item{fnr: False negative rate}
  \item{ppv: Positive predictive value, aka precision}
  \item{npv: Negative predictive value}
  }}

\item{cost_fp}{Cost of a false positive. Default = 1. Only affects cost.}

\item{cost_fn}{Cost of a false negative. Default = 1. Only affects cost.}
}
\value{
Tibble with rows for each possible threshold
and columns for the thresholds and each value in \code{measures}.
}
\description{
healthcareai gives you predicted probabilities for classification
problems, but sometimes you need to convert probabilities into predicted
classes. That requires choosing a threshold, where probabilities above the
threshold are predicted as the positive class and probabilities below the
threshold are predicted as the negative class. This function helps you do that
by calculating a bunch of model-performance metrics at every possible
threshold.

"cost" is an especially useful measure as it allows you to weight how bad a
false alarm is relative to a missed detection. E.g. if for your use case
a missed detection is five times as bad as a false alarm (another way to say
that is that you're willing to allow five false positives for every one
false negative), set \code{cost_fn = 5} and use the threshold that minimizes
cost (see \code{examples}).

We recommend plotting the thresholds with their performance measures to
see how optimizing for one measure affects performance on other measures.
See \code{\link{plot.thresholds_df}} for how to do this.
}
\examples{
library(dplyr)
models <- machine_learn(pima_diabetes[1:15, ], patient_id, outcome = diabetes,
                        models = "xgb", tune = FALSE)
get_thresholds(models)

# Identify the threshold that maximizes accuracy:
get_thresholds(models, optimize = "acc")

# Assert that one missed detection is as bad as five false alarms and
# filter to the threshold that minimizes "cost" based on that assertion:
get_thresholds(models, optimize = "cost", cost_fn = 5) \%>\%
  filter(optimal)

# Use that threshold to make class predictions
(class_preds <- predict(models, outcome_groups = 5))
attr(class_preds$predicted_group, "cutpoints")

# Plot performance on all measures across threshold values
get_thresholds(models) \%>\%
  plot()

# If a measure is provided to optimize, the best threshold will be highlighted in plots
get_thresholds(models, optimize = "acc") \%>\%
  plot()

## Transform probability predictions into classes based on an optimal threshold ##
# Pull the threshold that minimizes cost
optimal_threshold <-
  get_thresholds(models, optimize = "cost") \%>\%
  filter(optimal) \%>\%
  pull(threshold)

# Add a Y/N column to predictions based on whether the predicted probability
# is greater than the threshold
class_predictions <-
  predict(models) \%>\%
  mutate(predicted_class_diabetes = case_when(
    predicted_diabetes > optimal_threshold ~ "Y",
    predicted_diabetes <= optimal_threshold ~ "N"
  ))

class_predictions \%>\%
  select_at(vars(ends_with("diabetes"))) \%>\%
  arrange(predicted_diabetes)

# Examine the expected volume of false-and-true negatives-and-positive
table(Actual = class_predictions$diabetes,
      Predicted = class_predictions$predicted_class_diabetes)
}