Skip to content

ehringhaus/svm-practice

Repository files navigation

Module 4 Technique Practice

Justin Ehringhaus Last edited August 06, 2022 at 12:53

Importing Packages

library(pacman)
p_load(tidyverse)
p_load(e1071)
p_load(ROSE)

Importing and Exploring the Data

# Reading the data set as a dataframe
heart_df <- read_csv("/Users/justin/Desktop/ALY 6040/Homework/M4/svm-practice/heart_tidy.csv")

# Glimpse of the data
glimpse(heart_df)
## Rows: 300
## Columns: 14
## $ Age            <dbl> 63, 67, 67, 37, 41, 56, 62, 57, 63, 53, 57, 56, 56, 44,…
## $ V1             <dbl> 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0…
## $ V2             <dbl> 1, 4, 4, 3, 2, 2, 4, 4, 4, 4, 4, 2, 3, 2, 3, 3, 2, 4, 3…
## $ V3             <dbl> 145, 160, 120, 130, 130, 120, 140, 120, 130, 140, 140, …
## $ V4             <dbl> 233, 286, 229, 250, 204, 236, 268, 354, 254, 203, 192, …
## $ V5             <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0…
## $ V6             <dbl> 2, 2, 2, 0, 2, 0, 2, 0, 2, 2, 0, 2, 2, 0, 0, 0, 0, 0, 0…
## $ V7             <dbl> 150, 108, 129, 187, 172, 178, 160, 163, 147, 155, 148, …
## $ V8             <dbl> 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0…
## $ V9             <dbl> 2.3, 1.5, 2.6, 3.5, 1.4, 0.8, 3.6, 0.6, 1.4, 3.1, 0.4, …
## $ V10            <dbl> 3, 2, 2, 3, 1, 1, 3, 1, 2, 3, 2, 2, 2, 1, 1, 1, 3, 1, 1…
## $ V11            <dbl> 0, 3, 2, 0, 0, 0, 2, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0…
## $ V12            <dbl> 6, 3, 7, 3, 3, 3, 3, 3, 7, 7, 6, 3, 6, 7, 7, 3, 7, 3, 3…
## $ PredictDisease <dbl> 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0…
# Changing "predictdisease" data type to factor
heart_df[["PredictDisease"]] <- factor(heart_df[["PredictDisease"]])

# Checking for NAs
anyNA(heart_df)
## [1] FALSE
# Printing the summary
summary(heart_df)
##       Age              V1             V2              V3              V4       
##  Min.   :29.00   Min.   :0.00   Min.   :1.000   Min.   : 94.0   Min.   :126.0  
##  1st Qu.:48.00   1st Qu.:0.00   1st Qu.:3.000   1st Qu.:120.0   1st Qu.:211.0  
##  Median :56.00   Median :1.00   Median :3.000   Median :130.0   Median :241.5  
##  Mean   :54.48   Mean   :0.68   Mean   :3.153   Mean   :131.6   Mean   :246.9  
##  3rd Qu.:61.00   3rd Qu.:1.00   3rd Qu.:4.000   3rd Qu.:140.0   3rd Qu.:275.2  
##  Max.   :77.00   Max.   :1.00   Max.   :4.000   Max.   :200.0   Max.   :564.0  
##        V5               V6               V7              V8        
##  Min.   :0.0000   Min.   :0.0000   Min.   : 71.0   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:133.8   1st Qu.:0.0000  
##  Median :0.0000   Median :0.5000   Median :153.0   Median :0.0000  
##  Mean   :0.1467   Mean   :0.9867   Mean   :149.7   Mean   :0.3267  
##  3rd Qu.:0.0000   3rd Qu.:2.0000   3rd Qu.:166.0   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :2.0000   Max.   :202.0   Max.   :1.0000  
##        V9            V10             V11            V12        PredictDisease
##  Min.   :0.00   Min.   :1.000   Min.   :0.00   Min.   :3.000   0:162         
##  1st Qu.:0.00   1st Qu.:1.000   1st Qu.:0.00   1st Qu.:3.000   1:138         
##  Median :0.80   Median :2.000   Median :0.00   Median :3.000                 
##  Mean   :1.05   Mean   :1.603   Mean   :0.67   Mean   :4.727                 
##  3rd Qu.:1.60   3rd Qu.:2.000   3rd Qu.:1.00   3rd Qu.:7.000                 
##  Max.   :6.20   Max.   :3.000   Max.   :3.00   Max.   :7.000

The heart dataset is comprised entirely of numeric values, and there are no missing entries. Certain features are categorical (e.g., 0, 1, 2) and others are continuous (e.g., 0.3, 1.2, 4.3). The values in the PredictDisease column are either 0 or 1, suggesting this is a binary classification problem where individuals are predicted either to have, or to not have, a heart disease. As there are no missing values and as the minimum and maximum values per feature appear reasonable, we will assume there are no entry errors or outliers in this dataset. No further cleaning of the data is necessary.

Preparing the Data

# Splitting data into a train/test sets
index <- 1:nrow(heart_df)
set.seed(1)
testindex <- sample(index, trunc(length(index) / 3))
testset <- heart_df[testindex,]
trainset <- heart_df[-testindex,]

# Removing no longer needed variables from the environment
rm(index, testindex)

# Checking the dimensions of train/test sets
dim(trainset)
## [1] 200  14
dim(testset)
## [1] 100  14

By generating index values for each row, we can randomly sample 1/3 of the dataset to form a test set, and take those within the other 2/3 to form a training set. In this case, the training set holds 200 rows and the test set holds 100 rows. Each has 14 features.

SVM Model #1: Radial Kernel, cost = 100, gamma = 1

set.seed(1)
svm.model <- svm(PredictDisease ~ ., 
                 data = trainset, 
                 cost = 100, 
                 gamma = 1)

# Note: 199 support vectors, which is almost all of them!
# Note: kernel not specified, defaulted to radial kernel
svm.model
## 
## Call:
## svm(formula = PredictDisease ~ ., data = trainset, cost = 100, gamma = 1)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  100 
## 
## Number of Support Vectors:  199
pred_train <- predict(svm.model, trainset)
mean(pred_train == trainset$PredictDisease)
## [1] 1
table(original = trainset$PredictDisease, predicted = pred_train)
##         predicted
## original   0   1
##        0 103   0
##        1   0  97
pred_test <- predict(svm.model, testset)
mean(pred_test == testset$PredictDisease)
## [1] 0.64
table(original = testset$PredictDisease, predicted = pred_test)
##         predicted
## original  0  1
##        0 28 31
##        1  5 36
# ROC curve and area under the curve
roc.curve(testset$PredictDisease, pred_test,
          main = "ROC Curve #1: Radial Kernel, cost = 100, gamma = 1")

## Area under the curve (AUC): 0.676

svm.model uses a radial kernel and C-classification by default for training and predicting. The kernel used should take into account the shape of the decision boundary. For example, if the data is linearly separable, a linear kernel is likely the most fitting. Radial kernels (what is used here by default) are highly flexible and commonly used in practice (Awati, n.d.).

The cost parameter controls for how support vectors are penalized when they fall within the margin. A high cost (as cost approaches infinity) means a high penalty and a hard margin, where every support vectors lies exactly on the margin, and there is a risk of overfitting because the margin fits the training data too precisely but performs poorly on new data. A low cost (when cost approaches zero) means a low penalty and a soft margin, where support vectors may exist between the margin and the decision boundary, and there is a risk of underfitting because the model may not have incorporated enough of the relevant aspects of the training data to perform well on new data.

The gamma parameter affects the shape of the decision boundary, and the extent to which the support vectors spread their influence. A high gamma (as gamma approaches infinity) means the region of influence of the support vectors becomes stronger, whereas a low gamma (as gamma approaches zero) means the region of influence of the support vectors becomes weaker. In other words, a high gamma may result in a tighter decision boundary, whereas a low gamma may result in a looser decision boundary.

Predicting train data on svm.model reveals an accuracy of 100%. Overfitting is occurring because the number of support vectors equals the number of rows in the training set, and thus the margin contains the entirety of the dataset. The accuracy when predicting on test data is 64%, and thus the model performs less well on unseen data. The ROC Curve graphic reveals the extent to which the model has succeeded in a classification task. When the area under the curve (AUC) is high (approaching 1), the model is perfectly able to separate and classify the two classes: an ideal situation. When AUC is 0.5, a straight line, the model is unable to distinguish between classes at all: the worst situation. When AUC is 0, the model makes perfectly incorrect guesses, which is funnily not too bad so long as you flip each classification!

SVM Model #2: Linear Kernel, cost = 1, gamma = NA

set.seed(1)
svm.model.2 <- svm(formula = PredictDisease ~.,
                   data = trainset,
                   kernel = 'linear',
                   type = 'C-classification')
svm.model.2
## 
## Call:
## svm(formula = PredictDisease ~ ., data = trainset, kernel = "linear", 
##     type = "C-classification")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  1 
## 
## Number of Support Vectors:  78
pred_train.2 <- predict(svm.model.2, trainset)
mean(pred_train.2 == trainset$PredictDisease)
## [1] 0.855
table(original = trainset$PredictDisease, predicted = pred_train.2)
##         predicted
## original  0  1
##        0 92 11
##        1 18 79
pred_test.2 <- predict(svm.model.2, testset)
mean(pred_test.2 == testset$PredictDisease)
## [1] 0.84
table(original = testset$PredictDisease, predicted = pred_test.2)
##         predicted
## original  0  1
##        0 52  7
##        1  9 32
# ROC curve and area under the curve
roc.curve(testset$PredictDisease, pred_test.2,
          main = "ROC Curve #2: Linear Kernel, cost = 1, gamma = NA")

## Area under the curve (AUC): 0.831

svm.model.2 uses a linear kernel with the cost set to the default, 1. For linear kernels, it is unnecessary to specify gamma. Unlike svm.model, which had 199 support vectors (almost the entirety of the training set), svm.model.2 has just 78 support vectors, which means 39% of the data influences the shape of the decision boundary.

Predicting on test data results in an accuracy of 84%, which is superior to svm.model’s performance of 64%. However, cost has also been adjusted, so it is not yet clear which kernel performs better given the features of this particular dataset: radial or linear. The AUC in the ROC graph is also higher than in svm.model, suggesting the model does a better job at classification.

SVM Model #3: Radial Kernel, cost = 0.2, gamma = 1

set.seed(1)
svm.model.3 <- svm(formula = PredictDisease~.,
                   data = trainset,
                   kernel = 'radial',
                   type = 'C-classification',
                   cost = 0.2)
svm.model.3
## 
## Call:
## svm(formula = PredictDisease ~ ., data = trainset, kernel = "radial", 
##     type = "C-classification", cost = 0.2)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  0.2 
## 
## Number of Support Vectors:  147
pred_train.3 <- predict(svm.model.3, trainset)
mean(pred_train.3 == trainset$PredictDisease)
## [1] 0.885
table(original = trainset$PredictDisease, predicted = pred_train.3)
##         predicted
## original  0  1
##        0 93 10
##        1 13 84
pred_test.3 <- predict(svm.model.3, testset)
mean(pred_test.3 == testset$PredictDisease)
## [1] 0.81
table(original = testset$PredictDisease, predicted = pred_test.3)
##         predicted
## original  0  1
##        0 49 10
##        1  9 32
# ROC curve and area under the curve
roc.curve(testset$PredictDisease, pred_test.3,
          main = "ROC Curve #3: Radial Kernel, cost = 0.2, gamma = 1")

## Area under the curve (AUC): 0.805

svm.model.3 uses a radial kernel just as in svm.model, but the cost has been reduced from 100 to 0.2. As discussed previously, a lower cost results in a softer margin, where support vectors are allowed to exist between the decision boundary and the margin. Lowering the cost resulted in a superior model upon predicting on test data, with a training accuracy of 88.5% and a test accuracy of 81%.

The accuracies of svm.model.3 and svm.model.2 are similar when predicting on test data, but their major differences are: 1) the kernels (radial vs. linear), and 2) the number of support vectors (147 vs. 78). At this point, it is difficult to say which model is better, but I presume that fewer support vectors is desirable from the standpoint of reducing computational complexity.

SVM Model #4: Sigmoid Kernel, cost = 0.2, gamma = 1

set.seed(1)
svm.model.4 <- svm(formula = PredictDisease~.,
                   data = trainset,
                   kernel = 'sigmoid',
                   type = 'C-classification',
                   cost = 0.2)
svm.model.4
## 
## Call:
## svm(formula = PredictDisease ~ ., data = trainset, kernel = "sigmoid", 
##     type = "C-classification", cost = 0.2)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  sigmoid 
##        cost:  0.2 
##      coef.0:  0 
## 
## Number of Support Vectors:  112
pred_train.4 <- predict(svm.model.4, trainset)
mean(pred_train.4 == trainset$PredictDisease)
## [1] 0.845
table(original = trainset$PredictDisease, predicted = pred_train.4)
##         predicted
## original  0  1
##        0 92 11
##        1 20 77
pred_test.4 <- predict(svm.model.4, testset)
mean(pred_test.4 == testset$PredictDisease)
## [1] 0.84
table(original = testset$PredictDisease, predicted = pred_test.4)
##         predicted
## original  0  1
##        0 52  7
##        1  9 32
# ROC curve and area under the curve
roc.curve(testset$PredictDisease, pred_test.4,
          main = "ROC Curve #4: Sigmoid Kernel, cost = 0.2, gamma = 1")

## Area under the curve (AUC): 0.831

svm.model.4 uses a sigmoid kernel with the default for gamma and a low cost of 0.2. Thus far, it produce the best accuracy upon prediction, with a training accuracy of 84.5% and a test accuracy of 84%.

Given that both sigmoid and linear kernels produced the highest test accuracies thus far in the comparison of different models, I will choose one (sigmoid because it uses gamma) and attempt tuning this model with various parameters to find optimal cost and gamma values.

SVM Tuning

set.seed(1)
tune.svm.model.4 <- tune.svm(x = trainset[, -14], y = trainset$PredictDisease, 
                             gamma = c(10^(-2:2), 2*10^(-2:2), 3*10^(-2:2)),
                             cost = c(10^(-2:2), 2*10^(-2:2), 3*10^(-2:2)),
                             type = "C-classification", kernel = "sigmoid")
tune.svm.model.4$best.parameters$cost
## [1] 0.02
tune.svm.model.4$best.parameters$gamma
## [1] 3

Tuning an SVM model is computationally expensive, as the different combinations of gamma and cost result in different models that are compared to one another to select the best performing duo. In this case, I had input 15 distinct values ranging from as low as 0.01 to as high as 100 for both cost and gamma. Tuning found that a cost of 0.02 and a gamma of 3 is optimal for the training set.

Can tuning improve the model’s accuracy?

set.seed(1)
tuned.svm.model <- svm(formula = PredictDisease~.,
                       data = trainset,
                       kernel = "sigmoid",
                       type = "C-classification",
                       cost = tune.svm.model.4$best.parameters$cost,
                       gamma = tune.svm.model.4$best.parameters$gamma)
tuned.svm.model
## 
## Call:
## svm(formula = PredictDisease ~ ., data = trainset, kernel = "sigmoid", 
##     type = "C-classification", cost = tune.svm.model.4$best.parameters$cost, 
##     gamma = tune.svm.model.4$best.parameters$gamma)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  sigmoid 
##        cost:  0.02 
##      coef.0:  0 
## 
## Number of Support Vectors:  144
pred_train.tuned <- predict(tuned.svm.model, trainset)
mean(pred_train.tuned == trainset$PredictDisease)
## [1] 0.82
table(original = trainset$PredictDisease, predicted = pred_train.tuned)
##         predicted
## original  0  1
##        0 90 13
##        1 23 74
pred_test.tuned <- predict(tuned.svm.model, testset)
mean(pred_test.tuned == testset$PredictDisease)
## [1] 0.81
table(original = testset$PredictDisease, predicted = pred_test.tuned)
##         predicted
## original  0  1
##        0 49 10
##        1  9 32
# ROC curve and area under the curve
roc.curve(testset$PredictDisease, pred_test.tuned,
          main = "ROC Curve, Tuned Model: Sigmoid Kernel, cost = 0.1, gamma = 1")

## Area under the curve (AUC): 0.805

Training accuracy is now 82% and test accuracy is now 81%. To me, these results are surprising, as svm.model.4 comparatively had a training accuracy of 84.5% and a test accuracy of 84%, which is better on both fronts despite the tuned model having considered the same inputs. Possibly, this is a result of any randomness embedded within the creation of an SVM model, and to account for this I would propose creating the same model iteratively in a for loop to understand the mean accuracy and the standard deviation of accuracy. If continuing to tinker with this model’s accuracy, I would also propose the next step be tuning the model across different kernels to explore whether increases in accuracy can be gained on both training and test sets simultaneously.

Works Cited

Awati, Kailash. n.d. “Support Vector Machines in r.” https://app.datacamp.com/learn/courses/support-vector-machines-in-r.

Kumar, Ajitesh. n.d. “SVM RBF Kernel Parameters with Code Examples.” https://dzone.com/articles/using-jsonb-in-postgresql-how-to-effectively-store-1.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages