-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathReport.Rmd
161 lines (101 loc) · 5.31 KB
/
Report.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
---
title: "Practical Machine Learning Course Project: Prediction Assignment"
author: "Yaniela Fernandez M."
date: "26/09/2020"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Executive Summary
This report create a model to predict the manner in which 6 participants performed some exercise using the accelerometer data as predictors.The outcome to be predicted is the “classe” variable in the training data set. The machine learning algorithm described here is applied to the 20 test cases available in the test data and the predictions are submitted in appropriate format according to the Course Project Prediction Quiz for automated grading.
## Dataset description
The dataset has 19642 observations of 160 variables. The data was collect from devices that quantified self movement from 6 participants, that were asked to perform one set of 10 repetitions of barbell lifts correctly and incorrectly in 5 different ways.
The outcome variable is classe, a factor variable with 5 levels: A, B, C, D, E.
The training data for this project are available here: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv
The test data are available here: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv
## Initial configuration
The initial configuration consists of loading some required packages
```{r , results='hide' , message=FALSE}
library(caret)
library(rpart)
library(rpart.plot)
library(rattle)
library(randomForest)
library (lattice)
library(ggplot2)
library(dplyr)
```
## Downloading and cleaning data
In this section the data is downloaded and processed. Some basic transformations and cleanup will be performed, so that NA values are omitted. Constant and almost constant predictors across samples (called zero and near-zero variance predictors), that is not only non-informative, it can break some models, are omitted too. Irrelevant columns such as user_name, raw_timestamp_part_1, raw_timestamp_part_2, cvtd_timestamp, new_window, and num_window (columns 1 to 6) will be removed in train set.
The **pml-training.csv** data is used to devise training and testing sets. The **pml-test.csv** data is used to predict and answer the 20 questions based on the trained model.
```{r }
#data download
urlTrain <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
urlTest <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
download.file(urlTrain, destfile = "train.csv")
download.file(urlTest, destfile = "test.csv")
train <- read.csv("train.csv", na.strings=c("NA","#DIV/0!", ""))
test <- read.csv("test.csv", na.strings=c("NA","#DIV/0!", ""))
#clean training data
train<-train[,colSums(is.na(train)) == 0]
#near-zero variance predictors removed
x = nearZeroVar(train)
train<-train[,-x]
#Irrelevant columns omitted
train<-train[,-c(1:6)]
```
The models will be fit using only the folowing 53 variables:
```{r echo=FALSE }
names(train)
```
### Data partitioning
In this section is splitting the train data set in training (75%) and testing (25%) data.
```{r }
inTrain <- createDataPartition(y=train$classe, p=0.75, list=FALSE)
training <- train[inTrain, ]
testing <- train[-inTrain, ]
```
### Exploratory analysis
The variable **classe** contains 5 levels. The plot of the outcome variable shows the frequency of each levels in the training data. It is shows that Level A is the most frequent **classe**. D appears to be the least frequent one.
```{r echo=FALSE , fig.align='center', fig.height = 3, fig.width = 6, fig.cap="**Frecuency of classe levels**"}
table(training$classe) %>% barplot(col = "wheat")
```
## Models building
Three different model algorithms were building, with a cross-validation of k=3 k-folds. The three model types are:
* Decision trees with CART (rpart)
* Stochastic gradient boosting trees (gbm)
* Random forest decision trees (rf)
The code to fit those models are:
```{r results='hide' }
#cross validation
fitControl<-trainControl(method='cv', number = 3)
#rpart model
cart <- train(classe ~ ., data=training, trControl=fitControl, method='rpart')
predCART <- predict(cart, newdata=testing)
cmCART <- confusionMatrix(table(predCART, testing$classe))
#gbm model
gbm <- train( classe ~ ., data=training, trControl=fitControl, method='gbm')
predGBM <- predict(gbm, newdata=testing)
cmGBM <- confusionMatrix(table(predGBM, testing$classe))
#rf model
rf <- train(classe ~ ., data=training, trControl=fitControl, method='rf', ntree=100)
predRF <- predict(rf, newdata=testing)
cmRF <- confusionMatrix(table(predRF, testing$classe))
```
### Out-of-sample results
The expected out-of-sample error is calculated as the fraction of correct samples in the prediction model. This information is provided by the confusion matrix.
```{r echo=FALSE }
AccuracyResults <- data.frame(
Model = c('CART', 'GBM', 'RF'),
Accuracy = rbind(cmCART$overall[1], cmGBM$overall[1], cmRF$overall[1])
)
print(AccuracyResults)
```
Based on an assessment of the model fits and out-of-sample results, it looks like both gradient boosting and random forests outperform the CART model, with random forests being slightly more accurate.
## Prediction with validation test
The best model (random forest) will be applied to predict the 20 quiz results (test dataset) as shown below.
```{r }
prediction <- predict(rf, newdata=test)
prediction
```