-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathPML_project.Rmd
153 lines (117 loc) · 7.46 KB
/
PML_project.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
---
title: "Machine_Learning"
author: "emilliman4"
date: "02/22/2015"
output: html_document
---
```{r, echo=FALSE, message=FALSE}
library(ggplot2)
library(gplots)
library(caret)
library(rattle)
library(gridExtra)
library(randomForest)
library(foreach)
```
```{r, echo=FALSE, cache=TRUE, message=FALSE}
download.file("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv",destfile = "pml-training.csv",method="curl")
pml.training<-read.csv("pml-training.csv", stringsAsFactors=FALSE)
pml.training$user_name<-as.factor(pml.training$user_name)
pml.training$classe<-as.factor(pml.training$classe)
pml.training$new_window<-as.factor(pml.training$new_window)
pml.training<-pml.training[,c(-1,-3,-4,-5,-6,-7,-14,-17,-26,-89,-92,-101,-127,-130,-139)]
pml.training<-subset(pml.training,select = apply(pml.training,2, function(x) sum(is.na(x))) < 100)
pml.training<-subset(pml.training,select=apply(pml.training,2, function(x) !is.element("#DIV/0!",x))==TRUE)
##Project specific functions
circle <- function(center = c(0, 0), npoints = 100) {
r = 1
tt = seq(0, 2 * pi, length = npoints)
xx = center[1] + r * cos(tt)
yy = center[1] + r * sin(tt)
return(data.frame(x = xx, y = yy))
}
pml_write_files = function(x){
n = length(x)
for(i in 1:n){
filename = paste0("problem_id_",i,".txt")
write.table(x[i],file=filename,quote=FALSE,row.names=FALSE,col.names=FALSE)
}
}
```
###Getting and Cleaning Data
Looking at basic structure of the data. Notice that there are number of columns with very little/no data these are the summary stats for each window. A quick inspection of the test data shows that each entry is a single time point so all summary data is removed. Furthermore, the index, time, date, window columns are removed because they should have no bearing on the predictions.
Before further analysis, 40% of the data was set aside for validation/testing
```{r, echo=FALSE}
#head(pml.training)
#summary(pml.training)
inTrain<-createDataPartition(pml.training$classe,p=0.6,list=FALSE)
training<-pml.training[inTrain,]
testing<-pml.training[-inTrain,]
```
###Exploratory Data Analysis
Because of the large number of variables I opted to make a heatmap of correlation coefficients between all variables.Furthermore we used Caret's nearZeroVariance f(x) to look for uninformative data. Across all participants there were no zero variance variables. However, there were some on a per user basis (e.g. Jeremy: roll,pitch,yaw arm and aldelmo: roll,pitch,yaw forearm). Models with and without these variables should be tested and compared...
A histogram of each variable was made to assess each measurements normality/skew (graphs not shown). Anumber of variables showed non-normal distributions. These skews are due to shifts individual participant's sesnor measurements. This is conformed by PCA analysis of the dataset. The plot of PC1 and PC2 shows very nice separation of the users and not the classe. Furthermore, the varaibles with a high correlation are the most skewed. Normalizing this data would be a major undertaking and so I forst tried to classify the data with decision trees and random forests to determine the data would need to be transformed at all.
```{r}
heatmap.2(cor(training[,c(-1,-54)]), col=redblue(75),trace="none", main="Figure 2: Correlation of sensor data", margins=c(7,7), cexRow=0.75, cexCol=0.8)
```
```{r, eval=FALSE}
foreach(i=2:(length(colnames(training))-1)) %do% hist(as.numeric(training[,i]),
main=c("Index:",i,colnames(training)[i]))
nsv<-by(training[,c(-1,-54)],training$user_name,function(x) nearZeroVar(x, saveMetrics=T))
zeroVar<-nearZeroVar(training[,c(-1,-54)], saveMetrics=T)
```
```{r, fig.height=10,fig.width=8}
PCA<-prcomp(training[,c(-1,-54)],center = TRUE, scale=TRUE)
scores<-as.data.frame(PCA$x)
scores$group<-training$classe
p1<-ggplot(data = scores, aes(x = PC1, y = PC2,colour=training$user_name, label = rownames(scores))) +
geom_hline(yintercept = 0, colour = "gray65") +
geom_vline(xintercept = 0, colour = "gray65") +
geom_text(alpha = 0.8, size = 4) +
ggtitle("PCA plot of TimePoints - Sensor Data")
corcir<-circle(c(0,0), npoints=100)
correlations<-as.data.frame(cor(training[,c(-1,-54)],PCA$x))
arrows = data.frame(x1 = rep(0,dim(correlations)[1]), y1 = rep(0,dim(correlations)[1]), x2 = correlations$PC1, y2 = correlations$PC2)
p2<-ggplot() + geom_path(data = corcir, aes(x = x, y = y), colour = "gray65") +
geom_segment(data = arrows, aes(x = x1, y = y1, xend = x2, yend = y2), colour = "gray65") +
geom_text(data = correlations, size=4, aes(x = PC1, y = PC2, label = rownames(correlations))) +
geom_hline(yintercept = 0, colour = "gray65") + geom_vline(xintercept = 0,
colour = "gray65") + xlim(-1.1, 1.1) + ylim(-1.1, 1.1) + labs(x = "pc1 aixs",
y = "pc2 axis") + ggtitle("Circle of correlations")
grid.arrange(p1,p2, ncol=1)
```
###Model fitting
```{r, cache=TRUE}
names<-training$user_name
training<-training[,-1]
treeFit<-train(classe~., method="rpart", data=training)
fancyRpartPlot(treeFit$finalModel)
table(training$classe,predict(treeFit, newdata=training))
treePredict<-predict(treeFit, newdata=testing)
table(testing$classe,treePredict)
OoSE<-(1-sum(treePredict==testing$classe)/length(treePredict))*100
```
The first model trained was using the rpart package with default parameters and all variables, except for the participants name. The decision tree and confusion matrix show a very high error rate ~50% on the training data and the error rate on the validation dataset was `r OoSE`%. Furthermore, the model was unable to classify the "classe D" exercise. I suspect removal of the variables highly correlated with the participant may resolve this problem.
```{r, cache=TRUE}
rfFit<-randomForest(classe~., data=training[,-1])
rfFit
rfPredict<-predict(rfFit, newdata=testing)
table(testing$classe,rfPredict)
OoSE<-(1-sum(rfPredict==testing$classe)/length(rfPredict))*100
```
The second algorithm used to fit as model was a randomForest. Once again all variables were used minus the participants name. The model generated by random forests was very accurate, with an OOB error rate of 0.6% on the traing data. The error rate on the validation dataset was `r OoSE`%.
###Final Predictions
Finally the model was used to predict exercise classe on 20 observations. These were uploaded to Coursera and all were correct.
```{r, echo=FALSE}
#Predictions for submission
download.file(destfile = "testing_pml.csv", url="https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv", method="curl")
testset<-read.csv("testing_pml.csv", stringsAsFactors=F)
testset$user_name<-as.factor(testset$user_name)
testset<-testset[,c(-1,-3,-4,-5,-6,-7,-14,-17,-26,-89,-92,-101,-127,-130,-139)]
testPredict<-predict(rfFit, newdata=testset)
testPredict<-as.character(testPredict)
testPredict
```
###Data Citations:
Ugulino, W.; Cardador, D.; Vega, K.; Velloso, E.; Milidiu, R.; Fuks, H. Wearable Computing: Accelerometers' Data Classification of Body Postures and Movements. Proceedings of 21st Brazilian Symposium on Artificial Intelligence. Advances in Artificial Intelligence - SBIA 2012. In: Lecture Notes in Computer Science. , pp. 52-61. Curitiba, PR: Springer Berlin / Heidelberg, 2012. ISBN 978-3-642-34458-9. DOI: 10.1007/978-3-642-34459-6_6.
Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human '13) . Stuttgart, Germany: ACM SIGCHI, 2013.