PML_project.Rmd

---
title: "Machine_Learning"
author: "emilliman4"
date: "02/22/2015"
output: html_document
---

```{r, echo=FALSE, message=FALSE}
library(ggplot2)
library(gplots)
library(caret)
library(rattle)
library(gridExtra)
library(randomForest)
library(foreach)
```
```{r, echo=FALSE, cache=TRUE, message=FALSE}
download.file("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv",destfile = "pml-training.csv",method="curl")

pml.training<-read.csv("pml-training.csv", stringsAsFactors=FALSE)

pml.training$user_name<-as.factor(pml.training$user_name)
pml.training$classe<-as.factor(pml.training$classe)
pml.training$new_window<-as.factor(pml.training$new_window)
pml.training<-pml.training[,c(-1,-3,-4,-5,-6,-7,-14,-17,-26,-89,-92,-101,-127,-130,-139)]
pml.training<-subset(pml.training,select = apply(pml.training,2, function(x) sum(is.na(x))) < 100)
pml.training<-subset(pml.training,select=apply(pml.training,2, function(x) !is.element("#DIV/0!",x))==TRUE)

##Project specific functions
circle <- function(center = c(0, 0), npoints = 100) {
    r = 1
    tt = seq(0, 2 * pi, length = npoints)
    xx = center[1] + r * cos(tt)
    yy = center[1] + r * sin(tt)
    return(data.frame(x = xx, y = yy))
}

pml_write_files = function(x){
  n = length(x)
  for(i in 1:n){
    filename = paste0("problem_id_",i,".txt")
    write.table(x[i],file=filename,quote=FALSE,row.names=FALSE,col.names=FALSE)
  }
}
```
###Getting and Cleaning Data

Looking at basic structure of the data. Notice that there are number of columns with very little/no data these are the summary stats for each window. A quick inspection of the test data shows that each entry is a single time point so all summary data is removed. Furthermore, the index, time, date, window columns are removed because they should have no bearing on the predictions.

Before further analysis, 40% of the data was set aside for validation/testing

```{r, echo=FALSE}
#head(pml.training)
#summary(pml.training)
inTrain<-createDataPartition(pml.training$classe,p=0.6,list=FALSE)

training<-pml.training[inTrain,]
testing<-pml.training[-inTrain,]
```

###Exploratory Data Analysis

Because of the large number of variables I opted to make a heatmap of correlation coefficients between all variables.Furthermore we used Caret's nearZeroVariance f(x) to look for uninformative data. Across all participants there were no zero variance variables. However, there were some on a per user basis (e.g. Jeremy: roll,pitch,yaw arm and aldelmo: roll,pitch,yaw forearm). Models with and without these variables should be tested and compared...

A histogram of each variable was made to assess each measurements normality/skew (graphs not shown). Anumber of variables showed non-normal distributions. These skews are due to shifts individual participant's sesnor measurements. This is conformed by PCA analysis of the dataset. The plot of PC1 and PC2 shows very nice separation of the users and not the classe. Furthermore, the varaibles with a high correlation are the most skewed. Normalizing this data would be a major undertaking and so I forst tried to classify the data with decision trees and random forests to determine the data would need to be transformed at all.

```{r}
heatmap.2(cor(training[,c(-1,-54)]), col=redblue(75),trace="none", main="Figure 2: Correlation of sensor data", margins=c(7,7), cexRow=0.75, cexCol=0.8)
```

```{r, eval=FALSE}
foreach(i=2:(length(colnames(training))-1)) %do%  hist(as.numeric(training[,i]), 
                                                     main=c("Index:",i,colnames(training)[i]))
nsv<-by(training[,c(-1,-54)],training$user_name,function(x) nearZeroVar(x, saveMetrics=T))
zeroVar<-nearZeroVar(training[,c(-1,-54)], saveMetrics=T)
```

```{r, fig.height=10,fig.width=8}
PCA<-prcomp(training[,c(-1,-54)],center = TRUE, scale=TRUE)

scores<-as.data.frame(PCA$x)
scores$group<-training$classe
p1<-ggplot(data = scores, aes(x = PC1, y = PC2,colour=training$user_name, label = rownames(scores))) +
  geom_hline(yintercept = 0, colour = "gray65") +
  geom_vline(xintercept = 0, colour = "gray65") +
  geom_text(alpha = 0.8, size = 4) +
  ggtitle("PCA plot of TimePoints - Sensor Data")

corcir<-circle(c(0,0), npoints=100)

correlations<-as.data.frame(cor(training[,c(-1,-54)],PCA$x))
arrows = data.frame(x1 = rep(0,dim(correlations)[1]), y1 = rep(0,dim(correlations)[1]), x2 = correlations$PC1, y2 = correlations$PC2)

p2<-ggplot() + geom_path(data = corcir, aes(x = x, y = y), colour = "gray65") + 
    geom_segment(data = arrows, aes(x = x1, y = y1, xend = x2, yend = y2), colour = "gray65") + 
    geom_text(data = correlations, size=4, aes(x = PC1, y = PC2, label = rownames(correlations))) + 
    geom_hline(yintercept = 0, colour = "gray65") + geom_vline(xintercept = 0, 
    colour = "gray65") + xlim(-1.1, 1.1) + ylim(-1.1, 1.1) + labs(x = "pc1 aixs", 
    y = "pc2 axis") + ggtitle("Circle of correlations")


grid.arrange(p1,p2, ncol=1)
```

###Model fitting

```{r, cache=TRUE}
names<-training$user_name
training<-training[,-1]
treeFit<-train(classe~., method="rpart", data=training)
fancyRpartPlot(treeFit$finalModel)
table(training$classe,predict(treeFit, newdata=training))
treePredict<-predict(treeFit, newdata=testing)
table(testing$classe,treePredict)
OoSE<-(1-sum(treePredict==testing$classe)/length(treePredict))*100
```
The first model trained was using the rpart package with default parameters and all variables, except for the participants name. The decision tree and confusion matrix show a very high error rate ~50% on the training data and the error rate on the validation dataset was `r OoSE`%. Furthermore, the model was unable to classify the "classe D" exercise. I suspect removal of the variables highly correlated with the participant may resolve this problem.

```{r, cache=TRUE}
rfFit<-randomForest(classe~., data=training[,-1])
rfFit
rfPredict<-predict(rfFit, newdata=testing)
table(testing$classe,rfPredict)
OoSE<-(1-sum(rfPredict==testing$classe)/length(rfPredict))*100
```

The second algorithm used to fit as model was a randomForest. Once again all variables were used minus the participants name. The model generated by random forests was very accurate, with an OOB error rate of 0.6% on the traing data. The error rate on the validation dataset was `r OoSE`%.

###Final Predictions

Finally the model was used to predict exercise classe on 20 observations. These were uploaded to Coursera and all were correct.

```{r, echo=FALSE}
#Predictions for submission
download.file(destfile = "testing_pml.csv", url="https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv", method="curl")

testset<-read.csv("testing_pml.csv", stringsAsFactors=F)
testset$user_name<-as.factor(testset$user_name)
testset<-testset[,c(-1,-3,-4,-5,-6,-7,-14,-17,-26,-89,-92,-101,-127,-130,-139)]
testPredict<-predict(rfFit, newdata=testset)
testPredict<-as.character(testPredict)
testPredict
```

###Data Citations:
  
Ugulino, W.; Cardador, D.; Vega, K.; Velloso, E.; Milidiu, R.; Fuks, H. Wearable Computing: Accelerometers' Data Classification of Body Postures and Movements. Proceedings of 21st Brazilian Symposium on Artificial Intelligence. Advances in Artificial Intelligence - SBIA 2012. In: Lecture Notes in Computer Science. , pp. 52-61. Curitiba, PR: Springer Berlin / Heidelberg, 2012. ISBN 978-3-642-34458-9. DOI: 10.1007/978-3-642-34459-6_6.

Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human '13) . Stuttgart, Germany: ACM SIGCHI, 2013.