This document describes the steps involved in cleaning the data as described in readme.md
All the files contain data in space-separated-values format
None of the files contain column names or row names
-
features.txt: (dimesions:561,2): Contains the IDs and names of the variables
-
activity_labels.txt: (dimesions:6,2): Contains the IDs and names of the activities
-
train/X_train.txt: (dimesions:7352,561): Contains the values of all the variables in each observation of the training experiments set
-
train/y_train.txt: (dimesions:7352,1): Contains the activity ID of each observation/row in the observations file (train/X_train.txt)
-
train/subject_train.txt: (dimesions:7352,1): Contains the participant ID of each observation/row in the observations file (train/X_train.txt)
-
test/X_test.txt: (dimesions:7352,561): Contains the values of all the variables in each observation of the testing experiments set
-
test/y_test.txt: (dimesions:7352,1): Contains the activity ID of each observation/row in the observations file (test/X_test.txt)
-
test/subject_test.txt: (dimesions:7352,1): Contains the participant ID of each observation/row in the observations file (test/X_test.txt)
-
Check whether the UCI HAR Dataset folder exists in the current directory. If not, download and unzip it
-
Load the files train/X_train.txt, train/y_train.txt, train/subject_train.txt, test/X_test.txt, test/y_test.txt, test/subject_test.txt and activity_labels.txt into respective variables
-
Bind the columns of
subject_test, y_test
andx_test
, to form a testing dataframe -
Bind the columns of
subject_train, y_train
andx_train
to form a training dataframe -
Bind the rows of the two data frames obained in the 2 previous steps (3 and 4), to obtain
big_DF
-
Set the column names of
big_DF
to "subject" then "activity" then the values read from features.txt
-
Use
grep()
to find the column names that match "subject", "activity" or contain "mean()" or "std()" -
Assign to
big_DF
a dataframe that only contains the columns from big_DF that we obtained from the previous step.
- Match the keys from the
activities
dataframe to the activity values inbig_DF
and replace them with the corresponding names from theactivities
DF.
STEP 5: Creating a new, tidy data set with the average of each variable for each activity and each subject
-
Use the
melt()
function to reshapebig_DF
into a 4-column dataframe that contains the columns "activity", "subject", "variable" and "value". (Theid.vars
used are "activity" and "subject") -
Use the
dcast()
function to cast the previously molten dataframe while applying themean
aggregate function in order to collapse the rows that have a commmon subject and activity while applying themean
function to the values -
Assign the obtained data.frame to a variable called
clean_DF
in the calling environment