-
Notifications
You must be signed in to change notification settings - Fork 1
/
Rcourse_intermediate_spring_2024.Rmd
325 lines (214 loc) · 9.92 KB
/
Rcourse_intermediate_spring_2024.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
---
title: "R Course"
author: "Heidi Karlsen"
date: "2024-04-11"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# Topics
- RMarkdown
- R operators
- Correlations
- Regression models
- Statistical tests
- Visualisations in ggplot
```{r, install and load libraries}
#Packages below not previously installed by you, need to be installed before loading them.
#install.packages("<package name>")
library(gapminder)
library(dplyr)
library(ggplot2)
library(gridExtra)
library(MASS)
```
```{r}
head(gapminder, 20)
```
The gapminder dataset provides data about the population, life expectancy and GDP in multiple countries in the world from 1952 to 2007.
## R operator %\>% and statistics
This operator forwards a value, or the result of an expression, into the next function call/expression. See documentation here: <https://cran.r-project.org/web/packages/magrittr/index.html>
It passes the left-hand side data as the first argument to the function on the right-hand side.
### 1. Calculate trends in life expectancy over time
```{r}
gapminder %>%
group_by(year)%>%
summarize(average_lifeExp <-mean(lifeExp))
```
### 2. Calculating average life expectancy and gdp per capita in continents in 2007
(gross domestic product divided by the country's total population
```{r}
gapminder %>%
summarise(average_lifeExp <- mean(lifeExp), average_gdpPercap <- mean(gdpPercap))
```
```{r}
gapminder %>%
filter(year == 2007)%>%
summarise(average_lifeExp <- mean(lifeExp), average_gdpPercap <- mean(gdpPercap))
```
### Exercise:
Calculate average life expectancy and average gdp per capita in continents
- In the entire dataset
- For one year only
```{r}
gapminder %>%
group_by(continent)%>%
summarise(average_lifeExp = mean(lifeExp), average_gdpPercap = mean(gdpPercap))
```
```{r}
gapminder %>%
filter(year == 2007)%>%
group_by(continent)%>%
summarise(average_lifeExp = mean(lifeExp), average_gdpPercap = mean(gdpPercap))
```
### 3. Sorting your data with arrange()
```{r}
gapminder%>%
arrange(lifeExp)
```
Life expectancy in Africa in 2007
```{r}
gapminder%>%
filter(continent =="Africa", year == 2007)%>%
arrange(lifeExp)
```
```{r}
gapminder%>%
filter(continent== "Africa", year == 2007)%>%
arrange(desc(lifeExp))
```
## Regression models
### 4a. Plotting relation between life expectancy and gdp per capita
```{r}
gapminder%>%
ggplot(aes(gdpPercap, lifeExp))+
geom_point()+
geom_smooth(method = "lm", se = FALSE)+ #adding a trend line.
coord_cartesian(ylim = c(40, 90))
```
Trend line = fitting a line which follows the data points. We set the method to lm, for linear model, which gives a trend line calculated with a linear regression.
By default, geom_smooth() shows a standard error ribbon, which can be turned off by setting 'se' to FALSE.
### 4.b Improved plot
- Given the wide range of GDP per capita values, we can use a log scale. It transforms the scale of the x axis to a logarithmic scale with base 10. Thus every point on the x axis represents a tenfold increase from the previous point.
- Increase point transparency
```{r}
ggplot(gapminder, aes(gdpPercap, lifeExp, color = continent))+
geom_point(alpha = 0.5)+
geom_smooth(method = "lm", se = FALSE, color = "black")+
scale_x_log10()
```
The straight lines of linear regressions are defined by two properties:
- Intercept = The y value at the point when x is 0. Thus the value of the y axis where the regression line touches 0 on the x axis
- Slope = The steepness of the line: The amount the y value increases if you increase x by one.
- The equation of the line is the intercept plus the slope times the x value: y = intercept + slope \* x
- This trend line is in general pretty close to the data points (although some exceptions) thus, the linear regression seems to be a reasonable fit.
### Correlations
```{r}
cor(gapminder$lifeExp, gapminder$gdpPercap)
```
### Simple linear regression
Regression models are a class of statistical models that let you explore the relationship between a response variable/dependent variable and some explanatory/independent variables
Regression makes it possible to make thought experiments, hypotheses, predict the dependent variables, given some independent variables
The dependent variable = variable we want to predict
The independent variable = the variables that explain how the dependent variable will change
There are different types of regression analysis: linear and logistic
- Linear regression: when the dependent variable is numeric, which is the case here
- Logistic regression: when the variable is logistic (true or false)
- Both can be simple or multiple - simple when there is only one explanatory/independent variable, as is the case here.
```{r}
regression_gapminder <- lm(lifeExp ~ gdpPercap, data = gapminder)
regression_gapminder
```
```{r}
summary(regression_gapminder)
```
Using these values to predict life expectancy:
```{r}
LifeExp = 53.96 +0.0007649 *10000
LifeExp
```
#### Interpreting the results:
- Std.Error = Standard error: how different the population mean is likely to be from a sample mean
- T value = test statistics: how close the relations is between the distribution of our data and the distribution predicted under the null hypothesis
- Null hypothesis = the hypothesis that there is no correlation between the entities we are testing. it is the pattern in our data \#(i.e., the correlation between variables or difference between groups) divided by the variance in the data (i.e., the standard deviation). See: <https://www.scribbr.com/statistics/test-statistic/>
- p(\>\|t\|) = p(robability) value: the probability of finding the given t statistic if the null hypothesis were true.
- We often say that we can assume from a p value of 0.05 or lower that the pattern we have found is statistically significant. A p value of 0.05 means that there are 5 percent chance, if the null hypothesis were true, that we would find a t statistic as strong as the one we have found Here the p value is less than 2.2e-16, thus very small. It is represented in scientific notation. It can also be written as 0.000000000000000022.
- Residual standard error: an estimate of the standard deviation of the residuals (prediction errors) in a linear regression model: how much the observed values differ from the predicted values. It measures the overall lack of fit of the model to the data. The lower the better.
- RSE here: 10.49, given the range of life expectancy, this suggests that there is still a substantial amount of variability in life expectancy that isn't captured by gdp per capita.
- The adjusted R-squared value (0.3403) shows that about 34 % of the variance in life expectancy can be explained by gdp per capita in the model, thus about 65 % is explained by other factors.
- F-statistic: Measures how well the linear regression model fits the data. It measures the variation between group means. The F-statistic is a ratio of two variances - the variance explained by the regression model (explained variance) and the variance not explained by the model (residual variance).
- DF = Degrees of freedom, meaning the number of values in a statistic that are free to vary.
### Exercise
- Visualise the relation between life expectancy and gdp per capita in 2007
- Run a simple linear regression on these data
```{r}
gapminder%>%
filter(year == 2007)%>%
ggplot(aes(gdpPercap, lifeExp, color = continent))+
geom_point(alpha = 0.5)+
geom_smooth(method = "lm", se = FALSE, color = "black")+
scale_x_log10()
```
```{r}
gapminder_2007 <- gapminder%>%
filter(year == 2007)
```
```{r}
regression_gapminder_2007 <- lm(lifeExp ~ gdpPercap, data = gapminder_2007)
summary(regression_gapminder_2007)
```
### More plotting
We prepare for running a multiple linear regression. Let's first analyse the relation between life expectancy and gdp per capita, life expectancy and population - and life expectancy and continent by creating scatter plots with trend lines calculated with linear regression.
```{r}
plot1 <- ggplot(gapminder, aes(gdpPercap, lifeExp))+
geom_point()+
geom_smooth(method = "lm", se = FALSE)+
labs(x = "gdpPercap", y = "lifeExp")+
scale_x_log10()
plot2 <- ggplot(gapminder, aes(pop,lifeExp))+
geom_point()+
geom_smooth(method = "lm", se = FALSE)+
labs(x = "pop", y = "lifeExp")+
scale_x_log10()
plot3 <- ggplot(gapminder, aes(continent, lifeExp))+
geom_point()+
geom_smooth(method = "lm", se = FALSE)+
labs(x = "continent", y = "lifeExp")
grid.arrange(plot1, plot2, plot3)
```
### Multiple linear regression
```{r}
regression_gapminder2 <- lm(lifeExp~pop + gdpPercap + continent, data = gapminder)
regression_gapminder2
```
```{r}
summary(regression_gapminder2)
```
LifeExp= 47.81+(6.570e−9×pop)+(0.0004495×gdpPercap)+Continent Coefficients
### Interpretation
We see from the R squared value that this model predicts about 58 % of the variance can be explained by the model.
## Statistical tests
Let's use the survey dataset from the MASS package as an examle
```{r}
head(survey)
```
```{r}
summary(survey)
```
### T-tests
Statistical method to analyse whether there are siginificant differences between the means of two groups. We see if the null hypothesis - that there are no significant difference between the mean - can be rejected or not.
```{r}
t.test(Pulse~Sex, data = survey, var.equal = TRUE, paired = FALSE)
```
## Chi-square test
Statistical method to determine whether there is a statistically significant association between categorical variables.
```{r}
Smoke_Excercise = table(survey$Smoke, survey$Exer)
```
```{r}
Smoke_Excercise
```
```{r}
chisq.test(Smoke_Excercise)
```