-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME.Rmd
246 lines (182 loc) · 10.6 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
---
output: github_document
bibliography: references.bib
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r setup, include = FALSE}
library(knitr)
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%",
cache = TRUE,
cache.extra = rand_seed,
warning = FALSE,
message = FALSE
)
```
# polychoric <img src="man/figures/logo.png" title="Logo created with hexSticker package" align="right" height="139"/>
<!-- badges: start -->
[![R-CMD-check](https://github.com/Marwolaeth/polychoric/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/Marwolaeth/polychoric/actions/workflows/R-CMD-check.yaml)
<!-- badges: end -->
Instant Polychoric and Polyserial Correlation
## About
Polychoric is a package that provides a wrapper via [@RcppEigen] for C++ routines used to calculate polychoric and polyserial correlation coefficients, which are often used in social science or marketing research. The `cor_polychoric()` function can take in ordinal factor (possibly integer) vectors, a contingency table or a data frame. It returns corresponding polychoric correlation estimates in a form of single numeric value or correlation matrix.
The `cor_polyserial()` takes exactly one continuous and one ordinal vector and returns a polyserial coefficient estimate. Both functions optionally return the p-values associated with the coefficients and estimated discretization thresholds for ordinal variables.
## Installation
You can install the development version of polychoric from [GitHub](https://github.com/) with:
``` r
# install.packages("devtools")
# devtools::install_github("Marwolaeth/polychoric")
```
## The Purpose
Polychoric correlation coefficients are a type of correlation coefficient used to measure the relationship between two ordinal variables. They are computed by estimating the correlation between two underlying continuous variables that are assumed to give rise to the observed ordinal data. The `cor_polychoric()` function estimates latent Pearson correlation coefficients under the assumption that the latent traits of interest are standard normal random variables.
Polyserial correlation is used to measure the relationship between a continuous variable and an ordinal variable. They are computed by estimating the correlation between the observed continuous variable and a latent continuous variable that is derived from the observed ordinal variable.
The computation of both polychoric and polyserial correlation coefficients involves estimating the thresholds (here called `gamma` and `tau`, like in [@drasgow1986] and [@olsson1979], or just `tau` as in `psych` package [@psych]) that separate the ordinal categories for each variable. These thresholds are used to transform the ordinal data into a set of continuous variables, which can then be used to estimate the correlation coefficient using standard methods. The `cor_polychoric()` and `cor_polyserial()` functions currently estimate the coefficients using a two-step maximum likelihood estimation, where first the thresholds are deduced from univariate distributions of ordinal variable(s) and then the `L-BFGS-B` optimization algorithm (implemented in [LBFGS++](https://github.com/yixuan/LBFGSpp/), [@LBFGSpp] using `Eigen` library [@eigenweb]) is used to find the value of the correlation coefficient $\rho$ that maximizes the likelihood of the observed data. The `toms462` [@donnelly1973], [@owen1956] [algorithm](https://people.sc.fsu.edu/~jburkardt/cpp_src/toms462/toms462.html) is used to approximate the bivariate normal distribution (quadrant probabilities) of threshold values in `cor_polychoric()`.
Polychoric and polyserial correlation coefficients are useful for analyzing data that involve ordinal variables, such as Likert scales or survey responses. They provide a measure of the strength and direction of the relationship between two ordinal variables, which can be useful for understanding patterns in the data.
## Disclaimer
Please note that the `polychoric` package was developed as a personal project and is not intended for professional or commercial use. While every effort has been made to ensure the accuracy and reliability of the package, it is provided 'as is' without any warranty or guarantee of suitability for any particular purpose. The package is intended primarily as a means of exploring the direct implementation of (not so) complex mathematical concepts, such as likelihood functions and derivatives, and as such may not be suitable for all use cases. However, it is hoped that others may find the package useful and informative. If you do choose to use the `polychoric` package, please do so with an understanding of its intended use as a personal project and with full awareness of any limitations or potential issues.
## Example
### Data preview
```{r example-data}
library(polychoric)
library(psych)
if (!require(likert)) {
install.packages('likert')
library(likert)
}
data("gss12_values", package = 'polychoric')
head(gss12_values, 13)
```
```{r example-data-plot}
# likert() doesn't work with tibbles
gss12_values |> as.data.frame() |> likert() |> plot()
```
### For a pair of discrete vectors
Coefficient only:
```{r example-xy-coef}
cor_polychoric(gss12_values$valorig, gss12_values$valeql)
```
Full output:
```{r example-xy-full}
cor_polychoric(gss12_values$valorig, gss12_values$valeql, coef.only = FALSE)
```
### For a contingency table
```{r example-contingency}
(G <- table(gss12_values$valorig, gss12_values$valeql))
cor_polychoric(G)
# side with psych:polychoric()
psych::polychoric(G)$rho
```
Notice that threshold values are exactly the same for both functions:
```{r example-contingency-full}
cor_polychoric(G, coef.only = FALSE)
# side with psych:polychoric()
psych::polychoric(G, correct = 1e-08)
```
### For a data frame
```{r example-corrmatrix}
cor_polychoric(gss12_values)
```
Let's visualise and compare our matrices. We suggest `pheatmap` package for quick correlation matrix visualisation. `corrplot` is also a good option.
```{r example-corrmatrix-vis}
#| layout-ncol: 2
#| fig-cap:
#| - "psych correlation matrix"
#| - "polychoric correlation matrix"
if (!require(pheatmap)) {
install.packages('pheatmap')
library(pheatmap)
}
rho1 <- cor_polychoric(gss12_values)
# psych::polychoric() doesn't work with factor data directly
gss_num <- gss12_values |> lapply(as.integer) |> as.data.frame()
rho2 <- polychoric(gss_num)
pheatmap(
rho2$rho,
cluster_rows = FALSE,
cluster_cols = FALSE,
display_numbers = TRUE,
number_format = '%.2f',
fontsize_number = 9
)
pheatmap(
rho1,
cluster_rows = FALSE,
cluster_cols = FALSE,
display_numbers = TRUE,
number_format = '%.2f',
fontsize_number = 9
)
```
### Handling mixed variable types
The `cor_polychoric()` function is currently limited in its flexibility as it only provides polychoric estimation for ordinal variables and does not support biserial or polyserial estimation for mixed ordinal and continuous variables. The function does, however, attempt to recognise potentially non-discrete variables, allowing for up to 10 levels, like in [World Values Survey](https://www.worldvaluessurvey.org/wvs.jsp) [@ingelhart2014]questionnaire items. In comparison, the `polychoric()` function from the `psych` package allows up to 8 levels by default.
It's worth noting that variables with a high number of distinct values may cause estimation issues, so the `cor_polychoric()` function returns Spearman's $\rho$ instead (with a warning).
```{r example-mixed, set.seed(111)}
x <- rnorm(nrow(gss12_values))
cor_polychoric(gss12_values$valspl, x)
```
### Polyserial correlation
Alternatively, one can correlate a continuous and an ordinal variable explicitly using polyserial correlation. The `polychoric` package contains `cor_polyserial()` function that estimates polyserial correlation coefficients between a continuous and an ordinal variable.
```{r example-polyserial, set.seed(111)}
# Let them be actually correlated
x <- as.integer(gss12_values$valspl) * 20.2 + rnorm(nrow(gss12_values), sd = 13)
cor(x, as.integer(gss12_values$valspl))
cor_polyserial(x, gss12_values$valspl)
```
Due to its strong bivariate normality assumptions, `cor_polyserial()` is currently not a default choice for a mixed continuous-ordinal variable correlation.
### Handling missing values
The `cor_polychoric()` function always uses pairwise complete observations. Therefore, the user need not worry about missing data. However, depending on the analysis design and the ratio of missing data, it may be essential to check for patterns of missingness and consider imputation.
The General Social Survey Schwartz Values Module dataset [@smith2014] is cleared of missing values (non-response or non-applicable). Here we introduce some NAs into random places across the dataset. The summary will show the dataset now actually contains missings.
```{r example-missing-01, set.seed(111)}
gss_miss <- gss12_values
mask <- matrix(
sample(
c(TRUE, FALSE),
nrow(gss_miss)*ncol(gss_miss),
replace = TRUE,
prob = c(.9, .1)
),
nrow = nrow(gss_miss)
)
gss_miss[!mask] <- NA
summary(gss_miss[,1:4]) # Now NAs are present
```
Let's rerun the estimation: the function works, though coefficients may be different.
```{r example-missing-02}
pheatmap(
cor_polychoric(gss_miss),
cluster_rows = FALSE,
cluster_cols = FALSE,
display_numbers = TRUE,
number_format = '%.2f',
fontsize_number = 9
)
```
## Performance
Probably the only reason the `polychoric` package may be useful is that it is fast. During alpha-testing of the source code, it was nearly 50x faster in creating a correlation matrix than the gold standard `psych::polychoric()`. The package version is 5 times slower than the raw source code. However, `polychoric` still introduces a significant improvement in performance and can save a market researcher a couple of minutes a day.
```{r benchmark}
if (!require(microbenchmark)) {
install.packages('microbenchmark')
library(microbenchmark)
}
bm <- microbenchmark(
polychoric = polychoric(gss_num),
cor_polychoric = cor_polychoric(gss12_values),
times = 13L,
control = list(warmup = 2)
)
bm
```
Another minor advantage of the `polychoric` package is that its functions accept ordinal factor variables without need to convert them explicitly.
## Development
This is an alpha version of the package. It is under development.
Please report any issues you came up with on the [issues](https://github.com/Marwolaeth/polychoric/issues) page.
<details>
<summary>The upcoming steps</summary>
1. Implement (optional) more robust distributional assumptions, e.g. a skew normal distribution [@jin2016].
</details>
## References
::: {#refs}
:::