# Correlation

Two or more columns are required. A matrix is presented with the correlations between all pairs of columns. In the ‘Statistic \ p(uncorr)’ table format, correlation values are given in the lower triangle of the matrix, and the two-tailed probabilities that the columns are uncorrelated are given in the upper. Both parametric and non-parametric coefficients and tests are available.

Missing data: Supported by pairwise deletion, except for partial correlation which uses mean value imputation.

#### Linear r (Pearson)

Pearson’s r is the most commonly used parametric correlation coefficient. The significance is computed using a two-tailed t test with n-2 degrees of freedom.

#### Spearman’s D and rs

Spearman’s (non-parametric) rank-order correlation coefficient is the linear correlation coefficient (Pearson’s r) of the ranks.

For n>9, the probability of non-zero rs (two-tailed) is computed using a t test with n-2 degrees of freedom.

For small n this approximation is inaccurate, and for n<=9 the program therefore switches automatically to an exact test. This test compares the observed rs to the values obtained from all possible permutations of the first column.

The asymptotic test on D is closely related to the test on rs (see Press et al. 1992). It is computed for all n (no exact test for small n).

#### Kendall’s tau

This non-parametric correlation coefficient is not in very common use. It is computed according to Press et al. (1992).

The asymptotic test is based on Kendall’s tau being approximately normal.

#### Polyserial correlation

This correlation is only carried out if the second column consists of integers with a range less than 100. It is designed for correlating a normally distributed continuous/interval variable (first column) with an ordinal variable (second column) that bins a normally distributed variable. For example, the second column could contain the numbers 1-3 coding for “small”, “medium” and “large”. There would typically be more “medium” than “small” or “large” values because of the underlying normal distribution of sizes.

Past uses the two-step algorithm of Olsson et al. (1982).

#### Partial linear correlation

Using this option, for each pair of columns, the linear correlation is computed while controlling for all the remaining columns. For example, with three columns A, B, C the correlation AB is controlled for C; AC is controlled for B; BC is controlled for A. The partial linear correlation can be defined as the correlation of the residuals after regression on the controlling variable(s). The significance is estimated with a t test with n-2-k degrees of freedom, where k is the number of controlling variables.

#### Phi coefficient

The phi coefficient (Lovell et al. 2015) was designed for compositional (relative) data such as percentages. The usual correlation coefficients can be misleading for such data. The coefficient measures the degree of proportionality; the smaller the value (close to zero), the more the variables exhibit a proportional relationship. Pairs of variables can show strong correlations but low proportionality when they are linearly related, but with a non-zero intercept term.

#### Tetrachoric correlation

Tetrachoric correlation is appropriate when both variables are binary (0/1), but reflecting underlying quantities on a continuous scale. Past uses an accurate approximation due to Bonett & Price (2005). A standard error of this estimate is calculated by eq. (9) in Bonett & Price (2005), and a p value is then estimated by a simple two-sided Z test. For small sample sizes, the permutation test calculated by Past is probably better.

#### Permutation tests

Monte Carlo permutation tests (N=9999) are available for all the correlation coefficients except polyserial and partial correlation.

#### Correlation table plots

Plotting of the correlation table includes a number of options. The “Ellipses” function shows the correlation coefficients as ellipses with major axis of unity, and minor axis according to Schilling (1984).

#### References

Bonett, D.G. & Price, R.M. 2005. Inferential methods for the tetrachoric correlation coefficient. Journal of Educational and Behavioral Statistics 30:213-225.

Lovell, D., Pawlowsky-Glahn, V., Egozcue, J.J., Marguerat, S. & Bähler, J. 2015. Proportionality: A valid alternative to correlation for relative data. PLoS Computational Biology 11(3): e1004075

Olsson, U., F. Drasgow & N.J. Dorans. 1982. The polyserial correlation coefficient. Psychometrika 47:337-347.

Press, W.H., S.A. Teukolsky, W.T. Vetterling & B.P. Flannery. 1992. Numerical Recipes in C. Cambridge University Press.

Schilling, M.F. 1984. Some remarks on quick estimation of the correlation coefficient. The American Statistician 38:330.

Published Aug. 31, 2020 7:53 PM - Last modified Oct. 31, 2021 10:57 PM