18 Correlation and Covariance Matrices

Author

Lea Nehme & Zhayda Reilly

Published

July 18, 2023

#install.packages("skimr")
library(skimr)
library(knitr)
library(dplyr)
library(ggplot2)
#install.packages("ggpubr")
library(ggpubr)
#install.packages("patchwork")
library(patchwork)
library(tidyverse)
#install.packages("gtsummary")
library(gtsummary)
library(Hmisc)
library(corrplot)

corrplot 0.92 loaded

library(xtable)
#install.packages("modelsummary")
library(modelsummary)

`modelsummary` 2.0.0 now uses `tinytable` as its default table-drawing
  backend. Learn more at: https://vincentarelbundock.github.io/tinytable/

Revert to `kableExtra` for one session:

  options(modelsummary_factory_default = 'kableExtra')

Change the default backend persistently:

  config_modelsummary(factory_default = 'gt')

Silence this message forever:

  config_modelsummary(startup_message = FALSE)

library(psych)
#install.packages("rstatix")
library(rstatix)
#install.packages("PerformanceAnalytics")
library(PerformanceAnalytics)

Loading required package: xts

Loading required package: zoo


Attaching package: 'zoo'

The following objects are masked from 'package:base':

    as.Date, as.Date.numeric


######################### Warning from 'xts' package ##########################
#                                                                             #
# The dplyr lag() function breaks how base R's lag() function is supposed to  #
# work, which breaks lag(my_xts). Calls to lag(my_xts) that you type or       #
# source() into this session won't work correctly.                            #
#                                                                             #
# Use stats::lag() to make sure you're not using dplyr::lag(), or you can add #
# conflictRules('dplyr', exclude = 'lag') to your .Rprofile to stop           #
# dplyr from breaking base R's lag() function.                                #
#                                                                             #
# Code in packages is not affected. It's protected by R's namespace mechanism #
# Set `options(xts.warn_dplyr_breaks_lag = FALSE)` to suppress this warning.  #
#                                                                             #
###############################################################################


Attaching package: 'xts'

The following objects are masked from 'package:dplyr':

    first, last

#install.packages("ppcor")
library(ppcor)

Loading required package: MASS


Attaching package: 'MASS'

The following object is masked from 'package:rstatix':

    select

The following object is masked from 'package:gtsummary':

    select

The following object is masked from 'package:patchwork':

    area

The following object is masked from 'package:dplyr':

    select

#install.packages("remotes")
library(remotes)
#install.packages("conflicted")
library(conflicted)

18.1 Introduction

One of the general goals of statistics is to understand if there is a relationship between two variables.

Covariance measures the degree to which the deviation of one variable (X) from its mean is related to the deviation of another variable (Y) from its mean. In other words, covariance measures the joint variability of two random variables or how these increase or decrease in relation with each other.
1. For instance, if greater values of one variable tend to correspond with greater values of another variable, this suggests positive covariance.
2. Covariance can have both positive and negative values.
A Covariance Matrix shows the covariance between different variables of a data set.
Correlation tells us both the strength and the direction of this relationship by listing the Correlation Coefficient (“Pearson”, “Spearman”, or “Kendall”) for the pair as measure of association.
Correlation matrix allows for the exploration of the correlation among the multiple variables in a data set. It proves this information through a table listing the correlation coefficients (r) for each relationship between pair of variables in the data set.
- Correlation is best used for multiple variables that express a linear relationship with one another.
- Correlation between two variables highlight how a change in one variable impacts a change in another variable.

In sum, a covariance and a correlation matrix sound similar because covariance is a measure of correlation, while correlation is a scaled version of covariance. This means correlation is a special case of covariance which can be achieved when the data is in standardized form.

18.2 About Covariance and Correlation

18.2.1 Equations, properties, and uses

Mathematical equations, properties, and uses

Remember the equation for the variance for one variable

\[ var(x) = \frac{1}{n-1}{\sum_{i = 1}^{n}{(x_i - \bar{x})^2}} \]

Covariance: Relation between two variables where, (x_i are the values of the X-variable), (y_i* are the values of the Y-variable), (x_bar is the mean of the X-variable), (y_bar* is the mean of the Y-variable), and (n are the number of data points)

\[ Cov(x,y) = \frac{1}{n-1}{\sum_{i = 1}^{n}{(x_i - \bar{x})(y_i-\bar{y})}} \]

18.2.1.1 Covariance Properties

(x, y) > 0: the covariance for both variables is positive or moving in the same direction.
(x, y) < 0: the covariance for both variables is negative or moving in the opposite direction.
(x, y) = 0: there is no relationship between the variables.

Covariance Matrix: Covariance between variables in a data set

\[ C = \frac{1}{n-1}{\sum_{i = 1}^{n}{(X_i - \bar{X})(X_i-\bar{X})}^T} \]

The general formula to represent a covariance matrix

\[ \begin{bmatrix} Var(x_{1}) & ... & Cov(x_{1},x_{n})\\ : &. & :\\ :& \: \: \: \: \: \: \: \: \: \: .& :\\ Cov(x_{n},x_{1}) & ... & Var(x_{n}) \end{bmatrix}. \]

Uses of a covariance matrix

Calculate the Mahalanobis distance.,
Kalman filters.,
Gaussian mixture models.
PCA (principal component analysis)

This is of importance and allows us to see the relationship between covariance and correlation in this simplified equation for the Pearson Correlation Coefficient, r. [video explaining standarization and mathematical properties of r]:

^{tilestats2021?}

\[ r = \frac {Cov(x,y)}{(\sqrt{var(x)})(\sqrt{var(y)})} \]

18.2.1.2 Correlation Matrix

Correlation Properties: direction

r = 1: A perfect positive correlation.
r = 0: Zero or no correlation.
r = -1: A perfect negative relationship. where,
- The sign of the correlation coefficient indicates the direction of the association.

Correlation Properties: strength

.1 < | r | < .3 (small / weak correlation).
.3 < | r | < .5 (medium / moderate correlation).
.5 < | r | (large / strong correlation).
- The magnitude of the correlation coefficient indicates the strength of the association.

Correlation uses

To summarize the correlation among various variables of a data set.
To perform regression testing (like multicollinarity)
Basis of other analyses (like regression models)

Correlation, r and Covariance

We will concentrate on the Correlation Matrix which provides the Pearson Correlation Coefficient or (its variations for non-parametric data such as, spearman and kendall).

18.3 Data set used

We will use the R data set “mtcars” to provide examples on developing a Covariance Matrix and on checking the assumptions for and development of a Correlation Matrix using the pearson correlation coefficient.

# Load data
data("mtcars")
# Print sample
head(mtcars) %>% 
  kable(
    format = "markdown",
    digits = 2,
    caption = "The mtcars data set"
  )

The mtcars data set
	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
Mazda RX4	21.0	6	160	110	3.90	2.62	16.46	0	1	4	4
Mazda RX4 Wag	21.0	6	160	110	3.90	2.88	17.02	0	1	4	4
Datsun 710	22.8	4	108	93	3.85	2.32	18.61	1	1	4	1
Hornet 4 Drive	21.4	6	258	110	3.08	3.21	19.44	1	0	3	1
Hornet Sportabout	18.7	8	360	175	3.15	3.44	17.02	0	0	3	2
Valiant	18.1	6	225	105	2.76	3.46	20.22	1	0	3	1

18.4 Covariance Matrix Procedure

The procedure to develop a covariance matrix uses the cov() function. It takes the data frame as an argument and returns the covariance matrix as result. It is as follows:

conflicts_prefer(rstatix::select)

[conflicted] Will prefer rstatix::select over any other package.

mtcars2 <- mtcars %>%
  select(1, 3, 4, 5, 6, 7)
mtcars2_cov <- cov(mtcars2)
kable(mtcars2_cov,
      caption = "Covariance of mtcars using the Pearson type"
      )

Covariance of mtcars using the Pearson type
	mpg	disp	hp	drat	wt	qsec
mpg	36.324103	-633.09721	-320.73206	2.1950635	-5.1166847	4.5091492
disp	-633.097208	15360.79983	6721.15867	-47.0640192	107.6842040	-96.0516815
hp	-320.732056	6721.15867	4700.86694	-16.4511089	44.1926613	-86.7700806
drat	2.195063	-47.06402	-16.45111	0.2858814	-0.3727207	0.0871407
wt	-5.116685	107.68420	44.19266	-0.3727207	0.9573790	-0.3054816
qsec	4.509149	-96.05168	-86.77008	0.0871407	-0.3054816	3.1931661

Only the “continuous” variables of mtcars were included in this covariance matrix because the default type of procedure is “Pearson”.

The printout of the covariance matrix is not easy to read or as useful as the correlation matrix which has several functions that allow for visual and organizational displays of information.

Some of the results are the following:

The values along the diagonals of the matrix are simply the variances of each product.

The variance of mpg is 36.32

The variance of disp is 15360.8

The other values in the matrix represent the covariances between the various products, for example:

The covariance between mpg and disp is -633.10, this is a negative covariance indicating that vehicles that are fuel efficient or have high miles per galon have lower levels of displacement (smaller engines, less powerful).

The covariance between hp and wt is 44.19, this is a positive covariance indicating that as horse power increases, so does the weight of a vehicle.

18.5 The Correlation Matrix Procedure

The Correlation Matrix procedure is of emphasis because it provides information on the linear relationship among variables (strength and direction). This analysis is of importance to build Regression and multivariate models, as it assists in:

1. The observation of patterns and associations in the data.
2. To conduct or as the basis of other analyses (linear regression models)
3. To conduct diagnostic checks (multicollinearity)

18.5.1 Hypothesis testing in correlations

The bivariate Pearson Correlation produces a sample correlation coefficient, r, which measures the strength and direction of linear relationships between pairs of continuous variables in a sample. By extension, the Pearson Correlation evaluates whether there is statistical evidence for a linear relationship among the same pairs of variables in the population, represented by a population correlation coefficient, ρ (“rho”), which is unknown. The sample pearson correlation coefficient, r, is the estimate of the unknown population coefficient (ρ). Therefore,

The hypothesis test lets us decide whether the value of the population correlation coefficient ρ is “close to zero” or “significantly different from zero”. We decide this based on the sample correlation coefficient r and the sample size n .

Two-tailed significance test:

The null hypothesis for a two-tailed significance test of a correlation is: \[ H_0: ρ = 0 \] rho is zero or approximately zero. This means that the relationship in the population is equal to 0; (“there is no association”)

The alternative hypothesis for a two-tailed significance test of a correlation is: \[ H_1: ρ ≠ 0 \] rho is not zero, which means that the relationship in the population is not equal to 0; (“there is an association in the population”)

One-tailed significance test:

The null hypothesis for a one-tailed significance test of a correlation is: \[ H_0: ρ = 0 \] which means that the relationship in the population is equal to 0; (“there is no association”)

The alternative hypothesis for a one-tailed significance test of a correlation where ρ > 0 is: \[ H_1: ρ > 0 \] which means that there is a positive relationship in the population.

OR

The alternative hypothesis for a one-tailed significance test of a correlation where ρ < 0 is: \[ H_1: ρ < 0 \] which means that there is a negative relationship in the population.

These hypotheses are in terms of ρ.

18.5.2 Assumptions of a Correlation Matrix:

1. The two variables should be measured at the interval or ratio level (e.i., continuous).

a.  If interested in the relationship between two categorical variables or ranks (categorical is misleading) you can use "Spearman" as the type of correlation. But if you have two binary variables (i.e. living/dead, black/white, success/failure), then use a Phi Correlation coefficient.

b.  If interested in a mix, one can use various functions like "mixedCor" in the `psych` package.

c.  If interested in how the levels of a variable like male or female affect the correlation of two variables, we conduct a partial correlation which allow us to control for a third or more other variables using the function "pcor.test" in the `ppcor` package.

2. There should exist a linear relationship between the two variables.

3. Variables should be roughly normally distributed.

4. Each observation in the dataset should have a pair of values.

5. There should be no extreme outliers in the dataset.

18.5.3 Checking Assumptions

18.5.3.0.1 Level of Measurement

Using the skimr package, we will get data set information such as the distribution (histogram) of, and the type/level of measurement for the variables in the “mtcars” data set.

skim(mtcars)

Data summary
Name	mtcars
Number of rows	32
Number of columns	11
_______________________
Column type frequency:
numeric	11
________________________
Group variables	None

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
mpg	1	20.09	6.03	10.40	15.43	19.20	22.80	33.90	▃▇▅▁▂
cyl	1	6.19	1.79	4.00	4.00	6.00	8.00	8.00	▆▁▃▁▇
disp	1	230.72	123.94	71.10	120.83	196.30	326.00	472.00	▇▃▃▃▂
hp	1	146.69	68.56	52.00	96.50	123.00	180.00	335.00	▇▇▆▃▁
drat	1	3.60	0.53	2.76	3.08	3.70	3.92	4.93	▇▃▇▅▁
wt	1	3.22	0.98	1.51	2.58	3.33	3.61	5.42	▃▃▇▁▂
qsec	1	17.85	1.79	14.50	16.89	17.71	18.90	22.90	▃▇▇▂▁
vs	1	0.44	0.50	0.00	0.00	0.00	1.00	1.00	▇▁▁▁▆
am	1	0.41	0.50	0.00	0.00	0.00	1.00	1.00	▇▁▁▁▆
gear	1	3.69	0.74	3.00	3.00	4.00	4.00	5.00	▇▁▆▁▂
carb	1	2.81	1.62	1.00	2.00	2.00	4.00	8.00	▇▂▅▁▁

The skimr function nicely displays the data. It indicates that all the variables are currently in numeric format but based on the descriptive information and on the information provided by ?mtcars, only 6 variables meet the level of measurement assumption (interval, ratio), we will then create a data frame from mtcars named cars_df where we leave these 6 variables as continuous double/numeric and categorize the other 5 variables. We present the summary using the gtsummary package.

cars_df <- within(mtcars, {
   vs <- factor(vs, labels = c("V", "S"))
   am <- factor(am, labels = c("automatic", "manual"))
   cyl  <- ordered(cyl)
   gear <- ordered(gear)
   carb <- ordered(carb)
})

cars_df %>%
  gtsummary::tbl_summary(
    label = list(
      mpg ~ "Miles/(US) gallon",
      cyl ~ "Number of cylinders",
      disp ~ "Displacement (cu.in.)",
      hp ~  "Gross horsepower",
      drat ~ "Rear axle ratio",
      wt ~  "Weight (1000 lbs)",
      qsec ~ "1/4 mile time",
      vs ~  "Engine (0 = V-shaped, 1 = straight)",
      am ~ "Transmission (0 = automatic, 1 = manual)",
      gear ~ "Number of forward gears"),
    type = all_continuous() ~ "continuous2",
    statistic = all_continuous() ~ c(
      "{median} ({p25}, {p75})",
      "{min}, {max}"
    )
  ) %>%
  add_n() %>%
  bold_labels() %>%
  modify_header(label ~ "**Variable**")

Variable	N	N = 32¹
Miles/(US) gallon	32
Median (IQR)		19.2 (15.4, 22.8)
Range		10.4, 33.9
Number of cylinders	32
4		11 (34%)
6		7 (22%)
8		14 (44%)
Displacement (cu.in.)	32
Median (IQR)		196 (121, 326)
Range		71, 472
Gross horsepower	32
Median (IQR)		123 (97, 180)
Range		52, 335
Rear axle ratio	32
Median (IQR)		3.70 (3.08, 3.92)
Range		2.76, 4.93
Weight (1000 lbs)	32
Median (IQR)		3.33 (2.58, 3.61)
Range		1.51, 5.42
1/4 mile time	32
Median (IQR)		17.71 (16.89, 18.90)
Range		14.50, 22.90
Engine (0 = V-shaped, 1 = straight)	32
V		18 (56%)
S		14 (44%)
Transmission (0 = automatic, 1 = manual)	32
automatic		19 (59%)
manual		13 (41%)
Number of forward gears	32
3		15 (47%)
4		12 (38%)
5		5 (16%)
carb	32
1		7 (22%)
2		10 (31%)
3		3 (9.4%)
4		10 (31%)
6		1 (3.1%)
8		1 (3.1%)
¹ n (%)

Now we will examine only the 6 variables meeting the measurement assumption.

18.5.3.1 Linear Relationship

pairs(
  ~mpg + disp + hp + drat + wt + qsec,
  data = cars_df,
  main="Simple Scatterplot Matrix")

Based on this simple scatterplot matrix, all variables seem to follow a linear relationship, therefore, we will examine the normality and outliers assumption of these variables.

18.5.3.2 Normality distribution and extreme outliers

par(mfrow = c(2, 3))

qqnorm(cars_df$mpg,
       main = "MPG Q-Q plot",
       ylab = "Miles per Gallon")
qqline(cars_df$mpg)

qqnorm(cars_df$disp, 
         main = "Disp Q-Q plot",
         ylab = "Displacement")
qqline(cars_df$disp)

qqnorm(cars_df$hp, 
       main = "hp Q-Q plot",
       ylab = "Gross horsepower")
qqline(cars_df$hp)

qqnorm(cars_df$drat, 
       main = "drat Q-Q plot",
       ylab = "Rear axle ratio")
qqline(cars_df$drat)

qqnorm(cars_df$wt, 
       main = "wt Q-Q plot",
       ylab = "Weight")
qqline(cars_df$wt)

qqnorm(cars_df$qsec,
       main = "qsec Q-Q plot",
       ylab = "1/4 Mile Time")
qqline(cars_df$qsec)

# A combination of ggqqplot and plot_annotation can also be used for the qq plot chart creation

par(mfrow = c(2, 3))

MPG_b <- boxplot(cars_df$mpg,
                 main = "MPG boxplot",
                 ylab = "Miles per Gallon")

DISP_b <- boxplot(cars_df$disp,
                  main = "Disp boxplot",
                  ylab = "Displacement")

HP_b <- boxplot(cars_df$hp,
                main = "hp boxplot",
                ylab = "Gross HP")

DRAT_b <- boxplot(cars_df$drat,
                  main = "drat boxplot",
                  ylab = "drat")

WT_b <- boxplot(cars_df$wt,
                main = "wt boxplot",
                ylab = "Weight")

qsec_b <- boxplot(cars_df$qsec,
       main = "qsec boxplot",
       ylab = "1/4 Mile Time")

par(mfrow = c(2, 3))

MPG_h <- hist(cars_df$mpg,
              main = "MPG Histogram",
              xlab = "Miles per Gallon")

DISP_h <- hist(cars_df$disp,
               main = "disp Histogram", 
               xlab = "Displacement")

HP_h <- hist(cars_df$hp,
             main = "hp Histogram",
             xlab = "Gross HP")

DRAT_h <- hist(cars_df$drat,
               main = "drat Histogram",
               xlab = "drat")

WT_h <- hist(cars_df$wt,
             main = "wt Histogram",
             xlab = "Weight")

QSEC_h <- hist(cars_df$qsec,
          main = "qsec Histogram",
          ylab = "1/4 Mile Time")

Important procedures

We will run the correlation matrix using a “Pearson Correlation” for these data without resolving or reviewing some of the issues found with non-normality and extreme outliers for procedural and visual methods used to develop correlation matrices. However,
- Normality of variables: The qqplots, boxplots, and histograms visually hightlight that most of the variables have non-normality issues, for some of the variables, the issues with non-normality might not be too extreme for use in a pearson correlation. However, for three of the variables (displacement (disp), horse power (hp), and weight (wt)), the violation of non-normality might require one of the following strategies because non normally distributed data inflate the significance of “r” and outliers can have great influence on Pearson’ corrrelations as it might result in misleading results (like no association or very low association):
  - Strategies to assist non normal distributions: Data transformations (such as log), Spearman rho correlation, Kendall tau’s correlation.
- Presence of outliers The qqplots, and boxplots visually hightlight that most of the outliers present should be further analyzed as these can have great influence on Pearson’ corrrelations because they might provide misleading results/associations (like no association or very low association):
  - Strategies for outliers: Removal (if it is a confirmed measurement failure or a collection error), The use of Spearman correlation, adjustment/revisions.

18.5.4 Computation of a Correlation Matrix

The function in R to create a basic correlation matrix is cor(). This function uses “pearson” as the default coefficient/method. The other two methods (for ranked correlation matrices) are “kendall” or “spearman”.

If the data set has missing values, one can ask (use =) for “everything”, “all.obs”, “complete.obs”, “na.or.complete”, or “pairwise.complete.obs”. The easiest and most commonly used approach is “use = complete.obs” in which missing values are handled by case wise deletion (and if there are no complete cases, that gives an error).

Our data set (mtcars2) is composed of only the 6 “continuous” variables of the data set ’mtcars”, and does not have missing values, therefore we can use the default method of “pearson” and do not need to specify a use for missing values.

cars_cor <- round(cor(mtcars2, method = "pearson"),2)
kable(cars_cor,
      caption = "Basic correlation matrix")

Basic correlation matrix
	mpg	disp	hp	drat	wt	qsec
mpg	1.00	-0.85	-0.78	0.68	-0.87	0.42
disp	-0.85	1.00	0.79	-0.71	0.89	-0.43
hp	-0.78	0.79	1.00	-0.45	0.66	-0.71
drat	0.68	-0.71	-0.45	1.00	-0.71	0.09
wt	-0.87	0.89	0.66	-0.71	1.00	-0.17
qsec	0.42	-0.43	-0.71	0.09	-0.17	1.00

We can also display the lower triangle of the pearson coefficient matrix by using the upper.tri() function.

upper<-cars_cor
upper[upper.tri(cars_cor)]<-""
upper<-as.data.frame(upper)
kable(upper,
      caption = "Lower triangle of matrix correlation")

Lower triangle of matrix correlation
	mpg	disp	hp	drat	wt	qsec
mpg	1
disp	-0.85	1
hp	-0.78	0.79	1
drat	0.68	-0.71	-0.45	1
wt	-0.87	0.89	0.66	-0.71	1
qsec	0.42	-0.43	-0.71	0.09	-0.17	1

Another way (easier) to compute and visualize the lower triangle of a correlation matrix could be accomplished by using the function datasummary_correlation() in the modelsummary package.

datasummary_correlation(mtcars2)

tinytable_696urmcjg6q82400a5wn

	mpg	disp	hp	drat	wt	qsec
mpg	1	.	.	.	.	.
disp	-.85	1	.	.	.	.
hp	-.78	.79	1	.	.	.
drat	.68	-.71	-.45	1	.	.
wt	-.87	.89	.66	-.71	1	.
qsec	.42	-.43	-.71	.09	-.17	1

However, this table is difficult to interpret as p-values for the pearson correlation coefficients are not included. The package Hmisc has functions available to provide p-values and to format the results of a correlation matrix. The function rcorr() provides the p-values of a correlation matrix. It is necessary to convert the data frame to a matrix to run this procedure:

cars_cor_p <- rcorr(as.matrix(mtcars2))

cars_cor_p

       mpg  disp    hp  drat    wt  qsec
mpg   1.00 -0.85 -0.78  0.68 -0.87  0.42
disp -0.85  1.00  0.79 -0.71  0.89 -0.43
hp   -0.78  0.79  1.00 -0.45  0.66 -0.71
drat  0.68 -0.71 -0.45  1.00 -0.71  0.09
wt   -0.87  0.89  0.66 -0.71  1.00 -0.17
qsec  0.42 -0.43 -0.71  0.09 -0.17  1.00

n= 32 


P
     mpg    disp   hp     drat   wt     qsec  
mpg         0.0000 0.0000 0.0000 0.0000 0.0171
disp 0.0000        0.0000 0.0000 0.0000 0.0131
hp   0.0000 0.0000        0.0100 0.0000 0.0000
drat 0.0000 0.0000 0.0100        0.0000 0.6196
wt   0.0000 0.0000 0.0000 0.0000        0.3389
qsec 0.0171 0.0131 0.0000 0.6196 0.3389

Another way in which a correlation matrix could be constructed displaying p-values is with the function corr.test in the psych package. The default alpha = 0.05 and the default method is “pearson”. Both could be changed to for instance, alpha = 0.01 and method = “spearman”. However, the R documentation for this function indicates that non parametric method calculation is slow.

mtcars2_p <- corr.test(mtcars2)$p

kable(mtcars2_p)

	mpg	disp	hp	drat	wt	qsec
mpg	0.0000000	0.0000000	0.0000020	0.0001243	0.0000000	0.0525761
disp	0.0000000	0.0000000	0.0000009	0.0000478	0.0000000	0.0525761
hp	0.0000002	0.0000001	0.0000000	0.0499439	0.0002487	0.0000478
drat	0.0000178	0.0000053	0.0099888	0.0000000	0.0000478	0.6777366
wt	0.0000000	0.0000000	0.0000415	0.0000048	0.0000000	0.6777366
qsec	0.0170820	0.0131440	0.0000058	0.6195826	0.3388683	0.0000000

We can use the package `rstatix to: 1. Create a pearson coefficient correlation matrix with the function cor_mat() 2. Create a matrix of p-values (95% significance is default) with the function cor_pmat(), but p-values are not adjusted or rounded so it is not the best display 3. Create a visual representation of the matrix

# Correlation matrix between all variables
mtcars2_corr <- mtcars2 %>% 
  cor_mat()
mtcars2_corr

# A tibble: 6 × 7
  rowname   mpg  disp    hp   drat    wt   qsec
* <chr>   <dbl> <dbl> <dbl>  <dbl> <dbl>  <dbl>
1 mpg      1    -0.85 -0.78  0.68  -0.87  0.42 
2 disp    -0.85  1     0.79 -0.71   0.89 -0.43 
3 hp      -0.78  0.79  1    -0.45   0.66 -0.71 
4 drat     0.68 -0.71 -0.45  1     -0.71  0.091
5 wt      -0.87  0.89  0.66 -0.71   1    -0.17 
6 qsec     0.42 -0.43 -0.71  0.091 -0.17  1

mtcars2_corr_p <- mtcars2 %>% 
  cor_pmat()
mtcars2_corr_p

# A tibble: 6 × 7
  rowname      mpg     disp           hp       drat        wt       qsec
  <chr>      <dbl>    <dbl>        <dbl>      <dbl>     <dbl>      <dbl>
1 mpg     0        9.38e-10 0.000000179  0.0000178  1.29e- 10 0.0171    
2 disp    9.38e-10 0        0.0000000714 0.00000528 1.22e- 11 0.0131    
3 hp      1.79e- 7 7.14e- 8 0            0.00999    4.15e-  5 0.00000577
4 drat    1.78e- 5 5.28e- 6 0.00999      0          4.78e-  6 0.62      
5 wt      1.29e-10 1.22e-11 0.0000415    0.00000478 2.27e-236 0.339     
6 qsec    1.71e- 2 1.31e- 2 0.00000577   0.62       3.39e-  1 0

mtcars2_corr %>%
  #cor_get_pval() %>%
  cor_reorder() %>%
  pull_lower_triangle %>%
  cor_plot(label = TRUE)

The rquery.cormat in the corrplot package also provides the results of a correlation matrix showing r, p-values, and a visual representation. It can also provide formated results.

The package corrplot allows us to visually format the correlation matrix so that it can be a full matrix or the bottom or top triangle. Additionally, it can display the correlation coefficients, or another with p-values, or a visual representation of significant values.

testmtcars2 = cor.mtest(mtcars2, conf.level = 0.95)


corrplot(cars_cor,
         type = 'full',
         method = 'circle',
         addCoef.col = 'black',
         )

par(mfrow = c(1, 2))
corrplot(cars_cor,
         type = 'lower',
         p.mat = testmtcars2$p,
         number.font = 1,
         tl.cex = 0.9,
         pch.cex = 2,
         insig = 'p-value', 
         sig.level = -1
         )
corrplot(cars_cor,
         type = 'upper',
         tl.pos = 'n',
         p.mat = testmtcars2$p,
         method = 'color',
         diag = FALSE, 
         sig.level = c(0.001, 0.01, 0.05),          
         pch.cex = 0.9,
         insig = 'label_sig',
         pch.col = 'grey20'
)

Using the add = TRUE call inside the upper triangle corrplot function, one could stack matrices, but this method is complicated and requires adjusting sizes and labeling. To create stack full matrices it is best to use the function corrplot.mixed also in the corrplot package. Sadly, some of the features of corrplot are not available, but you can still have very informative visual outputs.

corrplot.mixed(cars_cor)

The following matrix visualization using the function chart.Correlation in the package PerformanceAnalytics provides many of the assumption tests for a correlation. The top triangle is the (absolute) value of the correlation coefficient (the default method is “pearson” but it could be changed to “kendall” or “spearman”) plus the result of the cor.test as stars (this can be changed to other symbols). In the bottom triangle of the matrix are the bivariate scatterplots, with a fitted line. This is a quick way to visualize a correlation matrix coefficient and significant relationships while examining assumptions. A must to run this function is to use the data set developed with the six continuous variables of interest (mtcars2).

chart.Correlation(mtcars2,
                  histogram = TRUE,
                  pch = 19)

Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter

*  The same visualization could be built using the **pairs()** function in base R, but **pairs()** is not very user friendly to construct a correlation matrix.
*  In addition, the function **pair.panel** in the `psych` package returns something similar. 
    *  Of interest about this function is that it also allows for regression visualization as regression lines and coinfidence intervals could be part of the output.

A simple correlation matrix could be ran using the full data set and then select the variables significantly correlated and then see if any of them meet the assumptions.
The cpairs() in the gclus package allows for a plot matrix while ordering and coloring the subplots by correlation.

18.6 Other Correlation Types (Spearman and Partial)

18.6.1 Spearman (ranking of variables, non-parametric)

# Load data
data("mtcars")
cars_partial <- mtcars
cars_spearman <- cor.test(mtcars$cyl, mtcars$carb, method = "spearman")

Warning in cor.test.default(mtcars$cyl, mtcars$carb, method = "spearman"):
Cannot compute exact p-value with ties

# This is simply an example of how a spearman procedure is done. Although we can 
# use continuous variables, the exact p-value cannot be computed with ties. 
# Therefore, it is better to use ranked data in order to compute the exact p-value.

18.6.2 Partial Correlation

The Partial correlation is the correlation of two variables while controlling for a third or more other variables. The partial correlation coefficient is a measure of the strength of the linear relationship between two variables after entirely controlling for the effects of other variables.

To conduct the partial correlation test

# partial correlation between "mpg" and "wt"  given "am"
pcor.test(cars_partial$mpg,cars_partial$wt,cars_partial$am)

    estimate      p.value statistic  n gp  Method
1 -0.7835341 1.867415e-07 -6.790807 32  1 pearson

This is the calculation of the correlation between miles per gallon and weight while controlling the variable “am” = Transmission (0 = automatic, 1 = manual).

18.7 Example of r in examining the relationship in a pair of variables using the Pearson Correlation Coefficient type of correlation

Using the ggscatter function in the ggpubr package, we could examine the Pearson coefficient and the related p-value for a pair of variables we found of interest in the scatterplot matrix above. We will explore the relationship between the the number of miles per gallon and the weight of a vehicle.

18.7.1 Hypothesis test between mpg and wt

We will test if a vehicle’s weight is related to its fuel efficiency.

\[ H_0: ρ = 0, \] \[ H_1: ρ ≠ 0 \]

ggscatter(cars_df, x = "mpg", y = "wt", 
          add = "reg.line",
          conf.int = TRUE, 
          cor.coef = TRUE, 
          cor.method = "pearson",
          xlab = "Miles/(US) gallon",
          ylab = "Weight (1000 lbs)"
          )

We can then conduct a Pearson correlation test using the cor.test function.

cars_ct <- cor.test(cars_df$wt, cars_df$mpg, 
                    method = "pearson")
cars_ct


    Pearson's product-moment correlation

data:  cars_df$wt and cars_df$mpg
t = -9.559, df = 30, p-value = 1.294e-10
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.9338264 -0.7440872
sample estimates:
       cor 
-0.8676594

To test the Hypothesis:

The test statistic t is rounded to -10 and the corresponding p-value is 1e-10.

Since the p-value is less than .05, we have sufficient evidence to say that the correlation between a vehicle’s weight and miles per gallon is statistically significant.

We can also test the hypothesis manually:

Using a t-test:

n(ordered pairs) = 32 r(sample correlation coefficient) = -0.868 alpha = 0.05

\[ t = \frac {r}{\sqrt{\frac {1-r^2}{n-2}}} \]

\[ t = \frac {-0.868}{\sqrt{\frac {0.2466}{30}}} = -9.574 \]

The critical values \(-t_O = - 2.0423\) and \(t_o = 2.0423\) form the rejection regions for the standardized test statistic. Since t = -9.574 and it is less than \(-t_O = - 2.0423\), we reject the null hypothesis and we can conclude that there is enough evidence at the 95% level of significance that there is a significant linear correlation between fuel efficiency and a vehicle’s weight.

Alternatively, we can use a table of critical values for pearson’s r. In this two-tailed example, with df = 30 and alpha = .05 the critical value is .349, then we can make a decision, |r| = 0.868 is greater than 0.349, which means that we reject the null hypothesis and conclude that there is a relationship between vehicle weight and miles per gallon, r(30) = -0.868, p<0.05

Furthermore,

The direction of the relationship is negative (i.e., heavier vehicles have lower numbers of miles per gallon or are less fuel efficient), meaning that when one variable increases the other variable decreases.

The magnitude, or strength, of the association is approximately strong (5 < | r |).