20  Multiple Linear Regression

Author

Shelly Sinclair and Alvonee Penn

Published

May 28, 2024

20.1 Multiple Linear Regression

Multiple regression generally explains the relationship between multiple independent or predictor variables and one dependent or criterion variable. A dependent variable is modeled as a function of several independent variables with corresponding coefficients, along with the constant term. Multiple regression requires two or more predictor variables.

Equation

\[ \hat{Y}= b_0 +b_1x_1 +b_2x_2 + ... b_nx_n + \epsilon \]

where

\(\hat{Y}\) is the predicted or expected value of the dependent variable, and is always numeric and continuous.

\(b_0\) is the value of \(\hat{Y}\) when all of the independent variables \(x_1 -x_n\) are equal to zero,

\(x_1 -x_n\) are distinct independent or predictor variables

\(b_1 - b_n\) are the estimated regression coefficients. Each regression coefficient represents the change in \(\hat{Y}\) relative to a one unit change in the respective independent variable.

In the multiple regression situation, \(b_1\), for example, is the change in \(\hat{Y}\) relative to a one unit change in \(x_1\), holding all other independent variables constant (i.e., when the remaining independent variables are held at the same value or are fixed).

Assumptions

  1. Independence: The Y-values are statistically independent of each other as well as the errors. As with simple linear regression, this assumption is violated if several Y observations are made on the same subject.

  2. Linearity: The mean value of Y for each specific combination of values of the independent variables (\(X_1, X_2...X_n\)) is a linear function of the intercept and parameters (\(\beta_0, \beta_1,...\))

  3. Homoscedasticity: The variance of Y is the same for any fixed combination of independent variables.

  4. Normal Distribution: The residual values are normally distributed with mean zero.

  5. Multicollinearity: Multicollinearity cannot exist among the predictors (the variables are not correlated)

20.1.1 Packages

Loading required package: carData
── Attaching packages ────────────────────────────────────── tidymodels 1.2.0 ──
✔ broom        1.0.5      ✔ rsample      1.2.1 
✔ dials        1.2.1      ✔ tibble       3.2.1 
✔ dplyr        1.1.4      ✔ tidyr        1.3.1 
✔ infer        1.0.7      ✔ tune         1.2.1 
✔ modeldata    1.3.0      ✔ workflows    1.1.4 
✔ parsnip      1.2.1      ✔ workflowsets 1.1.0 
✔ purrr        1.0.2      ✔ yardstick    1.3.1 
✔ recipes      1.0.10     
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ purrr::discard() masks scales::discard()
✖ dplyr::filter()  masks stats::filter()
✖ dplyr::lag()     masks stats::lag()
✖ dplyr::recode()  masks car::recode()
✖ purrr::some()    masks car::some()
✖ recipes::step()  masks stats::step()
• Search for functions across packages at https://www.tidymodels.org/find/
library(lmtest)
Loading required package: zoo

Attaching package: 'zoo'
The following objects are masked from 'package:base':

    as.Date, as.Date.numeric
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats   1.0.0     ✔ readr     2.1.5
✔ lubridate 1.9.3     ✔ stringr   1.5.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ readr::col_factor() masks scales::col_factor()
✖ purrr::discard()    masks scales::discard()
✖ dplyr::filter()     masks stats::filter()
✖ stringr::fixed()    masks recipes::fixed()
✖ dplyr::lag()        masks stats::lag()
✖ dplyr::recode()     masks car::recode()
✖ purrr::some()       masks car::some()
✖ readr::spec()       masks yardstick::spec()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

20.1.2 Data

The miles/gallon a car takes (dependent) based on the gross horsepower and weight of the car.

data("mtcars")
head(mtcars)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

20.1.3 Testing the Assumptions

20.1.3.1 Linearity

Is there a linear relationship between the dependent variable and each of the independent variables?

# is there a linear relationship between MPG and HP?
ggplot(data = mtcars) +
  aes(
    x = hp,
    y = mpg
  ) +
  geom_point()

# is there a linear relationship between MPG and WT?
ggplot(data = mtcars) +
  aes(
    x = wt,
    y = mpg
  ) +
  geom_point()

20.1.3.2 Multicollinearity

Are the variables correlated? If the VIF score > 10 then there is multicollinearity.

Code
mod <- lm(mpg ~ hp + wt, data = mtcars)

# checking the VIF of both predictors
vif(mod)
      hp       wt 
1.766625 1.766625 
Code
# checking the correlation between the predictors
cor(mtcars$hp, mtcars$wt)
[1] 0.6587479

20.1.3.3 Homoscedasticity

Checking the residual plots for homoscedasticity. The line below is relatively horizontal.

mtcars_df <- mtcars %>% 
  mutate(res_sqrt = sqrt(abs(rstandard(mod))))

ggplot(mtcars_df, aes(fitted(mod), res_sqrt)) +
  geom_point() +
  geom_smooth()
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

20.1.3.4 Normality

Checking normality via histogram and QQ plot of the residuals.

# histogram
hist(mtcars_df$res_sqrt)

# qq plot
plot(mod, 2)

# cook's d -> are the outliers in the qq plot driving the relationship in the model? 
cooks.distance(mod)
          Mazda RX4       Mazda RX4 Wag          Datsun 710      Hornet 4 Drive 
       1.589652e-02        5.464779e-03        2.070651e-02        4.724822e-05 
  Hornet Sportabout             Valiant          Duster 360           Merc 240D 
       2.736184e-04        2.155064e-02        1.255218e-02        1.677650e-02 
           Merc 230            Merc 280           Merc 280C          Merc 450SE 
       2.188702e-03        1.554996e-03        1.215737e-02        1.423008e-03 
         Merc 450SL         Merc 450SLC  Cadillac Fleetwood Lincoln Continental 
       1.458960e-04        6.266049e-03        2.786686e-05        1.780910e-02 
  Chrysler Imperial            Fiat 128         Honda Civic      Toyota Corolla 
       4.236109e-01        1.574263e-01        9.371446e-03        2.083933e-01 
      Toyota Corona    Dodge Challenger         AMC Javelin          Camaro Z28 
       2.791982e-02        2.087419e-02        2.751510e-02        9.943527e-03 
   Pontiac Firebird           Fiat X1-9       Porsche 914-2        Lotus Europa 
       1.443199e-02        5.920440e-04        5.674986e-06        7.353985e-02 
     Ford Pantera L        Ferrari Dino       Maserati Bora          Volvo 142E 
       8.919701e-03        5.732672e-03        2.720397e-01        5.600804e-03 
# Plot Cook's Distance with a horizontal line at 4/n to see which observations
# exceed this thresdhold
n <- nrow(mtcars)
plot(cooks.distance(mod), main = "Cooks Distance for Influential Obs")
abline(h = 4/n, lty = 2, col = "steelblue")

Cook’s distance refers to how far, on average, predicted y-values will move if the observation in question is dropped from the data set.

We can clearly see that four observation in the dataset exceed the 4/n threshold. Thus, we would identify these two observations as influential data points that have a negative impact on the regression model.

Since there are very few extreme values according to cook’s d values, we can assume they are not the driving force between the relationship of the predictor and outcome.

Tentative model:

mpg = \(b_0\) + \(b_1\times\)horsepower + \(b_2\times\)weight + \(\epsilon\)

20.1.4 Hypothesis Testing

Does the horsepower and weight of a car contribute significantly to the miles per gallon of the car?

  • What is the relationship between horsepower and weight, with miles per gallon of the car?-

\[ H_0: \beta_1 = \beta_2 = 0 \]

\[ H_1: \beta_1 \not= \beta_2 \not= 0 \]

Code
car_model <- lm(mpg ~ hp + wt, data = mtcars_df)

# get the F statistic and p-value
summary(car_model) 

Call:
lm(formula = mpg ~ hp + wt, data = mtcars_df)

Residuals:
   Min     1Q Median     3Q    Max 
-3.941 -1.600 -0.182  1.050  5.854 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 37.22727    1.59879  23.285  < 2e-16 ***
hp          -0.03177    0.00903  -3.519  0.00145 ** 
wt          -3.87783    0.63273  -6.129 1.12e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.593 on 29 degrees of freedom
Multiple R-squared:  0.8268,    Adjusted R-squared:  0.8148 
F-statistic: 69.21 on 2 and 29 DF,  p-value: 9.109e-12
Code
# creating a table of the model
car_model %>%
  tidy() %>% 
  gt()
term estimate std.error statistic p.value
(Intercept) 37.22727012 1.59878754 23.284689 2.565459e-20
hp -0.03177295 0.00902971 -3.518712 1.451229e-03
wt -3.87783074 0.63273349 -6.128695 1.119647e-06

The first step in interpreting the multiple regression analysis is to examine the F-statistic and the associated p-value, at the bottom of model summary.

In our example, it can be seen that p-value of the F-statistic is < 9.109e-12, which is less than 0.05. This means that we can reject the null and accept that either horsepower and/or weight of a car predict the miles per gallon.

To see which predictor variables are significant, we can examine the coefficients table, which shows the estimate and the associated t-statistic p-values:

Code
summary(car_model)$coefficient 
               Estimate Std. Error   t value     Pr(>|t|)
(Intercept) 37.22727012 1.59878754 23.284689 2.565459e-20
hp          -0.03177295 0.00902971 -3.518712 1.451229e-03
wt          -3.87783074 0.63273349 -6.128695 1.119647e-06
Code
summary(car_model)$coefficient %>% 
  as_tibble() %>% 
  add_column(Variable = c("Intercept", "Horsepower", "Weight"), .before = "Estimate") %>% 
  gt()
Variable Estimate Std. Error t value Pr(>|t|)
Intercept 37.22727012 1.59878754 23.284689 2.565459e-20
Horsepower -0.03177295 0.00902971 -3.518712 1.451229e-03
Weight -3.87783074 0.63273349 -6.128695 1.119647e-06

Full model: mpg = 37.23 + (-0.03)\(\times\)horsepower + (-3.88)\(\times\)weight

For a given predictor, the t-statistic value evaluates whether or not there is significant association between the predictor and the outcome variable, that is whether the beta coefficient of the predictor is significantly different from zero.

It can be seen that horsepower and weight are significantly associated to changes in mpg.

Code
# The confidence interval of the model
confint(car_model)
                  2.5 %      97.5 %
(Intercept) 33.95738245 40.49715778
hp          -0.05024078 -0.01330512
wt          -5.17191604 -2.58374544
Code
confint(car_model) %>% 
  as_tibble() %>% 
  add_column(Variable = c("Intercept", "Horsepower", "Weight"), .before = "2.5 %") %>% 
  gt()
Variable 2.5 % 97.5 %
Intercept 33.95738245 40.49715778
Horsepower -0.05024078 -0.01330512
Weight -5.17191604 -2.58374544

20.1.5 Likelihood Ratio Test

\(H_0\): The full model and the nested model fit the data equally well. Thus, you should use the nested model.

\(H_1\): The full model fits the data significantly better than the nested model. Thus, you should use the full model.

A likelihood ratio test compares the goodness of fit of two nested regression models.

A nested model is simply one that contains a subset of the predictor variables in the overall regression model.

We could then carry out another likelihood ratio test to determine if a model with only one predictor variable is significantly different from a model with the two predictors:

#fit full model
model_full <- lm(mpg ~ hp + wt, data = mtcars)

#fit reduced model
model_reduced <- lm(mpg ~ hp, data = mtcars)

#perform likelihood ratio test for differences in models
lrtest(model_full, model_reduced)
Likelihood ratio test

Model 1: mpg ~ hp + wt
Model 2: mpg ~ hp
  #Df  LogLik Df  Chisq Pr(>Chisq)    
1   4 -74.326                         
2   3 -87.619 -1 26.586   2.52e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#fit reduced model
model_reduced <- lm(mpg ~ wt, data = mtcars)

#perform likelihood ratio test for differences in models
lrtest(model_full, model_reduced)
Likelihood ratio test

Model 1: mpg ~ hp + wt
Model 2: mpg ~ wt
  #Df  LogLik Df  Chisq Pr(>Chisq)    
1   4 -74.326                         
2   3 -80.015 -1 11.377  0.0007436 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Our p-values are less than 0.05 for both Chi-squared values, suggesting that the full model offers significant improvement in fit over the model with just one of the predictors.

20.1.6 Model Accuracy

In multiple linear regression, the \(R^2\) represents the correlation coefficient between the observed values of the outcome variable (y) and the fitted (i.e., predicted) values of y. For this reason, the value of R will always be positive and will range from zero to one.

An \(R^2\) value close to 1 indicates that the model explains a large portion of the variance in the outcome variable.

Code
summary(car_model)$adj.r.squared %>% 
  as_tibble() %>% 
  rename("Adjusted R Squared" = value) %>% 
  gt()
Adjusted R Squared
0.8148396

Our model explains 81% of the variation in mpg values caused by horsepower and weight values.

20.2 Conclusion

avPlots(car_model)

For multiple linear regression, there are multiple independent variables affecting the outcome. This outcome must be continuous, however, the independent variables can be numeric or categorical.

There is an inverse or negative relationship between our predictors and outcome variable. The mpg decreases with heavier weight. Higher horsepower also reduces miles per gallon performance (higher fuel consumption).