Intro to Data Analysis with R: Exercise 3 Solutions

Introduction

In this set of Exercises, we will explore linear regression, variable selection, model diagnostics and prediction.

Part A

A.1 Open a new R script and save it in the directory you created in Part A.1 of Exercise 1. Then, load the Auto MPG data set with the additional variable diesel using the load() function and file path from the end of Exercise 1.

load(here::here("data", "auto_mpg_v2.Rda"))

A.2 Regress MPG on horsepower and look at the model results with the summary() function. Interpret the meaning of the coefficient of horsepower.

linFit <- lm(mpg ~ hp, data=auto)
summary(linFit)

## 
## Call:
## lm(formula = mpg ~ hp, data = auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5710  -3.2592  -0.3435   2.7630  16.9240 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 39.935861   0.717499   55.66   <2e-16 ***
## hp          -0.157845   0.006446  -24.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.906 on 390 degrees of freedom
##   (14 observations deleted due to missingness)
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.6049 
## F-statistic: 599.7 on 1 and 390 DF,  p-value: < 2.2e-16

We estimate that, on average, fuel economy decreases by 0.16 mpg for every one unit increase in horsepower.

A.3 Plot the model diagnostics. Do you think this model fits the data adequately? Why or why not?

par(mfrow=c(2,2))
plot(linFit)

Part B

B.1 Add in a quadratic term for horsepower and look at the model fit results. HINT: Use the indicator function I() along with update().

quadFit <- update(linFit, ~ . + I(hp^2), data=auto)
summary(quadFit)

## 
## Call:
## lm(formula = mpg ~ hp + I(hp^2), data = auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.7135  -2.5943  -0.0859   2.2868  15.8961 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 56.9000997  1.8004268   31.60   <2e-16 ***
## hp          -0.4661896  0.0311246  -14.98   <2e-16 ***
## I(hp^2)      0.0012305  0.0001221   10.08   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.374 on 389 degrees of freedom
##   (14 observations deleted due to missingness)
## Multiple R-squared:  0.6876, Adjusted R-squared:  0.686 
## F-statistic:   428 on 2 and 389 DF,  p-value: < 2.2e-16

B.2 Plot the model diagnostics. Do you think this model fits the data adequately? Why or why not?

par(mfrow=c(2,2))
plot(quadFit)

B.3 Compare the model from Part A to the model you just fit using an F-test. What model do you conclude fits the data better?

anova(linFit, quadFit)

## Analysis of Variance Table
## 
## Model 1: mpg ~ hp
## Model 2: mpg ~ hp + I(hp^2)
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1    390 9385.9                                  
## 2    389 7442.0  1    1943.9 101.61 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We reject the null hypothesis and conclude that the full model with horsepower squared fits better than the reduced model.

B.4 Make a scatterplot of mpg versus horsepower. Add the estimated regression line from Part A using the abline() function and color it red. Add in the estimated regression line from Part B and color it blue. HINT: You will need to use the predict() and curve() functions.

plot(auto$hp, auto$mpg, pch=20, xlab="Horsepower", ylab="MPG")
abline(reg = linFit, col="red", lwd=2)
curve(predict(quadFit, data.frame(hp=x)), add=TRUE, col="blue", lwd=2)
legend("topright", legend=c("Linear Model", "Quadratic Model"), col=c("red","blue"), lty=1, lwd=2, bty="n")

Part C

C.1 Using what you’ve learned so far, fit the best possible linear model you can to predict MPG. Answers will vary. HINT: You can use automatic variable selection methods, or do so manually and compare models via adjusted \(R^2\) and F-tests.

modelC1 <- lm(mpg ~ weight + model.yr, data=auto)
summary(modelC1)

## 
## Call:
## lm(formula = mpg ~ weight + model.yr, data = auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.8777 -2.3140 -0.1211  2.0591 14.3330 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.420e+01  3.968e+00  -3.578 0.000389 ***
## weight      -6.664e-03  2.139e-04 -31.161  < 2e-16 ***
## model.yr     7.566e-01  4.898e-02  15.447  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.435 on 395 degrees of freedom
##   (8 observations deleted due to missingness)
## Multiple R-squared:  0.8079, Adjusted R-squared:  0.8069 
## F-statistic: 830.4 on 2 and 395 DF,  p-value: < 2.2e-16

par(mfrow=c(2,2)); plot(modelC1)

auto[c(330, 334, 333), ]  # look at potential outliers... top 3 mpg values

##      mpg cyl disp hp weight  acc model.yr origin                 name
## 330 46.6   4   86 65   2110 17.9       80      3            mazda glc
## 334 43.4   4   90 48   2335 23.7       80      2   vw dasher (diesel)
## 333 44.3   4   90 48   2085 21.7       80      2 vw rabbit c (diesel)
##     diesel
## 330      0
## 334      1
## 333      1

auto[52, c("mpg", "weight", "model.yr")]  # look at leverage point... heaviest vehicle

##    mpg weight model.yr
## 52  13   5140       71

modelC2 <- lm(mpg ~ weight + I(weight^2) + model.yr, data=auto)
summary(modelC2)

## 
## Call:
## lm(formula = mpg ~ weight + I(weight^2) + model.yr, data = auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.4771 -1.7156 -0.1581  1.5327 13.1433 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.397e+00  3.823e+00   0.627    0.531    
## weight      -2.182e-02  1.426e-03 -15.296   <2e-16 ***
## I(weight^2)  2.388e-06  2.228e-07  10.718   <2e-16 ***
## model.yr     8.308e-01  4.370e-02  19.010   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.026 on 394 degrees of freedom
##   (8 observations deleted due to missingness)
## Multiple R-squared:  0.8512, Adjusted R-squared:  0.8501 
## F-statistic: 751.5 on 3 and 394 DF,  p-value: < 2.2e-16

par(mfrow=c(2,2)); plot(modelC2)

auto[c(330, 334, 396), c("mpg", "weight", "model.yr")]  # look at potential outliers... top 3 mpg values

##      mpg weight model.yr
## 330 46.6   2110       80
## 334 43.4   2335       80
## 396 38.0   3015       82

anova(modelC1, modelC2)

## Analysis of Variance Table
## 
## Model 1: mpg ~ weight + model.yr
## Model 2: mpg ~ weight + I(weight^2) + model.yr
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1    395 4659.8                                  
## 2    394 3607.8  1      1052 114.88 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

finalModel <- lm(mpg ~ weight + I(weight^2) + model.yr + diesel, data=auto)
summary(finalModel)

## 
## Call:
## lm(formula = mpg ~ weight + I(weight^2) + model.yr + diesel, 
##     data = auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.4833 -1.6161 -0.0464  1.6154 13.5027 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.708e+00  3.503e+00   1.915   0.0562 .  
## weight      -2.215e-02  1.296e-03 -17.098   <2e-16 ***
## I(weight^2)  2.440e-06  2.024e-07  12.058   <2e-16 ***
## model.yr     7.784e-01  4.009e-02  19.416   <2e-16 ***
## diesel1      9.776e+00  1.061e+00   9.214   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.748 on 393 degrees of freedom
##   (8 observations deleted due to missingness)
## Multiple R-squared:  0.8777, Adjusted R-squared:  0.8764 
## F-statistic: 704.9 on 4 and 393 DF,  p-value: < 2.2e-16

par(mfrow=c(2,2)); plot(finalModel)

auto[c(330, 403, 119), ]  # look at potential outliers...

##      mpg cyl disp hp weight  acc model.yr origin      name diesel
## 330 46.6   4   86 65   2110 17.9       80      3 mazda glc      0
## 403 44.0   4   97 52   2130 24.6       82      2 vw pickup      0
## 119 18.0   3   70 90   2124 13.5       73      3 maxda rx3      0

par(mfrow=c(1,1)); plot(auto$weight, auto$mpg, pch=20, xlab="Weight (lbs)", ylab="MPG", col=auto$diesel)

anova(modelC2, finalModel)

## Analysis of Variance Table
## 
## Model 1: mpg ~ weight + I(weight^2) + model.yr
## Model 2: mpg ~ weight + I(weight^2) + model.yr + diesel
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1    394 3607.8                                  
## 2    393 2966.9  1    640.95 84.902 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

C.2 Using the model you just fit, predict the fuel economy of the 8 vehicles with missing mpg values.

missing.mpg <- auto[is.na(auto$mpg), ]
cbind(missing.mpg, fit=predict(finalModel, newdata = missing.mpg))

##     mpg cyl disp  hp weight  acc model.yr origin
## 11   NA   4  133 115   3090 17.5       70      2
## 12   NA   8  350 165   4142 11.5       70      1
## 13   NA   8  351 153   4034 11.0       70      1
## 14   NA   8  383 175   4166 10.5       70      1
## 15   NA   8  360 175   3850 11.0       70      1
## 18   NA   8  302 140   3353  8.0       70      1
## 40   NA   4   97  48   1978 20.0       71      2
## 368  NA   4  121 110   2800 15.4       81      2
##                                 name diesel      fit
## 11              citroen ds-21 pallas      0 16.03857
## 12  chevrolet chevelle concours (sw)      0 11.29794
## 13                  ford torino (sw)      0 11.53585
## 14           plymouth satellite (sw)      0 11.25280
## 15                amc rebel sst (sw)      0 12.07229
## 18             ford mustang boss 302      0 14.34709
## 40       volkswagen super beetle 117      0 27.69954
## 368                        saab 900s      0 26.85690

Part D

Write a function that performs linear contrasts. E.g., takes in the fitted linear regression object and a vector containing the difference in two predictor vectors and ourputs a point estimate and 95% confidence interval. Specifically, do this comparing (1) two similar vehicles that differ by 10 model years, and (2) two similar vehicles that differ by 500 pounds. NOTE: You may not have included these variables in your final model. If you didn’t, chose two other variables. HINT: This problem is difficult! You will need to get the covariance matrix on the coefficients with vcov() and perform some matrix multiplications using the %*% operator. Ask for help if you make it this far!

linContr <- function(model, contr){
  beta.hat <- model$coef
  cov.beta <- vcov(model)
  est <- contr %*% beta.hat
  se.est <- sqrt(contr %*% cov.beta %*% t(contr))
  ci95.lo <- est - qnorm(0.975)*se.est
  ci95.hi <- est + qnorm(0.975)*se.est
  out <- data.frame(est, se.est, ci95.lo, ci95.hi)
  round(out, 3)
}

linContr(finalModel, 
         matrix(c(0, 0, 0, 10, 0), nrow=1, ncol=5))  # 10 yr diff

##     est se.est ci95.lo ci95.hi
## 1 7.784  0.401   6.998   8.569

linContr(finalModel, 
         matrix(c(0, 2500-3000, 2500^2-3000^2, 0, 0), nrow=1, ncol=5))  # 500 lb diff