In this set of Exercises, we will explore linear regression, variable selection, model diagnostics and prediction.
A.1 Open a new R script and save it in the directory you created in Part A.1 of Exercise 1. Then, load the Auto MPG data set with the additional variable diesel
using the load()
function and file path from the end of Exercise 1.
load(here::here("data", "auto_mpg_v2.Rda"))
A.2 Regress MPG on horsepower and look at the model results with the summary()
function. Interpret the meaning of the coefficient of horsepower.
linFit <- lm(mpg ~ hp, data=auto)
summary(linFit)
##
## Call:
## lm(formula = mpg ~ hp, data = auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.5710 -3.2592 -0.3435 2.7630 16.9240
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.935861 0.717499 55.66 <2e-16 ***
## hp -0.157845 0.006446 -24.49 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.906 on 390 degrees of freedom
## (14 observations deleted due to missingness)
## Multiple R-squared: 0.6059, Adjusted R-squared: 0.6049
## F-statistic: 599.7 on 1 and 390 DF, p-value: < 2.2e-16
We estimate that, on average, fuel economy decreases by 0.16 mpg for every one unit increase in horsepower.
A.3 Plot the model diagnostics. Do you think this model fits the data adequately? Why or why not?
par(mfrow=c(2,2))
plot(linFit)
B.1 Add in a quadratic term for horsepower and look at the model fit results. HINT: Use the indicator function I()
along with update()
.
quadFit <- update(linFit, ~ . + I(hp^2), data=auto)
summary(quadFit)
##
## Call:
## lm(formula = mpg ~ hp + I(hp^2), data = auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.7135 -2.5943 -0.0859 2.2868 15.8961
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 56.9000997 1.8004268 31.60 <2e-16 ***
## hp -0.4661896 0.0311246 -14.98 <2e-16 ***
## I(hp^2) 0.0012305 0.0001221 10.08 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.374 on 389 degrees of freedom
## (14 observations deleted due to missingness)
## Multiple R-squared: 0.6876, Adjusted R-squared: 0.686
## F-statistic: 428 on 2 and 389 DF, p-value: < 2.2e-16
B.2 Plot the model diagnostics. Do you think this model fits the data adequately? Why or why not?
par(mfrow=c(2,2))
plot(quadFit)
B.3 Compare the model from Part A to the model you just fit using an F-test. What model do you conclude fits the data better?
anova(linFit, quadFit)
## Analysis of Variance Table
##
## Model 1: mpg ~ hp
## Model 2: mpg ~ hp + I(hp^2)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 390 9385.9
## 2 389 7442.0 1 1943.9 101.61 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We reject the null hypothesis and conclude that the full model with horsepower squared fits better than the reduced model.
B.4 Make a scatterplot of mpg versus horsepower. Add the estimated regression line from Part A using the abline()
function and color it red. Add in the estimated regression line from Part B and color it blue. HINT: You will need to use the predict()
and curve()
functions.
plot(auto$hp, auto$mpg, pch=20, xlab="Horsepower", ylab="MPG")
abline(reg = linFit, col="red", lwd=2)
curve(predict(quadFit, data.frame(hp=x)), add=TRUE, col="blue", lwd=2)
legend("topright", legend=c("Linear Model", "Quadratic Model"), col=c("red","blue"), lty=1, lwd=2, bty="n")
C.1 Using what you’ve learned so far, fit the best possible linear model you can to predict MPG. Answers will vary. HINT: You can use automatic variable selection methods, or do so manually and compare models via adjusted \(R^2\) and F-tests.
modelC1 <- lm(mpg ~ weight + model.yr, data=auto)
summary(modelC1)
##
## Call:
## lm(formula = mpg ~ weight + model.yr, data = auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.8777 -2.3140 -0.1211 2.0591 14.3330
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.420e+01 3.968e+00 -3.578 0.000389 ***
## weight -6.664e-03 2.139e-04 -31.161 < 2e-16 ***
## model.yr 7.566e-01 4.898e-02 15.447 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.435 on 395 degrees of freedom
## (8 observations deleted due to missingness)
## Multiple R-squared: 0.8079, Adjusted R-squared: 0.8069
## F-statistic: 830.4 on 2 and 395 DF, p-value: < 2.2e-16
par(mfrow=c(2,2)); plot(modelC1)
auto[c(330, 334, 333), ] # look at potential outliers... top 3 mpg values
## mpg cyl disp hp weight acc model.yr origin name
## 330 46.6 4 86 65 2110 17.9 80 3 mazda glc
## 334 43.4 4 90 48 2335 23.7 80 2 vw dasher (diesel)
## 333 44.3 4 90 48 2085 21.7 80 2 vw rabbit c (diesel)
## diesel
## 330 0
## 334 1
## 333 1
auto[52, c("mpg", "weight", "model.yr")] # look at leverage point... heaviest vehicle
## mpg weight model.yr
## 52 13 5140 71
modelC2 <- lm(mpg ~ weight + I(weight^2) + model.yr, data=auto)
summary(modelC2)
##
## Call:
## lm(formula = mpg ~ weight + I(weight^2) + model.yr, data = auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.4771 -1.7156 -0.1581 1.5327 13.1433
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.397e+00 3.823e+00 0.627 0.531
## weight -2.182e-02 1.426e-03 -15.296 <2e-16 ***
## I(weight^2) 2.388e-06 2.228e-07 10.718 <2e-16 ***
## model.yr 8.308e-01 4.370e-02 19.010 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.026 on 394 degrees of freedom
## (8 observations deleted due to missingness)
## Multiple R-squared: 0.8512, Adjusted R-squared: 0.8501
## F-statistic: 751.5 on 3 and 394 DF, p-value: < 2.2e-16
par(mfrow=c(2,2)); plot(modelC2)
auto[c(330, 334, 396), c("mpg", "weight", "model.yr")] # look at potential outliers... top 3 mpg values
## mpg weight model.yr
## 330 46.6 2110 80
## 334 43.4 2335 80
## 396 38.0 3015 82
anova(modelC1, modelC2)
## Analysis of Variance Table
##
## Model 1: mpg ~ weight + model.yr
## Model 2: mpg ~ weight + I(weight^2) + model.yr
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 395 4659.8
## 2 394 3607.8 1 1052 114.88 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
finalModel <- lm(mpg ~ weight + I(weight^2) + model.yr + diesel, data=auto)
summary(finalModel)
##
## Call:
## lm(formula = mpg ~ weight + I(weight^2) + model.yr + diesel,
## data = auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.4833 -1.6161 -0.0464 1.6154 13.5027
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.708e+00 3.503e+00 1.915 0.0562 .
## weight -2.215e-02 1.296e-03 -17.098 <2e-16 ***
## I(weight^2) 2.440e-06 2.024e-07 12.058 <2e-16 ***
## model.yr 7.784e-01 4.009e-02 19.416 <2e-16 ***
## diesel1 9.776e+00 1.061e+00 9.214 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.748 on 393 degrees of freedom
## (8 observations deleted due to missingness)
## Multiple R-squared: 0.8777, Adjusted R-squared: 0.8764
## F-statistic: 704.9 on 4 and 393 DF, p-value: < 2.2e-16
par(mfrow=c(2,2)); plot(finalModel)
auto[c(330, 403, 119), ] # look at potential outliers...
## mpg cyl disp hp weight acc model.yr origin name diesel
## 330 46.6 4 86 65 2110 17.9 80 3 mazda glc 0
## 403 44.0 4 97 52 2130 24.6 82 2 vw pickup 0
## 119 18.0 3 70 90 2124 13.5 73 3 maxda rx3 0
par(mfrow=c(1,1)); plot(auto$weight, auto$mpg, pch=20, xlab="Weight (lbs)", ylab="MPG", col=auto$diesel)
anova(modelC2, finalModel)
## Analysis of Variance Table
##
## Model 1: mpg ~ weight + I(weight^2) + model.yr
## Model 2: mpg ~ weight + I(weight^2) + model.yr + diesel
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 394 3607.8
## 2 393 2966.9 1 640.95 84.902 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
C.2 Using the model you just fit, predict the fuel economy of the 8 vehicles with missing mpg values.
missing.mpg <- auto[is.na(auto$mpg), ]
cbind(missing.mpg, fit=predict(finalModel, newdata = missing.mpg))
## mpg cyl disp hp weight acc model.yr origin
## 11 NA 4 133 115 3090 17.5 70 2
## 12 NA 8 350 165 4142 11.5 70 1
## 13 NA 8 351 153 4034 11.0 70 1
## 14 NA 8 383 175 4166 10.5 70 1
## 15 NA 8 360 175 3850 11.0 70 1
## 18 NA 8 302 140 3353 8.0 70 1
## 40 NA 4 97 48 1978 20.0 71 2
## 368 NA 4 121 110 2800 15.4 81 2
## name diesel fit
## 11 citroen ds-21 pallas 0 16.03857
## 12 chevrolet chevelle concours (sw) 0 11.29794
## 13 ford torino (sw) 0 11.53585
## 14 plymouth satellite (sw) 0 11.25280
## 15 amc rebel sst (sw) 0 12.07229
## 18 ford mustang boss 302 0 14.34709
## 40 volkswagen super beetle 117 0 27.69954
## 368 saab 900s 0 26.85690
Write a function that performs linear contrasts. E.g., takes in the fitted linear regression object and a vector containing the difference in two predictor vectors and ourputs a point estimate and 95% confidence interval. Specifically, do this comparing (1) two similar vehicles that differ by 10 model years, and (2) two similar vehicles that differ by 500 pounds. NOTE: You may not have included these variables in your final model. If you didn’t, chose two other variables. HINT: This problem is difficult! You will need to get the covariance matrix on the coefficients with vcov()
and perform some matrix multiplications using the %*%
operator. Ask for help if you make it this far!
linContr <- function(model, contr){
beta.hat <- model$coef
cov.beta <- vcov(model)
est <- contr %*% beta.hat
se.est <- sqrt(contr %*% cov.beta %*% t(contr))
ci95.lo <- est - qnorm(0.975)*se.est
ci95.hi <- est + qnorm(0.975)*se.est
out <- data.frame(est, se.est, ci95.lo, ci95.hi)
round(out, 3)
}
linContr(finalModel,
matrix(c(0, 0, 0, 10, 0), nrow=1, ncol=5)) # 10 yr diff
## est se.est ci95.lo ci95.hi
## 1 7.784 0.401 6.998 8.569
linContr(finalModel,
matrix(c(0, 2500-3000, 2500^2-3000^2, 0, 0), nrow=1, ncol=5)) # 500 lb diff
## est se.est ci95.lo ci95.hi
## 1 4.366 0.121 4.129 4.603