This set of exercises picks up where we left off with the Auto MPG data in Exercise 1. We now focus on exploratory data analysis.
A.1 Open a new R script and save it in the directory you created in Part A.1 of Exercise 1. Then, load the Auto MPG data set with the additional variable diesel
using the load()
function and file path from the end of Exercise 1.
load(here::here("data", "auto_mpg_v2.Rda"))
A.2 Using the summary()
function, look at descriptive statistics for the Auto MPG data. What do you notice?
summary(auto)
## mpg cyl disp hp weight
## Min. : 9.00 3: 4 Min. : 68.0 Min. : 46.00 Min. :1613
## 1st Qu.:17.50 4:207 1st Qu.:105.0 1st Qu.: 75.75 1st Qu.:2226
## Median :23.00 5: 3 Median :151.0 Median : 95.00 Median :2822
## Mean :23.51 6: 84 Mean :194.8 Mean :105.08 Mean :2979
## 3rd Qu.:29.00 8:108 3rd Qu.:302.0 3rd Qu.:130.00 3rd Qu.:3618
## Max. :46.60 Max. :455.0 Max. :230.00 Max. :5140
## NA's :8 NA's :6
## acc model.yr origin name
## Min. : 8.00 Min. :70.00 Min. :1.000 Length:406
## 1st Qu.:13.70 1st Qu.:73.00 1st Qu.:1.000 Class :character
## Median :15.50 Median :76.00 Median :1.000 Mode :character
## Mean :15.52 Mean :75.92 Mean :1.569
## 3rd Qu.:17.18 3rd Qu.:79.00 3rd Qu.:2.000
## Max. :24.80 Max. :82.00 Max. :3.000
##
## diesel
## 0:399
## 1: 7
##
##
##
##
##
B.1 Plot a relative frequency histogram of the response variable, MPG. Color the boxes grey. Make sure to name the plot and axes.
hist(auto$mpg, freq=FALSE, col = "grey", main = "Histogram of MPG", xlab = "MPG")
B.2 Add a density curve to the histogram you plotted in B.1 using the lines()
and density()
functions. Color it red using the col
argument. HINT: You need to change the value of na.rm
in the density function to avoid an error. You also need to make sure that the histogram is on the proper scale!
hist(auto$mpg, freq=FALSE, col = "grey", main = "Histogram of MPG", xlab = "MPG")
lines(density(auto$mpg, na.rm=TRUE), col="red")
B.3 Add a vertical line to the plot from B.2 at the median of MPG. Make it dashed using the lty
arguement, and make it red. HINT: You may need to use na.rm
again.
hist(auto$mpg, freq=FALSE, col = "grey", main = "Histogram of MPG", xlab = "MPG")
lines(density(auto$mpg, na.rm=TRUE), col="red", main="Densities of Samp1 and Samp2", xlab="")
abline(v = median(auto$mpg, na.rm=TRUE), col = "red", lwd = 2, lty=2)
Create a boxplot of MPG grouped by the number of cylinders. Color the boxes grey. Make sure to name the plot and axes.
boxplot(mpg ~ cyl, data = auto, col = "grey",
main = "Distribution of MPG by Number of Cylinders",
xlab = "Cylinders", ylab = "MPG")
Create a scatterplot matrix of the Auto MPG data. Include the following variables: “mpg”, “cyl”, “disp”, “hp”, “weight”, “acc”, “model.yr”. What relationships do you see? Is there any confounding? Which variables should be included in a regression?
library(car)
scatterplotMatrix( auto[, c("mpg", "cyl", "disp", "hp", "weight", "acc", "model.yr")] )