Intro to Data Analysis with R: Exercise 2 Solutions

Introduction

This set of exercises picks up where we left off with the Auto MPG data in Exercise 1. We now focus on exploratory data analysis.

Part A

A.1 Open a new R script and save it in the directory you created in Part A.1 of Exercise 1. Then, load the Auto MPG data set with the additional variable diesel using the load() function and file path from the end of Exercise 1.

load(here::here("data", "auto_mpg_v2.Rda"))

A.2 Using the summary() function, look at descriptive statistics for the Auto MPG data. What do you notice?

summary(auto)

##       mpg        cyl          disp             hp             weight    
##  Min.   : 9.00   3:  4   Min.   : 68.0   Min.   : 46.00   Min.   :1613  
##  1st Qu.:17.50   4:207   1st Qu.:105.0   1st Qu.: 75.75   1st Qu.:2226  
##  Median :23.00   5:  3   Median :151.0   Median : 95.00   Median :2822  
##  Mean   :23.51   6: 84   Mean   :194.8   Mean   :105.08   Mean   :2979  
##  3rd Qu.:29.00   8:108   3rd Qu.:302.0   3rd Qu.:130.00   3rd Qu.:3618  
##  Max.   :46.60           Max.   :455.0   Max.   :230.00   Max.   :5140  
##  NA's   :8                               NA's   :6                      
##       acc           model.yr         origin          name          
##  Min.   : 8.00   Min.   :70.00   Min.   :1.000   Length:406        
##  1st Qu.:13.70   1st Qu.:73.00   1st Qu.:1.000   Class :character  
##  Median :15.50   Median :76.00   Median :1.000   Mode  :character  
##  Mean   :15.52   Mean   :75.92   Mean   :1.569                     
##  3rd Qu.:17.18   3rd Qu.:79.00   3rd Qu.:2.000                     
##  Max.   :24.80   Max.   :82.00   Max.   :3.000                     
##                                                                    
##  diesel 
##  0:399  
##  1:  7  
##         
##         
##         
##         
##

Part B

B.1 Plot a relative frequency histogram of the response variable, MPG. Color the boxes grey. Make sure to name the plot and axes.

hist(auto$mpg, freq=FALSE, col = "grey", main = "Histogram of MPG", xlab = "MPG")

B.2 Add a density curve to the histogram you plotted in B.1 using the lines() and density() functions. Color it red using the col argument. HINT: You need to change the value of na.rm in the density function to avoid an error. You also need to make sure that the histogram is on the proper scale!

hist(auto$mpg, freq=FALSE, col = "grey", main = "Histogram of MPG", xlab = "MPG")
lines(density(auto$mpg, na.rm=TRUE), col="red")

B.3 Add a vertical line to the plot from B.2 at the median of MPG. Make it dashed using the lty arguement, and make it red. HINT: You may need to use na.rm again.

hist(auto$mpg, freq=FALSE, col = "grey", main = "Histogram of MPG", xlab = "MPG")
lines(density(auto$mpg, na.rm=TRUE), col="red", main="Densities of Samp1 and Samp2", xlab="")
abline(v = median(auto$mpg, na.rm=TRUE), col = "red", lwd = 2, lty=2)

Part C

Create a boxplot of MPG grouped by the number of cylinders. Color the boxes grey. Make sure to name the plot and axes.

boxplot(mpg ~ cyl, data = auto, col = "grey",
        main = "Distribution of MPG by Number of Cylinders",
        xlab = "Cylinders", ylab = "MPG")

Part D

Create a scatterplot matrix of the Auto MPG data. Include the following variables: “mpg”, “cyl”, “disp”, “hp”, “weight”, “acc”, “model.yr”. What relationships do you see? Is there any confounding? Which variables should be included in a regression?