The first set of exercises will deal with reading a dataset into R, exploring various structural and content-related features of the data, and manipulating the dataset so that it is in a form we can use later for analyses.
We will be using the Auto MPG Data Set, available on the UCI Machine Learning Repository; https://archive.ics.uci.edu/ml/datasets/Auto+MPG
The data consists of a collection of automobile records from 1970 to 1982 containing the following variables:
Miles per Gallon (mpg)
Number of Cylinders
Engine Displacement (in cubic inches)
Horsepower
Weight (in pounds)
Acceleration
Model Year
Origin: where the data originated from (ignore this)
Car Name
We will be focusing on the relationships between miles per gallon (mpg) and various other features of the car (such as model year, weight, number of cylinders, etc.).
A.1 Open a new R script to write and save your code for the exercises. Save this file in your local copy of the IDA-with-R-master directory, e.g., IDA-with-R-master/my_exercises/exercise_1.R
.
A.2 Read in the Auto MPG data to a data frame named auto
from the following url using read.table()
: https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data-original HINT: Run ?read.table()
and read about how to use a url as a file path.
auto <- read.table("https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data-original")
A.3 Rename the variables (columns) using the following conventions: “mpg”, “cyl”, “disp”, “hp”, “weight”, “acc”, “model.yr”, “origin”, “name”. HINT: You will need to use the names
attribute of the data frame (i.e., names(auto)
).
names(auto) <- c("mpg", "cyl", "disp", "hp", "weight", "acc", "model.yr", "origin", "name")
A.4 Convert cyl
into a factor variable using factor()
. Convert name
into a character vector using as()
.
auto$cyl <- factor(auto$cyl)
auto$name <- as(auto$name, "character")
A.5 Use the head()
function to look at the first few rows of the data and make sure it looks like it was correctly loaded. You can compare the output here to the raw data by opening the url in A.2.
head(auto)
## mpg cyl disp hp weight acc model.yr origin name
## 1 18 8 307 130 3504 12.0 70 1 chevrolet chevelle malibu
## 2 15 8 350 165 3693 11.5 70 1 buick skylark 320
## 3 18 8 318 150 3436 11.0 70 1 plymouth satellite
## 4 16 8 304 150 3433 12.0 70 1 amc rebel sst
## 5 17 8 302 140 3449 10.5 70 1 ford torino
## 6 15 8 429 198 4341 10.0 70 1 ford galaxie 500
B.1 Locate the observations (rows) with missing data using is.na()
. HINT: You may want use which()
with arr.ind=TRUE
to return the (row, column) locations of the missing values.
missing <- which(is.na(auto), arr.ind = TRUE)
missing
## row col
## [1,] 11 1
## [2,] 12 1
## [3,] 13 1
## [4,] 14 1
## [5,] 15 1
## [6,] 18 1
## [7,] 40 1
## [8,] 368 1
## [9,] 39 4
## [10,] 134 4
## [11,] 338 4
## [12,] 344 4
## [13,] 362 4
## [14,] 383 4
B.2 Look at the missing observations by subsetting the auto
data frame.
auto[missing[,1], ]
## mpg cyl disp hp weight acc model.yr origin
## 11 NA 4 133 115 3090 17.5 70 2
## 12 NA 8 350 165 4142 11.5 70 1
## 13 NA 8 351 153 4034 11.0 70 1
## 14 NA 8 383 175 4166 10.5 70 1
## 15 NA 8 360 175 3850 11.0 70 1
## 18 NA 8 302 140 3353 8.0 70 1
## 40 NA 4 97 48 1978 20.0 71 2
## 368 NA 4 121 110 2800 15.4 81 2
## 39 25.0 4 98 NA 2046 19.0 71 1
## 134 21.0 6 200 NA 2875 17.0 74 1
## 338 40.9 4 85 NA 1835 17.3 80 2
## 344 23.6 4 140 NA 2905 14.3 80 1
## 362 34.5 4 100 NA 2320 15.8 81 2
## 383 23.0 4 151 NA 3035 20.5 82 1
## name
## 11 citroen ds-21 pallas
## 12 chevrolet chevelle concours (sw)
## 13 ford torino (sw)
## 14 plymouth satellite (sw)
## 15 amc rebel sst (sw)
## 18 ford mustang boss 302
## 40 volkswagen super beetle 117
## 368 saab 900s
## 39 ford pinto
## 134 ford maverick
## 338 renault lecar deluxe
## 344 ford mustang cobra
## 362 renault 18i
## 383 amc concord dl
B.3 Which variables are missing? What are the implications of this missingness?
MPG is the response variable, we will need to predict these after fitting models.
Horsepower is a predictor variable, we need to investigate its relationship with MPG.
C.1 Sort the Auto MPG data in descending order by mpg and store the result into a data frame named auto.sorted
. HINT: You will need to use order()
with na.last=NA
so that the values with missing mpg are not in the sorted data frame.
sort.index <- order(auto$mpg, decreasing = TRUE, na.last = NA)
auto.sorted <- auto[sort.index, ]
C.2 Look at the observations with the top five values for mpg using head()
.
head(auto.sorted, 5)
## mpg cyl disp hp weight acc model.yr origin name
## 330 46.6 4 86 65 2110 17.9 80 3 mazda glc
## 337 44.6 4 91 67 1850 13.8 80 3 honda civic 1500 gl
## 333 44.3 4 90 48 2085 21.7 80 2 vw rabbit c (diesel)
## 403 44.0 4 97 52 2130 24.6 82 2 vw pickup
## 334 43.4 4 90 48 2335 23.7 80 2 vw dasher (diesel)
C.3 Look at the observations with the bottom five values for mpg using tail()
.
tail(auto.sorted, 5)
## mpg cyl disp hp weight acc model.yr origin name
## 111 11 8 400 150 4997 14.0 73 1 chevrolet impala
## 132 11 8 350 180 3664 11.0 73 1 oldsmobile omega
## 32 10 8 360 215 4615 14.0 70 1 ford f250
## 33 10 8 307 200 4376 15.0 70 1 chevy c20
## 35 9 8 304 193 4732 18.5 70 1 hi 1200d
C.4 Do you notice any patterns with these two groups? HINT: You may need to do some Googling about these vehicles.
Answers will vary.
D.1 Locate the observations with diesel engines. HINT: If a vehicle has a diesel engine, it will mention “diesel” in the name of the car. Use the grep()
function do accomplish this.
diesel.index <- grep("diesel", auto$name)
diesel.index
## [1] 252 333 334 335 367 369 396
D.2 Create a new variable (column) in the auto
data frame called diesel
such that auto$diesel = 1
if the car has a diesel engine and 0
, otherwise.
auto$diesel <- 0
auto$diesel[diesel.index] <- 1
D.3 Coerce auto$diesel
into a factor variable using as()
.
auto$diesel <- factor(auto$diesel)
D.4 Look at the structure of the auto
data frame using str()
to make sure that this was done correctly.
str(auto)
## 'data.frame': 406 obs. of 10 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cyl : Factor w/ 5 levels "3","4","5","6",..: 5 5 5 5 5 5 5 5 5 5 ...
## $ disp : num 307 350 318 304 302 429 454 440 455 390 ...
## $ hp : num 130 165 150 150 140 198 220 215 225 190 ...
## $ weight : num 3504 3693 3436 3433 3449 ...
## $ acc : num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ model.yr: num 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : num 1 1 1 1 1 1 1 1 1 1 ...
## $ name : chr "chevrolet chevelle malibu" "buick skylark 320" "plymouth satellite" "amc rebel sst" ...
## $ diesel : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
D.5 Save your data set as an R data (.Rda
) file in the data directory (i.e., "IDA-with-R-master/data/auto_mpg_v2.Rda"
) using the save()
function.
save(auto, file=here::here("data", "auto_mpg_v2.Rda"))