Introduction

The first set of exercises will deal with reading a dataset into R, exploring various structural and content-related features of the data, and manipulating the dataset so that it is in a form we can use later for analyses.

We will be using the Auto MPG Data Set, available on the UCI Machine Learning Repository; https://archive.ics.uci.edu/ml/datasets/Auto+MPG

The data consists of a collection of automobile records from 1970 to 1982 containing the following variables:

  1. Miles per Gallon (mpg)

  2. Number of Cylinders

  3. Engine Displacement (in cubic inches)

  4. Horsepower

  5. Weight (in pounds)

  6. Acceleration

  7. Model Year

  8. Origin: where the data originated from (ignore this)

  9. Car Name

We will be focusing on the relationships between miles per gallon (mpg) and various other features of the car (such as model year, weight, number of cylinders, etc.).

Part A - Data Input

A.1 Open a new R script to write and save your code for the exercises. Save this file in your local copy of the IDA-with-R-master directory, e.g., IDA-with-R-master/my_exercises/exercise_1.R.

A.2 Read in the Auto MPG data to a data frame named auto from the following url using read.table(): https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data-original HINT: Run ?read.table() and read about how to use a url as a file path.

auto <- read.table("https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data-original")

A.3 Rename the variables (columns) using the following conventions: “mpg”, “cyl”, “disp”, “hp”, “weight”, “acc”, “model.yr”, “origin”, “name”. HINT: You will need to use the names attribute of the data frame (i.e., names(auto)).

names(auto) <- c("mpg", "cyl", "disp", "hp", "weight", "acc", "model.yr", "origin", "name")

A.4 Convert cyl into a factor variable using factor(). Convert name into a character vector using as().

auto$cyl <- factor(auto$cyl)
auto$name <- as(auto$name, "character")

A.5 Use the head() function to look at the first few rows of the data and make sure it looks like it was correctly loaded. You can compare the output here to the raw data by opening the url in A.2.

head(auto)
##   mpg cyl disp  hp weight  acc model.yr origin                      name
## 1  18   8  307 130   3504 12.0       70      1 chevrolet chevelle malibu
## 2  15   8  350 165   3693 11.5       70      1         buick skylark 320
## 3  18   8  318 150   3436 11.0       70      1        plymouth satellite
## 4  16   8  304 150   3433 12.0       70      1             amc rebel sst
## 5  17   8  302 140   3449 10.5       70      1               ford torino
## 6  15   8  429 198   4341 10.0       70      1          ford galaxie 500

Part B - Missing Values

B.1 Locate the observations (rows) with missing data using is.na(). HINT: You may want use which() with arr.ind=TRUE to return the (row, column) locations of the missing values.

missing <- which(is.na(auto), arr.ind = TRUE)
missing
##       row col
##  [1,]  11   1
##  [2,]  12   1
##  [3,]  13   1
##  [4,]  14   1
##  [5,]  15   1
##  [6,]  18   1
##  [7,]  40   1
##  [8,] 368   1
##  [9,]  39   4
## [10,] 134   4
## [11,] 338   4
## [12,] 344   4
## [13,] 362   4
## [14,] 383   4

B.2 Look at the missing observations by subsetting the auto data frame.

auto[missing[,1], ]
##      mpg cyl disp  hp weight  acc model.yr origin
## 11    NA   4  133 115   3090 17.5       70      2
## 12    NA   8  350 165   4142 11.5       70      1
## 13    NA   8  351 153   4034 11.0       70      1
## 14    NA   8  383 175   4166 10.5       70      1
## 15    NA   8  360 175   3850 11.0       70      1
## 18    NA   8  302 140   3353  8.0       70      1
## 40    NA   4   97  48   1978 20.0       71      2
## 368   NA   4  121 110   2800 15.4       81      2
## 39  25.0   4   98  NA   2046 19.0       71      1
## 134 21.0   6  200  NA   2875 17.0       74      1
## 338 40.9   4   85  NA   1835 17.3       80      2
## 344 23.6   4  140  NA   2905 14.3       80      1
## 362 34.5   4  100  NA   2320 15.8       81      2
## 383 23.0   4  151  NA   3035 20.5       82      1
##                                 name
## 11              citroen ds-21 pallas
## 12  chevrolet chevelle concours (sw)
## 13                  ford torino (sw)
## 14           plymouth satellite (sw)
## 15                amc rebel sst (sw)
## 18             ford mustang boss 302
## 40       volkswagen super beetle 117
## 368                        saab 900s
## 39                        ford pinto
## 134                    ford maverick
## 338             renault lecar deluxe
## 344               ford mustang cobra
## 362                      renault 18i
## 383                   amc concord dl

B.3 Which variables are missing? What are the implications of this missingness?

MPG is the response variable, we will need to predict these after fitting models.

Horsepower is a predictor variable, we need to investigate its relationship with MPG.

Part C - Sorting

C.1 Sort the Auto MPG data in descending order by mpg and store the result into a data frame named auto.sorted. HINT: You will need to use order() with na.last=NA so that the values with missing mpg are not in the sorted data frame.

sort.index <- order(auto$mpg, decreasing = TRUE, na.last = NA)
auto.sorted <- auto[sort.index, ]

C.2 Look at the observations with the top five values for mpg using head().

head(auto.sorted, 5)
##      mpg cyl disp hp weight  acc model.yr origin                 name
## 330 46.6   4   86 65   2110 17.9       80      3            mazda glc
## 337 44.6   4   91 67   1850 13.8       80      3  honda civic 1500 gl
## 333 44.3   4   90 48   2085 21.7       80      2 vw rabbit c (diesel)
## 403 44.0   4   97 52   2130 24.6       82      2            vw pickup
## 334 43.4   4   90 48   2335 23.7       80      2   vw dasher (diesel)

C.3 Look at the observations with the bottom five values for mpg using tail().

tail(auto.sorted, 5)
##     mpg cyl disp  hp weight  acc model.yr origin             name
## 111  11   8  400 150   4997 14.0       73      1 chevrolet impala
## 132  11   8  350 180   3664 11.0       73      1 oldsmobile omega
## 32   10   8  360 215   4615 14.0       70      1        ford f250
## 33   10   8  307 200   4376 15.0       70      1        chevy c20
## 35    9   8  304 193   4732 18.5       70      1         hi 1200d

C.4 Do you notice any patterns with these two groups? HINT: You may need to do some Googling about these vehicles.

Answers will vary.

Part D - String Manipulation

D.1 Locate the observations with diesel engines. HINT: If a vehicle has a diesel engine, it will mention “diesel” in the name of the car. Use the grep() function do accomplish this.

diesel.index <- grep("diesel", auto$name)
diesel.index
## [1] 252 333 334 335 367 369 396

D.2 Create a new variable (column) in the auto data frame called diesel such that auto$diesel = 1 if the car has a diesel engine and 0, otherwise.

auto$diesel <- 0
auto$diesel[diesel.index] <- 1

D.3 Coerce auto$diesel into a factor variable using as().

auto$diesel <- factor(auto$diesel)

D.4 Look at the structure of the auto data frame using str() to make sure that this was done correctly.

str(auto)
## 'data.frame':    406 obs. of  10 variables:
##  $ mpg     : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cyl     : Factor w/ 5 levels "3","4","5","6",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ disp    : num  307 350 318 304 302 429 454 440 455 390 ...
##  $ hp      : num  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight  : num  3504 3693 3436 3433 3449 ...
##  $ acc     : num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ model.yr: num  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin  : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ name    : chr  "chevrolet chevelle malibu" "buick skylark 320" "plymouth satellite" "amc rebel sst" ...
##  $ diesel  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...

D.5 Save your data set as an R data (.Rda) file in the data directory (i.e., "IDA-with-R-master/data/auto_mpg_v2.Rda") using the save() function.

save(auto, file=here::here("data", "auto_mpg_v2.Rda"))