The first set of exercises will deal with reading a dataset into R, exploring various structural and content-related features of the data, and manipulating the dataset so that it is in a form we can use later for analyses.
We will be using the Auto MPG Data Set, available on the UCI Machine Learning Repository; https://archive.ics.uci.edu/ml/datasets/Auto+MPG
The data consists of a collection of automobile records from 1970 to 1982 containing the following variables:
Miles per Gallon (mpg)
Number of Cylinders
Engine Displacement (in cubic inches)
Horsepower
Weight (in pounds)
Acceleration
Model Year
Origin: where the data originated from (ignore this)
Car Name
We will be focusing on the relationships between miles per gallon (mpg) and various other features of the car (such as model year, weight, number of cylinders, etc.).
A.1 Open a new R script to write and save your code for the exercises. Save this file in your local copy of the IDA-with-R-master directory, e.g., IDA-with-R-master/my_exercises/exercise_1.R
.
A.2 Read in the Auto MPG data to a data frame named auto
from the following url using read.table()
: https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data-original HINT: Run ?read.table()
and read about how to use a url as a file path.
A.3 Rename the variables (columns) using the following conventions: “mpg”, “cyl”, “disp”, “hp”, “weight”, “acc”, “model.yr”, “origin”, “name”. HINT: You will need to use the names
attribute of the data frame (i.e., names(auto)
).
A.4 Convert cyl
into a factor variable using factor()
. Convert name
into a character vector using as()
.
A.5 Use the head()
function to look at the first few rows of the data and make sure it looks like it was correctly loaded. You can compare the output here to the raw data by opening the url in A.2.
B.1 Locate the observations (rows) with missing data using is.na()
. HINT: You may want use which()
with arr.ind=TRUE
to return the (row, column) locations of the missing values.
B.2 Look at the missing observations by subsetting the auto
data frame.
B.3 Which variables are missing? What are the implications of this missingness?
C.1 Sort the Auto MPG data in descending order by mpg and store the result into a data frame named auto.sorted
. HINT: You will need to use order()
with na.last=NA
so that the values with missing mpg are not in the sorted data frame.
C.2 Look at the observations with the top five values for mpg using head()
.
C.3 Look at the observations with the bottom five values for mpg using tail()
.
C.4 Do you notice any patterns with these two groups? HINT: You may need to do some Googling about these vehicles.
D.1 Locate the observations with diesel engines. HINT: If a vehicle has a diesel engine, it will mention “diesel” in the name of the car. Use the grep()
function do accomplish this.
D.2 Create a new variable (column) in the auto
data frame called diesel
such that auto$diesel = 1
if the car has a diesel engine and 0
, otherwise.
D.3 Coerce auto$diesel
into a factor variable using as()
.
D.4 Look at the structure of the auto
data frame using str()
to make sure that this was done correctly.
D.5 Save your data set as an R data (.Rda
) file in the data directory (i.e., "IDA-with-R-master/data/auto_mpg_v2.Rda"
) using the save()
function.