Intro to Data Analysis with R: Exercise 1

Introduction

The first set of exercises will deal with reading a dataset into R, exploring various structural and content-related features of the data, and manipulating the dataset so that it is in a form we can use later for analyses.

We will be using the Auto MPG Data Set, available on the UCI Machine Learning Repository; https://archive.ics.uci.edu/ml/datasets/Auto+MPG

The data consists of a collection of automobile records from 1970 to 1982 containing the following variables:

Miles per Gallon (mpg)
Number of Cylinders
Engine Displacement (in cubic inches)
Horsepower
Weight (in pounds)
Acceleration
Model Year
Origin: where the data originated from (ignore this)
Car Name

We will be focusing on the relationships between miles per gallon (mpg) and various other features of the car (such as model year, weight, number of cylinders, etc.).

Part A - Data Input

A.1 Open a new R script to write and save your code for the exercises. Save this file in your local copy of the IDA-with-R-master directory, e.g., IDA-with-R-master/my_exercises/exercise_1.R.

A.2 Read in the Auto MPG data to a data frame named auto from the following url using read.table(): https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data-original HINT: Run ?read.table() and read about how to use a url as a file path.

A.3 Rename the variables (columns) using the following conventions: “mpg”, “cyl”, “disp”, “hp”, “weight”, “acc”, “model.yr”, “origin”, “name”. HINT: You will need to use the names attribute of the data frame (i.e., names(auto)).

A.4 Convert cyl into a factor variable using factor(). Convert name into a character vector using as().

A.5 Use the head() function to look at the first few rows of the data and make sure it looks like it was correctly loaded. You can compare the output here to the raw data by opening the url in A.2.

Part B - Missing Values

B.1 Locate the observations (rows) with missing data using is.na(). HINT: You may want use which() with arr.ind=TRUE to return the (row, column) locations of the missing values.

B.2 Look at the missing observations by subsetting the auto data frame.

B.3 Which variables are missing? What are the implications of this missingness?

Part C - Sorting

C.1 Sort the Auto MPG data in descending order by mpg and store the result into a data frame named auto.sorted. HINT: You will need to use order() with na.last=NA so that the values with missing mpg are not in the sorted data frame.

C.2 Look at the observations with the top five values for mpg using head().

C.3 Look at the observations with the bottom five values for mpg using tail().

C.4 Do you notice any patterns with these two groups? HINT: You may need to do some Googling about these vehicles.

Part D - String Manipulation

D.1 Locate the observations with diesel engines. HINT: If a vehicle has a diesel engine, it will mention “diesel” in the name of the car. Use the grep() function do accomplish this.

D.2 Create a new variable (column) in the auto data frame called diesel such that auto$diesel = 1 if the car has a diesel engine and 0, otherwise.

D.3 Coerce auto$diesel into a factor variable using as().

D.4 Look at the structure of the auto data frame using str() to make sure that this was done correctly.