Intro to Data Analysis with R: Session 2

UCI Data Science Initiative

April 20, 2018

Session 2 - Agenda

  1. R Packages, Functions & Help

  2. Exploratory Data Analysis

    • Numeric Summary Statistics
    • Histograms
    • Boxplots
    • Scatterplot Matrices

Packages

Packages

Installing a Package

There are two main ways to install a package in R:

  1. Installing from CRAN: install a package directly from the repository
    • Using R studio: Tools > Install Packages...
    • From R console: install.packages()
  2. Installing from source: first download the add-on R package and then type the following in your console:
    • install.packages("path_to_file", repos = NULL, type = "source")

Once you install a package, you need to load it into R using the function library()

Installing a Package, contd.

Let’s install the package car that we will need later in this lecture.

install.packages("car")  # install the package
library(car)  # load it into your R workspace

Functions in R

?sample

Calling a Function in R

Function arguments can either be matched by position within the parentheses or by name

sampSpace <- 1:6 
sample(sampSpace, 1)  # arguments with default values can be omitted
## [1] 4
sample(size = 1, x = sampSpace)  # no need to remember the order 
## [1] 4
sample(size = 1, sampSpace)
## [1] 3

Some Useful Functions

str()

Compact way of understanding what an object is and what it contains

str(str)
## function (object, ...)
str(sample)
## function (x, size, replace = FALSE, prob = NULL)

str()

After loading a data frame, it is often useful to use str() in order to understand the structure of your data.

prestige <- read.table(file = here::here("data", "prestige_v2.csv"), 
                       sep=",", 
                       header=TRUE, 
                       row.names=1)
str(prestige)
## 'data.frame':    101 obs. of  6 variables:
##  $ education: num  13.1 12.3 12.8 11.4 14.6 ...
##  $ income   : int  12351 25879 9271 8865 8403 11030 8258 14163 11377 11023 ...
##  $ women    : num  11.16 4.02 15.7 9.11 11.68 ...
##  $ prestige : num  68.8 69.1 63.4 56.8 73.5 77.6 72.6 78.1 73.1 68.8 ...
##  $ census   : int  1113 1130 1171 1175 2111 2113 2133 2141 2143 2153 ...
##  $ type     : Factor w/ 3 levels "bc","prof","wc": 2 2 2 2 2 2 2 2 2 2 ...

summary()

Another useful function for understanding your data by providing a numeric summary of each attribute (column).

summary(prestige)
##    education         income          women          prestige    
##  Min.   : 6.38   Min.   :  611   Min.   : 0.00   Min.   :14.80  
##  1st Qu.: 8.43   1st Qu.: 4075   1st Qu.: 3.59   1st Qu.:35.20  
##  Median :10.51   Median : 5902   Median :13.62   Median :43.50  
##  Mean   :10.73   Mean   : 6784   Mean   :29.19   Mean   :46.76  
##  3rd Qu.:12.71   3rd Qu.: 8131   3rd Qu.:52.27   3rd Qu.:59.60  
##  Max.   :15.97   Max.   :25879   Max.   :97.51   Max.   :87.20  
##      census       type   
##  Min.   :1113   bc  :47  
##  1st Qu.:3117   prof:31  
##  Median :5137   wc  :23  
##  Mean   :5422            
##  3rd Qu.:8313            
##  Max.   :9517

Exploratory Data Analysis

Plotting in R

Histograms

hist(prestige$prestige, freq = FALSE, 
     col = "grey",
     main = "Histogram of Prestige Score", 
     xlab = "Prestige Score")

Histograms, contd.

hist(prestige$prestige, freq = FALSE, col = "grey", 
     main = "Histogram of Prestige Score", xlab = "Prestige Score")
abline(v = median(prestige$prestige), col = "blue", lty = 2, lwd = 2)

Histograms, contd.

hist(prestige$prestige, freq=FALSE, col = "grey", main = "", xlab = "Prestige Score", ylim=c(0, 0.022))
abline(v = median(prestige$prestige), col = "blue", lty = 2, lwd = 2)
lines(density(prestige$prestige), col = "red", lwd = 2)

Adding a Legend

hist(prestige$prestige, freq=FALSE, col = "grey", main = "", xlab = "Prestige Score", ylim=c(0, 0.022))
abline(v = median(prestige$prestige), col = "blue", lty = 2, lwd = 2)
lines(density(prestige$prestige), col = "red", lwd = 2)
legend("topright", legend = c("Median", "Density Est."),
       col = c("blue", "red"), lty = c(2, 1), lwd = 2, bty = "n")

Boxplots

summary(prestige$prestige)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   14.80   35.20   43.50   46.76   59.60   87.20
boxplot(prestige$prestige, horizontal = TRUE, xlab = "Prestige Score")

Grouped Boxplots

boxplot(prestige ~ type, data = prestige, col = "grey",
        main = "Distribution of Prestige Score by Type of Occupation",
        xlab = "Occupation Type", ylab = "Prestige Score")

Scatterplots

plot(x = prestige$education, y = prestige$prestige, type = "p", pch = 20,
     main = "Prestige Score by Education",
     xlab = "Avg Years of Education", ylab = "Prestige Score")

Scatterplots, contd.

plot(prestige$education, prestige$prestige, type = "p", pch = 20,
     main = "Prestige Score by Education",
     xlab = "Ave. Years of Education", ylab = "Prestige Score")
abline(reg = lm(prestige ~ education, data = prestige), col = "green", lwd = 2)  # linear regression
lines(lowess(x = prestige$education, y = prestige$prestige), col = "red", lwd = 2)  # smoother
legend("topleft", legend = c("Regression Line", "Smoother"), col = c("green", "red"),
       lwd = c(2,2), lty = 1, bty = "n")

Scatterplot Matrices

library(car)
scatterplotMatrix( prestige[ ,c("prestige","education","income","women")] )

End of Session 2

Next up:

  1. Exercise 2
  2. Lunch

Return at 1:00 to discuss solutions to Exercise 2!