Intro to Data Analysis with R: Session 2

Packages

Terminology:
- Package: an extension of the R base system with code, data and documentation in a standardized format
- Library: a directory containing installed packages
- Repository: a website providing packages for installation
- Source: the original version of a package with human-readable text and code
- Base packages: part of the R source tree, maintained by R Core
For more info on how R packages are developed, please read Creating R Packages: A Tutorial by Friedrich Leisch.
Go to https://cran.r-project.org/web/packages/ for a list of all available packages.

Installing a Package

There are two main ways to install a package in R:

Installing from CRAN: install a package directly from the repository
- Using R studio: Tools > Install Packages...
- From R console: install.packages()
Installing from source: first download the add-on R package and then type the following in your console:
- install.packages("path_to_file", repos = NULL, type = "source")

Once you install a package, you need to load it into R using the function library()

Installing a Package, contd.

Let’s install the package car that we will need later in this lecture.

install.packages("car")  # install the package
library(car)  # load it into your R workspace

Popular Packages

To visualize data:
- ggplot2: to create beautiful graphics
- googleVis: to use Google Chart tools
To report results:
- shiny: to create interactive web-based apps
- knitr: to combine R codes and Latex/Markdown codes
- slidify: to build HTML 5 slide shows
To write high-performance R code:
- rcpp: to write R functions that call C++ code
- data.table: to organize datasets for fast operations
- parallel: to use parallel processing in R

Functions in R

Consider the function sample().
Run ?sample to read the help file.

?sample

sample() has four arguments:
- x: vector of elements from which to choose
- size: desired sample size
- replace: sampling with/without replacement (logical)
- prob: vector of probability weights
The help file will specify which arguments have default values (and what those values are)

Calling a Function in R

Function arguments can either be matched by position within the parentheses or by name

sampSpace <- 1:6 
sample(sampSpace, 1)  # arguments with default values can be omitted

## [1] 4

sample(size = 1, x = sampSpace)  # no need to remember the order

## [1] 4

sample(size = 1, sampSpace)

## [1] 3

Some Useful Functions

str(): a function to explain internal structure of an object
summary(): a function that summarizes variables in a data frame
- Note: this function is also used to summarize results of model fitting functions, which we will go over in the afternoon.

`str()`

Compact way of understanding what an object is and what it contains

str(str)

## function (object, ...)

str(sample)

## function (x, size, replace = FALSE, prob = NULL)

`str()`

After loading a data frame, it is often useful to use str() in order to understand the structure of your data.

prestige <- read.table(file = here::here("data", "prestige_v2.csv"), 
                       sep=",", 
                       header=TRUE, 
                       row.names=1)
str(prestige)

## 'data.frame':    101 obs. of  6 variables:
##  $ education: num  13.1 12.3 12.8 11.4 14.6 ...
##  $ income   : int  12351 25879 9271 8865 8403 11030 8258 14163 11377 11023 ...
##  $ women    : num  11.16 4.02 15.7 9.11 11.68 ...
##  $ prestige : num  68.8 69.1 63.4 56.8 73.5 77.6 72.6 78.1 73.1 68.8 ...
##  $ census   : int  1113 1130 1171 1175 2111 2113 2133 2141 2143 2153 ...
##  $ type     : Factor w/ 3 levels "bc","prof","wc": 2 2 2 2 2 2 2 2 2 2 ...

`summary()`

Another useful function for understanding your data by providing a numeric summary of each attribute (column).

summary(prestige)

##    education         income          women          prestige    
##  Min.   : 6.38   Min.   :  611   Min.   : 0.00   Min.   :14.80  
##  1st Qu.: 8.43   1st Qu.: 4075   1st Qu.: 3.59   1st Qu.:35.20  
##  Median :10.51   Median : 5902   Median :13.62   Median :43.50  
##  Mean   :10.73   Mean   : 6784   Mean   :29.19   Mean   :46.76  
##  3rd Qu.:12.71   3rd Qu.: 8131   3rd Qu.:52.27   3rd Qu.:59.60  
##  Max.   :15.97   Max.   :25879   Max.   :97.51   Max.   :87.20  
##      census       type   
##  Min.   :1113   bc  :47  
##  1st Qu.:3117   prof:31  
##  Median :5137   wc  :23  
##  Mean   :5422            
##  3rd Qu.:8313            
##  Max.   :9517

Exploratory Data Analysis

Before performing statistical analyses on your data, it is important to do exploratory data analysis (EDA) in order to better understand the variables and the relationships between them.
This can be done in many ways
- Numeric summaries (e.g., using str() and summary())
- Plots, plots, and more plots
- More advanced methods (variograms, empirical covariance matrices, etc)
We will cover some of the basic plotting functions.

Plotting in R

To explore the distribution of one variable:
- Histograms
- Boxplots
To explore relationships between variables:
- Grouped Boxplots
- Scatterplots
- Scatterplot Matrices
Useful resources:
- Quick-R Graphical Parameters
- ggplot2
- Search for (or post) specific questions on stackoverflow, which is a community that will answer questions & chose the best solutions via voting
NOTE: This course provides a basic introduction to R’s plotting capabilities. You can do much, much more elegant plots in R!

Histograms

One of the most basic plots in R is a histogram.
Let’s plot a histogram of the response variable prestige.
- freq: logical variable that controls the type of histogram (TRUE gives counts, FALSE gives relative counts)
- breaks: controls the number of bins & bin locations; multiple ways to set this
- We can add arguments to the function to change the bar colors, title, and axis labels.

hist(prestige$prestige, freq = FALSE, 
     col = "grey",
     main = "Histogram of Prestige Score", 
     xlab = "Prestige Score")

Histograms, contd.

Let’s add a dashed vertical line at the median of prestige using abline()
- Adds a straight line to the current figure
- Can be specified multiple ways–intercept/slope, horizontal/vertical location, regression coefficients/object
- Not unique to histograms–you can use abline on (almost) any figure

hist(prestige$prestige, freq = FALSE, col = "grey", 
     main = "Histogram of Prestige Score", xlab = "Prestige Score")
abline(v = median(prestige$prestige), col = "blue", lty = 2, lwd = 2)

Histograms, contd.

Sometimes we would rather have a smoothed version of a histogram (i.e., a density function).
- Not susceptible to number/location of bins like histograms are.
We can include this using the density() and lines() functions.
- lines(): takes coordinate pairs (in multiple input formats) and adds them to current figure connected by line segments
- density(): computes kernel density estimates (a smoothed histogram); see ?density for more details

hist(prestige$prestige, freq=FALSE, col = "grey", main = "", xlab = "Prestige Score", ylim=c(0, 0.022))
abline(v = median(prestige$prestige), col = "blue", lty = 2, lwd = 2)
lines(density(prestige$prestige), col = "red", lwd = 2)

Adding a Legend

With multiple features on one figure, a legend can help clearly convey what is plotted.
Use legend() to do this in R.
- Very versatile function–look at its documentation
- Best to play around with this on your own, iterating through multiple plots until you get the legend to appear how you want it

hist(prestige$prestige, freq=FALSE, col = "grey", main = "", xlab = "Prestige Score", ylim=c(0, 0.022))
abline(v = median(prestige$prestige), col = "blue", lty = 2, lwd = 2)
lines(density(prestige$prestige), col = "red", lwd = 2)
legend("topright", legend = c("Median", "Density Est."),
       col = c("blue", "red"), lty = c(2, 1), lwd = 2, bty = "n")

Boxplots

Provide a graphical representation of numeric summary statistics.
Use boxplot() to produce them in R
First let’s make a boxplot of the response variable prestige:

summary(prestige$prestige)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   14.80   35.20   43.50   46.76   59.60   87.20

boxplot(prestige$prestige, horizontal = TRUE, xlab = "Prestige Score")

Grouped Boxplots

Used to investigate relationships between variables:
- Continuous variable & factor (most common)
- Two continuous variables (group one of them)
Let’s look at a boxplot of prestige grouped by type

boxplot(prestige ~ type, data = prestige, col = "grey",
        main = "Distribution of Prestige Score by Type of Occupation",
        xlab = "Occupation Type", ylab = "Prestige Score")

Scatterplots

We can also use scatterplots to visualize the relationship between two (continuous) variables.
We use the plot() function to do this
- Extremely flexible–there are plot methods for a variety of objects! (e.g., plotting a regression object returns diagnostic plots… more on this later)
- See the help documentation (?plot) and Graphical Parameters for more details
For now, focus on the most simple plot where we specify the x and y coordinates.
- x and y must be the same dimension

plot(x = prestige$education, y = prestige$prestige, type = "p", pch = 20,
     main = "Prestige Score by Education",
     xlab = "Avg Years of Education", ylab = "Prestige Score")

Scatterplots, contd.

Let’s overlay both a linear fit and smoother to our scatterplot.
We can include this using the lowess(), lines(), and abline() functions.
- We’ve already used lines() and abline()
- Need to use lm() to fit a linear regression, more on this later
- lowess(): computes a smoothed fit using locally-weighted polynomial regression (LOWESS); don’t worry about the details here, just need to know it is an unbiased way (no assumptions) to estimate relationship between two variables

plot(prestige$education, prestige$prestige, type = "p", pch = 20,
     main = "Prestige Score by Education",
     xlab = "Ave. Years of Education", ylab = "Prestige Score")
abline(reg = lm(prestige ~ education, data = prestige), col = "green", lwd = 2)  # linear regression
lines(lowess(x = prestige$education, y = prestige$prestige), col = "red", lwd = 2)  # smoother
legend("topleft", legend = c("Regression Line", "Smoother"), col = c("green", "red"),
       lwd = c(2,2), lty = 1, bty = "n")

Scatterplot Matrices

The function scatterplotMatrix() (found in the car package we installed earlier) produces scatterplots between all variables in a data frame.
We can use direct ordering of the variables to control the order in which they are plotted.

library(car)
scatterplotMatrix( prestige[ ,c("prestige","education","income","women")] )

Intro to Data Analysis with R: Session 2

UCI Data Science Initiative

Session 2 - Agenda

Packages

Packages

Installing a Package

Installing a Package, contd.

Popular Packages

Functions in R

Calling a Function in R

Some Useful Functions

`str()`

`str()`

`summary()`

Exploratory Data Analysis

Plotting in R

Histograms

Histograms, contd.

Histograms, contd.

Adding a Legend

Boxplots

Grouped Boxplots

Scatterplots

Scatterplots, contd.

Scatterplot Matrices

End of Session 2