April 20, 2018
R Packages, Functions & Help
Exploratory Data Analysis
Terminology:
For more info on how R packages are developed, please read Creating R Packages: A Tutorial by Friedrich Leisch.
Go to https://cran.r-project.org/web/packages/ for a list of all available packages.
There are two main ways to install a package in R:
Tools > Install Packages...
install.packages()
install.packages("path_to_file", repos = NULL, type = "source")
Once you install a package, you need to load it into R using the function library()
Let’s install the package car
that we will need later in this lecture.
install.packages("car") # install the package
library(car) # load it into your R workspace
ggplot2
: to create beautiful graphicsgoogleVis
: to use Google Chart toolsshiny
: to create interactive web-based appsknitr
: to combine R codes and Latex/Markdown codesslidify
: to build HTML 5 slide showsrcpp
: to write R functions that call C++ codedata.table
: to organize datasets for fast operationsparallel
: to use parallel processing in Rsample()
.?sample
sample()
has four arguments:
x
: vector of elements from which to choosesize
: desired sample sizereplace
: sampling with/without replacement (logical)prob
: vector of probability weightsFunction arguments can either be matched by position within the parentheses or by name
sampSpace <- 1:6
sample(sampSpace, 1) # arguments with default values can be omitted
## [1] 4
sample(size = 1, x = sampSpace) # no need to remember the order
## [1] 4
sample(size = 1, sampSpace)
## [1] 3
str()
: a function to explain internal structure of an objectsummary()
: a function that summarizes variables in a data frame
str()
Compact way of understanding what an object is and what it contains
str(str)
## function (object, ...)
str(sample)
## function (x, size, replace = FALSE, prob = NULL)
str()
After loading a data frame, it is often useful to use str()
in order to understand the structure of your data.
prestige <- read.table(file = here::here("data", "prestige_v2.csv"),
sep=",",
header=TRUE,
row.names=1)
str(prestige)
## 'data.frame': 101 obs. of 6 variables:
## $ education: num 13.1 12.3 12.8 11.4 14.6 ...
## $ income : int 12351 25879 9271 8865 8403 11030 8258 14163 11377 11023 ...
## $ women : num 11.16 4.02 15.7 9.11 11.68 ...
## $ prestige : num 68.8 69.1 63.4 56.8 73.5 77.6 72.6 78.1 73.1 68.8 ...
## $ census : int 1113 1130 1171 1175 2111 2113 2133 2141 2143 2153 ...
## $ type : Factor w/ 3 levels "bc","prof","wc": 2 2 2 2 2 2 2 2 2 2 ...
summary()
Another useful function for understanding your data by providing a numeric summary of each attribute (column).
summary(prestige)
## education income women prestige
## Min. : 6.38 Min. : 611 Min. : 0.00 Min. :14.80
## 1st Qu.: 8.43 1st Qu.: 4075 1st Qu.: 3.59 1st Qu.:35.20
## Median :10.51 Median : 5902 Median :13.62 Median :43.50
## Mean :10.73 Mean : 6784 Mean :29.19 Mean :46.76
## 3rd Qu.:12.71 3rd Qu.: 8131 3rd Qu.:52.27 3rd Qu.:59.60
## Max. :15.97 Max. :25879 Max. :97.51 Max. :87.20
## census type
## Min. :1113 bc :47
## 1st Qu.:3117 prof:31
## Median :5137 wc :23
## Mean :5422
## 3rd Qu.:8313
## Max. :9517
Before performing statistical analyses on your data, it is important to do exploratory data analysis (EDA) in order to better understand the variables and the relationships between them.
This can be done in many ways
str()
and summary()
)We will cover some of the basic plotting functions.
To explore the distribution of one variable:
To explore relationships between variables:
Useful resources:
ggplot2
NOTE: This course provides a basic introduction to R
’s plotting capabilities. You can do much, much more elegant plots in R
!
prestige
.
freq
: logical variable that controls the type of histogram (TRUE
gives counts, FALSE
gives relative counts)breaks
: controls the number of bins & bin locations; multiple ways to set thishist(prestige$prestige, freq = FALSE,
col = "grey",
main = "Histogram of Prestige Score",
xlab = "Prestige Score")
abline()
abline
on (almost) any figurehist(prestige$prestige, freq = FALSE, col = "grey",
main = "Histogram of Prestige Score", xlab = "Prestige Score")
abline(v = median(prestige$prestige), col = "blue", lty = 2, lwd = 2)
density()
and lines()
functions.
lines()
: takes coordinate pairs (in multiple input formats) and adds them to current figure connected by line segmentsdensity()
: computes kernel density estimates (a smoothed histogram); see ?density
for more detailshist(prestige$prestige, freq=FALSE, col = "grey", main = "", xlab = "Prestige Score", ylim=c(0, 0.022))
abline(v = median(prestige$prestige), col = "blue", lty = 2, lwd = 2)
lines(density(prestige$prestige), col = "red", lwd = 2)
legend()
to do this in R.
hist(prestige$prestige, freq=FALSE, col = "grey", main = "", xlab = "Prestige Score", ylim=c(0, 0.022))
abline(v = median(prestige$prestige), col = "blue", lty = 2, lwd = 2)
lines(density(prestige$prestige), col = "red", lwd = 2)
legend("topright", legend = c("Median", "Density Est."),
col = c("blue", "red"), lty = c(2, 1), lwd = 2, bty = "n")
boxplot()
to produce them in Rprestige
:summary(prestige$prestige)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 14.80 35.20 43.50 46.76 59.60 87.20
boxplot(prestige$prestige, horizontal = TRUE, xlab = "Prestige Score")
prestige
grouped by type
boxplot(prestige ~ type, data = prestige, col = "grey",
main = "Distribution of Prestige Score by Type of Occupation",
xlab = "Occupation Type", ylab = "Prestige Score")
plot()
function to do this
?plot
) and Graphical Parameters for more detailsx
and y
coordinates.
x
and y
must be the same dimensionplot(x = prestige$education, y = prestige$prestige, type = "p", pch = 20,
main = "Prestige Score by Education",
xlab = "Avg Years of Education", ylab = "Prestige Score")
lowess()
, lines()
, and abline()
functions.
lines()
and abline()
lm()
to fit a linear regression, more on this laterlowess()
: computes a smoothed fit using locally-weighted polynomial regression (LOWESS); don’t worry about the details here, just need to know it is an unbiased way (no assumptions) to estimate relationship between two variablesplot(prestige$education, prestige$prestige, type = "p", pch = 20,
main = "Prestige Score by Education",
xlab = "Ave. Years of Education", ylab = "Prestige Score")
abline(reg = lm(prestige ~ education, data = prestige), col = "green", lwd = 2) # linear regression
lines(lowess(x = prestige$education, y = prestige$prestige), col = "red", lwd = 2) # smoother
legend("topleft", legend = c("Regression Line", "Smoother"), col = c("green", "red"),
lwd = c(2,2), lty = 1, bty = "n")
scatterplotMatrix()
(found in the car
package we installed earlier) produces scatterplots between all variables in a data frame.library(car)
scatterplotMatrix( prestige[ ,c("prestige","education","income","women")] )