Vectorized Operations

R is capable of vectorized operations without any need for running loops
This means that a function applied to a vector is actually applied individually to each element

x <- 1:5
y <- c(1, 2, 6, 7, 10)
x + y # R does an element by element summation

## [1]  2  4  9 11 15

x < y

## [1] FALSE FALSE  TRUE  TRUE  TRUE

Vectorized Operations

Similar to vectors, vectorized operations can be performed for matrices

x <- matrix(1:9, ncol = 3)
y <- matrix(rep(c(5,6,7), 3), ncol = 3)
x + y # R does an element by element summation

##      [,1] [,2] [,3]
## [1,]    6    9   12
## [2,]    8   11   14
## [3,]   10   13   16

x < y

##      [,1] [,2]  [,3]
## [1,] TRUE TRUE FALSE
## [2,] TRUE TRUE FALSE
## [3,] TRUE TRUE FALSE

Reading and Writing Data

The slides for “Reading and Writing Data” section were mainly from Dr. Roger D. Peng, Associate Professor at Johns Hopkins

Main functions for reading data into R:

read.table(), read.csv(): to read tabular data
readLines(): to read lines of a text file
source(), dget(): to read R code
load(): to read saved workspaces

Only read.table() and read.csv() are covered in this lecture.

Reading and Writing Data

Main functions for writing data from R:

write.table(), write.csv(): to write tabular data to file
writeLines(): to write lines to a text file
dump(), dput(): to write R code to a file
save(): to save a workspace

Only write.table() is covered in this lecture

read.table():

read.table() is the most commonly used function to read data in R
Type ?read.table in your R console to see the important arguments in the function
read.csv() is intended for reading comma separated value files
- It is equivalent to read.table() with sep = “,” and header = TRUE

read.table():

irisFile <- read.table(file = "iris.csv", sep=",", header = TRUE)
head(irisFile)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width     Species
## 1          5.1         3.5          1.4         0.2 Iris-setosa
## 2          4.9         3.0          1.4         0.2 Iris-setosa
## 3          4.7         3.2          1.3         0.2 Iris-setosa
## 4          4.6         3.1          1.5         0.2 Iris-setosa
## 5          5.0         3.6          1.4         0.2 Iris-setosa
## 6          5.4         3.9          1.7         0.4 Iris-setosa

write.table():

write.table(irisFile, file = "new_iris.csv", sep = ",", col.names = TRUE)

Control Structures:

A control structure is a block of programming that analyzes variables and chooses a direction in which to go based on given parameters
Control structures in R include:
- for loops
- if/else statements
- while loops
- repeat
- break
- next
- return

for loops:

Suppose we want to write “The year is [year]” where year is equal to 2014, 2015, and 2016.
One way to do so is like this:

print(paste("The year is", 2014))

## [1] "The year is 2014"

print(paste("The year is", 2015))

## [1] "The year is 2015"

print(paste("The year is", 2016))

## [1] "The year is 2016"

for loops:

Or, we can use a for loop:

for(i in 2014:2016){
  print(paste("The year is", i))
}

## [1] "The year is 2014"
## [1] "The year is 2015"
## [1] "The year is 2016"

for loops:

Suppose we have a numeric vector and we want to square each element

vec <- seq(2, 20, by = 2)

First, create a new vector of the same length as vec

newvec <- vector("numeric", length = length(vec))

Then, write the for loop

for(i in 1:length(vec)){
  newvec[i] <- vec[i]^2
}
newvec

##  [1]   4  16  36  64 100 144 196 256 324 400

if/else statements:

if/else statements are used to write conditional statements

x <- 7
if (x < 10){
  print("x is less than 10")
}else{
  print("x is greater than 10")
}

## [1] "x is less than 10"

Combining for loops and if/else statements:

Suppose we have the ages of 10 individuals, and we want to categorize each age as young, middle-aged, or old
Using a for loop, we can iterate through each age and then use if/else statements to classify each age

for loops and if/else statements:

age <- sample(1:100, 10)
ageCat <- rep(NA, length(age))
for (i in 1:length(age)) {
    if (age[i] <= 35){
       ageCat[i] <- "Young"
      }else if (age[i] <= 55){
        ageCat[i] <- "Middle-Aged"
      }else{
         ageCat[i] <- "Old"
      } 
}
age.df <- data.frame(age = age, ageCat = ageCat)
age.df[1:3,]

##   age      ageCat
## 1  30       Young
## 2  26       Young
## 3  41 Middle-Aged

Functions and Packages:

R has many built-in functions
Each function has a name followed by (), e.g., mean()
Arguments of a function are put within the parentheses
R packages are a way to maintain collections of R functions and data sets
Packages allow for easy, transparent and cross-platform extension of the R base system

Functions and Packages:

Terminology:

Package: an extension of the R base system with code, data and documentation in a standardized format
Library: a directory containing installed packages
Repository: a website providing packages for installation
Source: the original version of a package with human-readable text and code
Base packages: part of the R source tree, maintained by R Core

for more info on how R packages are developed, please read: “Creating R Packages: A Tutorial” (Friedrich Leisch)
http://cran.r-project.org/doc/contrib/Leisch-CreatingPackages.pdf

How to install a package in R:

There are two main ways to install a package in R:

Installing from CRAN: install a package directly from the repository
- Using R studio: tools/install packages
- From R console: install.packages()
Installing from Source: first download the add-on R package and then type the following in your console:
- install.packages("path_to_file", repos = NULL, type = "source")

Once you install a package, you need to load it into R using the function library()

Popular Packages in R:

To visualize data:

ggplot2: to create beautiful graphics
googleVis: to use Google Chart tools

To report results:

shiny: to create interactive web-based apps
knitr: to combine R codes and Latex/Markdown codes
slidify: to build HTML 5 slide shows

To write high-performance R code:

Rcpp: to write R functions that call C++ code
data.table: to organize datasets for fast operations
parallel: to use parallel processing in R

Functions in R

Consider the function sample(). Run ?sample to read the help file or str(sample) to see its arguments.

str(sample)

## function (x, size, replace = FALSE, prob = NULL)

sample() has four arguments:
- x: vector of elements from which to choose
- size: desired sample size
- replace: sampling with/without replacement (logical)
- prob: vector of probability weights
The help file will specify which arguments have default values (and what those values are)

Calling a function in R

Function arguments can either be matched by position within the parentheses or by name

sampSpace <- 1:6 
sample(sampSpace, 1) # arguments with default values can be omitted

## [1] 1

sample(size = 1, x = sampSpace) # no need to remember the order

## [1] 5

sample(size = 1, sampSpace)

## [1] 5

Writing Your Own Functions

One strength of R is the ability of the user to add functions
The structure of a function is as follows:

yourFnName <- function(arg1, arg2, ...){
  statements # body of your code
  
  return(object) # what is to be returned
}

To use your function, you can simply call the function name as:

yourFnName(arg1, arg2, ...)

Writing Your Own Functions

Let’s write a function that takes three values (arguments) a, b, and c and returns the min of the three numbers

myMin <- function(a, b, c){
  myMinVal <- min(a, b, c)
  return(myMinVal)
}

myMin(10, 20, 30)

## [1] 10

myMin(10, NA, 20) # how to fix this so it returns 10?

## [1] NA

Some Useful Functions:

Here we are going to talk about:
- str(): a function to explain internal structure of an object
- apply(): to apply a function to a matrix or dataframe

str():

str() is a compact way of understanding what an object is and what is in that object

str(str)

## function (object, ...)

str(sample)

## function (x, size, replace = FALSE, prob = NULL)

genderF <- factor(sample(c("Male", "Female"), 20, replace = TRUE))
str(genderF)

##  Factor w/ 2 levels "Female","Male": 1 2 1 1 2 1 2 2 2 1 ...

str():

myMat <- matrix(1:10, ncol = 5)
str(myMat)

##  int [1:2, 1:5] 1 2 3 4 5 6 7 8 9 10

myList <- list(numVec = 1:3, logVec = F, charVec = LETTERS[1:4])
str(myList)

## List of 3
##  $ numVec : int [1:3] 1 2 3
##  $ logVec : logi FALSE
##  $ charVec: chr [1:4] "A" "B" "C" "D"

apply():

str(apply) # try ?apply for more info

## function (X, MARGIN, FUN, ...)

apply() is a function that applies a function (FUN) on a MARGIN of a matrix or dataframe (X)
MARGIN: a vector giving the subscripts which the function will be applied over
- 1: indicates rows
- 2: indicates columns
- c(1, 2): indicates rows and columns
FUN: refers to the function that we want to apply on the dataset
“…” : additional arguments of FUN

apply():

Suppose we have a matrix and we want to sum each column. We can use apply to do this.

myMat <- matrix(1:10, ncol = 5)
myMat

##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    3    5    7    9
## [2,]    2    4    6    8   10

apply(myMat, 2, sum)

## [1]  3  7 11 15 19

apply()

Now, suppose our matrix has some NAs that we want to ignore in our summation.
The sum() function has an argument na.rm that ignores NAs in a sum. We can include this in the apply() function.

myMat <- matrix(1:10, ncol = 5)
myMat[2,c(2, 5)] <- NA
myMat

##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    3    5    7    9
## [2,]    2   NA    6    8   NA

apply(myMat, 2, sum, na.rm = TRUE)

## [1]  3  3 11 15  9

apply():

Consider the iris dataset:

head(iris) # more info ?iris

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Suppose we are interested in getting the 25% and 75% quantiles of each numeric column
Check the help page of quantile() to see what arguments should be included.

apply():

apply(iris[,-5], 2, quantile, probs = c(0.25, 0.75))

##     Sepal.Length Sepal.Width Petal.Length Petal.Width
## 25%          5.1         2.8          1.6         0.3
## 75%          6.4         3.3          5.1         1.8

Other functions in the apply() family:

As mentioned above, apply() applies a function on a matrix or an array and returns a vector or an array
Other apply functions can be applied to and/or return other data structures (we will briefly mention them here; check the help files for each function for more information)

Other functions in the apply() family:

lapply(): can be used on a list, data frame, or vector; returns a list
sapply(): works like lapply(), but simplifies the result
mapply(): stands for “multivariate” apply; applies a function to multiple list or multiple vector arguments
tapply(): applies a function on a subset of a vector broken down by a factor

Intro to R Workshop: Session 2

UCI Data Science Initiative

Session 2 - Agenda

Vectorized Operations

Vectorized Operations

Reading and Writing Data

Reading and Writing Data

read.table():

read.table():

write.table():

Control Structures:

for loops:

for loops:

for loops:

if/else statements:

Combining for loops and if/else statements:

for loops and if/else statements:

Functions and Packages:

Functions and Packages:

How to install a package in R:

Popular Packages in R:

Functions in R

Calling a function in R

Writing Your Own Functions

Writing Your Own Functions

Some Useful Functions:

str():

str():

apply():

apply():

apply()

apply():

apply():

Other functions in the apply() family:

Other functions in the apply() family:

Break Time