Intro to Data Analysis with R: Session 1

UCI Data Science Initiative

April 20, 2018

Introduction

Session 1 - Agenda

  1. What is R?
  2. Intro to RStudio
  3. Data Structures
  4. Subsetting & Indexing

What is R?

RStudio

Fundamentals of R

Functions in R

  1. R has many built-in functions
  2. Each function has a name followed by (), e.g., mean()
  3. Arguments of a function are put within the parentheses
  4. Help files are available for each function by typing ? followed by the function name, e.g., ?mean
  5. We will be using functions throughout the workshop

Data Types in R

Data Structures in R

  1. One-dimensional:
    • Vectors
  2. Multi-dimensional:
    • Matrices
    • Data frames

 

Vectors in R

numVec <- c(2,3,4)  # <- is the assigning operator
numVec
## [1] 2 3 4
length(numVec)  # gives the number of elements in the vector
## [1] 3

Examples of Character and Logical Vectors

charVec <- c("red", "green", "blue")
charVec
## [1] "red"   "green" "blue"
logVec <- c(TRUE, FALSE, FALSE, T, F)
logVec
## [1]  TRUE FALSE FALSE  TRUE FALSE

Data Type Coercion

numCharVec <- c(3.14, "a")
numCharVec                 
## [1] "3.14" "a"
numVec <- 1:10
numToChar <- as(numVec, "character")
numToChar
##  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10"

Factors

sex <- rep(c("Female", "Male"), times = 3)
sex
## [1] "Female" "Male"   "Female" "Male"   "Female" "Male"
sexFac1 <- factor(sex)
sexFac1
## [1] Female Male   Female Male   Female Male  
## Levels: Female Male

Matrices

myMat <- matrix(nrow = 2, ncol = 4)
myMat
##      [,1] [,2] [,3] [,4]
## [1,]   NA   NA   NA   NA
## [2,]   NA   NA   NA   NA
length(myMat)  # gives the total number of elements in the matrix, nrow*ncol
## [1] 8
dim(myMat)  # gives a vector containing the dimensions of the matrix, (nrow, ncol)
## [1] 2 4
myMat <- matrix(nrow = 2, ncol = 4, data = 1:8)
myMat
##      [,1] [,2] [,3] [,4]
## [1,]    1    3    5    7
## [2,]    2    4    6    8

Other Ways to Create a Matrix

vec1 <- 1:4
vec2 <- 5:8
vec3 <- 9:12
colMat <- cbind(vec1, vec2, vec3)
colMat
##      vec1 vec2 vec3
## [1,]    1    5    9
## [2,]    2    6   10
## [3,]    3    7   11
## [4,]    4    8   12

Other Ways to Create a Matrix, contd.

rowMat <- rbind(vec1, vec2, vec3)
rowMat
##      [,1] [,2] [,3] [,4]
## vec1    1    2    3    4
## vec2    5    6    7    8
## vec3    9   10   11   12

Special Values

There are some special values in R:

a <- Inf
b <- 0
rslt <- c(b/a, a/a, 1/b)
rslt
## [1]   0 NaN Inf

Missing Values

a <- c(1,2)
a[3]
## [1] NA
b <- 0/0
b
## [1] NaN

Missing Values, contd.

vec <- c(1, NA, 3, NaN, NA, 5, NaN)
is.na(vec)
## [1] FALSE  TRUE FALSE  TRUE  TRUE FALSE  TRUE
is.nan(vec)
## [1] FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE

Data Frames

Reading Data into R

  1. read.table(), read.csv(): to read tabular data
  2. readLines(): to read lines of a text file
  3. source(), dget(): to read R code
  4. load(): to read saved workspaces and R data files (.Rda)

Reading Data into R, contd.

Prestige Data

We will be using the data frame you just read in throughout the day. It contains the following attributes:

Reading in the Prestige Data

path <- here::here("data", "prestige.csv")
path  # example path from Chris' machine
## [1] "/Users/cmg/dev/UCIDataScienceInitiative/IDA-with-R/data/prestige.csv"
prestige <- read.table(file = path, 
                       sep = ",", 
                       header = TRUE, 
                       row.names = 1)
class(prestige)  # gives object type
## [1] "data.frame"
head(prestige)  # look at the first 5 rows, equivalent to prestige[1:5, ]  
##                     education income women prestige census type
## gov.administrators      13.11  12351 11.16     68.8   1113 prof
## general.managers        12.26  25879  4.02     69.1   1130 prof
## accountants             12.77   9271 15.70     63.4   1171 prof
## purchasing.officers     11.42   8865  9.11     56.8   1175 prof
## chemists                14.62   8403 11.68     73.5   2111 prof
## physicists              15.64  11030  5.13     77.6   2113 prof
dim(prestige)  # (nrow, ncol)
## [1] 102   6

Prestige Data Frame

head(row.names(prestige))  # equivalent to row.names(prestige)[1:5]
## [1] "gov.administrators"  "general.managers"    "accountants"        
## [4] "purchasing.officers" "chemists"            "physicists"

Subsetting

Subsetting Vectors

vec <- 1:10
vec[3]
## [1] 3

Single brackets allow us to select more than one element of an object

vec[1:3]
## [1] 1 2 3
vec[c(2,4,6)]
## [1] 2 4 6

Subsetting Data Frames

prestige[1:2, ]  # get the first 2 rows
##                    education income women prestige census type
## gov.administrators     13.11  12351 11.16     68.8   1113 prof
## general.managers       12.26  25879  4.02     69.1   1130 prof
prestige[1, 2]  # get the first row, second column
## [1] 12351

Subsetting Data Frames, contd.

head(prestige$income)  # select the first 5 values of the second column
## [1] 12351 25879  9271  8865  8403 11030
prestige$income[1]  # get the first element of the income vector
## [1] 12351
prestige[1, 2]  # get the first row, second column
## [1] 12351

Index Vectors

1. Logical Index Vector

logIndVec <- is.na(prestige$type)
head(logIndVec)
## [1] FALSE FALSE FALSE FALSE FALSE FALSE
length(logIndVec)  # same as the number of rows in prestige
## [1] 102
prestige[logIndVec,]
##             education income women prestige census type
## athletes        11.44   8206  8.13     54.1   3373 <NA>
## newsboys         9.62    918  7.00     14.8   5143 <NA>
## babysitters      9.46    611 96.53     25.9   6147 <NA>
## farmers          6.84   3643  3.60     44.1   7112 <NA>

2. Index Vector of Positive Integers

posIndVec <- which(is.na(prestige$type))
posIndVec  # indices of missing elements of type vector
## [1] 34 53 63 67
prestige[posIndVec,]
##             education income women prestige census type
## athletes        11.44   8206  8.13     54.1   3373 <NA>
## newsboys         9.62    918  7.00     14.8   5143 <NA>
## babysitters      9.46    611 96.53     25.9   6147 <NA>
## farmers          6.84   3643  3.60     44.1   7112 <NA>
which(logIndVec == TRUE)
## [1] 34 53 63 67

3. Index Vector of Character Strings

prestige[1:5, c("education", "income")]
##                     education income
## gov.administrators      13.11  12351
## general.managers        12.26  25879
## accountants             12.77   9271
## purchasing.officers     11.42   8865
## chemists                14.62   8403
prestige[1:5, 1:2]
##                     education income
## gov.administrators      13.11  12351
## general.managers        12.26  25879
## accountants             12.77   9271
## purchasing.officers     11.42   8865
## chemists                14.62   8403

Using Index Vectors to Sort a Data Frame

sort.index <- order(prestige$education, decreasing = FALSE)
sort.index[1:5]
## [1] 84 92 87 74 75
prestige.sorted <- prestige[sort.index, ]
prestige.sorted[1:5, ]
##                       education income women prestige census type
## sewing.mach.operators      6.38   2847 90.67     28.2   8563   bc
## masons                     6.60   5959  0.52     36.2   8782   bc
## railway.sectionmen         6.67   4696  0.00     27.3   8715   bc
## textile.weavers            6.69   4443 31.36     33.3   8267   bc
## textile.labourers          6.74   3485 39.48     28.8   8278   bc

Subsetting Example

ind <- which(is.na(prestige$type)) 
prestige[ind,]
##             education income women prestige census type
## athletes        11.44   8206  8.13     54.1   3373 <NA>
## newsboys         9.62    918  7.00     14.8   5143 <NA>
## babysitters      9.46    611 96.53     25.9   6147 <NA>
## farmers          6.84   3643  3.60     44.1   7112 <NA>
rbind(index=ind, name=rownames(prestige)[ind])
##       [,1]       [,2]       [,3]          [,4]     
## index "34"       "53"       "63"          "67"     
## name  "athletes" "newsboys" "babysitters" "farmers"

Subsetting Example, contd.

ind.ch <- ind[2:4]
prestige[ind.ch, "type"] <- rep("bc", 3)
summary(prestige$type)
##   bc prof   wc NA's 
##   47   31   23    1
prestige <- na.omit(prestige)
summary(prestige$type)
##   bc prof   wc 
##   47   31   23

Writing Data to File

write.table(prestige, 
            file = here::here("data", "prestige_v2.csv"), 
            sep = ",", 
            col.names = TRUE, 
            row.names = TRUE)

End of Session 1

Next up:

  1. Exercise 1
  2. Break

Return at 11:00 to discuss solutions to Exercise 1!