Intro to Data Analysis with R: Session 1

UCI Data Science Initiative

April 20, 2018

Introduction

Please ask questions during lectures & throughout the day!
To access course materials please visit the IDA-with-R website.
Please download & unzip the Github repository!

Session 1 - Agenda

What is R?
Intro to RStudio
Data Structures
Subsetting & Indexing

What is R?

R is a free software environment for statistical computing and graphics
- See http://www.r-project.org/ for more info
R compiles and runs on a wide variety of UNIX platforms, Windows and Mac OS
R is open-source and free
R is fundamentally a command-driven system
R is an object-oriented programming language
- Everything in R is an object (data, functions, etc.)
R is highly extendable
- You can write your own custom functions
- There are over 11,000 free add-on packages

RStudio

RStudio is a free and open source integrated development environment (IDE) for R.
Visit http://rstudio.org/ for more info
Please note that you must have R already installed before installing R Studio!
In the repository that you downloaded, there is a file called IDA-with-R.Rproj – please open it in RStudio.

Fundamentals of R

When you type commands at the prompt ‘>’ and hit ENTER
- R tries to interpret what you’ve asked it to do (evaluation)
- If it understands what you’ve written, it does it (execution)
- If it doesn’t, it will likely give you an error or a warning
Some commands trigger R to print to the screen, others don’t
If you type an incomplete command, R will usually respond by changing the command prompt to the ‘+’ character
- Hit ESC on a MAC to cancel
- Type Ctrl + C on Windows and Linux to cancel

Functions in R

R has many built-in functions
Each function has a name followed by (), e.g., mean()
Arguments of a function are put within the parentheses
Help files are available for each function by typing ? followed by the function name, e.g., ?mean
We will be using functions throughout the workshop

Data Types in R

R has 5 main atomic data types:
- Numeric
- Character
- Logical
- Integer
- Complex

Data Structures in R

One-dimensional:
- Vectors
Multi-dimensional:
- Matrices
- Data frames

Everything in R is an object
Objects can have attributes
- e.g., names, dimension, length

Vectors in R

A vector is the most basic object in R
It is one-dimensional; its single dimension is its length
A vector of length n has n cells
Each cell can hold a single value, like a numeric or string value
- In general, vectors can only hold ONE type of data

numVec <- c(2,3,4)  # <- is the assigning operator
numVec

## [1] 2 3 4

length(numVec)  # gives the number of elements in the vector

## [1] 3

Examples of Character and Logical Vectors

charVec <- c("red", "green", "blue")
charVec

## [1] "red"   "green" "blue"

logVec <- c(TRUE, FALSE, FALSE, T, F)
logVec

## [1]  TRUE FALSE FALSE  TRUE FALSE

Data Type Coercion

In general, vectors CANNOT have mixed types of data
If you create a vector with more than one type of data, R will automatically coerce it to a single type

numCharVec <- c(3.14, "a")
numCharVec

## [1] "3.14" "a"

Explicitly coerce objects from one type to another using the function as()
- Be careful about warnings; always check to make sure the coercion is correct!

numVec <- 1:10
numToChar <- as(numVec, "character")
numToChar

##  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10"

Factors

A factor is a vector object used to specify a discrete classification (categorical values)
Factors can be ordered or un-ordered
Levels of a factor are better labeled (self-descriptive)
- Consider sex as (0, 1) as opposed to labeled (“F”, “M”)

sex <- rep(c("Female", "Male"), times = 3)
sex

## [1] "Female" "Male"   "Female" "Male"   "Female" "Male"

sexFac1 <- factor(sex)
sexFac1

## [1] Female Male   Female Male   Female Male  
## Levels: Female Male

Matrices

A matrix is a special case of a vector (matrices have dimension)
- Like vectors, all elements of a matrix should be of the same data type

myMat <- matrix(nrow = 2, ncol = 4)
myMat

##      [,1] [,2] [,3] [,4]
## [1,]   NA   NA   NA   NA
## [2,]   NA   NA   NA   NA

length(myMat)  # gives the total number of elements in the matrix, nrow*ncol

## [1] 8

dim(myMat)  # gives a vector containing the dimensions of the matrix, (nrow, ncol)

## [1] 2 4

Matrices are filled column-wise (unless otherwise specified)

myMat <- matrix(nrow = 2, ncol = 4, data = 1:8)
myMat

##      [,1] [,2] [,3] [,4]
## [1,]    1    3    5    7
## [2,]    2    4    6    8

Other Ways to Create a Matrix

Intuitively, matrices seem to be a combination of vectors that are put next to each other (either column-wise or row-wise)
rbind() and cbind() (row bind and column bind) achieve this

vec1 <- 1:4
vec2 <- 5:8
vec3 <- 9:12
colMat <- cbind(vec1, vec2, vec3)
colMat

##      vec1 vec2 vec3
## [1,]    1    5    9
## [2,]    2    6   10
## [3,]    3    7   11
## [4,]    4    8   12

Other Ways to Create a Matrix, contd.

rowMat <- rbind(vec1, vec2, vec3)
rowMat

##      [,1] [,2] [,3] [,4]
## vec1    1    2    3    4
## vec2    5    6    7    8
## vec3    9   10   11   12

Special Values

There are some special values in R:

Inf: infinity
NaN: “Not a number”

a <- Inf
b <- 0
rslt <- c(b/a, a/a, 1/b)
rslt

## [1]   0 NaN Inf

Missing Values

There are two kinds of missing values in R:
- NaN: stands for “Not a Number” and is a missing value produced by numerical computation
- NA: stands for “Not Available” and is used when a value is missing

a <- c(1,2)
a[3]

## [1] NA

b <- 0/0
b

## [1] NaN

Missing Values, contd.

is.na() and is.nan() are functions that indicate which value(s) of an object are missing
NaN is also considered as NA (the reverse is NOT true)

vec <- c(1, NA, 3, NaN, NA, 5, NaN)
is.na(vec)

## [1] FALSE  TRUE FALSE  TRUE  TRUE FALSE  TRUE

is.nan(vec)

## [1] FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE

Data Frames

A data frame looks very similar to a matrix
Caveat: different columns in a data frame can be different data types
Often, data used for analysis will be in the form of a data frame
Let’s learn how to read a data frame into R

Reading Data into R

Main functions for reading data into R:

read.table(), read.csv(): to read tabular data
readLines(): to read lines of a text file
source(), dget(): to read R code
load(): to read saved workspaces and R data files (.Rda)

Only read.table() and read.csv() will be covered today.

Reading Data into R, contd.

read.table() is the most commonly used function to read data in R
- Type ?read.table in your R console to see the important arguments in the function
read.csv() is intended for reading comma separated value files
- Equivalent to read.table() with sep = “,” and header = TRUE

Prestige Data

We will be using the data frame you just read in throughout the day. It contains the following attributes:

education: Average education of occupational incumbents, years, in 1971.
income: Average income of incumbents, dollars, in 1971.
women: Percentage of incumbents who are women.
prestige: Pineo-Porter prestige score for occupation, from a social survey conducted in the mid-1960s.
census: Canadian Census occupational code.
type: Type of occupation, a factor with levels
- bc: Blue Collar
- prof: Professional, Managerial, and Technical
- wc: White Collar

Reading in the Prestige Data

Before executing the following commands, make sure that you are working in your local IDA-with-R-master directory!
- Type getwd() into your R console
- This should return a string with a path to the directory, e.g. "~/IDA-with-R-master"
- The function here::here() tells R where to look to find your data

path <- here::here("data", "prestige.csv")
path  # example path from Chris' machine

## [1] "/Users/cmg/dev/UCIDataScienceInitiative/IDA-with-R/data/prestige.csv"

prestige <- read.table(file = path, 
                       sep = ",", 
                       header = TRUE, 
                       row.names = 1)
class(prestige)  # gives object type

## [1] "data.frame"

head(prestige)  # look at the first 5 rows, equivalent to prestige[1:5, ]

##                     education income women prestige census type
## gov.administrators      13.11  12351 11.16     68.8   1113 prof
## general.managers        12.26  25879  4.02     69.1   1130 prof
## accountants             12.77   9271 15.70     63.4   1171 prof
## purchasing.officers     11.42   8865  9.11     56.8   1175 prof
## chemists                14.62   8403 11.68     73.5   2111 prof
## physicists              15.64  11030  5.13     77.6   2113 prof

dim(prestige)  # (nrow, ncol)

## [1] 102   6

Prestige Data Frame

The prestige data frame has named rows.
We can use the function row.names() to see the names of each row.

head(row.names(prestige))  # equivalent to row.names(prestige)[1:5]

## [1] "gov.administrators"  "general.managers"    "accountants"        
## [4] "purchasing.officers" "chemists"            "physicists"

Subsetting

Now that we have a data frame loaded, let’s learn about subsetting data frames and vectors
Consider two main operators to take a subset of an object
- [ ] single brackets return an object of the same class as the original object
- $ used primarily for selecting columns from data frames
  - We use $ when selecting an attribute by name
[ ] allows us to select more than one element
$ allows us to select only one

Subsetting Vectors

vec <- 1:10
vec[3]

## [1] 3

Single brackets allow us to select more than one element of an object

vec[1:3]

## [1] 1 2 3

vec[c(2,4,6)]

## [1] 2 4 6

Subsetting Data Frames

We also use the single square brackets to subset data frames
- In the square brackets, the first position refers to the row(s) and the second position refers to the column(s)

prestige[1:2, ]  # get the first 2 rows

##                    education income women prestige census type
## gov.administrators     13.11  12351 11.16     68.8   1113 prof
## general.managers       12.26  25879  4.02     69.1   1130 prof

prestige[1, 2]  # get the first row, second column

## [1] 12351

Subsetting Data Frames, contd.

We use $ when selecting an attribute by name
- This is commonly used to subset a column of a data frame

head(prestige$income)  # select the first 5 values of the second column

## [1] 12351 25879  9271  8865  8403 11030

Can combine $ with [ ]

prestige$income[1]  # get the first element of the income vector

## [1] 12351

prestige[1, 2]  # get the first row, second column

## [1] 12351

Index Vectors

Another way to select more than one element from an object is by using index vectors
- Vector of integers taking values from 1 to n, the length of the vector you wish to subset
- Used to select a subset of another vector (or matrix)
We will cover three types of index vectors:
1. Logical index vector
2. Vector of positive integers
3. Vector of character strings

1. Logical Index Vector

A vector of TRUE/FALSE values that should be the same length as the vector from which we are subsetting.
- Values corresponding to TRUE in the index vector are selected
We can treat is.na() as an index vector to subset rows containing NAs

logIndVec <- is.na(prestige$type)
head(logIndVec)

## [1] FALSE FALSE FALSE FALSE FALSE FALSE

length(logIndVec)  # same as the number of rows in prestige

## [1] 102

prestige[logIndVec,]

##             education income women prestige census type
## athletes        11.44   8206  8.13     54.1   3373 <NA>
## newsboys         9.62    918  7.00     14.8   5143 <NA>
## babysitters      9.46    611 96.53     25.9   6147 <NA>
## farmers          6.84   3643  3.60     44.1   7112 <NA>

2. Index Vector of Positive Integers

A vector of positive integers corresponding to the elements you want to subset
We can use the function which() along with is.na() to create this index vector

posIndVec <- which(is.na(prestige$type))
posIndVec  # indices of missing elements of type vector

## [1] 34 53 63 67

prestige[posIndVec,]

##             education income women prestige census type
## athletes        11.44   8206  8.13     54.1   3373 <NA>
## newsboys         9.62    918  7.00     14.8   5143 <NA>
## babysitters      9.46    611 96.53     25.9   6147 <NA>
## farmers          6.84   3643  3.60     44.1   7112 <NA>

This gives the same results as before, but a different way of getting there.

which(logIndVec == TRUE)

## [1] 34 53 63 67

3. Index Vector of Character Strings

If an object has a name attribute, we can take a subset of the vector by calling the names of the elements
For example, we can use this to take a subset of the columns of the prestige data

prestige[1:5, c("education", "income")]

##                     education income
## gov.administrators      13.11  12351
## general.managers        12.26  25879
## accountants             12.77   9271
## purchasing.officers     11.42   8865
## chemists                14.62   8403

Same as subsetting by column number, but R looks up the column number for you

prestige[1:5, 1:2]

##                     education income
## gov.administrators      13.11  12351
## general.managers        12.26  25879
## accountants             12.77   9271
## purchasing.officers     11.42   8865
## chemists                14.62   8403

Using Index Vectors to Sort a Data Frame

We can use the order() function along with the subsetting operators to sort a data frame by a specific column.
Let’s sort the prestige data in ascending order by education.

sort.index <- order(prestige$education, decreasing = FALSE)
sort.index[1:5]

## [1] 84 92 87 74 75

prestige.sorted <- prestige[sort.index, ]
prestige.sorted[1:5, ]

##                       education income women prestige census type
## sewing.mach.operators      6.38   2847 90.67     28.2   8563   bc
## masons                     6.60   5959  0.52     36.2   8782   bc
## railway.sectionmen         6.67   4696  0.00     27.3   8715   bc
## textile.weavers            6.69   4443 31.36     33.3   8267   bc
## textile.labourers          6.74   3485 39.48     28.8   8278   bc

Subsetting Example

Let’s see an example of how subsetting might be use to manipulate data.
We will replace some of the missing values in the type column with "bc"
Recall that we used an index vector to subset the rows of prestige that contain NA’s

ind <- which(is.na(prestige$type)) 
prestige[ind,]

##             education income women prestige census type
## athletes        11.44   8206  8.13     54.1   3373 <NA>
## newsboys         9.62    918  7.00     14.8   5143 <NA>
## babysitters      9.46    611 96.53     25.9   6147 <NA>
## farmers          6.84   3643  3.60     44.1   7112 <NA>

rbind(index=ind, name=rownames(prestige)[ind])

##       [,1]       [,2]       [,3]          [,4]     
## index "34"       "53"       "63"          "67"     
## name  "athletes" "newsboys" "babysitters" "farmers"

Subsetting Example, contd.

We will replace the last three NA’s with "bc" (blue collar)

ind.ch <- ind[2:4]
prestige[ind.ch, "type"] <- rep("bc", 3)
summary(prestige$type)

##   bc prof   wc NA's 
##   47   31   23    1

Exclude any rows that still contain NA (one row for athletes)

prestige <- na.omit(prestige)
summary(prestige$type)

##   bc prof   wc 
##   47   31   23

Writing Data to File

We can use write.table() to write a data frame to file
Similar to read.table(), but we now also specify the name of the data frame in addition to the path
- write.csv is analagous to read.csv
Let’s write our updated prestige data to a new csv file

write.table(prestige, 
            file = here::here("data", "prestige_v2.csv"), 
            sep = ",", 
            col.names = TRUE, 
            row.names = TRUE)

End of Session 1

Next up:

Exercise 1
Break

Return at 11:00 to discuss solutions to Exercise 1!