April 20, 2018
IDA-with-R.Rproj
– please open it in RStudio.mean()
?mean
numVec <- c(2,3,4) # <- is the assigning operator
numVec
## [1] 2 3 4
length(numVec) # gives the number of elements in the vector
## [1] 3
charVec <- c("red", "green", "blue")
charVec
## [1] "red" "green" "blue"
logVec <- c(TRUE, FALSE, FALSE, T, F)
logVec
## [1] TRUE FALSE FALSE TRUE FALSE
numCharVec <- c(3.14, "a")
numCharVec
## [1] "3.14" "a"
as()
numVec <- 1:10
numToChar <- as(numVec, "character")
numToChar
## [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10"
sex <- rep(c("Female", "Male"), times = 3)
sex
## [1] "Female" "Male" "Female" "Male" "Female" "Male"
sexFac1 <- factor(sex)
sexFac1
## [1] Female Male Female Male Female Male
## Levels: Female Male
myMat <- matrix(nrow = 2, ncol = 4)
myMat
## [,1] [,2] [,3] [,4]
## [1,] NA NA NA NA
## [2,] NA NA NA NA
length(myMat) # gives the total number of elements in the matrix, nrow*ncol
## [1] 8
dim(myMat) # gives a vector containing the dimensions of the matrix, (nrow, ncol)
## [1] 2 4
myMat <- matrix(nrow = 2, ncol = 4, data = 1:8)
myMat
## [,1] [,2] [,3] [,4]
## [1,] 1 3 5 7
## [2,] 2 4 6 8
Intuitively, matrices seem to be a combination of vectors that are put next to each other (either column-wise or row-wise)
rbind()
and cbind()
(row bind and column bind) achieve this
vec1 <- 1:4
vec2 <- 5:8
vec3 <- 9:12
colMat <- cbind(vec1, vec2, vec3)
colMat
## vec1 vec2 vec3
## [1,] 1 5 9
## [2,] 2 6 10
## [3,] 3 7 11
## [4,] 4 8 12
rowMat <- rbind(vec1, vec2, vec3)
rowMat
## [,1] [,2] [,3] [,4]
## vec1 1 2 3 4
## vec2 5 6 7 8
## vec3 9 10 11 12
There are some special values in R:
Inf
: infinityNaN
: “Not a number”a <- Inf
b <- 0
rslt <- c(b/a, a/a, 1/b)
rslt
## [1] 0 NaN Inf
NaN
: stands for “Not a Number” and is a missing value produced by numerical computationNA
: stands for “Not Available” and is used when a value is missinga <- c(1,2)
a[3]
## [1] NA
b <- 0/0
b
## [1] NaN
is.na()
and is.nan()
are functions that indicate which value(s) of an object are missingNaN
is also considered as NA (the reverse is NOT true)vec <- c(1, NA, 3, NaN, NA, 5, NaN)
is.na(vec)
## [1] FALSE TRUE FALSE TRUE TRUE FALSE TRUE
is.nan(vec)
## [1] FALSE FALSE FALSE TRUE FALSE FALSE TRUE
read.table()
, read.csv()
: to read tabular datareadLines()
: to read lines of a text filesource()
, dget()
: to read R codeload()
: to read saved workspaces and R data files (.Rda
)read.table()
and read.csv()
will be covered today.read.table()
is the most commonly used function to read data in R
?read.table
in your R console to see the important arguments in the functionread.csv()
is intended for reading comma separated value files
We will be using the data frame you just read in throughout the day. It contains the following attributes:
education
: Average education of occupational incumbents, years, in 1971.
income
: Average income of incumbents, dollars, in 1971.
women
: Percentage of incumbents who are women.
prestige
: Pineo-Porter prestige score for occupation, from a social survey conducted in the mid-1960s.
census
: Canadian Census occupational code.
type
: Type of occupation, a factor with levels
bc
: Blue Collarprof
: Professional, Managerial, and Technicalwc
: White CollarIDA-with-R-master
directory!
getwd()
into your R console"~/IDA-with-R-master"
here::here()
tells R where to look to find your datapath <- here::here("data", "prestige.csv")
path # example path from Chris' machine
## [1] "/Users/cmg/dev/UCIDataScienceInitiative/IDA-with-R/data/prestige.csv"
prestige <- read.table(file = path,
sep = ",",
header = TRUE,
row.names = 1)
class(prestige) # gives object type
## [1] "data.frame"
head(prestige) # look at the first 5 rows, equivalent to prestige[1:5, ]
## education income women prestige census type
## gov.administrators 13.11 12351 11.16 68.8 1113 prof
## general.managers 12.26 25879 4.02 69.1 1130 prof
## accountants 12.77 9271 15.70 63.4 1171 prof
## purchasing.officers 11.42 8865 9.11 56.8 1175 prof
## chemists 14.62 8403 11.68 73.5 2111 prof
## physicists 15.64 11030 5.13 77.6 2113 prof
dim(prestige) # (nrow, ncol)
## [1] 102 6
prestige
data frame has named rows.row.names()
to see the names of each row.head(row.names(prestige)) # equivalent to row.names(prestige)[1:5]
## [1] "gov.administrators" "general.managers" "accountants"
## [4] "purchasing.officers" "chemists" "physicists"
vec <- 1:10
vec[3]
## [1] 3
Single brackets allow us to select more than one element of an object
vec[1:3]
## [1] 1 2 3
vec[c(2,4,6)]
## [1] 2 4 6
prestige[1:2, ] # get the first 2 rows
## education income women prestige census type
## gov.administrators 13.11 12351 11.16 68.8 1113 prof
## general.managers 12.26 25879 4.02 69.1 1130 prof
prestige[1, 2] # get the first row, second column
## [1] 12351
head(prestige$income) # select the first 5 values of the second column
## [1] 12351 25879 9271 8865 8403 11030
prestige$income[1] # get the first element of the income vector
## [1] 12351
prestige[1, 2] # get the first row, second column
## [1] 12351
n
, the length of the vector you wish to subsetis.na()
as an index vector to subset rows containing NAslogIndVec <- is.na(prestige$type)
head(logIndVec)
## [1] FALSE FALSE FALSE FALSE FALSE FALSE
length(logIndVec) # same as the number of rows in prestige
## [1] 102
prestige[logIndVec,]
## education income women prestige census type
## athletes 11.44 8206 8.13 54.1 3373 <NA>
## newsboys 9.62 918 7.00 14.8 5143 <NA>
## babysitters 9.46 611 96.53 25.9 6147 <NA>
## farmers 6.84 3643 3.60 44.1 7112 <NA>
which()
along with is.na()
to create this index vectorposIndVec <- which(is.na(prestige$type))
posIndVec # indices of missing elements of type vector
## [1] 34 53 63 67
prestige[posIndVec,]
## education income women prestige census type
## athletes 11.44 8206 8.13 54.1 3373 <NA>
## newsboys 9.62 918 7.00 14.8 5143 <NA>
## babysitters 9.46 611 96.53 25.9 6147 <NA>
## farmers 6.84 3643 3.60 44.1 7112 <NA>
which(logIndVec == TRUE)
## [1] 34 53 63 67
prestige[1:5, c("education", "income")]
## education income
## gov.administrators 13.11 12351
## general.managers 12.26 25879
## accountants 12.77 9271
## purchasing.officers 11.42 8865
## chemists 14.62 8403
prestige[1:5, 1:2]
## education income
## gov.administrators 13.11 12351
## general.managers 12.26 25879
## accountants 12.77 9271
## purchasing.officers 11.42 8865
## chemists 14.62 8403
order()
function along with the subsetting operators to sort a data frame by a specific column.education
.sort.index <- order(prestige$education, decreasing = FALSE)
sort.index[1:5]
## [1] 84 92 87 74 75
prestige.sorted <- prestige[sort.index, ]
prestige.sorted[1:5, ]
## education income women prestige census type
## sewing.mach.operators 6.38 2847 90.67 28.2 8563 bc
## masons 6.60 5959 0.52 36.2 8782 bc
## railway.sectionmen 6.67 4696 0.00 27.3 8715 bc
## textile.weavers 6.69 4443 31.36 33.3 8267 bc
## textile.labourers 6.74 3485 39.48 28.8 8278 bc
type
column with "bc"
prestige
that contain NA
’sind <- which(is.na(prestige$type))
prestige[ind,]
## education income women prestige census type
## athletes 11.44 8206 8.13 54.1 3373 <NA>
## newsboys 9.62 918 7.00 14.8 5143 <NA>
## babysitters 9.46 611 96.53 25.9 6147 <NA>
## farmers 6.84 3643 3.60 44.1 7112 <NA>
rbind(index=ind, name=rownames(prestige)[ind])
## [,1] [,2] [,3] [,4]
## index "34" "53" "63" "67"
## name "athletes" "newsboys" "babysitters" "farmers"
NA
’s with "bc"
(blue collar)ind.ch <- ind[2:4]
prestige[ind.ch, "type"] <- rep("bc", 3)
summary(prestige$type)
## bc prof wc NA's
## 47 31 23 1
NA
(one row for athletes)prestige <- na.omit(prestige)
summary(prestige$type)
## bc prof wc
## 47 31 23
write.table()
to write a data frame to fileread.table()
, but we now also specify the name of the data frame in addition to the path
write.csv
is analagous to read.csv
write.table(prestige,
file = here::here("data", "prestige_v2.csv"),
sep = ",",
col.names = TRUE,
row.names = TRUE)