Intro to Data Analysis with R: Session 3

Standard Normal Distribution

Generate 10 independent random numbers from a standard normal distribution

str(rnorm) # generate random number from normal dist

## function (n, mean = 0, sd = 1)

rnorm(10, mean = 0, sd = 1)

##  [1] -0.06119123  1.35703065 -0.42420890 -0.48579143 -1.34478334
##  [6] -0.01492377  1.01789886  1.66025678  1.31331140  1.55496600

Standard Normal Distribution

Calculate the probability that \(X \leq 0\)

str(pnorm) # normal CDF

## function (q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)

pnorm(0, mean = 0, sd = 1) # Pr[X <= 0] = ?

## [1] 0.5

Standard Normal Distribution

Find the value for which the CDF = 0.975

str(qnorm) # normal quantile func

## function (p, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)

qnorm(0.975, mean = 0, sd = 1) # PR[X <= ?] = 0.975

## [1] 1.959964

T-Tests

T-tests can be used to draw statistical conclusions about parameters of interest in the data

Is the mean of this data different from zero (or another number)?
Are the means of two data sets different from one another?
Is the regression slope coefficient different from zero?

T-tests can be categorized into two groups:

One-sample t-test
Two-sample t-test

One-Sample T-Test (Create Data)

set.seed(123)
oneSampData <- rnorm(100, mean = 0, sd = 1)
mean(oneSampData)

## [1] 0.09040591

sd(oneSampData)

## [1] 0.9128159

One-Sample T-Test (\(H_0: \mu = 0\))

oneSampTest.0 <- t.test(oneSampData) 
oneSampTest.0

## 
##  One Sample t-test
## 
## data:  oneSampData
## t = 0.99041, df = 99, p-value = 0.3244
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  -0.09071657  0.27152838
## sample estimates:
##  mean of x 
## 0.09040591

names(oneSampTest.0)

## [1] "statistic"   "parameter"   "p.value"     "conf.int"    "estimate"   
## [6] "null.value"  "alternative" "method"      "data.name"

oneSampTest.0$statistic

##         t 
## 0.9904068

oneSampTest.0$estimate

##  mean of x 
## 0.09040591

One-Sample T-Test (\(H_0: \mu = a\))

a <- 0.3
oneSampTest.mu <- t.test(oneSampData, mu = a)
oneSampTest.mu

## 
##  One Sample t-test
## 
## data:  oneSampData
## t = -2.2961, df = 99, p-value = 0.02378
## alternative hypothesis: true mean is not equal to 0.3
## 95 percent confidence interval:
##  -0.09071657  0.27152838
## sample estimates:
##  mean of x 
## 0.09040591

Two-Sample T-Test

Two sample t-tests are categorized into 3 groups:

T-Test with equal variances
T-Test with un-equal variances
Paired T-Test (one-sample t-test on differences)

Two-Sample T-Test (Create & Plot Data)

Samp1 <- rnorm(300, mean = 2.5, sd = 1)
Samp2 <- rnorm(500, mean = 3.0, sd = 1) # notice: not the same sample size
plot(density(Samp1), col="red", main="Densities of Samp1 and Samp2", xlab="")
abline(v = mean(Samp1), col = "red", lwd = 2, lty=2)
lines(density(Samp2), col="blue")
abline(v = mean(Samp2), col = "blue", lwd = 2, lty = 2)
legend("topright", legend = c("Samp1", "Samp2"),
       fill = c("red","blue"), bty = "n", cex = 1.3)

Two-Sample T-Test (Un-equal Variances)

Null hypothesis: \(\mu_1 = \mu_2 \Leftrightarrow \mu_1 - \mu_2 = 0\)

t.test(Samp1, Samp2)  # default assump: unequal variances

## 
##  Welch Two Sample t-test
## 
## data:  Samp1 and Samp2
## t = -7.3058, df = 638.78, p-value = 8.249e-13
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.6730166 -0.3878678
## sample estimates:
## mean of x mean of y 
##  2.492232  3.022674

Two-Sample T-Test (Equal Variances)

Null hypothesis: \(\mu_1 = \mu_2 \Leftrightarrow \mu_1 - \mu_2 = 0\)

t.test(Samp1, Samp2, var.equal = TRUE)  # default assump: unequal variances

## 
##  Two Sample t-test
## 
## data:  Samp1 and Samp2
## t = -7.2723, df = 798, p-value = 8.447e-13
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.6736199 -0.3872646
## sample estimates:
## mean of x mean of y 
##  2.492232  3.022674

Two-Sample T-Test (Paired T-Test)

Let \(D \equiv \{x_i - y_i : i=1, \ldots, n \}\).

Null hypothesis: \(\mu_D = 0\)

t.test(Samp1, Samp2[1:300], paired = TRUE) # must be of the same sample size

## 
##  Paired t-test
## 
## data:  Samp1 and Samp2[1:300]
## t = -5.7581, df = 299, p-value = 2.109e-08
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.6428004 -0.3153406
## sample estimates:
## mean of the differences 
##              -0.4790705

Linear Regression

Approach to model the relationship between a response variable \(Y\) and one or more predictor variables \(X\).
\(Y\) is a noisy linear combination of \(X\): \[Y = X \beta + \epsilon\]
- \(Y\) is a \(n \times 1\) vector of response variables
- \(X\) is a \(n \times p\) matrix where each row \(X_i\) is the predictor variables for the \(i^{th}\) observation
- \(\beta\) is the \(p \times 1\) vector of regression coefficients
- \(\epsilon \sim N(0,\sigma^2 I)\) is the \(n \times 1\) error vector
- Implies that \(Y \sim N(X\beta, \sigma^2I)\)
Linear regression equivalent to modeling the mean of \(Y\). Namely, \(\mu = E(Y) = X\beta\)
Multiple ways to estimate \(\beta\), but we will focus on the most basic method which is ordinary least squares estimation.
- Minimize sum of squared error (SSE): \(\sum_{i=1}^{n} (y_i - X_i\beta)^2\)
- Closed form solution: \(\hat{\beta} = (X^T X)^{-1} X^T Y\)
- Unbiased and consistent

Linear Regression in R

We use lm() to fit linear regression models, which has requires us to pass in a regression formula.
Don’t need to create the matrix \(X\) (R does this for us), just tell it which predictors to include.
Some notes on formula:
- Response variable goes on the left and the predictors go on the right of a ~
- Predictors are separated by + symbols (e.g., y ~ x + z)
- If you want to use all variables in a data frame, simply use . (e.g., y ~ .)
- By default, the formula contains an intercept. If you want to exclude it (i.e., force the intercept to be zero), can use either of y ~ 0 + . or y ~ . - 1
- Can include polynomial terms: Say we want to have a quadratic regression of y on x, we simply use y ~ x + I(x^2)

Linear Regression - Prestige Data

We will fit a linear model using the prestige data. Recall that we have the following variables

education: Average education of occupational incumbents, years, in 1971.
income: Average income of incumbents, dollars, in 1971.
women: Percentage of incumbents who are women.
prestige: Pineo-Porter prestige score for occupation, from a social survey conducted in the mid-1960s.
census: Canadian Census occupational code.
type: Type of occupation, a factor with levels
- bc: Blue Collar
- prof: Professional, Managerial, and Technical
- wc: White Collar

Load Data

prestige <- read.csv(file = here::here("data", "prestige_v2.csv"), 
                     row.names=1)
str(prestige)

## 'data.frame':    101 obs. of  6 variables:
##  $ education: num  13.1 12.3 12.8 11.4 14.6 ...
##  $ income   : int  12351 25879 9271 8865 8403 11030 8258 14163 11377 11023 ...
##  $ women    : num  11.16 4.02 15.7 9.11 11.68 ...
##  $ prestige : num  68.8 69.1 63.4 56.8 73.5 77.6 72.6 78.1 73.1 68.8 ...
##  $ census   : int  1113 1130 1171 1175 2111 2113 2133 2141 2143 2153 ...
##  $ type     : Factor w/ 3 levels "bc","prof","wc": 2 2 2 2 2 2 2 2 2 2 ...

head(prestige)

##                     education income women prestige census type
## gov.administrators      13.11  12351 11.16     68.8   1113 prof
## general.managers        12.26  25879  4.02     69.1   1130 prof
## accountants             12.77   9271 15.70     63.4   1171 prof
## purchasing.officers     11.42   8865  9.11     56.8   1175 prof
## chemists                14.62   8403 11.68     73.5   2111 prof
## physicists              15.64  11030  5.13     77.6   2113 prof

Another look at the scatterplot matrix

library(car)
scatterplotMatrix(prestige[,c("prestige","education","income","women")])

Linear Regression - Fit the Model

myReg <- lm(prestige ~ education + income + women, data = prestige)
myReg

## 
## Call:
## lm(formula = prestige ~ education + income + women, data = prestige)
## 
## Coefficients:
## (Intercept)    education       income        women  
##   -6.801684     4.182482     0.001315    -0.008304

names(myReg)

##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"

Linear Regression - Summary of Fit

summary(myReg)

## 
## Call:
## lm(formula = prestige ~ education + income + women, data = prestige)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.7831  -5.4102  -0.1204   5.1821  17.5318 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -6.8016842  3.2543983  -2.090   0.0392 *  
## education    4.1824817  0.3907842  10.703  < 2e-16 ***
## income       0.0013153  0.0002791   4.712  8.2e-06 ***
## women       -0.0083038  0.0306187  -0.271   0.7868    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.883 on 97 degrees of freedom
## Multiple R-squared:  0.798,  Adjusted R-squared:  0.7917 
## F-statistic: 127.7 on 3 and 97 DF,  p-value: < 2.2e-16

Linear Regression - Summary Contents

myReg.summary <- summary(myReg)
names(myReg.summary) # show different contents

##  [1] "call"          "terms"         "residuals"     "coefficients" 
##  [5] "aliased"       "sigma"         "df"            "r.squared"    
##  [9] "adj.r.squared" "fstatistic"    "cov.unscaled"

names(myReg) # this is what we had previously

##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"

Linear Regression - Confidence Intervals

95% confidence interval for coefficient of income

confint(myReg, 'income', level=0.95)

##              2.5 %      97.5 %
## income 0.000761253 0.001869315

95% confidence interval for each coefficient

confint(myReg, level=0.95)

##                     2.5 %       97.5 %
## (Intercept) -13.260764035 -0.342604455
## education     3.406883236  4.958080173
## income        0.000761253  0.001869315
## women        -0.069073423  0.052465913

Linear Regression - Adding Variables

Recall that type of occupation has a relationship with prestige score.

So let’s add the predictor type into our regression model:

mod <- update(myReg, ~ . + type)
summary(mod)

## 
## Call:
## lm(formula = prestige ~ education + income + women + type, data = prestige)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.1126  -4.9844   0.0942   5.1318  19.4630 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.7378124  5.5112850   0.497   0.6205    
## education    3.1246575  0.6644750   4.702 8.70e-06 ***
## income       0.0011764  0.0002702   4.355 3.36e-05 ***
## women        0.0050863  0.0308647   0.165   0.8695    
## typeprof     8.5502774  4.0724806   2.100   0.0384 *  
## typewc      -1.1457779  2.7280276  -0.420   0.6754    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.536 on 95 degrees of freedom
## Multiple R-squared:  0.8192, Adjusted R-squared:  0.8097 
## F-statistic:  86.1 on 5 and 95 DF,  p-value: < 2.2e-16

Linear Regression - Comparing Models

Suppose we want to test which of our two models is better when one model is nested within the other (i.e., both models contain the same terms and one has at least one additional term).

formula(myReg)

## prestige ~ education + income + women
## <environment: 0x107231590>

formula(mod)

## prestige ~ education + income + women + type
## <environment: 0x107231590>

We say that myReg is nested within mod, or equivantley that myReg is the reduced model and mod is the full model.

Hypothesis Test:

Null: Reduced model fits as well as the full model (formally, coefficients for the omitted terms are all zero).
Alternative: Full model fits better than the reduced model (formally, at least one omitted coefficient is non-zero).

Linear Regression - Comparing Models, contd.

anova(myReg, mod)

## Analysis of Variance Table
## 
## Model 1: prestige ~ education + income + women
## Model 2: prestige ~ education + income + women + type
##   Res.Df    RSS Df Sum of Sq      F   Pr(>F)   
## 1     97 6028.2                                
## 2     95 5394.6  2     633.6 5.5789 0.005119 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

So we reject the null hypothesis and conclude that at least one of the levels of type of occupation contributes information about the prestige score.

Linear Regression - Relevel a Factor

Say we want the change the reference level for the covariate type so that the intercept of the model includes is for professionals. We simply relevel the variable type and re-fit the model.

levels(prestige$type)

## [1] "bc"   "prof" "wc"

prestige$type <- relevel(prestige$type, "prof")
levels(prestige$type)

## [1] "prof" "bc"   "wc"

Note: This does not change the results of the model (e.g., predictions, F-statistics) but does change the interpretation and p-values of the type coefficients.

Linear Regression - Relevel a Factor, contd.

mod <- update(myReg, ~ . + type)
summary(mod)

## 
## Call:
## lm(formula = prestige ~ education + income + women + type, data = prestige)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.1126  -4.9844   0.0942   5.1318  19.4630 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.2880898  9.1659054   1.232  0.22116    
## education    3.1246575  0.6644750   4.702 8.70e-06 ***
## income       0.0011764  0.0002702   4.355 3.36e-05 ***
## women        0.0050863  0.0308647   0.165  0.86946    
## typebc      -8.5502774  4.0724806  -2.100  0.03842 *  
## typewc      -9.6960553  2.9391152  -3.299  0.00137 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.536 on 95 degrees of freedom
## Multiple R-squared:  0.8192, Adjusted R-squared:  0.8097 
## F-statistic:  86.1 on 5 and 95 DF,  p-value: < 2.2e-16

Linear Regression - More on Variable Selection

Automated algorithms for automatic variable selection exist, but should be used with caution.
Better to start with an a priori set of variables you need to include and then compare models, especially when goal is assocation instead of prediction.
Stepwise regression: start with a model and add (or remove) predictors until some criteria is met

null <- lm(prestige ~ 1, data=prestige)  # most basic model, intercept only
full <- lm(prestige ~ education + income + women + type, data=prestige)  # saturated model, all predictors
step(null, scope=list(lower=null, upper=full), direction='forward')

## Start:  AIC=576.54
## prestige ~ 1
## 
##             Df Sum of Sq     RSS    AIC
## + education  1   21567.5  8274.6 448.99
## + type       2   20685.1  9157.0 461.22
## + income     1   15236.2 14605.9 506.38
## <none>                   29842.1 576.54
## + women      1     400.9 29441.1 577.18
## 
## Step:  AIC=448.99
## prestige ~ education
## 
##          Df Sum of Sq    RSS    AIC
## + income  1   2241.78 6032.8 419.07
## + type    2   1428.45 6846.1 433.85
## + women   1    866.64 7407.9 439.81
## <none>                8274.6 448.99
## 
## Step:  AIC=419.07
## prestige ~ education + income
## 
##         Df Sum of Sq    RSS    AIC
## + type   2    636.63 5396.2 411.81
## <none>               6032.8 419.07
## + women  1      4.57 6028.2 421.00
## 
## Step:  AIC=411.81
## prestige ~ education + income + type
## 
##         Df Sum of Sq    RSS    AIC
## <none>               5396.2 411.81
## + women  1    1.5421 5394.6 413.78

## 
## Call:
## lm(formula = prestige ~ education + income + type, data = prestige)
## 
## Coefficients:
## (Intercept)    education       income       typebc       typewc  
##   11.520005     3.135291     0.001153    -8.646693    -9.655940

step(full, scope=list(lower=null, upper=full), direction='backward')

## Start:  AIC=413.78
## prestige ~ education + income + women + type
## 
##             Df Sum of Sq    RSS    AIC
## - women      1      1.54 5396.2 411.81
## <none>                   5394.6 413.78
## - type       2    633.60 6028.2 421.00
## - income     1   1076.80 6471.4 430.16
## - education  1   1255.70 6650.3 432.92
## 
## Step:  AIC=411.81
## prestige ~ education + income + type
## 
##             Df Sum of Sq    RSS    AIC
## <none>                   5396.2 411.81
## - type       2    636.63 6032.8 419.07
## - education  1   1276.30 6672.5 431.25
## - income     1   1449.96 6846.1 433.85

## 
## Call:
## lm(formula = prestige ~ education + income + type, data = prestige)
## 
## Coefficients:
## (Intercept)    education       income       typebc       typewc  
##   11.520005     3.135291     0.001153    -8.646693    -9.655940

step(null, scope=list(lower=null, upper=full), direction='both')

## Start:  AIC=576.54
## prestige ~ 1
## 
##             Df Sum of Sq     RSS    AIC
## + education  1   21567.5  8274.6 448.99
## + type       2   20685.1  9157.0 461.22
## + income     1   15236.2 14605.9 506.38
## <none>                   29842.1 576.54
## + women      1     400.9 29441.1 577.18
## 
## Step:  AIC=448.99
## prestige ~ education
## 
##             Df Sum of Sq     RSS    AIC
## + income     1    2241.8  6032.8 419.07
## + type       2    1428.5  6846.1 433.85
## + women      1     866.6  7407.9 439.81
## <none>                    8274.6 448.99
## - education  1   21567.5 29842.1 576.54
## 
## Step:  AIC=419.07
## prestige ~ education + income
## 
##             Df Sum of Sq     RSS    AIC
## + type       2     636.6  5396.2 411.81
## <none>                    6032.8 419.07
## + women      1       4.6  6028.2 421.00
## - income     1    2241.8  8274.6 448.99
## - education  1    8573.1 14605.9 506.38
## 
## Step:  AIC=411.81
## prestige ~ education + income + type
## 
##             Df Sum of Sq    RSS    AIC
## <none>                   5396.2 411.81
## + women      1      1.54 5394.6 413.78
## - type       2    636.63 6032.8 419.07
## - education  1   1276.30 6672.5 431.25
## - income     1   1449.96 6846.1 433.85

## 
## Call:
## lm(formula = prestige ~ education + income + type, data = prestige)
## 
## Coefficients:
## (Intercept)    education       income       typebc       typewc  
##   11.520005     3.135291     0.001153    -8.646693    -9.655940

Linear Regression - More on Variable Selection

All subsets regression: test all possible subsets of the predictors
- \(p\) predictors \(\Rightarrow 2^p\) combinations
- Pick the best model based on some well-defined criteria
- Pro: Exhaustive
- Con: Computationally expensive
Best subsets regression: similar to all subsets, but only look at the best model of each size
Multiple packages can do this, but we will use leaps

if (!require("leaps")) install.packages("leaps")  # install leaps if it isn't already
library(leaps)
bestSubsets <- regsubsets(prestige ~ education + income + women + type, data=prestige, nbest=1)
summary(bestSubsets)

## Subset selection object
## Call: .f(.x[[i]], ...)
## 5 Variables  (and intercept)
##           Forced in Forced out
## education     FALSE      FALSE
## income        FALSE      FALSE
## women         FALSE      FALSE
## typebc        FALSE      FALSE
## typewc        FALSE      FALSE
## 1 subsets of each size up to 5
## Selection Algorithm: exhaustive
##          education income women typebc typewc
## 1  ( 1 ) "*"       " "    " "   " "    " "   
## 2  ( 1 ) "*"       "*"    " "   " "    " "   
## 3  ( 1 ) "*"       "*"    " "   " "    "*"   
## 4  ( 1 ) "*"       "*"    " "   "*"    "*"   
## 5  ( 1 ) "*"       "*"    "*"   "*"    "*"

par(mfrow=c(1,2))
plot(bestSubsets, scale="adjr2"); plot(bestSubsets, scale="Cp")

Note: The plot on the right is Mallow’s \(C_p\), which is a statistic used to compare models that accounts for the number of predictors in the model (we want it to be roughly equal to the number of predictors included in the model).

Intro to Data Analysis with R: Session 3

UCI Data Science Initiative

Session 3 Agenda

Statistical Distributions in R:

Standard Normal Distribution

Standard Normal Distribution

Standard Normal Distribution

Standard Normal Distribution

Standard Normal Distribution

T-Tests

One-Sample T-Test (Create Data)

One-Sample T-Test (\(H_0: \mu = 0\))

One-Sample T-Test (\(H_0: \mu = a\))

Two-Sample T-Test

Two-Sample T-Test (Create & Plot Data)

Two-Sample T-Test (Un-equal Variances)

Two-Sample T-Test (Equal Variances)

Two-Sample T-Test (Paired T-Test)

Linear Regression

Linear Regression in R

Linear Regression - Prestige Data

Load Data

Another look at the scatterplot matrix

Linear Regression - Fit the Model

Linear Regression - Summary of Fit

Linear Regression - Summary Contents

Linear Regression - Confidence Intervals

Linear Regression - Adding Variables

Linear Regression - Comparing Models

Linear Regression - Comparing Models, contd.

Linear Regression - Relevel a Factor

Linear Regression - Relevel a Factor, contd.

Linear Regression - More on Variable Selection

Linear Regression - More on Variable Selection

Linear Regression - Diagnostics

Linear Regression - Prediction

End of Session 3