Intro to Data Analysis with R: Session 3

UCI Data Science Initiative

April 20, 2018

Session 3 Agenda

  1. Statistical Distributions (very brief)

  2. T-Tests

  3. Linear Regression

    • Fitting Models
    • Interpretation
    • Comparing Models & Variable Selection
    • Diagnostics
    • Prediction

Statistical Distributions in R:

Standard Normal Distribution

str(dnorm) # normal pdf
## function (x, mean = 0, sd = 1, log = FALSE)
dnorm(x = 0, mean = 0, sd = 1)
## [1] 0.3989423

Standard Normal Distribution

x <- seq(from = -3, to = 3, by = 0.05)
y <- dnorm(x, mean = 0, sd = 1)
plot(x, y, type = "l", ylab = "Density")

Standard Normal Distribution

str(rnorm) # generate random number from normal dist
## function (n, mean = 0, sd = 1)
rnorm(10, mean = 0, sd = 1)
##  [1] -0.06119123  1.35703065 -0.42420890 -0.48579143 -1.34478334
##  [6] -0.01492377  1.01789886  1.66025678  1.31331140  1.55496600

Standard Normal Distribution

str(pnorm) # normal CDF
## function (q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
pnorm(0, mean = 0, sd = 1) # Pr[X <= 0] = ?
## [1] 0.5

Standard Normal Distribution

str(qnorm) # normal quantile func
## function (p, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
qnorm(0.975, mean = 0, sd = 1) # PR[X <= ?] = 0.975
## [1] 1.959964

T-Tests

T-tests can be used to draw statistical conclusions about parameters of interest in the data

T-tests can be categorized into two groups:

  1. One-sample t-test

  2. Two-sample t-test

One-Sample T-Test (Create Data)

set.seed(123)
oneSampData <- rnorm(100, mean = 0, sd = 1)
mean(oneSampData)
## [1] 0.09040591
sd(oneSampData)
## [1] 0.9128159

One-Sample T-Test (\(H_0: \mu = 0\))

oneSampTest.0 <- t.test(oneSampData) 
oneSampTest.0
## 
##  One Sample t-test
## 
## data:  oneSampData
## t = 0.99041, df = 99, p-value = 0.3244
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  -0.09071657  0.27152838
## sample estimates:
##  mean of x 
## 0.09040591
names(oneSampTest.0) 
## [1] "statistic"   "parameter"   "p.value"     "conf.int"    "estimate"   
## [6] "null.value"  "alternative" "method"      "data.name"
oneSampTest.0$statistic
##         t 
## 0.9904068
oneSampTest.0$estimate
##  mean of x 
## 0.09040591

One-Sample T-Test (\(H_0: \mu = a\))

a <- 0.3
oneSampTest.mu <- t.test(oneSampData, mu = a)
oneSampTest.mu
## 
##  One Sample t-test
## 
## data:  oneSampData
## t = -2.2961, df = 99, p-value = 0.02378
## alternative hypothesis: true mean is not equal to 0.3
## 95 percent confidence interval:
##  -0.09071657  0.27152838
## sample estimates:
##  mean of x 
## 0.09040591

Two-Sample T-Test

Two sample t-tests are categorized into 3 groups:

Two-Sample T-Test (Create & Plot Data)

Samp1 <- rnorm(300, mean = 2.5, sd = 1)
Samp2 <- rnorm(500, mean = 3.0, sd = 1) # notice: not the same sample size
plot(density(Samp1), col="red", main="Densities of Samp1 and Samp2", xlab="")
abline(v = mean(Samp1), col = "red", lwd = 2, lty=2)
lines(density(Samp2), col="blue")
abline(v = mean(Samp2), col = "blue", lwd = 2, lty = 2)
legend("topright", legend = c("Samp1", "Samp2"),
       fill = c("red","blue"), bty = "n", cex = 1.3)

Two-Sample T-Test (Un-equal Variances)

Null hypothesis: \(\mu_1 = \mu_2 \Leftrightarrow \mu_1 - \mu_2 = 0\)

t.test(Samp1, Samp2)  # default assump: unequal variances
## 
##  Welch Two Sample t-test
## 
## data:  Samp1 and Samp2
## t = -7.3058, df = 638.78, p-value = 8.249e-13
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.6730166 -0.3878678
## sample estimates:
## mean of x mean of y 
##  2.492232  3.022674

Two-Sample T-Test (Equal Variances)

Null hypothesis: \(\mu_1 = \mu_2 \Leftrightarrow \mu_1 - \mu_2 = 0\)

t.test(Samp1, Samp2, var.equal = TRUE)  # default assump: unequal variances
## 
##  Two Sample t-test
## 
## data:  Samp1 and Samp2
## t = -7.2723, df = 798, p-value = 8.447e-13
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.6736199 -0.3872646
## sample estimates:
## mean of x mean of y 
##  2.492232  3.022674

Two-Sample T-Test (Paired T-Test)

Let \(D \equiv \{x_i - y_i : i=1, \ldots, n \}\).

Null hypothesis: \(\mu_D = 0\)

t.test(Samp1, Samp2[1:300], paired = TRUE) # must be of the same sample size
## 
##  Paired t-test
## 
## data:  Samp1 and Samp2[1:300]
## t = -5.7581, df = 299, p-value = 2.109e-08
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.6428004 -0.3153406
## sample estimates:
## mean of the differences 
##              -0.4790705

Linear Regression

Linear Regression in R

Linear Regression - Prestige Data

We will fit a linear model using the prestige data. Recall that we have the following variables

Load Data

prestige <- read.csv(file = here::here("data", "prestige_v2.csv"), 
                     row.names=1)
str(prestige)
## 'data.frame':    101 obs. of  6 variables:
##  $ education: num  13.1 12.3 12.8 11.4 14.6 ...
##  $ income   : int  12351 25879 9271 8865 8403 11030 8258 14163 11377 11023 ...
##  $ women    : num  11.16 4.02 15.7 9.11 11.68 ...
##  $ prestige : num  68.8 69.1 63.4 56.8 73.5 77.6 72.6 78.1 73.1 68.8 ...
##  $ census   : int  1113 1130 1171 1175 2111 2113 2133 2141 2143 2153 ...
##  $ type     : Factor w/ 3 levels "bc","prof","wc": 2 2 2 2 2 2 2 2 2 2 ...
head(prestige)
##                     education income women prestige census type
## gov.administrators      13.11  12351 11.16     68.8   1113 prof
## general.managers        12.26  25879  4.02     69.1   1130 prof
## accountants             12.77   9271 15.70     63.4   1171 prof
## purchasing.officers     11.42   8865  9.11     56.8   1175 prof
## chemists                14.62   8403 11.68     73.5   2111 prof
## physicists              15.64  11030  5.13     77.6   2113 prof

Another look at the scatterplot matrix

library(car)
scatterplotMatrix(prestige[,c("prestige","education","income","women")])

Linear Regression - Fit the Model

myReg <- lm(prestige ~ education + income + women, data = prestige)
myReg
## 
## Call:
## lm(formula = prestige ~ education + income + women, data = prestige)
## 
## Coefficients:
## (Intercept)    education       income        women  
##   -6.801684     4.182482     0.001315    -0.008304
names(myReg)
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"

Linear Regression - Summary of Fit

summary(myReg)
## 
## Call:
## lm(formula = prestige ~ education + income + women, data = prestige)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.7831  -5.4102  -0.1204   5.1821  17.5318 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -6.8016842  3.2543983  -2.090   0.0392 *  
## education    4.1824817  0.3907842  10.703  < 2e-16 ***
## income       0.0013153  0.0002791   4.712  8.2e-06 ***
## women       -0.0083038  0.0306187  -0.271   0.7868    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.883 on 97 degrees of freedom
## Multiple R-squared:  0.798,  Adjusted R-squared:  0.7917 
## F-statistic: 127.7 on 3 and 97 DF,  p-value: < 2.2e-16

Linear Regression - Summary Contents

myReg.summary <- summary(myReg)
names(myReg.summary) # show different contents
##  [1] "call"          "terms"         "residuals"     "coefficients" 
##  [5] "aliased"       "sigma"         "df"            "r.squared"    
##  [9] "adj.r.squared" "fstatistic"    "cov.unscaled"
names(myReg) # this is what we had previously
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"

Linear Regression - Confidence Intervals

confint(myReg, 'income', level=0.95)
##              2.5 %      97.5 %
## income 0.000761253 0.001869315
confint(myReg, level=0.95)
##                     2.5 %       97.5 %
## (Intercept) -13.260764035 -0.342604455
## education     3.406883236  4.958080173
## income        0.000761253  0.001869315
## women        -0.069073423  0.052465913

Linear Regression - Adding Variables

Recall that type of occupation has a relationship with prestige score.

So let’s add the predictor type into our regression model:

mod <- update(myReg, ~ . + type)
summary(mod)
## 
## Call:
## lm(formula = prestige ~ education + income + women + type, data = prestige)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.1126  -4.9844   0.0942   5.1318  19.4630 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.7378124  5.5112850   0.497   0.6205    
## education    3.1246575  0.6644750   4.702 8.70e-06 ***
## income       0.0011764  0.0002702   4.355 3.36e-05 ***
## women        0.0050863  0.0308647   0.165   0.8695    
## typeprof     8.5502774  4.0724806   2.100   0.0384 *  
## typewc      -1.1457779  2.7280276  -0.420   0.6754    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.536 on 95 degrees of freedom
## Multiple R-squared:  0.8192, Adjusted R-squared:  0.8097 
## F-statistic:  86.1 on 5 and 95 DF,  p-value: < 2.2e-16

Linear Regression - Comparing Models

Suppose we want to test which of our two models is better when one model is nested within the other (i.e., both models contain the same terms and one has at least one additional term).

formula(myReg)
## prestige ~ education + income + women
## <environment: 0x107231590>
formula(mod)
## prestige ~ education + income + women + type
## <environment: 0x107231590>

We say that myReg is nested within mod, or equivantley that myReg is the reduced model and mod is the full model.

Hypothesis Test:

Linear Regression - Comparing Models, contd.

anova(myReg, mod)
## Analysis of Variance Table
## 
## Model 1: prestige ~ education + income + women
## Model 2: prestige ~ education + income + women + type
##   Res.Df    RSS Df Sum of Sq      F   Pr(>F)   
## 1     97 6028.2                                
## 2     95 5394.6  2     633.6 5.5789 0.005119 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

So we reject the null hypothesis and conclude that at least one of the levels of type of occupation contributes information about the prestige score.

Linear Regression - Relevel a Factor

Say we want the change the reference level for the covariate type so that the intercept of the model includes is for professionals. We simply relevel the variable type and re-fit the model.

levels(prestige$type)
## [1] "bc"   "prof" "wc"
prestige$type <- relevel(prestige$type, "prof")
levels(prestige$type)
## [1] "prof" "bc"   "wc"

Note: This does not change the results of the model (e.g., predictions, F-statistics) but does change the interpretation and p-values of the type coefficients.

Linear Regression - Relevel a Factor, contd.

mod <- update(myReg, ~ . + type)
summary(mod)
## 
## Call:
## lm(formula = prestige ~ education + income + women + type, data = prestige)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.1126  -4.9844   0.0942   5.1318  19.4630 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.2880898  9.1659054   1.232  0.22116    
## education    3.1246575  0.6644750   4.702 8.70e-06 ***
## income       0.0011764  0.0002702   4.355 3.36e-05 ***
## women        0.0050863  0.0308647   0.165  0.86946    
## typebc      -8.5502774  4.0724806  -2.100  0.03842 *  
## typewc      -9.6960553  2.9391152  -3.299  0.00137 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.536 on 95 degrees of freedom
## Multiple R-squared:  0.8192, Adjusted R-squared:  0.8097 
## F-statistic:  86.1 on 5 and 95 DF,  p-value: < 2.2e-16

Linear Regression - More on Variable Selection

null <- lm(prestige ~ 1, data=prestige)  # most basic model, intercept only
full <- lm(prestige ~ education + income + women + type, data=prestige)  # saturated model, all predictors
step(null, scope=list(lower=null, upper=full), direction='forward')
## Start:  AIC=576.54
## prestige ~ 1
## 
##             Df Sum of Sq     RSS    AIC
## + education  1   21567.5  8274.6 448.99
## + type       2   20685.1  9157.0 461.22
## + income     1   15236.2 14605.9 506.38
## <none>                   29842.1 576.54
## + women      1     400.9 29441.1 577.18
## 
## Step:  AIC=448.99
## prestige ~ education
## 
##          Df Sum of Sq    RSS    AIC
## + income  1   2241.78 6032.8 419.07
## + type    2   1428.45 6846.1 433.85
## + women   1    866.64 7407.9 439.81
## <none>                8274.6 448.99
## 
## Step:  AIC=419.07
## prestige ~ education + income
## 
##         Df Sum of Sq    RSS    AIC
## + type   2    636.63 5396.2 411.81
## <none>               6032.8 419.07
## + women  1      4.57 6028.2 421.00
## 
## Step:  AIC=411.81
## prestige ~ education + income + type
## 
##         Df Sum of Sq    RSS    AIC
## <none>               5396.2 411.81
## + women  1    1.5421 5394.6 413.78
## 
## Call:
## lm(formula = prestige ~ education + income + type, data = prestige)
## 
## Coefficients:
## (Intercept)    education       income       typebc       typewc  
##   11.520005     3.135291     0.001153    -8.646693    -9.655940
step(full, scope=list(lower=null, upper=full), direction='backward')
## Start:  AIC=413.78
## prestige ~ education + income + women + type
## 
##             Df Sum of Sq    RSS    AIC
## - women      1      1.54 5396.2 411.81
## <none>                   5394.6 413.78
## - type       2    633.60 6028.2 421.00
## - income     1   1076.80 6471.4 430.16
## - education  1   1255.70 6650.3 432.92
## 
## Step:  AIC=411.81
## prestige ~ education + income + type
## 
##             Df Sum of Sq    RSS    AIC
## <none>                   5396.2 411.81
## - type       2    636.63 6032.8 419.07
## - education  1   1276.30 6672.5 431.25
## - income     1   1449.96 6846.1 433.85
## 
## Call:
## lm(formula = prestige ~ education + income + type, data = prestige)
## 
## Coefficients:
## (Intercept)    education       income       typebc       typewc  
##   11.520005     3.135291     0.001153    -8.646693    -9.655940
step(null, scope=list(lower=null, upper=full), direction='both')
## Start:  AIC=576.54
## prestige ~ 1
## 
##             Df Sum of Sq     RSS    AIC
## + education  1   21567.5  8274.6 448.99
## + type       2   20685.1  9157.0 461.22
## + income     1   15236.2 14605.9 506.38
## <none>                   29842.1 576.54
## + women      1     400.9 29441.1 577.18
## 
## Step:  AIC=448.99
## prestige ~ education
## 
##             Df Sum of Sq     RSS    AIC
## + income     1    2241.8  6032.8 419.07
## + type       2    1428.5  6846.1 433.85
## + women      1     866.6  7407.9 439.81
## <none>                    8274.6 448.99
## - education  1   21567.5 29842.1 576.54
## 
## Step:  AIC=419.07
## prestige ~ education + income
## 
##             Df Sum of Sq     RSS    AIC
## + type       2     636.6  5396.2 411.81
## <none>                    6032.8 419.07
## + women      1       4.6  6028.2 421.00
## - income     1    2241.8  8274.6 448.99
## - education  1    8573.1 14605.9 506.38
## 
## Step:  AIC=411.81
## prestige ~ education + income + type
## 
##             Df Sum of Sq    RSS    AIC
## <none>                   5396.2 411.81
## + women      1      1.54 5394.6 413.78
## - type       2    636.63 6032.8 419.07
## - education  1   1276.30 6672.5 431.25
## - income     1   1449.96 6846.1 433.85
## 
## Call:
## lm(formula = prestige ~ education + income + type, data = prestige)
## 
## Coefficients:
## (Intercept)    education       income       typebc       typewc  
##   11.520005     3.135291     0.001153    -8.646693    -9.655940

Linear Regression - More on Variable Selection

if (!require("leaps")) install.packages("leaps")  # install leaps if it isn't already
library(leaps)
bestSubsets <- regsubsets(prestige ~ education + income + women + type, data=prestige, nbest=1)
summary(bestSubsets)
## Subset selection object
## Call: .f(.x[[i]], ...)
## 5 Variables  (and intercept)
##           Forced in Forced out
## education     FALSE      FALSE
## income        FALSE      FALSE
## women         FALSE      FALSE
## typebc        FALSE      FALSE
## typewc        FALSE      FALSE
## 1 subsets of each size up to 5
## Selection Algorithm: exhaustive
##          education income women typebc typewc
## 1  ( 1 ) "*"       " "    " "   " "    " "   
## 2  ( 1 ) "*"       "*"    " "   " "    " "   
## 3  ( 1 ) "*"       "*"    " "   " "    "*"   
## 4  ( 1 ) "*"       "*"    " "   "*"    "*"   
## 5  ( 1 ) "*"       "*"    "*"   "*"    "*"
par(mfrow=c(1,2))
plot(bestSubsets, scale="adjr2"); plot(bestSubsets, scale="Cp")

Note: The plot on the right is Mallow’s \(C_p\), which is a statistic used to compare models that accounts for the number of predictors in the model (we want it to be roughly equal to the number of predictors included in the model).

Linear Regression - Diagnostics

par(mfrow = c(2, 2), oma = c(0, 0, 2, 0))
plot(myReg)

Linear Regression - Prediction

Predict the output for a new input

newData = data.frame(education=13.2, income=12000, women=12)
predict(myReg, newData, interval="predict")
##        fit     lwr      upr
## 1 64.09084 48.2459 79.93578

End of Session 3

Next up:

  1. Exercise 3
  2. Break

Return at 3:30 to discuss solutions to Exercise 3!