April 20, 2018
Statistical Distributions (very brief)
T-Tests
Linear Regression
d<dist>()
: evaluates the probability density/mass function at a given valuer<dist>()
: generates random numbersp<dist>()
: returns the cumulative distribution function (CDF) for a given quantileq<dist>()
: returns the quantile for a given probabilitystr(dnorm) # normal pdf
## function (x, mean = 0, sd = 1, log = FALSE)
dnorm(x = 0, mean = 0, sd = 1)
## [1] 0.3989423
x <- seq(from = -3, to = 3, by = 0.05)
y <- dnorm(x, mean = 0, sd = 1)
plot(x, y, type = "l", ylab = "Density")
str(rnorm) # generate random number from normal dist
## function (n, mean = 0, sd = 1)
rnorm(10, mean = 0, sd = 1)
## [1] -0.06119123 1.35703065 -0.42420890 -0.48579143 -1.34478334
## [6] -0.01492377 1.01789886 1.66025678 1.31331140 1.55496600
str(pnorm) # normal CDF
## function (q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
pnorm(0, mean = 0, sd = 1) # Pr[X <= 0] = ?
## [1] 0.5
str(qnorm) # normal quantile func
## function (p, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
qnorm(0.975, mean = 0, sd = 1) # PR[X <= ?] = 0.975
## [1] 1.959964
T-tests can be used to draw statistical conclusions about parameters of interest in the data
Is the mean of this data different from zero (or another number)?
Are the means of two data sets different from one another?
Is the regression slope coefficient different from zero?
T-tests can be categorized into two groups:
One-sample t-test
Two-sample t-test
set.seed(123)
oneSampData <- rnorm(100, mean = 0, sd = 1)
mean(oneSampData)
## [1] 0.09040591
sd(oneSampData)
## [1] 0.9128159
oneSampTest.0 <- t.test(oneSampData)
oneSampTest.0
##
## One Sample t-test
##
## data: oneSampData
## t = 0.99041, df = 99, p-value = 0.3244
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## -0.09071657 0.27152838
## sample estimates:
## mean of x
## 0.09040591
names(oneSampTest.0)
## [1] "statistic" "parameter" "p.value" "conf.int" "estimate"
## [6] "null.value" "alternative" "method" "data.name"
oneSampTest.0$statistic
## t
## 0.9904068
oneSampTest.0$estimate
## mean of x
## 0.09040591
a <- 0.3
oneSampTest.mu <- t.test(oneSampData, mu = a)
oneSampTest.mu
##
## One Sample t-test
##
## data: oneSampData
## t = -2.2961, df = 99, p-value = 0.02378
## alternative hypothesis: true mean is not equal to 0.3
## 95 percent confidence interval:
## -0.09071657 0.27152838
## sample estimates:
## mean of x
## 0.09040591
Two sample t-tests are categorized into 3 groups:
T-Test with equal variances
T-Test with un-equal variances
Paired T-Test (one-sample t-test on differences)
Samp1 <- rnorm(300, mean = 2.5, sd = 1)
Samp2 <- rnorm(500, mean = 3.0, sd = 1) # notice: not the same sample size
plot(density(Samp1), col="red", main="Densities of Samp1 and Samp2", xlab="")
abline(v = mean(Samp1), col = "red", lwd = 2, lty=2)
lines(density(Samp2), col="blue")
abline(v = mean(Samp2), col = "blue", lwd = 2, lty = 2)
legend("topright", legend = c("Samp1", "Samp2"),
fill = c("red","blue"), bty = "n", cex = 1.3)
Null hypothesis: \(\mu_1 = \mu_2 \Leftrightarrow \mu_1 - \mu_2 = 0\)
t.test(Samp1, Samp2) # default assump: unequal variances
##
## Welch Two Sample t-test
##
## data: Samp1 and Samp2
## t = -7.3058, df = 638.78, p-value = 8.249e-13
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.6730166 -0.3878678
## sample estimates:
## mean of x mean of y
## 2.492232 3.022674
Null hypothesis: \(\mu_1 = \mu_2 \Leftrightarrow \mu_1 - \mu_2 = 0\)
t.test(Samp1, Samp2, var.equal = TRUE) # default assump: unequal variances
##
## Two Sample t-test
##
## data: Samp1 and Samp2
## t = -7.2723, df = 798, p-value = 8.447e-13
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.6736199 -0.3872646
## sample estimates:
## mean of x mean of y
## 2.492232 3.022674
Let \(D \equiv \{x_i - y_i : i=1, \ldots, n \}\).
Null hypothesis: \(\mu_D = 0\)
t.test(Samp1, Samp2[1:300], paired = TRUE) # must be of the same sample size
##
## Paired t-test
##
## data: Samp1 and Samp2[1:300]
## t = -5.7581, df = 299, p-value = 2.109e-08
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.6428004 -0.3153406
## sample estimates:
## mean of the differences
## -0.4790705
Approach to model the relationship between a response variable \(Y\) and one or more predictor variables \(X\).
\(Y\) is a noisy linear combination of \(X\): \[Y = X \beta + \epsilon\]
Linear regression equivalent to modeling the mean of \(Y\). Namely, \(\mu = E(Y) = X\beta\)
Multiple ways to estimate \(\beta\), but we will focus on the most basic method which is ordinary least squares estimation.
We use lm()
to fit linear regression models, which has requires us to pass in a regression formula.
Don’t need to create the matrix \(X\) (R does this for us), just tell it which predictors to include.
Some notes on formula
:
~
+
symbols (e.g., y ~ x + z
).
(e.g., y ~ .
)y ~ 0 + .
or y ~ . - 1
y
on x
, we simply use y ~ x + I(x^2)
We will fit a linear model using the prestige data. Recall that we have the following variables
education
: Average education of occupational incumbents, years, in 1971.
income
: Average income of incumbents, dollars, in 1971.
women
: Percentage of incumbents who are women.
prestige
: Pineo-Porter prestige score for occupation, from a social survey conducted in the mid-1960s.
census
: Canadian Census occupational code.
type
: Type of occupation, a factor with levels
bc
: Blue Collarprof
: Professional, Managerial, and Technicalwc
: White Collarprestige <- read.csv(file = here::here("data", "prestige_v2.csv"),
row.names=1)
str(prestige)
## 'data.frame': 101 obs. of 6 variables:
## $ education: num 13.1 12.3 12.8 11.4 14.6 ...
## $ income : int 12351 25879 9271 8865 8403 11030 8258 14163 11377 11023 ...
## $ women : num 11.16 4.02 15.7 9.11 11.68 ...
## $ prestige : num 68.8 69.1 63.4 56.8 73.5 77.6 72.6 78.1 73.1 68.8 ...
## $ census : int 1113 1130 1171 1175 2111 2113 2133 2141 2143 2153 ...
## $ type : Factor w/ 3 levels "bc","prof","wc": 2 2 2 2 2 2 2 2 2 2 ...
head(prestige)
## education income women prestige census type
## gov.administrators 13.11 12351 11.16 68.8 1113 prof
## general.managers 12.26 25879 4.02 69.1 1130 prof
## accountants 12.77 9271 15.70 63.4 1171 prof
## purchasing.officers 11.42 8865 9.11 56.8 1175 prof
## chemists 14.62 8403 11.68 73.5 2111 prof
## physicists 15.64 11030 5.13 77.6 2113 prof
library(car)
scatterplotMatrix(prestige[,c("prestige","education","income","women")])
myReg <- lm(prestige ~ education + income + women, data = prestige)
myReg
##
## Call:
## lm(formula = prestige ~ education + income + women, data = prestige)
##
## Coefficients:
## (Intercept) education income women
## -6.801684 4.182482 0.001315 -0.008304
names(myReg)
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"
summary(myReg)
##
## Call:
## lm(formula = prestige ~ education + income + women, data = prestige)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.7831 -5.4102 -0.1204 5.1821 17.5318
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.8016842 3.2543983 -2.090 0.0392 *
## education 4.1824817 0.3907842 10.703 < 2e-16 ***
## income 0.0013153 0.0002791 4.712 8.2e-06 ***
## women -0.0083038 0.0306187 -0.271 0.7868
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.883 on 97 degrees of freedom
## Multiple R-squared: 0.798, Adjusted R-squared: 0.7917
## F-statistic: 127.7 on 3 and 97 DF, p-value: < 2.2e-16
myReg.summary <- summary(myReg)
names(myReg.summary) # show different contents
## [1] "call" "terms" "residuals" "coefficients"
## [5] "aliased" "sigma" "df" "r.squared"
## [9] "adj.r.squared" "fstatistic" "cov.unscaled"
names(myReg) # this is what we had previously
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"
income
confint(myReg, 'income', level=0.95)
## 2.5 % 97.5 %
## income 0.000761253 0.001869315
confint(myReg, level=0.95)
## 2.5 % 97.5 %
## (Intercept) -13.260764035 -0.342604455
## education 3.406883236 4.958080173
## income 0.000761253 0.001869315
## women -0.069073423 0.052465913
Recall that type
of occupation has a relationship with prestige score.
So let’s add the predictor type
into our regression model:
mod <- update(myReg, ~ . + type)
summary(mod)
##
## Call:
## lm(formula = prestige ~ education + income + women + type, data = prestige)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.1126 -4.9844 0.0942 5.1318 19.4630
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.7378124 5.5112850 0.497 0.6205
## education 3.1246575 0.6644750 4.702 8.70e-06 ***
## income 0.0011764 0.0002702 4.355 3.36e-05 ***
## women 0.0050863 0.0308647 0.165 0.8695
## typeprof 8.5502774 4.0724806 2.100 0.0384 *
## typewc -1.1457779 2.7280276 -0.420 0.6754
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.536 on 95 degrees of freedom
## Multiple R-squared: 0.8192, Adjusted R-squared: 0.8097
## F-statistic: 86.1 on 5 and 95 DF, p-value: < 2.2e-16
Suppose we want to test which of our two models is better when one model is nested within the other (i.e., both models contain the same terms and one has at least one additional term).
formula(myReg)
## prestige ~ education + income + women
## <environment: 0x107231590>
formula(mod)
## prestige ~ education + income + women + type
## <environment: 0x107231590>
We say that myReg
is nested within mod
, or equivantley that myReg
is the reduced model and mod
is the full model.
Hypothesis Test:
Null: Reduced model fits as well as the full model (formally, coefficients for the omitted terms are all zero).
Alternative: Full model fits better than the reduced model (formally, at least one omitted coefficient is non-zero).
anova(myReg, mod)
## Analysis of Variance Table
##
## Model 1: prestige ~ education + income + women
## Model 2: prestige ~ education + income + women + type
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 97 6028.2
## 2 95 5394.6 2 633.6 5.5789 0.005119 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
So we reject the null hypothesis and conclude that at least one of the levels of type of occupation contributes information about the prestige score.
Say we want the change the reference level for the covariate type
so that the intercept of the model includes is for professionals. We simply relevel the variable type
and re-fit the model.
levels(prestige$type)
## [1] "bc" "prof" "wc"
prestige$type <- relevel(prestige$type, "prof")
levels(prestige$type)
## [1] "prof" "bc" "wc"
Note: This does not change the results of the model (e.g., predictions, F-statistics) but does change the interpretation and p-values of the type coefficients.
mod <- update(myReg, ~ . + type)
summary(mod)
##
## Call:
## lm(formula = prestige ~ education + income + women + type, data = prestige)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.1126 -4.9844 0.0942 5.1318 19.4630
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.2880898 9.1659054 1.232 0.22116
## education 3.1246575 0.6644750 4.702 8.70e-06 ***
## income 0.0011764 0.0002702 4.355 3.36e-05 ***
## women 0.0050863 0.0308647 0.165 0.86946
## typebc -8.5502774 4.0724806 -2.100 0.03842 *
## typewc -9.6960553 2.9391152 -3.299 0.00137 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.536 on 95 degrees of freedom
## Multiple R-squared: 0.8192, Adjusted R-squared: 0.8097
## F-statistic: 86.1 on 5 and 95 DF, p-value: < 2.2e-16
Automated algorithms for automatic variable selection exist, but should be used with caution.
Better to start with an a priori set of variables you need to include and then compare models, especially when goal is assocation instead of prediction.
Stepwise regression: start with a model and add (or remove) predictors until some criteria is met
null <- lm(prestige ~ 1, data=prestige) # most basic model, intercept only
full <- lm(prestige ~ education + income + women + type, data=prestige) # saturated model, all predictors
step(null, scope=list(lower=null, upper=full), direction='forward')
## Start: AIC=576.54
## prestige ~ 1
##
## Df Sum of Sq RSS AIC
## + education 1 21567.5 8274.6 448.99
## + type 2 20685.1 9157.0 461.22
## + income 1 15236.2 14605.9 506.38
## <none> 29842.1 576.54
## + women 1 400.9 29441.1 577.18
##
## Step: AIC=448.99
## prestige ~ education
##
## Df Sum of Sq RSS AIC
## + income 1 2241.78 6032.8 419.07
## + type 2 1428.45 6846.1 433.85
## + women 1 866.64 7407.9 439.81
## <none> 8274.6 448.99
##
## Step: AIC=419.07
## prestige ~ education + income
##
## Df Sum of Sq RSS AIC
## + type 2 636.63 5396.2 411.81
## <none> 6032.8 419.07
## + women 1 4.57 6028.2 421.00
##
## Step: AIC=411.81
## prestige ~ education + income + type
##
## Df Sum of Sq RSS AIC
## <none> 5396.2 411.81
## + women 1 1.5421 5394.6 413.78
##
## Call:
## lm(formula = prestige ~ education + income + type, data = prestige)
##
## Coefficients:
## (Intercept) education income typebc typewc
## 11.520005 3.135291 0.001153 -8.646693 -9.655940
step(full, scope=list(lower=null, upper=full), direction='backward')
## Start: AIC=413.78
## prestige ~ education + income + women + type
##
## Df Sum of Sq RSS AIC
## - women 1 1.54 5396.2 411.81
## <none> 5394.6 413.78
## - type 2 633.60 6028.2 421.00
## - income 1 1076.80 6471.4 430.16
## - education 1 1255.70 6650.3 432.92
##
## Step: AIC=411.81
## prestige ~ education + income + type
##
## Df Sum of Sq RSS AIC
## <none> 5396.2 411.81
## - type 2 636.63 6032.8 419.07
## - education 1 1276.30 6672.5 431.25
## - income 1 1449.96 6846.1 433.85
##
## Call:
## lm(formula = prestige ~ education + income + type, data = prestige)
##
## Coefficients:
## (Intercept) education income typebc typewc
## 11.520005 3.135291 0.001153 -8.646693 -9.655940
step(null, scope=list(lower=null, upper=full), direction='both')
## Start: AIC=576.54
## prestige ~ 1
##
## Df Sum of Sq RSS AIC
## + education 1 21567.5 8274.6 448.99
## + type 2 20685.1 9157.0 461.22
## + income 1 15236.2 14605.9 506.38
## <none> 29842.1 576.54
## + women 1 400.9 29441.1 577.18
##
## Step: AIC=448.99
## prestige ~ education
##
## Df Sum of Sq RSS AIC
## + income 1 2241.8 6032.8 419.07
## + type 2 1428.5 6846.1 433.85
## + women 1 866.6 7407.9 439.81
## <none> 8274.6 448.99
## - education 1 21567.5 29842.1 576.54
##
## Step: AIC=419.07
## prestige ~ education + income
##
## Df Sum of Sq RSS AIC
## + type 2 636.6 5396.2 411.81
## <none> 6032.8 419.07
## + women 1 4.6 6028.2 421.00
## - income 1 2241.8 8274.6 448.99
## - education 1 8573.1 14605.9 506.38
##
## Step: AIC=411.81
## prestige ~ education + income + type
##
## Df Sum of Sq RSS AIC
## <none> 5396.2 411.81
## + women 1 1.54 5394.6 413.78
## - type 2 636.63 6032.8 419.07
## - education 1 1276.30 6672.5 431.25
## - income 1 1449.96 6846.1 433.85
##
## Call:
## lm(formula = prestige ~ education + income + type, data = prestige)
##
## Coefficients:
## (Intercept) education income typebc typewc
## 11.520005 3.135291 0.001153 -8.646693 -9.655940
All subsets regression: test all possible subsets of the predictors
Best subsets regression: similar to all subsets, but only look at the best model of each size
Multiple packages can do this, but we will use leaps
if (!require("leaps")) install.packages("leaps") # install leaps if it isn't already
library(leaps)
bestSubsets <- regsubsets(prestige ~ education + income + women + type, data=prestige, nbest=1)
summary(bestSubsets)
## Subset selection object
## Call: .f(.x[[i]], ...)
## 5 Variables (and intercept)
## Forced in Forced out
## education FALSE FALSE
## income FALSE FALSE
## women FALSE FALSE
## typebc FALSE FALSE
## typewc FALSE FALSE
## 1 subsets of each size up to 5
## Selection Algorithm: exhaustive
## education income women typebc typewc
## 1 ( 1 ) "*" " " " " " " " "
## 2 ( 1 ) "*" "*" " " " " " "
## 3 ( 1 ) "*" "*" " " " " "*"
## 4 ( 1 ) "*" "*" " " "*" "*"
## 5 ( 1 ) "*" "*" "*" "*" "*"
par(mfrow=c(1,2))
plot(bestSubsets, scale="adjr2"); plot(bestSubsets, scale="Cp")
Note: The plot on the right is Mallow’s \(C_p\), which is a statistic used to compare models that accounts for the number of predictors in the model (we want it to be roughly equal to the number of predictors included in the model).
par(mfrow = c(2, 2), oma = c(0, 0, 2, 0))
plot(myReg)
Predict the output for a new input
newData = data.frame(education=13.2, income=12000, women=12)
predict(myReg, newData, interval="predict")
## fit lwr upr
## 1 64.09084 48.2459 79.93578