April 20, 2018
On January 28, 1986, the USA Space Shuttle Challenger exploded 73 seconds into flight, killing all 7 crew members. The explosion was traced to failure of O-ring seals in the solid rocket booster at liftoff.
It was hypothesized that there is a inverse relationship between O-ring failure and temperature.
The night before the flight, engineers had to make a decision whether or not to lauch Challenger the next day with a forecasted temperature of 31 degrees Fahrenheit.
The engineers had data available on 23 previous launches to aid in their decision with the following varaibles
More information about the O-ring data is available here.
oring <- read.table("https://archive.ics.uci.edu/ml/machine-learning-databases/space-shuttle/o-ring-erosion-only.data")
names(oring) <- c("n_risk", "n_fail", "temp", "psi", "order")
head(oring)##   n_risk n_fail temp psi order
## 1      6      0   66  50     1
## 2      6      1   70  50     2
## 3      6      0   69  50     3
## 4      6      0   68  50     4
## 5      6      0   67  50     5
## 6      6      0   72  50     6summary(oring)##      n_risk      n_fail            temp            psi       
##  Min.   :6   Min.   :0.0000   Min.   :53.00   Min.   : 50.0  
##  1st Qu.:6   1st Qu.:0.0000   1st Qu.:67.00   1st Qu.: 75.0  
##  Median :6   Median :0.0000   Median :70.00   Median :200.0  
##  Mean   :6   Mean   :0.3043   Mean   :69.57   Mean   :152.2  
##  3rd Qu.:6   3rd Qu.:0.5000   3rd Qu.:75.00   3rd Qu.:200.0  
##  Max.   :6   Max.   :2.0000   Max.   :81.00   Max.   :200.0  
##      order     
##  Min.   : 1.0  
##  1st Qu.: 6.5  
##  Median :12.0  
##  Mean   :12.0  
##  3rd Qu.:17.5  
##  Max.   :23.0oring <- oring[order(oring$temp), ]  # sort by ascending temp
# some flights have the same number of failures & launch temp, so make point size vary 
pt.size <- rep(1, nrow(oring))
dups <- oring[duplicated(cbind(oring$temp, oring$n_fail)), c("n_fail", "temp")]
pt.size[as.numeric(rownames(dups))] <- 2
trips <- dups[duplicated(dups), ]
pt.size[as.numeric(rownames(trips))] <- 3
plot(oring$temp, oring$n_fail, pch = 20, cex = pt.size, 
     xlab = "Temperature (deg F)", ylab = "Number of Failures")  ## Fit a linear model
linearReg <- lm(n_fail ~ temp, data = oring)
summary(linearReg)## 
## Call:
## lm(formula = n_fail ~ temp, data = oring)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.50921 -0.30810  0.00794  0.20905  0.74381 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.30159    0.83110   5.176 3.96e-05 ***
## temp        -0.05746    0.01189  -4.833 8.90e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3935 on 21 degrees of freedom
## Multiple R-squared:  0.5266, Adjusted R-squared:  0.5041 
## F-statistic: 23.36 on 1 and 21 DF,  p-value: 8.895e-05## Predict the output for the Challenger launch day
launchDay <- data.frame(temp = 31)
predict(linearReg, launchDay, interval="predict")  # point estimate & 95% CI##        fit      lwr     upr
## 1 2.520317 1.252255 3.78838## Plot the fit extrapolating to launch day
plot(oring$temp, oring$n_fail, pch = 20, cex = pt.size/2, 
     xlab = "Temperature (deg F)", ylab = "Num O-Ring Failures",
     xlim = c(30, 85), ylim = c(0, 5))
abline(reg = linearReg, col = "red", lwd = 2)
points(x=31, y=2.52, pch=4, col="black", lwd=2)
x <- seq(25,90,.1)
pred <- predict(linearReg, data.frame(temp=x), interval="predict")
lines(x, pred[,2], lty=2, col="red")
lines(x, pred[,3], lty=2, col="red")Linear regression clearly isn’t the proper technique for this type of data, but GLMs can overcome these issues.
Flexible extension of linear regression that allows response variable to have non-Gaussian error distributions.
Instead of modeling the mean \(\mu\) of the response variable directly as a linear combination of the predictors, we model a link function \(g(\mu)\). \[g[E(Y)] = X \beta \Rightarrow E(Y) = g^{-1}(X \beta) \]
\(Y\) must have a distribution from the exponential family, which includes the Normal, binomial, Poisson, gamma, and many others.
Choice of link function \(g\) depends on the distributional form of \(Y\). There can be multiple link functions for any singe distribution.
Specific GLM when dealing with binary response data \(Y_i \in \{0,1\}\) \[Y_i \sim Bernoulli(p_i)\] \[\Rightarrow f(Y|p) = \prod_{i=1}^{n} \left[ p_i^{y_i} (1-p_i)^{1-y_i} \right]\] where \(p_i \in [0,1]\) is the probability that \(Y_i = 1\).
Link function must map values in \((-\infty, \infty)\) to the range \([0,1]\)
Logit link function: \(g(p) = \log\left( \frac{p}{1-p} \right) = X\beta\)
Inverse of logit function is the logistic function: \(p = g^{-1}(X\beta) = \frac{e^{X\beta}}{e^{X\beta} + 1} = \frac{1}{1+e^{-X\beta}}\)
glm() is used to fit generalized linear models.
The formula argument is equivalent to that of lm() used for linear regression
New argument: family
family = binomial(link="logit")Summary(), update(), predict(), variable selection methods, etc. all work the same
## Reformat the response variable
oring$fail <- as.numeric(oring$n_fail > 0)
head(oring)##    n_risk n_fail temp psi order fail
## 14      6      2   53 200    14    1
## 9       6      1   57 200     9    1
## 23      6      1   58 200    23    1
## 10      6      1   63 200    10    1
## 1       6      0   66  50     1    0
## 5       6      0   67  50     5    0plot(oring$temp, oring$fail, pch = 20, cex = pt.size,
     xlab = "Temperature (deg F)", ylab = "O-Ring Failure")## Fit a GLM with logistic link
logisticReg <- glm(fail ~ temp, data = oring, family=binomial(link="logit"))
summary(logisticReg)## 
## Call:
## glm(formula = fail ~ temp, family = binomial(link = "logit"), 
##     data = oring)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.00214  -0.57815  -0.21802   0.04401   2.01706  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  23.7750    11.8204   2.011   0.0443 *
## temp         -0.3667     0.1752  -2.093   0.0363 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 26.402  on 22  degrees of freedom
## Residual deviance: 14.426  on 21  degrees of freedom
## AIC: 18.426
## 
## Number of Fisher Scoring iterations: 6names(logisticReg)##  [1] "coefficients"      "residuals"         "fitted.values"    
##  [4] "effects"           "R"                 "rank"             
##  [7] "qr"                "family"            "linear.predictors"
## [10] "deviance"          "aic"               "null.deviance"    
## [13] "iter"              "weights"           "prior.weights"    
## [16] "df.residual"       "df.null"           "y"                
## [19] "converged"         "boundary"          "model"            
## [22] "call"              "formula"           "terms"            
## [25] "data"              "offset"            "control"          
## [28] "method"            "contrasts"         "xlevels"plot(oring$temp, oring$fail, pch = 20, cex = pt.size,
     xlab = "Temperature (deg F)", ylab = "O-Ring Failure")
curve(predict(logisticReg, data.frame(temp=x), type="response"), add=TRUE, col="red", lwd=2)To get probabilities, be sure to use type = "response".
launchDay##   temp
## 1   31predict(logisticReg, launchDay, type="response")##         1 
## 0.9999959predict() simply computes the linear combination & evaulates the inverse link
XB <- as.numeric(coef(logisticReg)[1] + coef(logisticReg)[2]*31)
XB  # estimated log-odds of o-ring failure on launch day## [1] 12.40723exp(XB)/(1 + exp(XB))  # estimated probability of o-ring failure on launch day## [1] 0.9999959\(\exp(\beta)\) is the multiplicative change in the odds of O-ring failure for a one degree increase in temperature.
95% CI for \(\exp(\beta)\) (profile likelihood method)
exp(confint(logisticReg, "temp", level=0.95))##     2.5 %    97.5 % 
## 0.4159216 0.8853986exp(confint.default(logisticReg, "temp", level=0.95))##          2.5 %    97.5 %
## temp 0.4916331 0.9768919logisticReg2 = update(logisticReg, . ~ . + psi, data=oring)
summary(logisticReg2)## 
## Call:
## glm(formula = fail ~ temp + psi, family = binomial(link = "logit"), 
##     data = oring)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.00837  -0.54507  -0.28098   0.02512   2.21488  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)  
## (Intercept) 21.843631  11.936459   1.830   0.0673 .
## temp        -0.350098   0.172977  -2.024   0.0430 *
## psi          0.006007   0.009749   0.616   0.5378  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 26.402  on 22  degrees of freedom
## Residual deviance: 14.033  on 20  degrees of freedom
## AIC: 20.033
## 
## Number of Fisher Scoring iterations: 6anova(logisticReg, logisticReg2, test='LRT')## Analysis of Deviance Table
## 
## Model 1: fail ~ temp
## Model 2: fail ~ temp + psi
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1        21     14.426                     
## 2        20     14.033  1   0.3929   0.5308Make sure you submit a course evaluation survey!