Introduce basic functions to perform logistic regression using R.
Logistisc regression is a kind og linear regression on binary data.
The interpretations of a linear regression is: For a given x, y increases by…
The effect of the explanatory variables is associtated by the other variables in the model, and the effect is linear.
However, when we are working with other variables, such as binary or factor variables.
First of all, lets just take a quick look at probabilities:
Another way to describe odds is by taking the logarithmic value of odds (logit)
logit models require significant number of cases because they use maximum likelihood estimation techniques. It is also important to keep in mind that when the outcome is rare, even if the overall dataset is large, it can be difficult to estimate a logit model.
Odds-ratio refers to the difference in odds for getting the response variable or not between different factors of a risk factor/explanatory variable.
Our H0 (null-hypothesis) states that there is no association between a risk factor and the outcome variable.
library(readr) # Reading data.
library(tibble) # The "tibble" datastructure.
library(dplyr) # Working with the tibble datastructure.
library(ggplot2) # Plotting.
library(lubridate) # Working with datetime objects.
library(magrittr) # Need this for the %>% operator.
library(devtools) # For printing the session info at the end of the notebook.
library(knitr)
library(aod)
We will go back to the data loaded previously (“weight_height_data.xlsx”). The data was loaded as a dataset called “data”" and manupilated previously. We do not have a binomial data in our data set, so this case just serves as an example. First of all, to create a binomial response variable we would like to create a variable that describes if Height_2 is equal to or higher than 138 cm (h2 < 138 cm = 0, h2 >= 138 cm = 1).
The outcome variable is now a continous variable, however our hypothesis is stated as if the variable was a binary variable. So we have to change it to a binary variable. The predictor variables are continous and dicotom variables (h1, w0, gender).
data <- read_csv('weight_height_cleanup.csv')
Parsed with column specification:
cols(
gender = col_character(),
w0 = col_double(),
h0 = col_double(),
h1 = col_double(),
w1 = col_double(),
h2 = col_double(),
w2 = col_double(),
age1 = col_double(),
age2 = col_double(),
bmi0 = col_double(),
bmi1 = col_double(),
bmi2 = col_double()
)
head(data)
We will create the dicotome variable ‘h2d’ based on the height data from measurement 2 (‘h2’)
data <- mutate(data, h2d = ifelse(h2 < 1.38, 0,
ifelse(h2 >= 1.38, 1, 2)))
data <- read_delim('weight_height_cleanup.csv', delim=',')
Parsed with column specification:
cols(
gender = col_character(),
w0 = col_double(),
h0 = col_double(),
h1 = col_double(),
w1 = col_double(),
h2 = col_double(),
w2 = col_double(),
age1 = col_double(),
age2 = col_double(),
bmi0 = col_double(),
bmi1 = col_double(),
bmi2 = col_double()
)
head(data)
xtabs(~h2d + gender, data=data)
gender
h2d Female Male
0 143 154
1 82 65
First, lets change the diicotom variable to a factor variable:
data$h2d <- factor(data$h2d)
Then, lets use the ‘glm’ function (generalized linear model).
First of all we specifiy the function. we want to see if height at birth can predict if a person gets higher than 1.38.
logit <- glm(h2d ~ h0, data=data, family = "binomial")
summary(logit)
Call:
glm(formula = h2d ~ h0, family = "binomial", data = data)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.8625 -0.9084 -0.6788 1.1744 2.0751
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -19.680 3.099 -6.351 2.14e-10 ***
h0 33.943 5.525 6.144 8.05e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 563.83 on 443 degrees of freedom
Residual deviance: 516.12 on 442 degrees of freedom
AIC: 520.12
Number of Fisher Scoring iterations: 4
-First part of the output shows what was run. -Next we see deviance residuals. - Then we are presented with the actual model. The coefficients, their standard errors, the z-statistic (Wald z-statistic), and the associated p-values. - Each one-unit change in age will increase the log odds of getting higher than 1.38 by 0.52908, however, the p-value indicates that it is insignificant. - The difference between Null deviance and Residual deviance tells us that the model is a good fit. Greater the difference better the model. - Null deviance is the value when you only have intercept in your equation with no variables and Residual deviance is the value when you are taking all the variables into account. It makes sense to consider the model good if that difference is big enough.
If we want to plot the data and the logistic regression model we can do this using ggplot:
ggplot(data, aes(x=h1, y=h2d)) + geom_point() +
stat_smooth(method="glm", method.args=list(family="binomial"), se=FALSE)
Now we will create a more complicated model with more predictors.
logit <- glm(h2d ~ w0 + h0 + w1 + h1 + age2 + gender, data=data, family = "binomial")
summary(logit)
Call:
glm(formula = h2d ~ w0 + h0 + w1 + h1 + age2 + gender, family = "binomial",
data = data)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.3849 -0.5653 -0.2210 0.4461 2.8707
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -80.1542 9.3865 -8.539 < 2e-16 ***
w0 0.2290 0.3513 0.652 0.5144
h0 15.3870 7.0152 2.193 0.0283 *
w1 0.2247 0.1412 1.591 0.1115
h1 78.7225 10.0100 7.864 3.71e-15 ***
age2 0.3659 0.4591 0.797 0.4254
genderMale -1.2152 0.2941 -4.131 3.61e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 563.83 on 443 degrees of freedom
Residual deviance: 327.85 on 437 degrees of freedom
AIC: 341.85
Number of Fisher Scoring iterations: 6
We can also add CIs using profiled log-likelihood or using standard errors
confint(logit)
Waiting for profiling to be done...
2.5 % 97.5 %
(Intercept) -99.60605340 -62.7121704
w0 -0.45761694 0.9227437
h0 1.84689532 29.5711240
w1 -0.02959402 0.5336110
h1 59.80659664 99.1679029
age2 -0.52698287 1.2778403
genderMale -1.80708767 -0.6510048
You can also exponentiate the coefficients and interpret them as odds-ratios. R will do this computation for you. To get the exponentiated coefficients, you tell R that you want to exponentiate (exp), and that the object you want to exponentiate is called coefficients and it is part of mylogit (coef(mylogit)). We can use the same logic to get odds ratios and their confidence intervals, by exponentiating the confidence intervals from before. To put it all in one table, we use cbind to bind the coefficients and confidence intervals column-wise.
exp(cbind(OR = coef(logit), confint(logit)))
Waiting for profiling to be done...
OR 2.5 % 97.5 %
(Intercept) 1.546921e-35 5.516208e-44 5.813671e-28
w0 1.257337e+00 6.327898e-01 2.516185e+00
h0 4.813976e+06 6.340105e+00 6.959467e+12
w1 1.251962e+00 9.708396e-01 1.705078e+00
h1 1.544309e+34 9.411848e+25 1.169696e+43
age2 1.441866e+00 5.903835e-01 3.588881e+00
genderMale 2.966617e-01 1.641314e-01 5.215215e-01
It is possible to test for interactions when there are multiple predictors. The interactions can be specified individually, as with a + b + c + a:b + b:c + a:b:c, or they can be expanded automatically, with a * b * c.
In this case we chose to perform a single interaction.
logit_int <- glm(h2d ~ w0 + h0 + w1 + h1 + age2 + gender + h0:w0 + w1:h1 + h0:h1, data=data, family = "binomial")
summary(logit)
Call:
glm(formula = h2d ~ w0 + h0 + w1 + h1 + age2 + gender, family = "binomial",
data = data)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.3849 -0.5653 -0.2210 0.4461 2.8707
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -80.1542 9.3865 -8.539 < 2e-16 ***
w0 0.2290 0.3513 0.652 0.5144
h0 15.3870 7.0152 2.193 0.0283 *
w1 0.2247 0.1412 1.591 0.1115
h1 78.7225 10.0100 7.864 3.71e-15 ***
age2 0.3659 0.4591 0.797 0.4254
genderMale -1.2152 0.2941 -4.131 3.61e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 563.83 on 443 degrees of freedom
Residual deviance: 327.85 on 437 degrees of freedom
AIC: 341.85
Number of Fisher Scoring iterations: 6
It was probably not a very good idea in this case since none of the interaction links are significant and they do not make much sense in this case. It was mostly to show it in the case.
We will go back to the previous model.
To return to the overall fit of the model we can test the difference between null deviance and residual deviance, the difference in degrees of freedoms and the p-value:
logit_int <- glm(h2d ~ w0 + h0 + w1 + h1 + age2 + gender + h0:w0 + w1:h1 + h0:h1, data=data, family = "binomial")
summary(logit)
Call:
glm(formula = h2d ~ w0 + h0 + w1 + h1 + age2 + gender, family = "binomial",
data = data)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.3849 -0.5653 -0.2210 0.4461 2.8707
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -80.1542 9.3865 -8.539 < 2e-16 ***
w0 0.2290 0.3513 0.652 0.5144
h0 15.3870 7.0152 2.193 0.0283 *
w1 0.2247 0.1412 1.591 0.1115
h1 78.7225 10.0100 7.864 3.71e-15 ***
age2 0.3659 0.4591 0.797 0.4254
genderMale -1.2152 0.2941 -4.131 3.61e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 563.83 on 443 degrees of freedom
Residual deviance: 327.85 on 437 degrees of freedom
AIC: 341.85
Number of Fisher Scoring iterations: 6
The chi-square of 153 with 7 degrees of freedom and an associated p-value of <0.001 tells us that our model as a whole fits significantly better than an empty model. This is sometimes called a likelihood ratio test (the deviance residual is -2*log likelihood). To see the model’s log likelihood, we type:
logLik(logit)
'log Lik.' -163.9263 (df=7)
We will now create a factor. None of the parameters in the dataset really works as a factor, so we will just create a random factor variable. Imagine that it is a variable of the childrens parents educational status.
f <- floor(runif(444, min=1, max=5))
data = data %>% mutate(f=as.factor(f))
Furthermore, we will remove data that we are not using:
data <- select(data, -bmi1)
data <- select(data, -age1)
head(data, n=5L)
logit <- glm(h2d ~ w0 + h0 + w1 + h1 + age2 + gender + f, data=data, family = "binomial")
summary(logit)
Call:
glm(formula = h2d ~ w0 + h0 + w1 + h1 + age2 + gender + f, family = "binomial",
data = data)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.5390 -0.5405 -0.2203 0.4654 2.8475
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -80.5370 9.4596 -8.514 < 2e-16 ***
w0 0.2015 0.3555 0.567 0.5710
h0 15.0635 7.0942 2.123 0.0337 *
w1 0.2204 0.1401 1.574 0.1155
h1 79.5339 10.0984 7.876 3.38e-15 ***
age2 0.3244 0.4612 0.703 0.4818
genderMale -1.2322 0.2980 -4.135 3.55e-05 ***
f2 0.5361 0.4097 1.308 0.1907
f3 0.7404 0.3913 1.892 0.0585 .
f4 0.3995 0.4042 0.988 0.3230
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 563.83 on 443 degrees of freedom
Residual deviance: 324.03 on 434 degrees of freedom
AIC: 344.03
Number of Fisher Scoring iterations: 6
confint(logit)
Waiting for profiling to be done...
2.5 % 97.5 %
(Intercept) -100.13933509 -62.9604668
w0 -0.49359449 0.9035203
h0 1.39984930 29.4563125
w1 -0.03295962 0.5283004
h1 60.45053624 100.1669201
age2 -0.57218338 1.2413096
genderMale -1.83242651 -0.6610458
f2 -0.26389978 1.3478216
f3 -0.01961977 1.5193902
f4 -0.39056833 1.1993458
The indicator variables for rank have a slightly different interpretation. For example, having the kids parents a educational level of with rank of 2 versus a rank of 1, changes the log odds of the children being taller than 1.38 cm by -x.xxx (different every time we run the script as we have generated the rank randomly).
You can also exponentiate the coefficients again:
exp(cbind(OR = coef(logit), confint(logit)))
Waiting for profiling to be done...
OR 2.5 % 97.5 %
(Intercept) 1.054957e-35 3.236230e-44 4.535411e-28
w0 1.223196e+00 6.104283e-01 2.468277e+00
h0 3.483353e+06 4.054589e+00 6.204603e+12
w1 1.246616e+00 9.675776e-01 1.696047e+00
h1 3.476313e+34 1.791985e+26 3.176440e+43
age2 1.383249e+00 5.642920e-01 3.460142e+00
genderMale 2.916495e-01 1.600248e-01 5.163111e-01
f2 1.709406e+00 7.680505e-01 3.849032e+00
f3 2.096782e+00 9.805714e-01 4.569438e+00
f4 1.491053e+00 6.766722e-01 3.317946e+00
devtools::session_info()
─ Session info ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
setting value
version R version 3.4.4 (2018-03-15)
os Ubuntu 16.04.6 LTS
system x86_64, linux-gnu
ui RStudio
language en_US
collate en_US.UTF-8
ctype en_US.UTF-8
tz Atlantic/Faroe
date 2019-08-26
─ Packages ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
package * version date lib source
aod * 1.3.1 2019-01-26 [1] CRAN (R 3.4.4)
assertthat 0.2.1 2019-03-21 [1] CRAN (R 3.4.4)
backports 1.1.4 2019-04-10 [1] CRAN (R 3.4.4)
base64enc 0.1-3 2015-07-28 [1] CRAN (R 3.4.4)
bitops 1.0-6 2013-08-17 [1] CRAN (R 3.4.4)
broom 0.5.2 2019-04-07 [1] CRAN (R 3.4.4)
callr 3.3.1 2019-07-18 [1] CRAN (R 3.4.4)
cellranger 1.1.0 2016-07-27 [1] CRAN (R 3.4.4)
cli 1.1.0 2019-03-19 [1] CRAN (R 3.4.4)
colorspace 1.4-1 2019-03-18 [1] CRAN (R 3.4.4)
crayon 1.3.4 2017-09-16 [1] CRAN (R 3.4.4)
curl 4.0 2019-07-22 [1] CRAN (R 3.4.4)
desc 1.2.0 2018-05-01 [1] CRAN (R 3.4.4)
devtools * 2.1.0 2019-07-06 [1] CRAN (R 3.4.4)
digest 0.6.20 2019-07-04 [1] CRAN (R 3.4.4)
dplyr * 0.8.3 2019-07-04 [1] CRAN (R 3.4.4)
evaluate 0.14 2019-05-28 [1] CRAN (R 3.4.4)
fansi 0.4.0 2018-10-05 [1] CRAN (R 3.4.4)
forcats * 0.4.0 2019-02-17 [1] CRAN (R 3.4.4)
fs 1.3.1 2019-05-06 [1] CRAN (R 3.4.4)
generics 0.0.2 2018-11-29 [1] CRAN (R 3.4.4)
ggmap * 3.0.0 2019-02-05 [1] CRAN (R 3.4.4)
ggplot2 * 3.2.1 2019-08-10 [1] CRAN (R 3.4.4)
ggrepel * 0.8.1 2019-05-07 [1] CRAN (R 3.4.4)
glue 1.3.1 2019-03-12 [1] CRAN (R 3.4.4)
gridExtra * 2.3 2017-09-09 [1] CRAN (R 3.4.4)
gtable 0.3.0 2019-03-25 [1] CRAN (R 3.4.4)
haven 2.1.1 2019-07-04 [1] CRAN (R 3.4.4)
hms 0.5.1 2019-08-23 [1] CRAN (R 3.4.4)
htmltools 0.3.6 2017-04-28 [1] CRAN (R 3.4.4)
httr 1.4.1 2019-08-05 [1] CRAN (R 3.4.4)
jpeg 0.1-8 2014-01-23 [1] CRAN (R 3.4.4)
jsonlite 1.6 2018-12-07 [1] CRAN (R 3.4.4)
knitr * 1.24 2019-08-08 [1] CRAN (R 3.4.4)
labeling 0.3 2014-08-23 [1] CRAN (R 3.4.4)
lattice 0.20-38 2018-11-04 [4] CRAN (R 3.4.4)
lazyeval 0.2.2 2019-03-15 [1] CRAN (R 3.4.4)
lubridate * 1.7.4 2018-04-11 [1] CRAN (R 3.4.4)
magrittr * 1.5 2014-11-22 [1] CRAN (R 3.4.4)
mapproj * 1.2.6 2018-03-29 [1] CRAN (R 3.4.4)
maps * 3.3.0 2018-04-03 [1] CRAN (R 3.4.4)
MASS 7.3-50 2018-04-30 [4] CRAN (R 3.4.4)
Matrix 1.2-14 2018-04-09 [4] CRAN (R 3.4.4)
memoise 1.1.0 2017-04-21 [1] CRAN (R 3.4.4)
modelr 0.1.5 2019-08-08 [1] CRAN (R 3.4.4)
munsell 0.5.0 2018-06-12 [1] CRAN (R 3.4.4)
nlme 3.1-137 2018-04-07 [4] CRAN (R 3.4.4)
pillar 1.4.2 2019-06-29 [1] CRAN (R 3.4.4)
pkgbuild 1.0.4 2019-08-05 [1] CRAN (R 3.4.4)
pkgconfig 2.0.2 2018-08-16 [1] CRAN (R 3.4.4)
pkgload 1.0.2 2018-10-29 [1] CRAN (R 3.4.4)
plyr 1.8.4 2016-06-08 [1] CRAN (R 3.4.4)
png 0.1-7 2013-12-03 [1] CRAN (R 3.4.4)
prettyunits 1.0.2 2015-07-13 [1] CRAN (R 3.4.4)
processx 3.4.1 2019-07-18 [1] CRAN (R 3.4.4)
ps 1.3.0 2018-12-21 [1] CRAN (R 3.4.4)
purrr * 0.3.2 2019-03-15 [1] CRAN (R 3.4.4)
R6 2.4.0 2019-02-14 [1] CRAN (R 3.4.4)
Rcpp 1.0.2 2019-07-25 [1] CRAN (R 3.4.4)
readr * 1.3.1 2018-12-21 [1] CRAN (R 3.4.4)
readxl * 1.3.1 2019-03-13 [1] CRAN (R 3.4.4)
remotes 2.1.0 2019-06-24 [1] CRAN (R 3.4.4)
RgoogleMaps 1.4.4 2019-08-20 [1] CRAN (R 3.4.4)
rjson 0.2.20 2018-06-08 [1] CRAN (R 3.4.4)
rlang 0.4.0 2019-06-25 [1] CRAN (R 3.4.4)
rmarkdown 1.15 2019-08-21 [1] CRAN (R 3.4.4)
rprojroot 1.3-2 2018-01-03 [1] CRAN (R 3.4.4)
rstudioapi 0.10 2019-03-19 [1] CRAN (R 3.4.4)
rvest 0.3.4 2019-05-15 [1] CRAN (R 3.4.4)
scales 1.0.0 2018-08-09 [1] CRAN (R 3.4.4)
sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.4.4)
stringi 1.4.3 2019-03-12 [1] CRAN (R 3.4.4)
stringr * 1.4.0 2019-02-10 [1] CRAN (R 3.4.4)
survival * 2.44-1.1 2019-04-01 [1] CRAN (R 3.4.4)
testthat 2.2.1 2019-07-25 [1] CRAN (R 3.4.4)
tibble * 2.1.3 2019-06-06 [1] CRAN (R 3.4.4)
tidyr * 0.8.3 2019-03-01 [1] CRAN (R 3.4.4)
tidyselect 0.2.5 2018-10-11 [1] CRAN (R 3.4.4)
tidyverse * 1.2.1 2017-11-14 [1] CRAN (R 3.4.4)
usethis * 1.5.1 2019-07-04 [1] CRAN (R 3.4.4)
utf8 1.1.4 2018-05-24 [1] CRAN (R 3.4.4)
vctrs 0.2.0 2019-07-05 [1] CRAN (R 3.4.4)
viridis * 0.5.1 2018-03-29 [1] CRAN (R 3.4.4)
viridisLite * 0.3.0 2018-02-01 [1] CRAN (R 3.4.4)
withr 2.1.2 2018-03-15 [1] CRAN (R 3.4.4)
xfun 0.9 2019-08-21 [1] CRAN (R 3.4.4)
xml2 1.2.2 2019-08-09 [1] CRAN (R 3.4.4)
yaml 2.2.0 2018-07-25 [1] CRAN (R 3.4.4)
zeallot 0.1.0 2018-01-28 [1] CRAN (R 3.4.4)
[1] /home/olavur/R/x86_64-pc-linux-gnu-library/3.4
[2] /usr/local/lib/R/site-library
[3] /usr/lib/R/site-library
[4] /usr/lib/R/library