In this case study, we will calculate each individual’s BMI from their weight and height measurements at three timepoints, at birth and at two exam days where the children are about 2 and 9 years of age. We will deal with a dataset that is more problematic than the previous, as it is less straight forward to load and contains missing data. We are going to clean the data up before computing BMI, by removing outliers and imputing missing data.
We need to install a package for the specific imputation method we will be using.
install.packages("missForest")
library(readr) # For reading the data.
library(tibble) # The "tibble" datastructure.
library(dplyr) # Working with the tibble datastructure.
library(ggplot2) # Plotting.
library(lubridate) # Working with datetime objects.
library(missForest) # For imputing missing values.
library(magrittr) # Need this for the %>% operator.
library(devtools) # For printing the session info at the end of the notebook.
We will also need the following script to make some of the plots in this notebook.
Multiplot:
http://www.cookbook-r.com/Graphs/Multiple_graphs_on_one_page_(ggplot2)/
We have saved this script somewhere on the computer, and run the script to load the functions in it. After running the script, we have the multiplot()
function available.
source("../multiplot.R")
The data we want to load is CSV. However, it doesn’t use the standard comma delimitation as CSV files do; if we open the file in a text editor, we find that it uses semicolon. So we use read_delim()
and specify delim=';'
. Second, the first line in the file is not a proper entry in the CSV format, it is just a line of free text; we ignore this line using the skip=1
argument. Finally, the file contains some whitespace, which will give our columns messy names with a bunch of spaces in them; we fix this using trim_ws=TRUE
.
data <- read_delim('weight_height_data.csv', delim=';', skip=1, trim_ws=TRUE)
You can download the data we use in this case here:
As in previous examples, we will give the variables convenient names, remove unnecessary variables, and convert character variables to factors.
data = data %>% rename("ID" = "Identification number", "birth" = "Date of birth", "gender" = "Gender",
"w0" = "Weight_birth", "h0" = "Height_birth", "exam1" = "Exam_day1", "h1" = "Height_1",
"w1" = "Weight_1", "exam2" = "Exam_day2", "h2" = "Height_2", "w2" = "Weight_2")
data <- data %>% select(-'ID')
data <- data %>% mutate(gender=as.factor(gender))
head(data)
Note that the birth weight is in grams, while the other weight measurements are in kilograms, so we will convert the birth weight. We do this using the mutate()
function, calculating the new vaue w0 / 1000
and storing it in the w0
variable, thereby replacing the original values in the w0
variable.
We also convert the height measures to meters, because we want to compute BMI which unit is \(kg/m^2\).
# Convert birth weight from grams to kilograms, to match the other weight units.
data = data %>% mutate(w0=w0/1000)
# Convert height measures to meters.
data = data %>% mutate(h0=h0/100)
data = data %>% mutate(h1=h1/100)
data = data %>% mutate(h2=h2/100)
We want to know the ages of the children at exam day 1 and 2. Unfortunately, working with dates can be a hassle, because of irregular formatted dates, and not completely straight-forward math in computing the time differences. Luckily, the lubridate
package makes things (relatively) easy for us.
All the dates are are formatted as days, months and years, but examples include “30-Dec-06” and “14.08.08”. So let’s convert all the dates to the same format using the dmy()
(“day-month-year”) function from lubridate
. Just to check the the data type for these reformatted dates, we pull one of the columns and use the class()
function.
# Convert all dates to lubridate's "Date" objects.
data = data %>% mutate(birth=dmy(birth))
data = data %>% mutate(exam1=dmy(exam1))
data = data %>% mutate(exam2=dmy(exam2))
# We show what the data type is for the dates.
data %>% pull(birth) %>% class()
[1] "Date"
Now we want to compute new variables in the data representing the ages. To compute the age, we use the interval()
function, convert the interval to a “duration” (in the lingo of lubridate
), and then convert that into a numeric value representing years as a floating point number.
# Convert to an "interval" object, then to a "duration" object, and finally to a numeric value representing the number of years as a floating point.
data = data %>% mutate(age1=interval(birth, exam1) %>%
as.duration(.) %>%
as.numeric(., "years"))
# Do the same for exam day 2.
data = data %>% mutate(age2=interval(birth, exam2) %>%
as.duration(.) %>%
as.numeric(., "years"))
We won’t be using the dates anymore, now that we have calculated the ages of the children, so we will remove those columns. We use the select()
function, as usual, but this time we supply a list of columns we want to remove, rather than the ones we want to keep, and use the minus sign (-
) in front of the list to invert this selection operation.
# Remove some columns from the data that we no longer need (note the minus sign
# in front of the list).
data = data %>% select(-c('birth', 'exam1', 'exam2'))
Let’s plot the distribution of some of our variables. Below we make a box plot and a histogram of our age variables. There is one very obvious outlier in the second exam day, where one child is three years old. We will definitely remove this outlier, as it doesn’t make sense for this experiment. Besides this outlier, the children are quite close in age. Note that we don’t have an assumption that these variables are normally distributed.
p1 = ggplot(data, aes(x='', age1)) +
geom_boxplot(na.rm=TRUE) +
labs(tag='A', x='', y='Age (years)', title='Box plot of age at exam day 1')
p2 = ggplot(data, aes(age1)) +
geom_histogram(na.rm=TRUE, bins=10) +
labs(tag='B', title='Histogram of age at exam day 1', x='Age (years)', y='Count')
p3 = ggplot(data, aes(x='', age2)) +
geom_boxplot(na.rm=TRUE) +
labs(tag='C', x='', y='Age (years)', title='Box plot of age at exam day 2')
p4 = ggplot(data, aes(age2)) +
geom_histogram(na.rm=TRUE, bins=10) +
labs(tag='D', title='Histogram of age at exam day 2', x='Age (years)', y='Count')
multiplot(p1, p3, p2, p4, cols=2)
Similarly, we plot the distribution of the weight and the height variables below. We see an outlier with the birth height of about 1 meter, which we will remove. Although we see some other outliers we don’t have any rationale for removing them.
p1 = ggplot(data, aes(x='', w0)) +
geom_boxplot(na.rm=TRUE) +
labs(tag='A', x='', y='Weight (kg)', title='Box plot of weight at birth')
p2 = ggplot(data, aes(w0)) +
geom_histogram(na.rm=TRUE, bins=10) +
labs(tag='B', x='Weight (kg)', y='Count', title='Histogram of weight at birth')
p3 = ggplot(data, aes(x='', w1)) +
geom_boxplot(na.rm=TRUE) +
labs(tag='A', x='', y='Weight (kg)', title='Box plot of weight at exam day 1')
p4 = ggplot(data, aes(w1)) +
geom_histogram(na.rm=TRUE, bins=10) +
labs(tag='B', x='Weight (kg)', y='Count', title='Histogram of weight at exam day 1')
p5 = ggplot(data, aes(x='', w2)) +
geom_boxplot(na.rm=TRUE) +
labs(tag='A', x='', y='Weight (kg)', title='Box plot of weight at exam day 2')
p6 = ggplot(data, aes(w2)) +
geom_histogram(na.rm=TRUE, bins=10) +
labs(tag='B', x='Weight (kg)', y='Count', title='Histogram of weight at exam day 2')
multiplot(p1, p3, p5, p2, p4, p6, cols=2)
p1 = ggplot(data, aes(x='', h0)) +
geom_boxplot(na.rm=TRUE) +
labs(tag='A', x='', y='Height (m)', title='Box plot of height at birth')
p2 = ggplot(data, aes(h0)) +
geom_histogram(na.rm=TRUE, bins=10) +
labs(tag='B', x='Height (m)', y='Count', title='Histogram of height at birth')
p3 = ggplot(data, aes(x='', h1)) +
geom_boxplot(na.rm=TRUE) +
labs(tag='A', x='', y='Height (m)', title='Box plot of height at exam day 1')
p4 = ggplot(data, aes(h1)) +
geom_histogram(na.rm=TRUE, bins=10) +
labs(tag='B', x='Height (m)', y='Count', title='Histogram of height at exam day 1')
p5 = ggplot(data, aes(x='', h2)) +
geom_boxplot(na.rm=TRUE) +
labs(tag='A', x='', y='Height (m)', title='Box plot of height at exam day 2')
p6 = ggplot(data, aes(h2)) +
geom_histogram(na.rm=TRUE, bins=10) +
labs(tag='B', x='Height (m)', y='Count', title='Histogram of height at exam day 2')
multiplot(p1, p3, p5, p2, p4, p6, cols=2)
As we see below, when we filter the data we only remove two outliers, based on criteria discussed above.
# Number of observation before filter.
n_before = dim(data)[1]
# Remove outliers.
data = data %>% filter(age2 > 4 | is.na(age2)) %>%
filter(h0 < 0.9 | is.na(h0))
# Number of observations after filter.
n_after = dim(data)[1]
# Print difference.
print(n_before - n_after)
[1] 2
Before we impute missing data, let’s find out how many values are missing in each observation. Below, we see that most are missing none, many are missing three, and only a few are missing 6 values.
# Count the number of NA values in each observation and calculate a table.
is.na(data) %>% rowSums() %>% table()
.
0 1 3 6
353 2 84 5
Let’s also do this for each variable. Not surprisingly, exam day 2 has the most missing values.
# Count number of NA values in each variable.
is.na(data) %>% colSums()
gender w0 h0 h1 w1 h2 w2 age1 age2
0 0 1 24 25 70 70 24 70
We use the missForest()
function to impute missing data. There are many packages that perform imputation, but we chose this one because it makes few assumptions about the data. We could probably have used the MICE R package, which among other things assumes that the data follows a multivariate normal distribution.
Note that we pass the verbose=TRUE
argument to missForest()
, which makes missForest()
give us more information for each iteration (round) of imputation. Note by the error and the difference that the error drops quite quickly, such that we probably only needed a single iteration.
# Convert the tibble to dataframe befoew running imputation.
imp <- data %>%
as.data.frame() %>%
missForest(verbose=TRUE)
missForest iteration 1 in progress...done!
estimated error(s): 0.1542297 0
difference(s): 0.001429285 0
time: 1.851 seconds
missForest iteration 2 in progress...done!
estimated error(s): 0.1552789 0
difference(s): 0.0001688379 0
time: 1.362 seconds
missForest iteration 3 in progress...done!
estimated error(s): 0.1524495 0
difference(s): 0.0001482733 0
time: 1.327 seconds
missForest iteration 4 in progress...done!
estimated error(s): 0.1549887 0
difference(s): 5.877677e-05 0
time: 1.408 seconds
missForest iteration 5 in progress...done!
estimated error(s): 0.1524853 0
difference(s): 3.942262e-05 0
time: 1.491 seconds
missForest iteration 6 in progress...done!
estimated error(s): 0.1543504 0
difference(s): 7.13651e-05 0
time: 1.504 seconds
We print the out-of-bag error, the normalized root mean squared error (NRMSE) and the proportion of falsly classified (PFC). PFC is zero because we only have continuous variables.
imp$OOBerror
NRMSE PFC
0.1524853 0.0000000
We extract the data from the missForest()
object, imp
, which is a matrix, and convert it to a tibble.
# Get the imputed data and convert it to a tibble.
data_imp <- imp$ximp %>% as_tibble()
Now that we have cleaned up our data by removing outliers and imputing missing values, let’s calculate the BMI of each individual at three time points, birth and exam day 1 and 2.
# Compute BMI at birth and at exam days 1 and 2, and add columns to data.
data_imp = data_imp %>% mutate(bmi0=w0/h0^2,
bmi1=w1/h1^2,
bmi2=w2/h2^2)
Let’s make histograms, box plots and QQ-plots of BMI. Some of the potential outliers that we didn’t remove still muddy the picture somewhat unfortunately. The data looks reasonably normally distributed, except that BMI at exam day 2 seems to skew to the left.
p1 = ggplot(data_imp, aes(x='', bmi0)) +
geom_boxplot(na.rm=TRUE) +
labs(tag='A', x='', y='BMI', title='Box plot of BMI at birth')
p2 = ggplot(data_imp, aes(bmi0)) +
geom_histogram(na.rm=TRUE, bins=10) +
labs(tag='B', title='Histogram of BMI at birth', x='BMI', y='Count')
p3 = ggplot(data_imp, aes(sample=bmi0)) +
stat_qq(na.rm=TRUE) + stat_qq_line(na.rm=TRUE) +
labs(tag='C', title='QQ-plot of BMI at birth')
p4 = ggplot(data_imp, aes(x='', bmi1)) +
geom_boxplot(na.rm=TRUE) +
labs(tag='D', x='', y='BMI', title='Box plot of BMI at exam day 1')
p5 = ggplot(data_imp, aes(bmi1)) +
geom_histogram(na.rm=TRUE, bins=10) +
labs(tag='E', title='Histogram of BMI at exam day 1', x='BMI', y='Count')
p6 = ggplot(data_imp, aes(sample=bmi1)) +
stat_qq(na.rm=TRUE) + stat_qq_line(na.rm=TRUE) +
labs(tag='F', title='QQ-plot of BMI at exam day 1')
p7 = ggplot(data_imp, aes(x='', bmi2)) +
geom_boxplot(na.rm=TRUE) +
labs(tag='G', x='', y='BMI', title='Box plot of BMI at exam day 2')
p8 = ggplot(data_imp, aes(bmi2)) +
geom_histogram(na.rm=TRUE, bins=10) +
labs(tag='H', title='Histogram of BMI at exam day 2', x='BMI', y='Count')
p9 = ggplot(data_imp, aes(sample=bmi2)) +
stat_qq(na.rm=TRUE) + stat_qq_line(na.rm=TRUE) +
labs(tag='I', title='QQ-plot of BMI at exam day 2')
multiplot(p1, p4, p7, p2, p5, p8, p3, p6, p9, cols=3)
Say we want to share this data with colleagues, then we need to save it somehow. We can do this with the write_csv()
function, supplying the tibble and a path where want to write the data. This writes the data in a standard CSV file with comma delimitation. Storing the cleaned-up data we also avoid having to deal with the messy nature of the original data, so that we can import it directly with read_csv('weight_height_cleanup.csv')
and start analysing.
write_csv(data_imp, 'weight_height_cleanup.csv')
devtools::session_info()
─ Session info ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
setting value
version R version 3.4.4 (2018-03-15)
os Ubuntu 16.04.6 LTS
system x86_64, linux-gnu
ui RStudio
language en_US
collate en_US.UTF-8
ctype en_US.UTF-8
tz Atlantic/Faroe
date 2019-08-25
─ Packages ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
package * version date lib source
assertthat 0.2.1 2019-03-21 [1] CRAN (R 3.4.4)
backports 1.1.4 2019-04-10 [1] CRAN (R 3.4.4)
callr 3.3.1 2019-07-18 [1] CRAN (R 3.4.4)
cli 1.1.0 2019-03-19 [1] CRAN (R 3.4.4)
codetools 0.2-15 2016-10-05 [4] CRAN (R 3.3.1)
colorspace 1.4-1 2019-03-18 [1] CRAN (R 3.4.4)
crayon 1.3.4 2017-09-16 [1] CRAN (R 3.4.4)
desc 1.2.0 2018-05-01 [1] CRAN (R 3.4.4)
devtools * 2.1.0 2019-07-06 [1] CRAN (R 3.4.4)
digest 0.6.20 2019-07-04 [1] CRAN (R 3.4.4)
dplyr * 0.8.3 2019-07-04 [1] CRAN (R 3.4.4)
foreach * 1.4.7 2019-07-27 [1] CRAN (R 3.4.4)
fs 1.3.1 2019-05-06 [1] CRAN (R 3.4.4)
ggplot2 * 3.2.1 2019-08-10 [1] CRAN (R 3.4.4)
glue 1.3.1 2019-03-12 [1] CRAN (R 3.4.4)
gtable 0.3.0 2019-03-25 [1] CRAN (R 3.4.4)
hms 0.5.1 2019-08-23 [1] CRAN (R 3.4.4)
iterators * 1.0.12 2019-07-26 [1] CRAN (R 3.4.4)
itertools * 0.1-3 2014-03-12 [1] CRAN (R 3.4.4)
knitr 1.24 2019-08-08 [1] CRAN (R 3.4.4)
labeling 0.3 2014-08-23 [1] CRAN (R 3.4.4)
lazyeval 0.2.2 2019-03-15 [1] CRAN (R 3.4.4)
lubridate * 1.7.4 2018-04-11 [1] CRAN (R 3.4.4)
magrittr * 1.5 2014-11-22 [1] CRAN (R 3.4.4)
memoise 1.1.0 2017-04-21 [1] CRAN (R 3.4.4)
missForest * 1.4 2013-12-31 [1] CRAN (R 3.4.4)
munsell 0.5.0 2018-06-12 [1] CRAN (R 3.4.4)
pillar 1.4.2 2019-06-29 [1] CRAN (R 3.4.4)
pkgbuild 1.0.4 2019-08-05 [1] CRAN (R 3.4.4)
pkgconfig 2.0.2 2018-08-16 [1] CRAN (R 3.4.4)
pkgload 1.0.2 2018-10-29 [1] CRAN (R 3.4.4)
prettyunits 1.0.2 2015-07-13 [1] CRAN (R 3.4.4)
processx 3.4.1 2019-07-18 [1] CRAN (R 3.4.4)
ps 1.3.0 2018-12-21 [1] CRAN (R 3.4.4)
purrr 0.3.2 2019-03-15 [1] CRAN (R 3.4.4)
R6 2.4.0 2019-02-14 [1] CRAN (R 3.4.4)
randomForest * 4.6-14 2018-03-25 [1] CRAN (R 3.4.4)
Rcpp 1.0.2 2019-07-25 [1] CRAN (R 3.4.4)
readr * 1.3.1 2018-12-21 [1] CRAN (R 3.4.4)
remotes 2.1.0 2019-06-24 [1] CRAN (R 3.4.4)
rlang 0.4.0 2019-06-25 [1] CRAN (R 3.4.4)
rprojroot 1.3-2 2018-01-03 [1] CRAN (R 3.4.4)
rstudioapi 0.10 2019-03-19 [1] CRAN (R 3.4.4)
scales 1.0.0 2018-08-09 [1] CRAN (R 3.4.4)
sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.4.4)
stringi 1.4.3 2019-03-12 [1] CRAN (R 3.4.4)
stringr 1.4.0 2019-02-10 [1] CRAN (R 3.4.4)
testthat 2.2.1 2019-07-25 [1] CRAN (R 3.4.4)
tibble * 2.1.3 2019-06-06 [1] CRAN (R 3.4.4)
tidyselect 0.2.5 2018-10-11 [1] CRAN (R 3.4.4)
usethis * 1.5.1 2019-07-04 [1] CRAN (R 3.4.4)
vctrs 0.2.0 2019-07-05 [1] CRAN (R 3.4.4)
withr 2.1.2 2018-03-15 [1] CRAN (R 3.4.4)
xfun 0.9 2019-08-21 [1] CRAN (R 3.4.4)
zeallot 0.1.0 2018-01-28 [1] CRAN (R 3.4.4)
[1] /home/olavur/R/x86_64-pc-linux-gnu-library/3.4
[2] /usr/local/lib/R/site-library
[3] /usr/lib/R/site-library
[4] /usr/lib/R/library