In this case study, we are going to analyse a dataset containing bodyweight measurements of mice receiving two different diets, the standard “chow” diet and a high-fat diet. We also have the gender of the mice, but we will save that variable for next case study. Our hypothesis is that the high-fat diet has an effect on the weight of the mice.
These case studies will all follow more or less the same structure:
tibble
and dplyr
ggplot2
Although we don’t need to tidy this dataset.
First, we install some packages we will need, if we don’t have them installed already. The “tidyverse” package installs several packages in the “tidyverse” ecosystem, including “tibble”, “readr” and “ggplot2”.
install.packages('tidyverse') # Installs several packages from the tidyverse ecosystem.
install.packages('magrittr')
install.packages('devtools')
Next, we load the packages we need.
library(readr) # For reading data.
library(tibble) # The "tibble" datastructure.
library(dplyr) # Working with the tibble datastructure.
library(ggplot2) # Plotting.
library(magrittr) # Need this for the %>% (pipe) operator.
library(devtools) # For printing the session info at the end of the notebook.
The data we need is available on the web, so we are just going to download it directly to our working directory with an R command. The download.file()
function takes an URL, downloads the file that URL points to, and saves it in the current directory with the file name we specify.
# Download CSV file into current directory.
url <- 'https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/mice_pheno.csv'
download.file(url, 'mice_pheno.csv')
trying URL 'https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/mice_pheno.csv'
Content type 'text/plain; charset=utf-8' length 10108 bytes
==================================================
downloaded 10108 bytes
We read the data from the CSV file we just downloaded. Below, we pass the relative path to the read_csv()
function; in this case, the file is in our current working directory.
# Read CSV file into a tibble.
mice <- read_csv('mice_pheno.csv')
The read_csv()
function reads the CSV file and stores the data in a tibble object which we name mice
. The data is stored in the memory of the computer, and any changes made to data
are only stored in memory. If you want to save the changes you make to the data, write the data to disk using, for example, the write_csv()
function.
Below, we use the head()
function to inspect the first few rows of the data. We also see the column names (such as “Sex” and “Diet”) and the data type (such as <chr>
and <dbl>
).
head(mice)
Because we are lazy, we are going to simplify the column names a bit, as we are going to be writing these a lot. We do this using the rename()
function. Note that we could use the syntax rename(mice, "new variable name" = "old variable name")
, but instead we are using the syntax mice %>% rename("new variable name" = "old variable name")
, where %>%
is the pipe operator from the magrittr
package.
# Change column names.
mice <- mice %>% rename('sex' = 'Sex', 'diet' = 'Diet', 'weight' = 'Bodyweight')
We are not going to use all of the columns in the data, so we drop all the unused columns. We do this using the select()
function, supplying the names of the columns we want to retain.
mice <- mice %>% select("diet", "weight")
Error: Unknown column `diet`
Call `rlang::last_error()` to see a backtrace
Next, we will use the mutate()
function to convert the diet
variable from characters to factors. Below, we use as.factor()
to convert the diet
variable, and use mutate()
to overwrite the existing diet
variable.
# Convert the diet variable from character to factor.
mice <- mice %>% mutate(diet=as.factor(diet))
We print the head of the dataset again, just to show that changes have been made.
head(mice)
In this section, we will visualize the data, in order to get an overview of the data, check assumptions, and possibly pre-process the data further if necessary. For this first case study, we will make a histogram, a box plot and a QQ-plot of each of our variables.
First, however, let’s quickly look at some summary stats for the data, using the summary()
function. Note that the weight
variable has five NA
values, meaning that there is missing data. Because the diet
variable is a factor, summary()
counts the occurrence of each level, which is quite useful.
summary(mice)
diet weight
chow:449 Min. :15.51
hf :397 1st Qu.:24.04
Median :28.19
Mean :28.85
3rd Qu.:32.95
Max. :54.08
NA's :5
A different way to count occurrences in any vector is using the table()
function, and below we first pull the diet
variable from the tibble, and then pass that to table()
.
mice %>% pull(diet) %>% table()
.
chow hf
449 397
We use ggplot2
to plot our data. The ggplot(mice, aes(weight))
statement below tells R that we want to plot the weight
variable in the mice
dataset. Then we further say that we want to make a histogram out of this data with 15 bins. We make a title for the plot as well. Note that we simply “add” more functionality on top of the plot with the +
operator. ggplot2
will not know how to deal with the NA values, and will give us a warning if we pass it NA values. To ignore this warning we use the na.rm=TRUE
argument.
ggplot(mice, aes(weight)) +
geom_histogram(na.rm=TRUE, bins=15) +
labs(title='Histogram of mice weights')
Actually, we aren’t interested in the distribution of mice weights in general. What we are interested in is the two populations, or groups, receiving the chow and high-fat diets. There are many ways we could split this into two plots. Below, we’ve decided to do it using the facet_grid()
function, receiving a variable name with which to split the data. We will discuss the ~
operator when we get to the modelling part, later in this case study.
ggplot(mice, aes(weight)) +
geom_histogram(na.rm=TRUE, bins=15) +
labs(title='Histogram of mice weights') +
facet_grid(~ diet)
geom_boxplot()
will split the data according to the x
variable, so there is no need to use facet_grid()
below.
ggplot(mice, aes(diet, weight)) +
geom_boxplot(na.rm=TRUE) +
labs(title='Box plot of mice weights')
Next, we are going to make QQ-plots of our variables. The QQ-plot is a good way to check whether data is normally distributed. It computes theoretical quantiles based on summary statistics of the data and assuming it is normally distributed, and then compares these with the actual quantiles. The closer the points match the QQ-line, the more accurately the normal distribution represents the data.
Below we use stat_qq()
to add the points, representing the theoretical and actual (“sample”) quantiles, and stat_qq_line()
to add the QQ-line, ignoring NA values with na.rm=TRUE
. Note that we pass the variable to ggplot()
as aes(sample=weight)
this time, as the stat_qq()
and stat_qq_line()
functions will look in this keyword for the data.
Note that in both groups, the lower end of the distributions veer upwards. This indicates that the lower end of the distribution is heavy tailed. Other than that, the data seems reasonably well approximated by a normal distribution.
ggplot(mice, aes(sample=weight)) +
stat_qq(na.rm=TRUE) + stat_qq_line(na.rm=TRUE) +
labs(title='QQ-plot of mice weights') +
facet_grid(~ diet)
Below, we use a t test to test the difference in the mean weight
in two groups, the chow
and the hf
groups, specifying the mice
dataset. The weight ~ diet
syntax, using the ~
operator, is common in modelling in R. This is a formula relating the variable of interest (the reponse) on the left-hand-side (in this case weight
) to the explanatory variables on the right-hand-side. This way of thinking about it will make more sense when we get to the linear models, in the next case study.
model <- t.test(weight ~ diet, data=mice)
model
Welch Two Sample t-test
data: weight by diet
t = -7.1932, df = 735.02, p-value = 1.563e-12
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-3.906857 -2.231533
sample estimates:
mean in group chow mean in group hf
27.41281 30.48201
We see that the test is highly significant. The 95% confidence interval for the difference in mean is somewhat broad if what we wanted was a precise estimate of this difference. However, if we just want to know whether there is a difference, what we can note is that the confidence interval does not include zero, not by far. So we can be quite confident that this difference is real, provided that the original experiment sampled the populations correctly.
To conclude this R notebook, we use the devtools::sessions_info()
function to print information about this R session, including R version and packages loaded. This improves the reproducibility of this analysis, as without this information, package versions in particular, the code may not perform the way it is supposed to, and either produce incorrect results or not work at all. If you are asking for help when something isn’t working, whether it is from colleagues or online, this information is also very useful.
devtools::session_info()
─ Session info ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
setting value
version R version 3.4.4 (2018-03-15)
os Ubuntu 16.04.6 LTS
system x86_64, linux-gnu
ui RStudio
language en_US
collate en_US.UTF-8
ctype en_US.UTF-8
tz Atlantic/Faroe
date 2019-08-25
─ Packages ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
package * version date lib source
assertthat 0.2.1 2019-03-21 [1] CRAN (R 3.4.4)
backports 1.1.4 2019-04-10 [1] CRAN (R 3.4.4)
base64enc 0.1-3 2015-07-28 [1] CRAN (R 3.4.4)
callr 3.3.1 2019-07-18 [1] CRAN (R 3.4.4)
cli 1.1.0 2019-03-19 [1] CRAN (R 3.4.4)
codetools 0.2-15 2016-10-05 [4] CRAN (R 3.3.1)
colorspace 1.4-1 2019-03-18 [1] CRAN (R 3.4.4)
crayon 1.3.4 2017-09-16 [1] CRAN (R 3.4.4)
desc 1.2.0 2018-05-01 [1] CRAN (R 3.4.4)
devtools * 2.1.0 2019-07-06 [1] CRAN (R 3.4.4)
digest 0.6.20 2019-07-04 [1] CRAN (R 3.4.4)
dplyr * 0.8.3 2019-07-04 [1] CRAN (R 3.4.4)
evaluate 0.14 2019-05-28 [1] CRAN (R 3.4.4)
foreach * 1.4.7 2019-07-27 [1] CRAN (R 3.4.4)
fs 1.3.1 2019-05-06 [1] CRAN (R 3.4.4)
ggplot2 * 3.2.1 2019-08-10 [1] CRAN (R 3.4.4)
glue 1.3.1 2019-03-12 [1] CRAN (R 3.4.4)
gtable 0.3.0 2019-03-25 [1] CRAN (R 3.4.4)
hms 0.5.1 2019-08-23 [1] CRAN (R 3.4.4)
htmltools 0.3.6 2017-04-28 [1] CRAN (R 3.4.4)
iterators * 1.0.12 2019-07-26 [1] CRAN (R 3.4.4)
itertools * 0.1-3 2014-03-12 [1] CRAN (R 3.4.4)
jsonlite 1.6 2018-12-07 [1] CRAN (R 3.4.4)
knitr 1.24 2019-08-08 [1] CRAN (R 3.4.4)
labeling 0.3 2014-08-23 [1] CRAN (R 3.4.4)
lazyeval 0.2.2 2019-03-15 [1] CRAN (R 3.4.4)
lubridate * 1.7.4 2018-04-11 [1] CRAN (R 3.4.4)
magrittr * 1.5 2014-11-22 [1] CRAN (R 3.4.4)
memoise 1.1.0 2017-04-21 [1] CRAN (R 3.4.4)
missForest * 1.4 2013-12-31 [1] CRAN (R 3.4.4)
munsell 0.5.0 2018-06-12 [1] CRAN (R 3.4.4)
pillar 1.4.2 2019-06-29 [1] CRAN (R 3.4.4)
pkgbuild 1.0.4 2019-08-05 [1] CRAN (R 3.4.4)
pkgconfig 2.0.2 2018-08-16 [1] CRAN (R 3.4.4)
pkgload 1.0.2 2018-10-29 [1] CRAN (R 3.4.4)
plyr 1.8.4 2016-06-08 [1] CRAN (R 3.4.4)
prettyunits 1.0.2 2015-07-13 [1] CRAN (R 3.4.4)
processx 3.4.1 2019-07-18 [1] CRAN (R 3.4.4)
ps 1.3.0 2018-12-21 [1] CRAN (R 3.4.4)
purrr 0.3.2 2019-03-15 [1] CRAN (R 3.4.4)
R6 2.4.0 2019-02-14 [1] CRAN (R 3.4.4)
randomForest * 4.6-14 2018-03-25 [1] CRAN (R 3.4.4)
Rcpp 1.0.2 2019-07-25 [1] CRAN (R 3.4.4)
readr * 1.3.1 2018-12-21 [1] CRAN (R 3.4.4)
remotes 2.1.0 2019-06-24 [1] CRAN (R 3.4.4)
reshape2 1.4.3 2017-12-11 [1] CRAN (R 3.4.4)
rlang 0.4.0 2019-06-25 [1] CRAN (R 3.4.4)
rmarkdown 1.15 2019-08-21 [1] CRAN (R 3.4.4)
rprojroot 1.3-2 2018-01-03 [1] CRAN (R 3.4.4)
rstudioapi 0.10 2019-03-19 [1] CRAN (R 3.4.4)
scales 1.0.0 2018-08-09 [1] CRAN (R 3.4.4)
sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.4.4)
stringi 1.4.3 2019-03-12 [1] CRAN (R 3.4.4)
stringr 1.4.0 2019-02-10 [1] CRAN (R 3.4.4)
testthat 2.2.1 2019-07-25 [1] CRAN (R 3.4.4)
tibble * 2.1.3 2019-06-06 [1] CRAN (R 3.4.4)
tidyselect 0.2.5 2018-10-11 [1] CRAN (R 3.4.4)
usethis * 1.5.1 2019-07-04 [1] CRAN (R 3.4.4)
vctrs 0.2.0 2019-07-05 [1] CRAN (R 3.4.4)
withr 2.1.2 2018-03-15 [1] CRAN (R 3.4.4)
xfun 0.9 2019-08-21 [1] CRAN (R 3.4.4)
yaml 2.2.0 2018-07-25 [1] CRAN (R 3.4.4)
zeallot 0.1.0 2018-01-28 [1] CRAN (R 3.4.4)
[1] /home/olavur/R/x86_64-pc-linux-gnu-library/3.4
[2] /usr/local/lib/R/site-library
[3] /usr/lib/R/site-library
[4] /usr/lib/R/library