In this case study, we are going to analyse a dataset containing bodyweight measurements of mice receiving two different diets, the standard “chow” diet and a high-fat diet. We also have the gender of the mice, but we will save that variable for next case study. Our hypothesis is that the high-fat diet has an effect on the weight of the mice.

These case studies will all follow more or less the same structure:

  • Read the data into R
  • Tidy the data with tibble and dplyr
  • Plot the data with ggplot2
  • Fit models to the data

Although we don’t need to tidy this dataset.

First, we install some packages we will need, if we don’t have them installed already. The “tidyverse” package installs several packages in the “tidyverse” ecosystem, including “tibble”, “readr” and “ggplot2”.

install.packages('tidyverse')  # Installs several packages from the tidyverse ecosystem.
install.packages('magrittr')
install.packages('devtools')

Next, we load the packages we need.

library(readr)  # For reading data.
library(tibble)  # The "tibble" datastructure.
library(dplyr)  # Working with the tibble datastructure.
library(ggplot2)  # Plotting.
library(magrittr)  # Need this for the %>% (pipe) operator.
library(devtools)  # For printing the session info at the end of the notebook.

Read and pre-process data

The data we need is available on the web, so we are just going to download it directly to our working directory with an R command. The download.file() function takes an URL, downloads the file that URL points to, and saves it in the current directory with the file name we specify.

# Download CSV file into current directory.
url <- 'https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/mice_pheno.csv'
download.file(url, 'mice_pheno.csv')
trying URL 'https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/mice_pheno.csv'
Content type 'text/plain; charset=utf-8' length 10108 bytes
==================================================
downloaded 10108 bytes

We read the data from the CSV file we just downloaded. Below, we pass the relative path to the read_csv() function; in this case, the file is in our current working directory.

# Read CSV file into a tibble.
mice <- read_csv('mice_pheno.csv')

The read_csv() function reads the CSV file and stores the data in a tibble object which we name mice. The data is stored in the memory of the computer, and any changes made to data are only stored in memory. If you want to save the changes you make to the data, write the data to disk using, for example, the write_csv() function.

Below, we use the head() function to inspect the first few rows of the data. We also see the column names (such as “Sex” and “Diet”) and the data type (such as <chr> and <dbl>).

head(mice)

Because we are lazy, we are going to simplify the column names a bit, as we are going to be writing these a lot. We do this using the rename() function. Note that we could use the syntax rename(mice, "new variable name" = "old variable name"), but instead we are using the syntax mice %>% rename("new variable name" = "old variable name"), where %>% is the pipe operator from the magrittr package.

# Change column names.
mice <- mice %>% rename('sex' = 'Sex', 'diet' = 'Diet', 'weight' = 'Bodyweight')

We are not going to use all of the columns in the data, so we drop all the unused columns. We do this using the select() function, supplying the names of the columns we want to retain.

mice <- mice %>% select("diet", "weight")
Error: Unknown column `diet` 
Call `rlang::last_error()` to see a backtrace

Next, we will use the mutate() function to convert the diet variable from characters to factors. Below, we use as.factor() to convert the diet variable, and use mutate() to overwrite the existing diet variable.

# Convert the diet variable from character to factor.
mice <- mice %>% mutate(diet=as.factor(diet))

We print the head of the dataset again, just to show that changes have been made.

head(mice)

Exploratory data analysis

In this section, we will visualize the data, in order to get an overview of the data, check assumptions, and possibly pre-process the data further if necessary. For this first case study, we will make a histogram, a box plot and a QQ-plot of each of our variables.

First, however, let’s quickly look at some summary stats for the data, using the summary() function. Note that the weight variable has five NA values, meaning that there is missing data. Because the diet variable is a factor, summary() counts the occurrence of each level, which is quite useful.

summary(mice)
   diet         weight     
 chow:449   Min.   :15.51  
 hf  :397   1st Qu.:24.04  
            Median :28.19  
            Mean   :28.85  
            3rd Qu.:32.95  
            Max.   :54.08  
            NA's   :5      

A different way to count occurrences in any vector is using the table() function, and below we first pull the diet variable from the tibble, and then pass that to table().

mice %>% pull(diet) %>% table()
.
chow   hf 
 449  397 

We use ggplot2 to plot our data. The ggplot(mice, aes(weight)) statement below tells R that we want to plot the weight variable in the mice dataset. Then we further say that we want to make a histogram out of this data with 15 bins. We make a title for the plot as well. Note that we simply “add” more functionality on top of the plot with the + operator. ggplot2 will not know how to deal with the NA values, and will give us a warning if we pass it NA values. To ignore this warning we use the na.rm=TRUE argument.

ggplot(mice, aes(weight)) +
  geom_histogram(na.rm=TRUE, bins=15) +
  labs(title='Histogram of mice weights')

Actually, we aren’t interested in the distribution of mice weights in general. What we are interested in is the two populations, or groups, receiving the chow and high-fat diets. There are many ways we could split this into two plots. Below, we’ve decided to do it using the facet_grid() function, receiving a variable name with which to split the data. We will discuss the ~ operator when we get to the modelling part, later in this case study.

ggplot(mice, aes(weight)) +
  geom_histogram(na.rm=TRUE, bins=15) +
  labs(title='Histogram of mice weights') +
  facet_grid(~ diet)

geom_boxplot() will split the data according to the x variable, so there is no need to use facet_grid() below.

ggplot(mice, aes(diet, weight)) +
  geom_boxplot(na.rm=TRUE) +
  labs(title='Box plot of mice weights')

Next, we are going to make QQ-plots of our variables. The QQ-plot is a good way to check whether data is normally distributed. It computes theoretical quantiles based on summary statistics of the data and assuming it is normally distributed, and then compares these with the actual quantiles. The closer the points match the QQ-line, the more accurately the normal distribution represents the data.

Below we use stat_qq() to add the points, representing the theoretical and actual (“sample”) quantiles, and stat_qq_line() to add the QQ-line, ignoring NA values with na.rm=TRUE. Note that we pass the variable to ggplot() as aes(sample=weight) this time, as the stat_qq() and stat_qq_line() functions will look in this keyword for the data.

Note that in both groups, the lower end of the distributions veer upwards. This indicates that the lower end of the distribution is heavy tailed. Other than that, the data seems reasonably well approximated by a normal distribution.

ggplot(mice, aes(sample=weight)) +
  stat_qq(na.rm=TRUE) + stat_qq_line(na.rm=TRUE) +
  labs(title='QQ-plot of mice weights') +
  facet_grid(~ diet)

Modelling

Below, we use a t test to test the difference in the mean weight in two groups, the chow and the hf groups, specifying the mice dataset. The weight ~ diet syntax, using the ~ operator, is common in modelling in R. This is a formula relating the variable of interest (the reponse) on the left-hand-side (in this case weight) to the explanatory variables on the right-hand-side. This way of thinking about it will make more sense when we get to the linear models, in the next case study.

model <- t.test(weight ~ diet, data=mice)
model

    Welch Two Sample t-test

data:  weight by diet
t = -7.1932, df = 735.02, p-value = 1.563e-12
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -3.906857 -2.231533
sample estimates:
mean in group chow   mean in group hf 
          27.41281           30.48201 

We see that the test is highly significant. The 95% confidence interval for the difference in mean is somewhat broad if what we wanted was a precise estimate of this difference. However, if we just want to know whether there is a difference, what we can note is that the confidence interval does not include zero, not by far. So we can be quite confident that this difference is real, provided that the original experiment sampled the populations correctly.

To conclude this R notebook, we use the devtools::sessions_info() function to print information about this R session, including R version and packages loaded. This improves the reproducibility of this analysis, as without this information, package versions in particular, the code may not perform the way it is supposed to, and either produce incorrect results or not work at all. If you are asking for help when something isn’t working, whether it is from colleagues or online, this information is also very useful.

devtools::session_info()
─ Session info ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 setting  value                       
 version  R version 3.4.4 (2018-03-15)
 os       Ubuntu 16.04.6 LTS          
 system   x86_64, linux-gnu           
 ui       RStudio                     
 language en_US                       
 collate  en_US.UTF-8                 
 ctype    en_US.UTF-8                 
 tz       Atlantic/Faroe              
 date     2019-08-25                  

─ Packages ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 package      * version date       lib source        
 assertthat     0.2.1   2019-03-21 [1] CRAN (R 3.4.4)
 backports      1.1.4   2019-04-10 [1] CRAN (R 3.4.4)
 base64enc      0.1-3   2015-07-28 [1] CRAN (R 3.4.4)
 callr          3.3.1   2019-07-18 [1] CRAN (R 3.4.4)
 cli            1.1.0   2019-03-19 [1] CRAN (R 3.4.4)
 codetools      0.2-15  2016-10-05 [4] CRAN (R 3.3.1)
 colorspace     1.4-1   2019-03-18 [1] CRAN (R 3.4.4)
 crayon         1.3.4   2017-09-16 [1] CRAN (R 3.4.4)
 desc           1.2.0   2018-05-01 [1] CRAN (R 3.4.4)
 devtools     * 2.1.0   2019-07-06 [1] CRAN (R 3.4.4)
 digest         0.6.20  2019-07-04 [1] CRAN (R 3.4.4)
 dplyr        * 0.8.3   2019-07-04 [1] CRAN (R 3.4.4)
 evaluate       0.14    2019-05-28 [1] CRAN (R 3.4.4)
 foreach      * 1.4.7   2019-07-27 [1] CRAN (R 3.4.4)
 fs             1.3.1   2019-05-06 [1] CRAN (R 3.4.4)
 ggplot2      * 3.2.1   2019-08-10 [1] CRAN (R 3.4.4)
 glue           1.3.1   2019-03-12 [1] CRAN (R 3.4.4)
 gtable         0.3.0   2019-03-25 [1] CRAN (R 3.4.4)
 hms            0.5.1   2019-08-23 [1] CRAN (R 3.4.4)
 htmltools      0.3.6   2017-04-28 [1] CRAN (R 3.4.4)
 iterators    * 1.0.12  2019-07-26 [1] CRAN (R 3.4.4)
 itertools    * 0.1-3   2014-03-12 [1] CRAN (R 3.4.4)
 jsonlite       1.6     2018-12-07 [1] CRAN (R 3.4.4)
 knitr          1.24    2019-08-08 [1] CRAN (R 3.4.4)
 labeling       0.3     2014-08-23 [1] CRAN (R 3.4.4)
 lazyeval       0.2.2   2019-03-15 [1] CRAN (R 3.4.4)
 lubridate    * 1.7.4   2018-04-11 [1] CRAN (R 3.4.4)
 magrittr     * 1.5     2014-11-22 [1] CRAN (R 3.4.4)
 memoise        1.1.0   2017-04-21 [1] CRAN (R 3.4.4)
 missForest   * 1.4     2013-12-31 [1] CRAN (R 3.4.4)
 munsell        0.5.0   2018-06-12 [1] CRAN (R 3.4.4)
 pillar         1.4.2   2019-06-29 [1] CRAN (R 3.4.4)
 pkgbuild       1.0.4   2019-08-05 [1] CRAN (R 3.4.4)
 pkgconfig      2.0.2   2018-08-16 [1] CRAN (R 3.4.4)
 pkgload        1.0.2   2018-10-29 [1] CRAN (R 3.4.4)
 plyr           1.8.4   2016-06-08 [1] CRAN (R 3.4.4)
 prettyunits    1.0.2   2015-07-13 [1] CRAN (R 3.4.4)
 processx       3.4.1   2019-07-18 [1] CRAN (R 3.4.4)
 ps             1.3.0   2018-12-21 [1] CRAN (R 3.4.4)
 purrr          0.3.2   2019-03-15 [1] CRAN (R 3.4.4)
 R6             2.4.0   2019-02-14 [1] CRAN (R 3.4.4)
 randomForest * 4.6-14  2018-03-25 [1] CRAN (R 3.4.4)
 Rcpp           1.0.2   2019-07-25 [1] CRAN (R 3.4.4)
 readr        * 1.3.1   2018-12-21 [1] CRAN (R 3.4.4)
 remotes        2.1.0   2019-06-24 [1] CRAN (R 3.4.4)
 reshape2       1.4.3   2017-12-11 [1] CRAN (R 3.4.4)
 rlang          0.4.0   2019-06-25 [1] CRAN (R 3.4.4)
 rmarkdown      1.15    2019-08-21 [1] CRAN (R 3.4.4)
 rprojroot      1.3-2   2018-01-03 [1] CRAN (R 3.4.4)
 rstudioapi     0.10    2019-03-19 [1] CRAN (R 3.4.4)
 scales         1.0.0   2018-08-09 [1] CRAN (R 3.4.4)
 sessioninfo    1.1.1   2018-11-05 [1] CRAN (R 3.4.4)
 stringi        1.4.3   2019-03-12 [1] CRAN (R 3.4.4)
 stringr        1.4.0   2019-02-10 [1] CRAN (R 3.4.4)
 testthat       2.2.1   2019-07-25 [1] CRAN (R 3.4.4)
 tibble       * 2.1.3   2019-06-06 [1] CRAN (R 3.4.4)
 tidyselect     0.2.5   2018-10-11 [1] CRAN (R 3.4.4)
 usethis      * 1.5.1   2019-07-04 [1] CRAN (R 3.4.4)
 vctrs          0.2.0   2019-07-05 [1] CRAN (R 3.4.4)
 withr          2.1.2   2018-03-15 [1] CRAN (R 3.4.4)
 xfun           0.9     2019-08-21 [1] CRAN (R 3.4.4)
 yaml           2.2.0   2018-07-25 [1] CRAN (R 3.4.4)
 zeallot        0.1.0   2018-01-28 [1] CRAN (R 3.4.4)

[1] /home/olavur/R/x86_64-pc-linux-gnu-library/3.4
[2] /usr/local/lib/R/site-library
[3] /usr/lib/R/site-library
[4] /usr/lib/R/library
---
title: "Case 1.1 - Simple statistics workflow in R"
output:
  html_notebook:
    toc: true
    toc_float: true
---


In this case study, we are going to analyse a dataset containing bodyweight measurements of mice receiving two different diets, the standard "chow" diet and a high-fat diet. We also have the gender of the mice, but we will save that variable for next case study. Our hypothesis is that the high-fat diet has an effect on the weight of the mice.

These case studies will all follow more or less the same structure:

* Read the data into R
* Tidy the data with `tibble` and `dplyr`
* Plot the data with `ggplot2`
* Fit models to the data

Although we don't need to tidy this dataset.

First, we install some packages we will need, if we don't have them installed already. The "tidyverse" package installs several packages in the "tidyverse" ecosystem, including "tibble", "readr" and "ggplot2".

```{r eval=FALSE}
install.packages('tidyverse')  # Installs several packages from the tidyverse ecosystem.
install.packages('magrittr')
install.packages('devtools')
```

Next, we load the packages we need.

```{r message=FALSE}
library(readr)  # For reading data.
library(tibble)  # The "tibble" datastructure.
library(dplyr)  # Working with the tibble datastructure.
library(ggplot2)  # Plotting.
library(magrittr)  # Need this for the %>% (pipe) operator.
library(devtools)  # For printing the session info at the end of the notebook.
```

# Read and pre-process data

The data we need is available on the web, so we are just going to download it directly to our working directory with an R command. The `download.file()` function takes an URL, downloads the file that URL points to, and saves it in the current directory with the file name we specify.

```{r}
# Download CSV file into current directory.
url <- 'https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/mice_pheno.csv'
download.file(url, 'mice_pheno.csv')
```

We read the data from the CSV file we just downloaded. Below, we pass the relative path to the `read_csv()` function; in this case, the file is in our current working directory.

```{r message=FALSE}
# Read CSV file into a tibble.
mice <- read_csv('mice_pheno.csv')
```

The `read_csv()` function reads the CSV file and stores the data in a tibble object which we name `mice`. The data is stored in the memory of the computer, and any changes made to `data` are only stored in memory. If you want to save the changes you make to the data, write the data to disk using, for example, the `write_csv()` function.

Below, we use the `head()` function to inspect the first few rows of the data. We also see the column names (such as "Sex" and "Diet") and the data type (such as `<chr>` and `<dbl>`).

```{r}
head(mice)
```

Because we are lazy, we are going to simplify the column names a bit, as we are going to be writing these a lot. We do this using the `rename()` function. Note that we could use the syntax `rename(mice, "new variable name" = "old variable name")`, but instead we are using the syntax `mice %>% rename("new variable name" = "old variable name")`, where `%>%` is the *pipe* operator from the `magrittr` package.

```{r}
# Change column names.
mice <- mice %>% rename('sex' = 'Sex', 'diet' = 'Diet', 'weight' = 'Bodyweight')
```

We are not going to use all of the columns in the data, so we drop all the unused columns. We do this using the `select()` function, supplying the names of the columns we want to retain. 

```{r}
# Subset the columns in the data.
mice <- mice %>% select("diet", "weight")
```

Next, we will use the `mutate()` function to convert the `diet` variable from characters to factors. Below, we use `as.factor()` to convert the `diet` variable, and use `mutate()` to overwrite the existing `diet` variable.

```{r}
# Convert the diet variable from character to factor.
mice <- mice %>% mutate(diet=as.factor(diet))
```

We print the head of the dataset again, just to show that changes have been made.

```{r}
head(mice)
```


# Exploratory data analysis

In this section, we will visualize the data, in order to get an overview of the data, check assumptions, and possibly pre-process the data further if necessary. For this first case study, we will make a histogram, a box plot and a QQ-plot of each of our variables.

First, however, let's quickly look at some summary stats for the data, using the `summary()` function. Note that the `weight` variable has five `NA` values, meaning that there is missing data. Because the `diet` variable is a factor, `summary()` counts the occurrence of each level, which is quite useful.

```{r}
summary(mice)
```

A different way to count occurrences in any vector is using the `table()` function, and below we first *pull* the `diet` variable from the tibble, and then pass that to `table()`.

```{r}
mice %>% pull(diet) %>% table()
```


We use `ggplot2` to plot our data. The `ggplot(mice, aes(weight))` statement below tells R that we want to plot the `weight` variable in the `mice` dataset. Then we further say that we want to make a histogram out of this data with 15 bins. We make a title for the plot as well. Note that we simply "add" more functionality on top of the plot with the `+` operator. `ggplot2` will not know how to deal with the NA values, and will give us a warning if we pass it NA values. To ignore this warning we use the `na.rm=TRUE` argument.

```{r fig.width=10, fig.height=5}
ggplot(mice, aes(weight)) +
  geom_histogram(na.rm=TRUE, bins=15) +
  labs(title='Histogram of mice weights')
```

Actually, we aren't interested in the distribution of mice weights in general. What we are interested in is the two populations, or groups, receiving the chow and high-fat diets. There are many ways we could split this into two plots. Below, we've decided to do it using the `facet_grid()` function, receiving a variable name with which to split the data. We will discuss the `~` operator when we get to the modelling part, later in this case study.

```{r fig.width=10, fig.height=5}
ggplot(mice, aes(weight)) +
  geom_histogram(na.rm=TRUE, bins=15) +
  labs(title='Histogram of mice weights') +
  facet_grid(~ diet)
```

`geom_boxplot()` will split the data according to the `x` variable, so there is no need to use `facet_grid()` below.

```{r fig.width=5, fig.height=5}
ggplot(mice, aes(diet, weight)) +
  geom_boxplot(na.rm=TRUE) +
  labs(title='Box plot of mice weights')
```

Next, we are going to make QQ-plots of our variables. The QQ-plot is a good way to check whether data is normally distributed. It computes theoretical quantiles based on summary statistics of the data and assuming it is normally distributed, and then compares these with the actual quantiles. The closer the points match the QQ-line, the more accurately the normal distribution represents the data.

Below we use `stat_qq()` to add the points, representing the theoretical and actual ("sample") quantiles, and `stat_qq_line()` to add the QQ-line, ignoring NA values with `na.rm=TRUE`. Note that we pass the variable to `ggplot()` as `aes(sample=weight)` this time, as the `stat_qq()` and `stat_qq_line()` functions will look in this keyword for the data.

Note that in both groups, the lower end of the distributions veer upwards. This indicates that the lower end of the distribution is heavy tailed. Other than that, the data seems reasonably well approximated by a normal distribution.

```{r fig.width=10, fig.height=5}
ggplot(mice, aes(sample=weight)) +
  stat_qq(na.rm=TRUE) + stat_qq_line(na.rm=TRUE) +
  labs(title='QQ-plot of mice weights') +
  facet_grid(~ diet)
```


# Modelling

Below, we use a t test to test the difference in the mean `weight` in two groups, the `chow` and the `hf` groups, specifying the `mice` dataset. The `weight ~ diet` syntax, using the `~` operator, is common in modelling in R. This is a formula relating the variable of interest (the reponse) on the left-hand-side (in this case `weight`) to the explanatory variables on the right-hand-side. This way of thinking about it will make more sense when we get to the linear models, in the next case study.

```{r}
model <- t.test(weight ~ diet, data=mice)
model
```

We see that the test is highly significant. The 95% confidence interval for the difference in mean is somewhat broad if what we wanted was a precise estimate of this difference. However, if we just want to know whether there *is* a difference, what we can note is that the confidence interval does not include zero, not by far. So we can be quite confident that this difference is real, provided that the original experiment sampled the populations correctly.


To conclude this R notebook, we use the `devtools::sessions_info()` function to print information about this R session, including R version and packages loaded. This improves the reproducibility of this analysis, as without this information, package versions in particular, the code may not perform the way it is supposed to, and either produce incorrect results or not work at all. If you are asking for help when something isn't working, whether it is from colleagues or online, this information is also very useful.

```{r}
devtools::session_info()
```

