We are going to use some nice packages to query, manipulate, and plot genomic data. The point of this case is to showcase the versatility of R and the breadth of application specific packages available. It is not the point that you should fully understand how these packages work, nor should you understand the genetics aspect.
Below, we install some packages, and note that we do not use the normal install.packages()
command, but rather a command from BiocInstaller
, which is itself a package that needs to be installed. All these packages are from the Bioconductor (https://www.bioconductor.org/) ecosystem of packages, which is an organization and a community that curates high quality R packages for reproducible research in genetics.
We install GenomicRanges
, which is a data structure with powerful and convenient manipulation tools. All data in GenomicRanges
objects is based on intervals of the genome, and we can very easily query about a specific region or find overlap between objects and much more.
The TxDb.Hsapiens.UCSC.hg38.knownGene
package contains data, in the GenomicRanges
format, with information about genes in the human genome.
Gviz
gives us tools to easily plot genomic information.
BiocInstaller::biocLite('GenomicRanges')
BiocInstaller::biocLite('TxDb.Hsapiens.UCSC.hg38.knownGene')
BiocInstaller::biocLite('Gviz')
Load some libraries.
library(magrittr) # The pipe operator %>%.
library(readr) # Reading data.
library(dplyr) # Tibble manipulation.
library(TxDb.Hsapiens.UCSC.hg38.knownGene) # Info about genes in humans.
library(GenomicRanges) # Genomic data structure.
library(Gviz) # Vizualize genomic data.
First, we want to get some data relating to a specific gene, called SCN5A, which is implicated in irritable bowel disease. The database contains, among other things, information about transcripts and exons, and what we want is all the exons corresponding to a specific transcript of SCN5A.
So below we basically query the database for transcripts of gene SCN5A, then we pick a transcript, query the database for all exons, and extract the exons corresponding to the transcript of interest.
txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene
# UCSC/Entrez ID of SCN5A.
gene_id <- '6331'
# Get transcripts corresponding to the gene.
tx <- transcriptsBy(txdb, by='gene') # All transcripts by gene.
tx <- tx[[gene_id]]
# Get the exons corresponding to one of the transcripts in the gene.
exons <- exonsBy(txdb, by='tx') # All exons by transcript.
tx_name <- tx$tx_name[1] # Pick the first transcript.
tx_id <- tx$tx_id[1]
exons <- exons[[tx_id]]
Let’s see what this GenomicRanges
data object looks like. Note that it looks a lot like a dataframe or a tibble, with columns, rows and data classes. Note also that is has some extra information about the data in the table itself.
head(exons)
GRanges object with 6 ranges and 3 metadata columns:
seqnames ranges strand | exon_id exon_name exon_rank
<Rle> <IRanges> <Rle> | <integer> <character> <integer>
[1] chr3 [38633035, 38633349] - | 109088 <NA> 1
[2] chr3 [38630311, 38630429] - | 109084 <NA> 2
[3] chr3 [38622400, 38622489] - | 109081 <NA> 3
[4] chr3 [38620843, 38620971] - | 109079 <NA> 4
[5] chr3 [38613975, 38614066] - | 109076 <NA> 5
[6] chr3 [38609734, 38609964] - | 109074 <NA> 6
-------
seqinfo: 455 sequences (1 circular) from hg38 genome
Let’s make a simple plot with Gviz
. Below, we make an “annotation track”, supplying the exons and a track name, and tell Gviz
to plot this track. Quite boring plot, let’s make it a little bit more interesting.
# Make a genomic track.
atrack <- AnnotationTrack(exons, name="Exons of SCN5A")
# Plot track.
plotTracks(atrack)
There are many different types of tracks in Gviz
, and below we examples of an ideogram, which shows our location in the genome, a genome axis track, showing more detailed positions, and a gene region track, again showing the exons but a little bit differently than before. We can pass a lot of different arguments to these tracks and the plotTracks()
function to modify our plots.
# Get chromosome and genome (species and version) from exons object.
chr <- as.character(unique(seqnames(exons)))
gen <- as.character(unique(genome(exons)))
# Ideogram track shows where in the chromosome we are located.
itrack <- IdeogramTrack(genome=gen, chromosome=chr)
# Axis track shows position on chromosome.
gtrack <- GenomeAxisTrack()
# Gene region track shows general inforation about intervals on the genome. Use it to plot exons.
grtrack <- GeneRegionTrack(exons, genome=gen, chromosome=chr, name='Exons of SCN5A')
# Plot all tracks, and specify relative sizes (proportions) of tracks.
plotTracks(list(itrack, gtrack, grtrack), sizes=c(0.2,0.2,0.6))
We are going to download some data to plot in this gene. The GWAS Catalog (https://www.ebi.ac.uk/gwas/) contains summary results from research in the genetic causes of disease, aggregated from many independent experiments. This data contains the location of mutations in the genome and their reported effect size from regression analysis, indicating increase in disease risk.
Below we read in a table with these data, using read_delim()
which will store the data in a tibble, as we have seen in the previous cases. This dataset is quite large, so it will take a while to read and use a lot of memory.
# Read experimental data from ulcerative colitis.
uc = read_delim('~/Documents/gwas_sumstats/uc/28067908-GCST004133-EFO_0000729.h.tsv.gz', delim='\t')
We happen to know that the gene we are interested in lies on chromosome 3, so we can simply discard everything else, so we can free up a lot of memory on the computer. We do this using the filter()
function, which chooses rows in the tibble that match the expression we pass to it. In this expression, we use the hm_chrom
variable, which tells us which chromosome number the mutations is located on.
# Keep records that match expression. Choose records in chromosome 3.
uc <- uc %>% filter(hm_chrom == 3)
When working with large datasets, rm()
and gc()
are your friend. rm()
removes the variable you pass to it, and gc()
is “garbage collection” and frees up unused memory. Since we just removed a bunch of rows from uc
, we are going to use gc()
to clear up that memory.
# Free up unused memory.
gc()
We do the same for the Crohn’s disease data.
# Read experimental data from Crohn's disease.
cd = read_delim('~/Documents/gwas_sumstats/cd/28067908-GCST004132-EFO_0000384.h.tsv.gz', delim='\t')
# Choose records on chromosome 3.
cd <- cd %>% filter(hm_chrom == 3)
# Free up unused memory.
gc()
These datasets contain many columns that we are not interested in. Let’s delete all these, and give the variables more convenient names.
# Remove irrelevant columns.
# Give columns more convenient names.
uc = uc %>% dplyr::select(c('hm_rsid', 'hm_chrom', 'hm_pos', 'hm_beta')) %>% rename('hm_rsid'='rsid', 'hm_chrom'='chrom', 'hm_pos'='pos', 'hm_beta'='beta')
cd = cd %>% dplyr::select(c('hm_rsid', 'hm_chrom', 'hm_pos', 'hm_beta')) %>% rename('hm_rsid'='rsid', 'hm_chrom'='chrom', 'hm_pos'='pos', 'hm_beta'='beta')
The remaining variables are the IDs of the mutations, the “rsid”, the chromosome number where the mutation is located, the position along this chromosome, and the effect size beta.
We want to have all this data in the same tibble, and we do this using a join operation. full_join()
takes two datasets and matches by the variables you supply; the rest of the variables that weren’t mentioned in the by
argument are added to the resulting dataset with unique names. Anything that doesn’t match anything else will make NA values in your data. We also give the betas intuitive names below.
# Joint two tibbles by: rsid, chrom and pos.
# full_join: "by" variables with no matches get NA values in unmatched dataset.
betas = full_join(cd, uc, by=c('rsid'='rsid', 'chrom'='chrom', 'pos'='pos'))
# Give more convenient names to betas.
betas <- betas %>% rename('beta.x'='beta_cd', 'beta.y'='beta_uc')
Now we’re done with uc
and cd
, since we have all the information in betas
, we delete the variables and free up the memory.
# Delete variables and free up memory.
rm(uc, cd)
gc()
The chromosomes in the data are just integers, but want them to be in the format “chr3”, so we use the paste0()
function to do this job for us, and mutate the variable in the dataset. paste0()
takes basically any values or vectors and concatenates them into a character string, this function is very often useful.
# Use "chr3" naming convention, same as used in the exons object.
betas <- betas %>% mutate(chrom=paste0('chr', chrom))
We want to turn this tibble into the GenomicRanges
data structure, so that we can work with it together with the exon data. We construct this using the GRanges()
function, supplying data from the tibble about the locations of the mutations. Then we add the extra information about the mutations, the metadata, namely the effect sizes in Crohn’s disease and ulcerative colitis.
# Make a GRanges object out of the betas.
betas_gr = GRanges(seqnames=betas$chrom, IRanges(start=betas$pos, end=betas$pos))
# Add the metadata.
mcols(betas_gr, use.names=TRUE) = dplyr::select(betas, 'beta_cd', 'beta_uc')
Next we want to remove all betas that are not inside the gene region we are interested in. This means we have to construct a new GenomicRanges
object, that spans the entire gene region. When we’ve done that we use subsetByOverlaps()
to retrieve the betas that are in the gene region.
# Extract betas in the SCN5A gene.
# Get the first and last positions in the exons.
start <- min(start(exons))
end <- max(end(exons))
# Get the chromosome.
chr <- as.character(unique(seqnames(exons)))
gene_region <- GRanges(seqnames=chr, ranges=IRanges(start=start, end=end))
# Get the subset of the betas that are in the gene region.
betas_gr <- subsetByOverlaps(betas_gr, gene_region)
Now we again delete some data.
# Delete variables and free up memory.
rm(betas)
gc()
All the code dealing with the CD and UC data takes a long time to run and uses a lot of memory, so we want to save our results so far. We convert the GenomicRanges
object to a tibble and write a CSV with the data.
# Convert GRanges object to tibble and write to CSV.
write_csv(as_tibble(betas_gr), 'betas.csv')
To load the data, we need to read it into a tibble, and construct the GenomicRanges
object in a similar way as we did before.
# Read CSV into tibble.
betas_temp <- read_csv('betas.csv')
Parsed with column specification:
cols(
seqnames = col_character(),
start = col_double(),
end = col_double(),
width = col_double(),
strand = col_character(),
beta_cd = col_double(),
beta_uc = col_double()
)
# Construct GRanges object and add betas.
betas_gr = GRanges(seqnames=betas_temp$seqnames, IRanges(start=betas_temp$start, end=betas_temp$end))
mcols(betas_gr, use.names=TRUE) = dplyr::select(betas_temp, 'beta_cd', 'beta_uc')
Finally, we plot everything. As before, we plot the ideogram, the axis, and the exons. Now we add a data track, pass it the GenomicRanges
with the betas, and it will plot the beta values. Using the groups=c('cd', 'uc')
argument, Gviz
will match the columns with these strings and color the data by group.
# Get chromosome and genome build from exons object.
chr <- as.character(unique(seqnames(exons)))
gen <- genome(exons)
# Ideogram track shows where in the chromosome we are located.
itrack <- IdeogramTrack(genome=gen[1], chromosome=chr)
# Axis track shows position on chromosome.
gtrack <- GenomeAxisTrack()
# Plot exons as gene regions.
grtrack <- GeneRegionTrack(exons, genome=gen, chromosome=chr, name='Exons of SCN5A')
dtrack <- DataTrack(betas_gr, groups=c('cd', 'uc'))
plotTracks(list(itrack, gtrack, grtrack, dtrack))
We did it! We plotted some experimental data along with genomic annotations! So what did we learn in this case?
rm()
and gc()
as much as you canfilter()
full_join()
paste0()
functiondevtools::session_info()
─ Session info ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
setting value
version R version 3.4.4 (2018-03-15)
os Ubuntu 16.04.6 LTS
system x86_64, linux-gnu
ui RStudio
language en_US
collate en_US.UTF-8
ctype en_US.UTF-8
tz Atlantic/Faroe
date 2019-08-25
─ Packages ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
package * version date lib source
acepack 1.4.1 2016-10-29 [1] CRAN (R 3.4.4)
AnnotationDbi * 1.40.0 2018-08-03 [1] Bioconductor
AnnotationFilter 1.2.0 2019-08-24 [1] Bioconductor
AnnotationHub 2.10.1 2018-08-03 [1] Bioconductor
assertthat 0.2.1 2019-03-21 [1] CRAN (R 3.4.4)
backports 1.1.4 2019-04-10 [1] CRAN (R 3.4.4)
base64enc 0.1-3 2015-07-28 [1] CRAN (R 3.4.4)
Biobase * 2.38.0 2018-08-02 [1] Bioconductor
BiocGenerics * 0.24.0 2018-08-02 [1] Bioconductor
BiocInstaller 1.28.0 2018-08-02 [1] Bioconductor
BiocParallel 1.12.0 2018-08-02 [1] Bioconductor
biomaRt 2.34.2 2018-08-03 [1] Bioconductor
Biostrings 2.46.0 2018-08-02 [1] Bioconductor
biovizBase 1.26.0 2019-08-24 [1] Bioconductor
bit 1.1-14 2018-05-29 [1] CRAN (R 3.4.4)
bit64 0.9-7 2017-05-08 [1] CRAN (R 3.4.4)
bitops 1.0-6 2013-08-17 [1] CRAN (R 3.4.4)
blob 1.2.0 2019-07-09 [1] CRAN (R 3.4.4)
BSgenome 1.46.0 2018-08-03 [1] Bioconductor
callr 3.3.1 2019-07-18 [1] CRAN (R 3.4.4)
checkmate 1.9.4 2019-07-04 [1] CRAN (R 3.4.4)
cli 1.1.0 2019-03-19 [1] CRAN (R 3.4.4)
cluster 2.0.7-1 2018-04-09 [4] CRAN (R 3.4.4)
colorspace 1.4-1 2019-03-18 [1] CRAN (R 3.4.4)
crayon 1.3.4 2017-09-16 [1] CRAN (R 3.4.4)
curl 4.0 2019-07-22 [1] CRAN (R 3.4.4)
data.table 1.12.2 2019-04-07 [1] CRAN (R 3.4.4)
DBI 1.0.0 2018-05-02 [1] CRAN (R 3.4.4)
DelayedArray 0.4.1 2018-08-02 [1] Bioconductor
desc 1.2.0 2018-05-01 [1] CRAN (R 3.4.4)
devtools 2.1.0 2019-07-06 [1] CRAN (R 3.4.4)
dichromat 2.0-0 2013-01-24 [1] CRAN (R 3.4.4)
digest 0.6.20 2019-07-04 [1] CRAN (R 3.4.4)
dplyr * 0.8.3 2019-07-04 [1] CRAN (R 3.4.4)
ensembldb 2.2.2 2019-08-24 [1] Bioconductor
evaluate 0.14 2019-05-28 [1] CRAN (R 3.4.4)
fansi 0.4.0 2018-10-05 [1] CRAN (R 3.4.4)
foreign 0.8-70 2018-04-23 [4] CRAN (R 3.4.4)
Formula 1.2-3 2018-05-03 [1] CRAN (R 3.4.4)
fs 1.3.1 2019-05-06 [1] CRAN (R 3.4.4)
GenomeInfoDb * 1.14.0 2018-08-02 [1] Bioconductor
GenomeInfoDbData 1.0.0 2018-08-02 [1] Bioconductor
GenomicAlignments 1.14.2 2018-08-02 [1] Bioconductor
GenomicFeatures * 1.30.3 2018-08-03 [1] Bioconductor
GenomicRanges * 1.30.3 2019-06-05 [1] Bioconductor
ggplot2 3.2.1 2019-08-10 [1] CRAN (R 3.4.4)
glue 1.3.1 2019-03-12 [1] CRAN (R 3.4.4)
gridExtra 2.3 2017-09-09 [1] CRAN (R 3.4.4)
gtable 0.3.0 2019-03-25 [1] CRAN (R 3.4.4)
Gviz * 1.22.3 2019-08-24 [1] Bioconductor
Hmisc 4.2-0 2019-01-26 [1] CRAN (R 3.4.4)
hms 0.5.1 2019-08-23 [1] CRAN (R 3.4.4)
htmlTable 1.13.1 2019-01-07 [1] CRAN (R 3.4.4)
htmltools 0.3.6 2017-04-28 [1] CRAN (R 3.4.4)
htmlwidgets 1.3 2018-09-30 [1] CRAN (R 3.4.4)
httpuv 1.5.1 2019-04-05 [1] CRAN (R 3.4.4)
httr 1.4.1 2019-08-05 [1] CRAN (R 3.4.4)
interactiveDisplayBase 1.16.0 2018-08-03 [1] Bioconductor
IRanges * 2.12.0 2018-08-02 [1] Bioconductor
jsonlite 1.6 2018-12-07 [1] CRAN (R 3.4.4)
knitr 1.24 2019-08-08 [1] CRAN (R 3.4.4)
later 0.8.0 2019-02-11 [1] CRAN (R 3.4.4)
lattice 0.20-38 2018-11-04 [4] CRAN (R 3.4.4)
latticeExtra 0.6-28 2016-02-09 [1] CRAN (R 3.4.4)
lazyeval 0.2.2 2019-03-15 [1] CRAN (R 3.4.4)
magrittr * 1.5 2014-11-22 [1] CRAN (R 3.4.4)
Matrix 1.2-14 2018-04-09 [4] CRAN (R 3.4.4)
matrixStats 0.54.0 2018-07-23 [1] CRAN (R 3.4.4)
memoise 1.1.0 2017-04-21 [1] CRAN (R 3.4.4)
mime 0.7 2019-06-11 [1] CRAN (R 3.4.4)
munsell 0.5.0 2018-06-12 [1] CRAN (R 3.4.4)
nnet 7.3-12 2016-02-02 [4] CRAN (R 3.4.0)
pillar 1.4.2 2019-06-29 [1] CRAN (R 3.4.4)
pkgbuild 1.0.4 2019-08-05 [1] CRAN (R 3.4.4)
pkgconfig 2.0.2 2018-08-16 [1] CRAN (R 3.4.4)
pkgload 1.0.2 2018-10-29 [1] CRAN (R 3.4.4)
prettyunits 1.0.2 2015-07-13 [1] CRAN (R 3.4.4)
processx 3.4.1 2019-07-18 [1] CRAN (R 3.4.4)
progress 1.2.2 2019-05-16 [1] CRAN (R 3.4.4)
promises 1.0.1 2018-04-13 [1] CRAN (R 3.4.4)
ProtGenerics 1.10.0 2019-08-24 [1] Bioconductor
ps 1.3.0 2018-12-21 [1] CRAN (R 3.4.4)
purrr 0.3.2 2019-03-15 [1] CRAN (R 3.4.4)
R6 2.4.0 2019-02-14 [1] CRAN (R 3.4.4)
RColorBrewer 1.1-2 2014-12-07 [1] CRAN (R 3.4.4)
Rcpp 1.0.2 2019-07-25 [1] CRAN (R 3.4.4)
RCurl 1.95-4.12 2019-03-04 [1] CRAN (R 3.4.4)
readr * 1.3.1 2018-12-21 [1] CRAN (R 3.4.4)
remotes 2.1.0 2019-06-24 [1] CRAN (R 3.4.4)
rlang 0.4.0 2019-06-25 [1] CRAN (R 3.4.4)
rmarkdown 1.15 2019-08-21 [1] CRAN (R 3.4.4)
RMySQL 0.10.17 2019-03-04 [1] CRAN (R 3.4.4)
rpart 4.1-15 2019-04-12 [4] CRAN (R 3.4.4)
rprojroot 1.3-2 2018-01-03 [1] CRAN (R 3.4.4)
Rsamtools 1.30.0 2018-08-08 [1] Bioconductor
RSQLite 2.1.2 2019-07-24 [1] CRAN (R 3.4.4)
rstudioapi 0.10 2019-03-19 [1] CRAN (R 3.4.4)
rtracklayer 1.38.3 2018-08-03 [1] Bioconductor
S4Vectors * 0.16.0 2018-08-02 [1] Bioconductor
scales 1.0.0 2018-08-09 [1] CRAN (R 3.4.4)
sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.4.4)
shiny 1.3.2 2019-04-22 [1] CRAN (R 3.4.4)
stringi 1.4.3 2019-03-12 [1] CRAN (R 3.4.4)
stringr 1.4.0 2019-02-10 [1] CRAN (R 3.4.4)
SummarizedExperiment 1.8.1 2018-08-02 [1] Bioconductor
survival 2.44-1.1 2019-04-01 [1] CRAN (R 3.4.4)
testthat 2.2.1 2019-07-25 [1] CRAN (R 3.4.4)
tibble 2.1.3 2019-06-06 [1] CRAN (R 3.4.4)
tidyselect 0.2.5 2018-10-11 [1] CRAN (R 3.4.4)
TxDb.Hsapiens.UCSC.hg38.knownGene * 3.4.0 2019-08-23 [1] Bioconductor
usethis 1.5.1 2019-07-04 [1] CRAN (R 3.4.4)
utf8 1.1.4 2018-05-24 [1] CRAN (R 3.4.4)
VariantAnnotation 1.24.5 2019-06-05 [1] Bioconductor
vctrs 0.2.0 2019-07-05 [1] CRAN (R 3.4.4)
withr 2.1.2 2018-03-15 [1] CRAN (R 3.4.4)
xfun 0.9 2019-08-21 [1] CRAN (R 3.4.4)
XML 3.98-1.20 2019-06-06 [1] CRAN (R 3.4.4)
xtable 1.8-4 2019-04-21 [1] CRAN (R 3.4.4)
XVector 0.18.0 2018-08-02 [1] Bioconductor
yaml 2.2.0 2018-07-25 [1] CRAN (R 3.4.4)
zeallot 0.1.0 2018-01-28 [1] CRAN (R 3.4.4)
zlibbioc 1.24.0 2018-08-02 [1] Bioconductor
[1] /home/olavur/R/x86_64-pc-linux-gnu-library/3.4
[2] /usr/local/lib/R/site-library
[3] /usr/lib/R/site-library
[4] /usr/lib/R/library