King's College London - MRC SGDP Centre Summer School - Introduction to R

Bioconductor - a use case - Tools to handle genomic data

Dr Matthew Davis, Enright laboratory, EBI


In this practical we will be using packages from Bioconductor. Bioconductor provides a repository for packages designed to handle large quantities of genomic data. The packages are used within R. The Bioconductor website can be found here: Bioconductor currently holds a huge number of packages along with documentation and reference manuals and is definitely worth exploring.

Accessing genomic annotation data using biomaRt

When performing genomewide analysis there are many ways to access annotation data. The biomaRt package strikes a balance between ease of use and flexibility. It also allows you to access databases and gather genomic information even if R is your only programming language.

First load the biomaRt library.


Next display all the web services accessible via biomaRt.


We will use the Ensembl database today.

database <- useMart("ensembl")

Make a list of the datasets available, we will use the mouse and append this information to our 'Mart' object.

database <- useDataset("mmusculus_gene_ensembl",mart=database)

The getBM function allows the collection of data from the database. We supply this function with the Mart object we are interested in and 3 other options.

Attributes are the pieces of information we are interested in collecting from the database.


Filters specify any criteria by which we wish to restrict our results.


Values are the specific values applied to the filters, eg. gene names, specific chromosomes etc. The examples below will make this clearer...

First we will retrieve the genomic coordinates of a set of miRNA hairpins that we will use later.

miRsequences <- c("mmu-mir-291b","mmu-mir-292","mmu-mir-293","mmu-mir-294","mmu-mir-295")
miRNAs <- getBM(
      attributes = c("chromosome_name", "start_position", "end_position", "strand","mirbase_id","ensembl_gene_id"), 
      filters = "mirbase_id", values = miRsequences, mart = database)
  chromosome_name start_position end_position strand   mirbase_id    ensembl_gene_id
1               7        3220642      3220725      1  mmu-mir-294 ENSMUSG00000077903
2               7        3219190      3219271      1  mmu-mir-292 ENSMUSG00000078041
3               7        3220774      3220842      1  mmu-mir-295 ENSMUSG00000077886
4               7        3220344      3220423      1  mmu-mir-293 ENSMUSG00000078035
5               7        3219483      3219561      1 mmu-mir-291b ENSMUSG00000078032

We'll save this data for use in the second session.

save(miRNAs, file="miRNA_cluster.RData")

It is also possible to find features associated with a specified genomic position. For example, if you have a SNP of interest and know its genomic coordintes you can look for overlapping features.

thefeatures <- getBM(
      attributes = c("chromosome_name", "start_position", "end_position", "strand","external_gene_id","ensembl_gene_id"), 
      filters = c("chromosomal_region"), values = c("16:18255882:18255882"), mart = database)
  chromosome_name start_position end_position strand external_gene_id    ensembl_gene_id
1              16       18254041     18289342     -1            Dgcr8 ENSMUSG00000022718

Finally, it is also possible to derive sequences from the database.

                                                                             gene_exon   mirbase_id

These are just a few examples of how you can use biomaRt to retrieve data for use in your work. However, biomaRt is a very versitile tool and can be applied to many situations. The manual/documentation for this package is very descriptive and contains loads of examples both to access Ensembl and other databases.


If, for some reason, you are unable to access these vignettes on your system, do not dispair! All of the documentation is also maintained online:


1. You are interested in the effect of various conditions on the G2/M mitotic cell cycle transition in the mouse (mmusculus). A colleague returns from a conference and has notes from a lecture they feel will interest you. The notes contain the MGI symbols of 5 genes (Tpd52l1, Birc5, Rab6, Tmem204, Wnt10b). Identify which of these genes is directly associated with the cell cycle. (Hint: The GO ID for the G2/M cell cycle transition is GO:0000086). Check the vignette/manual to find out how to use multiple filters. There are loads of examples here - take a look at 4.3 Task 3)

2. Your boss has published hundreds of papers on Birc5 and is intrigued. After chatting to a collaborator suggests searching the 3utr of this gene for protein binding motifs. Retrieve the sequence for all of th 3' UTR sequences of Birc5 so that you can use it in the searches. Hint: Check the getSequence manual for all of the wonderful things it can do.

3. The search for motifs was not fruitful but you are not disheartened. A friend has recently generated a high quality genomic map of epigenetic markers from across the genome. She offers to see if any reqions of interest are overlapping the G2/M associated genes. Find the exonic coordinates for these genes. Hint: If you try to do this as demonstrated in the work session it may well break. Try googling the error message and see if you can fix the problem.

4. Your friend has also produced this map for a series of G2/M phase stalled, mutagenised rat cell lines (rnorvegicus_gene_ensembl). She identifies a region that appears to be consistently differentially effected on chromosome 5 (300000 to 600000) for which the epigenetic markers appear to be significantly altered. Find any genes that overlap this region.

Manipulating genomic coordinates with GenomicRanges

As we saw above it is possible to use biomaRt to identify genomic features that are associated with a particular set of genomic coordinates. However, when handling data derived by next-generation technologies these genomewide comparisons can involve millions of coordinate pairs. The GenomicRanges package helps to simplify what may at first seem like a daunting task.

The data sets

For this section of the course we require a large set of genomic coordinates. To this end, we will use some small RNA sequencing data downloaded from ArrayExpress (ID: E-GEOD-11724, ENA Accession: SRR023847). This data is derived from a small RNA sequencing experiment performed on mouse embryonic stem cells ("Connecting miRNA Genes to the Core Transcriptional Regulatory Circuitry of Embryonic Stem Cells" Marson et al. Cell 134, 521-533). In principle the sequences within this file will correspond to small RNAs. In order to prepare the data for the course I have trimmed and cleaned the sequences to remove adapter sequences added during sample preparation and from those trimmed sequences I have selected those that are expected to correspond to mature miRNAs, based on length. I have then mapped the reads to mouse genome sequence and selected those reads that map to a small region on chromosome 7. This provides a large set of genomic coordinates for us to use but the principles below can be applied to any other set of genomic coordinate data.

We will be comparing these genomically mapped sequences to the miRNA coordinates we generated earlier in the biomaRt section above. These are miRNAs that are known to be highly expressed in mouse ES cells.

The format of the aligned reads

The read alignments are represented in SAM and BAM formats. SAM format is a tab delineated, human readable format containing the coodinates of the alignments etc. BAM is the equivalent file but optimised for computers. I have converted the read alignments to BAM format for reading into our R session.

Extract of SAM format:

Please be aware that it is beyond the scope of this session to provide a comprehensive guide to interpreting small RNA sequencing data and so a few considerations will be skipped (eg. adjusting counts for reads that map to multiple genomic loci). However, if this interests you, hopefully by the end you will feel more happy working with these Bioconductor packages and be comfortable exploring further.

Loading the data into R

First load the GenomicRanges library


Next load the miRNA coordintaes we generated earlier (If you have the same R session open still there is no need to do this). Note that the genomic strand information provided by Ensembl has to be converted to match our alignment file for this analysis.


miRNAs[miRNAs$strand == 1,"strand"] <- "+"
miRNAs[miRNAs$strand == -1,"strand"] <- "-"
miRNAs[miRNAs$strand == 0,"strand"] <- "*"

Convert the miRNA coordinates into a GRanges object.

miRNAsGR <- GRanges(
   seqnames   = Rle(factor(miRNAs$chromosome_name)),
   ranges     = IRanges(start= miRNAs$start_position, end= miRNAs$end_position),
   strand     = Rle(factor(miRNAs$strand)),
   seqlengths = c("7"=152524553),
   name       = miRNAs$mirbase_id

GRanges with 5 ranges and 1 elementMetadata value
    seqnames             ranges strand |         name
[1]        7 [3220642, 3220725]      + |  mmu-mir-294
[2]        7 [3219190, 3219271]      + |  mmu-mir-292
[3]        7 [3220774, 3220842]      + |  mmu-mir-295
[4]        7 [3220344, 3220423]      + |  mmu-mir-293
[5]        7 [3219483, 3219561]      + | mmu-mir-291b


To load the alignment data we will use a further package; Rsamtools. Functions from this package allow us to interpret and load BAM files, translating them into a format compatible with R.


my_interesting_variables <- c("qname","rname","strand","pos","qwidth")
param <- ScanBamParam(what=my_interesting_variables)
aligned_data <- scanBam("chromosome_7_aligned.bam",param=param)[[1]]

alignments  <- data.frame(aligned_data,stringsAsFactors=FALSE)
colnames(alignments) <- c("Read","Seq","Strand","Start","Width")

               Read Seq Strand   Start Width
1   SRR023847.43288   7      + 3218991    21
2  SRR023847.123657   7      + 3218991    20
3  SRR023847.976020   7      + 3218991    21
4 SRR023847.1172564   7      + 3218991    19
5 SRR023847.2152401   7      + 3218991    19
6 SRR023847.2236490   7      + 3218991    20

Again, the data needs to be converted into a GRanges object.

chr7GR  <-   GRanges(
   seqnames   = Rle(factor(alignments$Seq)),
   ranges     = IRanges(start=alignments$Start, width=alignments$Width),
   strand     = Rle(factor(alignments$Strand)),
   seqlengths = c("7"=152524553),
   name       = alignments$Read

GRanges with 234054 ranges and 1 elementMetadata value
         seqnames             ranges strand   |              name
     [1]        7 [3218991, 3219011]      +   |   SRR023847.43288
     [2]        7 [3218991, 3219010]      +   |  SRR023847.123657
     [3]        7 [3218991, 3219011]      +   |  SRR023847.976020
     [4]        7 [3218991, 3219009]      +   | SRR023847.1172564
     [5]        7 [3218991, 3219009]      +   | SRR023847.2152401
     [6]        7 [3218991, 3219010]      +   | SRR023847.2236490
     [7]        7 [3218991, 3219009]      +   | SRR023847.2839325
     [8]        7 [3218991, 3219009]      +   | SRR023847.2889185
     [9]        7 [3218991, 3219011]      +   | SRR023847.3369098
     ...      ...                ...    ... ...               ...

Finding overlaps between sequence reads and known genomic features

Once we have our data represented as GRanges objects they become easy to manipulate using the specialised functions. First we will find which of our sequence reads overlap our miRNA coordinates.

overlaps <- findOverlaps(query= miRNAsGR,subject=chr7GR, minoverlap = 20)
     query subject
[1,]     1   99021
[2,]     1   99022
[3,]     1   99023
[4,]     1   99024
[5,]     1   99025
[6,]     1   99026

We can split the RangesMatching object to reveal which reads are associated with each miRNA

miRNAreads <- split(subjectHits(overlaps),queryHits(overlaps))
names(miRNAreads) <- values(miRNAsGR)[as.numeric(names(miRNAreads)),"name"]

[1] "list"
[1] 99021 99022 99023 99024 99025 99026

[1] 22 23 25 26 27 28

[1] 152354 152355 152356 152357 152358 152359

[1] 75318 75319 75320 75322 75325 75327

[1] 67748 67749 67750 67751 67752 67753

Finally, we can replace the read indexes with the read names. Although during small RNA-seqeunceing analysis it is rarely necessary to associate specific reads with a feature, in other circumstances, such as searching for features associated with specific SNPs etc. displaying the results in this form may be useful.

associatedReads <- lapply(miRNAreads, function(x){values(chr7GR)[as.numeric(x),"name"]})

[1] "SRR023847.479662"  "SRR023847.846685"  "SRR023847.867067"  "SRR023847.1087858" "SRR023847.1190599"
[6] "SRR023847.1340549"

[1] "SRR023847.3434161" "SRR023847.52340"   "SRR023847.331166"  "SRR023847.617030"  "SRR023847.646330" 
[6] "SRR023847.984093" 

[1] "SRR023847.1766634" "SRR023847.2070305" "SRR023847.16483"   "SRR023847.48427"   "SRR023847.65172"  
[6] "SRR023847.80442"  

[1] "SRR023847.6592"  "SRR023847.8644"  "SRR023847.9256"  "SRR023847.10743" "SRR023847.20458"
[6] "SRR023847.25731"

[1] "SRR023847.33928"   "SRR023847.69231"   "SRR023847.496089"  "SRR023847.1073815" "SRR023847.1339362"
[6] "SRR023847.1445575"

Measuring expression

When analysing small RNA sequencing data often the goal is to produce an expression matrix for differential expression analysis. In these situations the number of reads mapping to a miRNA locus is often taken as a surrogate for miRNA expression. To this end the countOverlaps function can be used in a fashion similar to findOverlaps to count the number of reads representing each miRNA.

miRNACounts <- countOverlaps(query= miRNAsGR,subject=chr7GR, minoverlap = 20)
names(miRNACounts) <- values(miRNAsGR)$name
mmu-mir-294  mmu-mir-292  mmu-mir-295  mmu-mir-293 mmu-mir-291b 
       52859        66585        81004        23149         7415 

Visualise sequence coverage of a known feature

Often when analysing next generation sequencing data it is useful to be able to visualise a locus to see exactly which region of a gene is providing the source for the small RNAs. The coverage function reports the number of overlapping ranges that cover each base of our underlying sequence.

favourite <- c("mmu-mir-292")
miRStart <- start(miRNAsGR[values(miRNAsGR)$name==favourite])
miREnd <- end(miRNAsGR[values(miRNAsGR)$name==favourite])
my_cov <- coverage(chr7GR)[["7"]]
'integer' Rle of length 152524553 with 181 runs
  Lengths:   3218990        19         1         1       168 ...         1         1        17 149303695
  Values :         0        11         7         5         0 ...       789         6         1         0

We can plot this as a pretty picture.

plot(as.vector(my_cov)[miRStart:miREnd], type='l', col='red', 
   xaxt="n", ylab = "Coverage",xlab="Chromosomal coordinates", main=favourite)
axis(side = 1, at=, to=(miREnd-miRStart+1), by=10), 
   labels = miRStart, to= miREnd, by=10), cex.axis = 0.5, las=2)

This coverage profile is typical for an expressed miRNA loci. The profile is formed as a consequence of the process by which mature miRNAs are released from a precursor hairpin.