King's College London - MRC SGDP Centre Summer School - Introduction to R



Bioconductor - a use case - Illumina Genome Analyzer sRNA-seq

Dr Matthew Davis, Enright laboratory, EBI


Introduction

In this practical we will be using packages from Bioconductor. Bioconductor provides a repository for packages designed to handle large quantities of genomic data. The majority of packages are used within R. The Bioconductor website can be found here: http://www.bioconductor.org. Bioconductor currently holds a huge number of packages along with documentation and reference manuals and is definitely worth exploring.



Finding Help

There are several useful places to start when faced with a new problem you want to address with R and Bioconductor.

First of all browse for appropriate packages on the Bioconductor website. Chances are you will have a colleague who will recommend packages to use.

Each package is accompanied by a raft of examples and documentation. This documentation can also be accessed in R.

library(ShortRead)
browseVignettes(package="ShortRead")

As always in R you can also access the help for the functions in the normal way.

?trimLRPatterns()

There are many courses and events recorded on the Bioconductor website. Some place their course material online. Take a look at this website that is full of great material: http://manuals.bioinformatics.ucr.edu/home/

Finally there are plenty of online mailing lists to sign up to. I would recommend the Bioconductor mailing lists found here: http://www.bioconductor.org/help/mailing-list/.

If all else fails use google.

It is worth remembering at all times the class() function is your friend. Many of the methods supplied in packages will work on specific classes of objects. The trick is to supply the correct object classes at the correct times.

First we will give a brief introduction on reading short read sequencing data into R and some basic manipulations. We will be using a Bioconductor package to work with the sequencing data called ShortRead. Towards the end of the practical we may have a look at retrieving genomic annotation data using biomaRt.



Sequence Strings

R is only limited by the packages you have access to or can write.

You'll have noticed that loading the ShortRead package will cause several additional required packages to load automatically. These are additional packages upon which ShortRead relies.

The Biostrings package contains classes and functions used throughout this practical and is designed to allow quick and easy the manipulation of biological sequences.

Create a random RNA sequence:

rna <- RNAString(paste(sample(c("U","A","C","G"), size=37, replace=TRUE), collapse=""))
rna
  36-letter "RNAString" instance
  seq: UAGGCUAGUCUACGAGACCAACUACCCUUUCUUCAU

You can convert this RNA sequence into the comparable DNA sequence.

dna <- DNAString(rna)

Shorter sequence regions can be selected using the normal [ brackets.

dna[2:8]
  7-letter "DNAString" instance
  seq: AGGCTAG
dna[8:2]
  7-letter "DNAString" instance
  seq: GATCGGA

A more efficient way to achieve the same thing

subseq(dna, start=2 , end=8 )

Two additional functions that may be of interest

reverseComplement(dna)
translate(dna)

The DNAString class of object is one of several classes that behave in a similar way (eg. BString, RNAString objects). These are all types of XString, a term that will appear in the manual pages. XStringSet objects (eg. BStringSet and DNAstringSet class objects) are groups of these Xstring objects handled together. These objects make the manipulation of large numbers of sequences much easier. We will come across these again as we progress through the practical.



Downloading the data

Although Illumina sequencers can produce millions of reads today we will be working with only a fraction of a sequencing run. The file containing the reads can be downloaded here: solexa_100k.fastq.zip. Please download it to your My Documents folder. Go to this directory and double click on the .zip file to extract it. Now select Change dir from the File menu in your R Console and select your solexa_100k.fastq folder in the My Directory folder.



Reading fastq files in R with ShortRead

Packages make complex data structures more simple.

To start using ShortRead, we need to define a couple of variables. We need the name of the file that we want to read and the directory in which its found. If our fastq file is in the current working directory, we can specify this by ".".

 
fastqFile <- "./solexa_100k.fastq" 

We can now use the readFastq function to read in our Solexa file.

solexa <- readFastq(fastqFile)
solexa
class: ShortReadQ
length: 100000 reads; width: 37 cycles
Lets now see some functions to access the basic properties of ShortReadQ objects.
length(solexa)
[1] 100000
The identifiers of the sequences can be obtained:
id(solexa)

  A BStringSet instance of length 1000000
          width seq
      [1]    21 IL23_3198:2:1:0:199/1
      [2]    21 IL23_3198:2:1:0:110/1
      ...   ... ...
  [99999]    25 IL23_3198:2:1:1544:1685/1
 [100000]    25 IL23_3198:2:1:1544:1304/1
the sequences:
sread(solexa)

  A DNAStringSet instance of length 1000000
          width seq
      [1]    37 NCGGAACCGGGCGGAAACACCATCGTATGCCGTCTTC
      [2]    37 NCTCAAACTATGGGGGCACTTTAATCGTATGCCGTCT
      ...   ... ...
  [99999]    37 ATAGTTGTGTGTGGATGTGTGTATTGAAATCGTATGC
 [100000]    37 GGGGGTGCAGCTCAGTGGTAGAGAGCGTGCTTGTCGT
the quality scores:
quality(solexa)

class: SFastqQuality
quality:
  A BStringSet instance of length 1000000
          width seq
      [1]    37 &0;;;;:9:;:7;;;;;;;;;;;;;;;;;;;;;;;:8
      [2]    37 &0<>><><><><>>>=;<<<>>=>>9>>>>>>:==5=
      ...   ... ...
  [99999]    37 EEDDEEEEEEEEDADDEEEEEEEEEEEEEDA>DEECD
 [100000]    37 >>?;=8;?=>0;;;=;>=:;?==;;5>5;3:==;6==
or the three things in one go:
detail(solexa)
and as always in R, we can use [ ] to extract only a portion of the object:
sread(solexa[101:103])

  A DNAStringSet instance of length 3
    width seq
[1]    37 TTTCTACAGTCCGACGATCTCGTATGCCGTCTTCTGC
[2]    37 CCACCAGGCGTTCGGCTCGTATGCCGTCTTCTGCTTG
[3]    37 AAGTGCTACTACTTTTGAGTCTTCGTATGCCGTCTTC

Sequencing QC

Packages will contain methods that make achieving tedious tasks simple.

As the sequencing reaction progresses, errors will be incorporated, and the general quality is expected to decrease. This is one of the reasons why read length is limited. Its a good idea to produce plots to observe this. We can look at the quality scores and how they change across each position in the reads, and we can look at the frequency of different bases.

First we will calculate the frequency of bases per cycle:

abc <- alphabetByCycle(sread(solexa), alphabet=c("A","T","G","C","N")) 
abc

        cycle
alphabet  [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9] [,10] [,11]
       A 28354 23421 20614 21918 32793 22894 25814 15999 20475 28767 14986
       T 32476 13759 24461 22009 20199 26988 20097 29968 25983 15357 21546
       G 18869 36382 24695 29872 29386 20904 29502 21628 25095 35364 41589
       C 20113 26394 30169 26151 17578 29163 24519 32363 28395 20458 21831
       N   188    44    61    50    44    51    68    42    52    54    48

If we want to convert these numbers into frequencies, we can divide by the total number of sequences:

abcFreq <- abc / length(solexa)
abcFreq

        cycle
alphabet    [,1]    [,2]    [,3]    [,4]    [,5]    [,6]    [,7]    [,8]
       A 0.28354 0.23421 0.20614 0.21918 0.32793 0.22894 0.25814 0.15999
       T 0.32476 0.13759 0.24461 0.22009 0.20199 0.26988 0.20097 0.29968
       G 0.18869 0.36382 0.24695 0.29872 0.29386 0.20904 0.29502 0.21628
       C 0.20113 0.26394 0.30169 0.26151 0.17578 0.29163 0.24519 0.32363
       N 0.00188 0.00044 0.00061 0.00050 0.00044 0.00051 0.00068 0.00042

These frequencies are easier to interpret in an awesome plot:

cols <- c("green","blue","yellow","red","grey")

plot(NULL, xlim=c(1,37), ylim=c(0,1), xlab="Cycle", ylab="Frequency", main="Base frequency by cycle")

matlines(t(abcFreq), col=cols, lty=1)

legend("topleft",legend=rownames(abcFreq), lty=1, col=cols, bty="n", horiz=TRUE)

Next look at the Phred quality scores as the sequencing run progresses through its cycles.

There's a quick way to convert the ascii encoded numbers back into numbers:

numQuals <- as(quality(solexa), "matrix")

head(numQuals)

     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
[1,]    5   15   26   26   26   26   25   24   25    26    25    22    26    26
[2,]    5   15   27   29   29   27   29   27   29    27    29    27    29    29
[3,]    5   15   28   20   28   27   28   30   30    23    22    26    26    21
...

But once again, its much nicer to see a figure:

meanQ <- colMeans(numQuals)
minQ  <- apply(numQuals,2,min)
maxQ  <- apply(numQuals,2,max)

plot(meanQ, xlim=c(1,37), ylim=c(1,40), xlab="Cycle", 
      ylab="Phred Quality", main="Max/Mean/Min quality by cycle", col="blue")
points(minQ, col="red")
points(maxQ, col="green")


Sequence depth and matching sequences

A demonstration of where to look and how to use functions.

A very useful function for dealing with short read sequences is:

tables(solexa)

$top
TCTACAGTCCGACGATCTCGTATGCCGTCTTCTGCTT TACAGTCCGACGATCTCGTATGCCGTCTTCTGCTTGA 
                                 3844                                  2589 
TCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAAAAAA AGTGCCGCAGAGTTTGTAGTGTTCGTATGCCGTCTTC 
                                 2351                                  1757 
CGCGACCTCAGATCAGACTCGTATGCCGTCTTCTGCT TTCTACAGTCCGACGATCTCGTATGCCGTCTTCTGCT 
                                 1743                                  1561 
AGTTCTACAGTCCGACGATCTCGTATGCCGTCTTCTG CGCGACCTCAGATCAGACGTGTCGTATGCCGTCTTCT 
                                 1539                                  1392 

...

$distribution
    nOccurrences nReads
1              1  19357
2              2   2036
3              3    791
4              4    452
5              5    273
6              6    191

We may want to find out how many copies we have of a particular kind of sequence. We also might want to eliminate reads that contain runs of polyA. One way to search for individual sequences is as follows:

?srdistance

polyA   <- "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"

polyADist   <- srdistance(sread(solexa), polyA)[[1]]

These are edit distances, defined as the number of changes that each read will require to turn it into the sequences we're asking for (a deletion, insertion or substitution all count as one change). We can see the results displayed as a histogram:

hist(polyADist, xlim=c(1,37), main="Histogram of distances to polyA", col="skyblue")

After looking at this plot, we may decide to only keep reads that have a distance of at least 25 to the polyA sequence:

solexaClean <- solexa[polyADist >= 25]
length(solexaClean)
[1] 94511

This means that we have kept 94.5% of our reads after this filtering.

In addition, the ShortRead package has a number of built in filters that can be applied to datasets in order to remove unwanted sequences. During the sequencing process N's can be inserted into the reads. These bases are uninformative for mapping to the genome so we may want to remove reads that contain multiple N's. We can use a built in filter to do this.

filterN <- nFilter(threshold=3)
sread(solexaClean[!filterN(sread(solexaClean))])
  A DNAStringSet instance of length 18
     width seq
 [1]    37 AAGCAGGAGGTGTCAGAATCGTNTGNNGTCTNCTGCT
 [2]    37 AGTGCCGCAGAGTTTGTAGTGTNCGNNTGCCNTCTTC
 [3]    37 CAGTGCAATGATGAAAGGGCATNCGNNTGCCNTCTTC
 [4]    37 CGCGACCTCAGATCAGACGTTCNTANNCCGTNTTCTG
solexaClean <- solexaClean[filterN(sread(solexaClean))]


Trimming reads for genomic alignment

A more complicated example of using package methods to manipulate data.

Annotate your code with useful comments and reminders! (Prefix these comments with a #)

In the case of small RNA sequencing it is general practice to trim the sequence reads to remove adapter sequences (TCGTATGCCGTCTTCTGCTTGTT) before the reads are aligned to the genome. In this way reads are more likely to align correctly. This is possible using the trimLRPatterns function.

Take a look at a sequence read to see the problem:
sread(solexaClean[52])
  A DNAStringSet instance of length 1
      width seq
      [1]    37 AGTGCCGCAGAGTTTGTAGTGTTCGTATGCCGTCTTC

First we have to break up the ShortReadQ object into its constituent parts:

adapter <- "TCGTATGCCGTCTTCTGCTTGTT"
sequences <- sread(solexaClean)
quality <- quality(quality(solexaClean))

Next we must create a padded adapter sequence the length of the reads in the FASTQ file along with a mismatch vector that defines an acceptable number of mismatches between the adapter and the reads being trimmed at each base position.

adapterPad <- paste(substr(adapter,1,12), paste(rep("N", 25),collapse=""), sep="")
adapterPad
[1] "TCGTATGCCGTCNNNNNNNNNNNNNNNNNNNNNNNNN"
maxMm <- round(seq_len(12)*0.2)
maxMmPad <-  c(maxMm,rep(maxMm[length(maxMm)], 25))
maxMmPad
 [1] 0 0 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

Finally we use these two objects and the trimLRPatterns function to determine relative coordinates of sequences retained post trimming. These coordinates can then be used to modify the quality and sequence objects.

coordinates <- trimLRPatterns(Rpattern=adapterPad, subject=sequences, max.Rmismatch=maxMmPad, 
                                    ranges=TRUE, with.Rindels=FALSE, Rfixed=FALSE)
seqTrim <- DNAStringSet(sequences, start=start(coordinates), end=end(coordinates))
qualTrim <- BStringSet(quality, start=start(coordinates), end=end(coordinates))

Once we have removed the adapter sequence we rebuild the ShortReadQ object.

qualTrim <- FastqQuality(qualTrim)
solexaTrim <- ShortReadQ(sread=seqTrim, quality=qualTrim, id=id(solexaClean))

Let's take another look at the sequence we saw earlier:

sread(solexaTrim[52])
DNAStringSet instance of length 1
    width seq
    [1]    22 AGTGCCGCAGAGTTTGTAGTGT

If we subsequently display the trimmed sequences we can see that a large proportion of the reads are now shorter once adapter matching sequences have been removed.

tables(solexaTrim)$top
                 TCTACAGTCCGACGATC                    TACAGTCCGACGATC                 CGCGACCTCAGATCAGAC 
                              3930                               2683                               1807 
            AGTGCCGCAGAGTTTGTAGTGT                 TTCTACAGTCCGACGATC               AGTTCTACAGTCCGACGATC 
                              1790                               1597                               1578 
             CGCGACCTCAGATCAGACGTG                   CTACAGTCCGACGATC            ACTCAAACTATGGGGGCACTTTT 
                              1425                               1353                               1287 

Normally the trimming would be repeated a number of times, varying the parametres to ensure the complete removal of contaminating stretches of sequence. The fourth most common sequence does, however, resemble a known mouse miRNA: mmu-miR-293 (AGTGCCGCAGAGTTTGTAGTGT).



Comparing reads to miRNAs

Combine specialised tools and generic visualisations to understand data.

Let's compare the remaining sequencing reads to known miRNAs.

The first two are microRNAs expressed in ES cells.

miR290 <- "ACTCAAACTATGGGGGCACTTT"
miR25  <- "CATTGCACTTGTCTCGGTCTGA"

This one is a liver specific miRNA.

miR122 <- "TGGAGTGTGACAATGGTGTTTG"

Now lets get the distances.

miR290Dist <- srdistance(sread(solexaTrim), miR290)[[1]]
miR25Dist  <- srdistance(sread(solexaTrim), miR25)[[1]]
miR122Dist <- srdistance(sread(solexaTrim), miR122)[[1]]

We can do several things that we can do with these results. We could plot histograms as before, but we can also quickly see if we have perfect hits.

sread(solexaTrim)[miR290Dist == 0]
  A DNAStringSet instance of length 963
      width seq
  [1]    22 ACTCAAACTATGGGGGCACTTT
  [2]    22 ACTCAAACTATGGGGGCACTTT
  [3]    22 ACTCAAACTATGGGGGCACTTT
  ...   ... ...
[961]    22 ACTCAAACTATGGGGGCACTTT
[962]    22 ACTCAAACTATGGGGGCACTTT
[963]    22 ACTCAAACTATGGGGGCACTTT
sread(solexaTrim)[miR25Dist == 0]
  A DNAStringSet instance of length 1052
       width seq
   [1]    22 CATTGCACTTGTCTCGGTCTGA
   [2]    22 CATTGCACTTGTCTCGGTCTGA
   [3]    22 CATTGCACTTGTCTCGGTCTGA
  ...   ... ...
[1050]    22 CATTGCACTTGTCTCGGTCTGA
[1051]    22 CATTGCACTTGTCTCGGTCTGA
[1052]    22 CATTGCACTTGTCTCGGTCTGA
sread(solexaTrim)[miR122Dist == 0]
  A DNAStringSet instance of length 0

We can see that we have 963 hits for miR-290, 1052 for miR-25 and 0 for miR-122. We could be a bit less strict and allow for some differences to the miRNA sequence. Lets allow three changes, then we can plot the results:

miR290Count <- sum(miR290Dist <= 3)
miR25Count <- sum(miR25Dist <= 3)
miR122Count <- sum(miR122Dist <= 3)
barplot(c("miR-25"=miR25Count, "miR-290"=miR290Count, "miR-122"=miR122Count), 
         col=rainbow(3), main="Read counts",ylab="Reads",xlab="miRNAs")


Challenges:

1. miRNAs are ~22nt in length. Produce a length distribution plot for the sample after the adapter has been trimmed. Make it as pretty and informative as you can (Take a look at the manuals for the par(), legend(), axis(), table(), barplot() and abline() functions). Does the experiment seem to have sequenced miRNAs? After reads have been trimmed to remove unwanted sequences we would normally then size select the remaining reads, retaining only those that are approximately the expected length. Select the sequences whose size may corresponds to the miRNAs. What proportion of the original sample are you left with?

2. Are there any ShortRead filters you could use to remove low complexity reads. If you happen to find one, use it to remove any read that contains more than 25 copies of any single base. What proportion of reads would remain? HINT: Check the srFilter() manual for some clues.

3. miR-92a (UAUUGCACUUGUCCCGGCCUG) is a miRNA that is very similar to miR-25. Use the srdistance() function to compare the two miRNAs and find the edit distance. What problem does this pose to the simple use of edit distance to look at which miRNAs may be represented in the sample HINT: The pattern for the srdistance() function must be a DNAStringSet class object. You will need to use the DNAStringSet() function to create this.




Accessing genomic annotation data using biomaRt

Although there is not time to take you through the entire small RNA analysis today there is one more tool that you may find universally useful that I will introduce.

Once we have cleaned our reads we could write them back to a fastq file and align them to the genomic sequence. This will generate genomic coordinates for each sequence. There are several formats for recording alignment information. Here is an example of SAM format. The example has been produced by using the alignment Bowtie tool: http://bowtie-bio.sourceforge.net/index.shtml. Of particular interest are the 3rd and 4th columns; chromosome name and position.

These alignments are easily read back into R using the Rsamtools package, at which point we would look for overlaps with genomic features of interest. Again there are R tools available for doing this; IRanges and GenomicRanges. To find these overlaps we need a set of genomic coordinates for these features. biomaRt provides a suite of tools for accessing data repositories with a user friendly R interface in order to retrieve data such as this.

First load the biomaRt library.

library("biomaRt")

Next display all the web services accessible via biomaRt.

listMarts()

We will use the Ensembl database today.

database <- useMart("ensembl")

Make a list of the datasets available, we will use the mouse and append this information to our 'Mart' object.

listDatasets(database)
database <- useDataset("mmusculus_gene_ensembl",mart=database)

The getBM function allows the collection of data from the database. We supply this function with the Mart object we are interested in and 3 other options. Attributes are the pieces of information we are interested in retrieving. Filters specify any criteria by which we wish to restrict our results. Values are the specific values applied to the filters, eg. gene names, specific chromosomes etc.

Take a closer look at the attributes and filters available in this database.

head(listFilters(database)) 
head(listAttributes(database))

Retrieve the genomic coordinates of the miRNAs used above.

miRNAs <- getBM(
      attributes = c("chromosome_name", "start_position", "end_position", "strand","mirbase_id","ensembl_gene_id"), 
      filters = "mirbase_id", values = c("mmu-miR-25","mmu-miR-290","mmu-miR-122"), mart = database)
miRNAs
  chromosome_name start_position end_position strand  mirbase_id    ensembl_gene_id
  1               5      138606549    138606632     -1  mmu-mir-25 ENSMUSG00000065394
  2              18       65408515     65408580      1 mmu-mir-122 ENSMUSG00000065402
  3               7        3218627      3218709      1 mmu-mir-290 ENSMUSG00000077815

In the SAM file above it is possible to see that the highlighted read aligns to chromosome 7 at position 3218969 and that the match is 22nt long. We can use biomaRt to find out if this read aligns to any genomic features.

overlap <- getBM(
      attributes = c("chromosome_name", "start_position", "end_position", "strand","mirbase_id","ensembl_gene_id"), 
      filters = c("chromosomal_region"), values = c("7:3218969:3218990"), mart = database)
overlap
  chromosome_name start_position end_position strand   mirbase_id    ensembl_gene_id
  1               7        3218920      3219001      1 mmu-mir-291a ENSMUSG00000078008

Finally, it is also possible to derive sequences from the database.

miRsequences <- c("mmu-miR-25","mmu-miR-290","mmu-miR-122")
getSequence(id=miRsequences,type="mirbase_id",seqType="gene_exon",mart=database)
                                                                             gene_exon  mirbase_id
1                   AGCTGTGGAGTGTGACAATGGTGTTTGTGTCCAAACCATCAAACGCCATTATCACACTAAATAGCT mmu-mir-122
2 GGCCAGTGTTGAGAGGCGGAGACTTGGGCAATTGCTGGACGCTGCCCTGGGCATTGCACTTGTCTCGGTCTGACAGTGCCGGCC  mmu-mir-25
3  CTCATCTTGCGGTACTCAAACTATGGGGGCACTTTTTTTTTTCTTTAAAAAGTGCCGCCTAGTTTTAAGCCCCGCCGGTTGAG mmu-mir-290

Although these examples are pertinent to this course, biomaRt is a very versitile tool and can be applied to many situations. For more examples take a look at the manual.

browseVignettes(package="biomaRt")


Challenges

1. You are interested in the effect of various conditions on the G2/M mitotic cell cycle transition in the mouse. A colleague returns from a conference and has notes from a lecture they feel will interest you. The notes contain the MGI symbols of 5 genes (Tpd52l1, Birc5, Rab6, Tmem204, Wnt10b). Identify which of these genes is directly associated with the cell cycle. (Hint: The GO Biological Process ID for the G2/M cell cell cycle transition is GO:0000086). Check the manual to find out how to use multiple filters.

2. Your boss has published hundreds of papers on Birc5 and wants to search the promoter region of this gene for transcription factor binding motifs. Retrieve the sequence for 1000 bases upstream of Birc5 so that they can use it in there searches. Hint: Check the getSequence manual for all of the wonderful things it can do.

3. The search for motifs was not fruitful but you are not disheartened. A colleague has recently generated a high quality genomic map of epigenetic markers from across the genome. She offers to see if any reqions of interest are overlapping the G2/M associated genes. Find the exonic coordinates for these genes. Hint: If you try to do this as demonstrated in the work session it may well break. Try googling the error message and see if you can fix the problem.

4. Your colleague has also produced this map for a series of G2/M phase stalled knockout fugu fish cell lines. She identifies a region on chromosome 22 (20000000 to 20100000) for which the epigenetic markers appear to be significantly altered. Find any genes of interest that overlap this region.