nanoCAGE

Summary

Biology is being revolutionised by technologies for reading the sequence of DNA and RNA, which are becoming routinely accessible for biological and clinical research. With CAGE (Cap Analysis of Gene Expression), we are sequencing the start of RNA molecules, to understand how genes regulate each other, and to take a snapshot of the genetic program at work in the samples that we are studying. The CAGE technology is hypothesis-free and data-driven by design, as it is open to the detection of novel genes.

For samples yielding nanograms of total RNA (in the range of 1–10,000 cells), we have developed nanoCAGE (Plessy et al., 2010). nanoCAGE libraries are enriched for the start of RNA molecules (capped 5′ ends) through the use of the template switching method, which in practice is implemented by adding one oligonucleotide to the reverse transcription. This oligonucleotide tends to interact with the random reverse-transcription primers (needed to detect the non-polyadenylated RNAs) and produce short artifacts, and therefore we developed a method to reduce them, which we termed semi-suppressive PCR. This method differs from the usual suppressive PCR as it uses a different linker on both sides of the cDNA. nanoCAGE is therefore 5′-enriched, comprehensive, and directional.

In one sentence, nanoCAGE can be summarised as a CAGE method for small samples, combining template switching with semi-suppressive PCR.

Technological timeline

  • 2010: Original publication of the nanoCAGE protocol, where tags were cleaved with the EcoP15I enzyme. (Plessy et al., 2010).

  • 2011: Detailed protocol published in Cold Spring Harb Protoc. (Salimullah et al., 2011). The tag cleavage with the EcoP15I enzyme, which was mostly useful in the context of concatenation of the tags in long reads, is replaced by a direct sequencing of the cDNA's 5′ end.

  • 2013: Introduction of spacers in the template-switching oligonucleotides to reduce the bias caused by the barcodes (Tang et al., 2013).

  • 2013: combination of nanoCAGE and CAP Trapper to maximise the promoter rate (Batut et al., 2013).

  • 2013: Use of locked nucleic acids for a more even coverage of the gene bodies (Harbers et al., 2013).

Questions and answers

What is the latest version of the protocol ?

The latest published protocol was in Cold Spring Harb Protoc. (Salimullah et al., 2011). In 2013, We added brief update in the comments section, summarising the current evolutions. Do not hesitate to contact Charles Plessy for more information.

Do I have to use PrimeScript ?

We also have made libraries with SuperScript II and III. Benchmarks can indicate that one enzyme performs better, but libraries could be made with every enzyme we tested.

What is the difference with barcodes and indexes?

Barcodes and indexes are artificial DNA sequences encoding the identity of a sample. Both words have been used interchangeably in different laboratories or products. With nanoCAGE, we use the following definitions for historical reasons.

  • Barcodes are identifiers that is part of the forward (CAGE) read, and that have to be extracted before alignment. They are introduced during reverse-transcription through the template-switching oligonucleotides.

  • Indexes are sequenced as a separate “index” read on the Illumina platform, similarly as in Illumina's TruSeq product line. On current sequencers, the demultiplexing is automatic. They are introduced when adding the sequencing linkers (the library PCR), through the reverse PCR primer.

What is the difference between the spacer and the linker ?

nanoCAGE design of the template-switching oligonucleotide.

The linker (GGG) has a fixed sequence can not be changed: is is an essential part of the template-switching oligonucleotide that contacts the first-strand cDNA and interacts with the extra Cs added by the reverse-transcriptase.

The spacer is there to shift in 5′ the other functional regions of the template-switching oligonucleotide, namely the barcode and the fingerprint, to reduce their capacity to interact with the first-strand cDNA. It needs to be AT-rich and to have the same sequence in all experiments, to avoid introducing sample-specific biases (Tang et al., 2013).

Does nanoCAGE also add extra Gs ?

After polymerising the first-strand cDNA, the reverse transcriptase may add extra nucleotides. In nanoCAGE we observe a prevalence of extra Cs, reflected by extra Gs at the 5′ end of the first-mate sequence reads. Some publications suggest that these nucleotides are templated by the RNA cap itself, which is a methylguanosine. Thus, by the use of the template-switching mechanism, nanoCAGE libraries are enriched in 5′-full-length sequences. Here are key articles related to this topic:

  • Hirzmann, J., Luo, D., Hahnen, J. et al. (1993). Determination of messenger RNA 5'-ends by reverse transcription of the cap structure. Nucleic Acids Res. 21, 3597-3598, suggesting that the cap is reverse-transcribed.
  • Ohake, H., Ohtoko, K., Ishimaru, Y. et al. (2004). Determination of the capped site sequence of mRNA based on the detection of cap-dependent nucleotide addition using an anchor ligation method. DNA Res. 11, 305-309, showing that A-caps are reverse-transcribed as Ts.
  • Lavie L, Maldener E, Brouha B, Meese EU, Mayer J. (2004). The human L1 promoter: variable transcription initiation sites and a major impact of upstream flanking sequence on promoter activity. Genome Res. 2004 Nov;14(11):2253-60, showing that endogenous reverse-transcriptases also reverse-transcribe the cap.
  • Kulpa D, Topping R, Telesnitsky A. (1997) Determination of the site of first strand transfer during Moloney murine leukemia virus reverse transcription and identification of strand transfer-associated reverse transcriptase errors. EMBO J. 1997 Feb 17;16(4):856-65, with another discussion on the reverse-transcription of the cap.
  • Oz-Gleenberg I, Herzig E, Hizi A (2011). Template-independent DNA synthesis activity associated with the reverse transcriptase of the long terminal repeat retrotransposon Tf1. FEBS J. 2012 Jan;279(1):142-53, showith that reverse-transcriptases, like other DNA polymerases, add non-templated As to blunt DNA duplexes.

Since the extra Gs do not originate from the genome, make sure that your alignment pipeline can tolerate their mismatches. Also, depending on how the mismatches are represented, make sur that they do not cause a shift in the TSS in your downstream analyses. To process the alignment produced by the command bwa sampe, we added a new flag (-extraG) to the pairedBamToBed12 tool, to remove mismatched Gs at 5′ ends. (Of course, this will miss the case where the extra G matches a G on the genome by chance, but visually inspection of the results, comparing ENCODE HeLa libraries with nanoCAGE HeLa libraries, convinced us that this simple approach was already a considerable improvement over the non-corrected data.)

What is the difference between nanoCAGE and CAGEscan ?

nanoCAGE is a CAGE protocol using template switching to enrich for 5′ ends. CAGEscan, published in the same article (Plessy et al., 2010) is a paired-end approach where the forward read is a CAGE tag and the reverse read indicates the site of random priming. It has been originally published using nanoCAGE for library preparation, but it is applicable to any CAGE protocol where the cDNAs are not cleaved in small tags, for instance in the nAnTi-CAGE protocol (Murata et al., 2014).

On what RNAs does it work ?

nanoCAGE works on total RNAs from a broad range of species. We used it mostly on vertebrate RNAs, but there is also a published report using insect total RNA (honey bee, Khamis et al., 2015). We also produced libraries from plant (Arabidopsis), Yeast (S. pombe) and bacterial (E. coli) total RNA (unpublished results). nanoCAGE also works on other RNA preparations, such as from ribosome pulldown (Kratz et al., 2014).

Spikes are not capped. Why can we use spikes ?

nanoCAGE uses template switching to add linkers at 5′ ends. We (Plessy et al., 2010) and others have shown that template switching is facilitated by the presence of a cap, but on the other hand it definitely works on non-capped molecules as well. For that reason, the fraction of reads aligning to the ribosomal RNAs is higher in nanoCAGE than in the more stringent CAGE protocols based on the chemical CAP Trapper method, used for instance in the FANTOM and ENCODE projects. Therefore, we can detect the non-capped ERCC spikes in nanoCAGE (after corrections to the reference spike sequences explained in the Biostars forum).

Why are you still using BWA aln ?

We either sequence on MiSeq with the v2 50 cycles kit, or on HiSeq with 50-bases reads. In both cases, Read 1 is short (around 30 bases) and in our hands, the aln algorithm of BWA is a good enough tool to align it. The length of Read 2 varies more according to the sequencer type. On MiSeq it is around 20 bases, so bwa aln is still a good option. On HiSeq, it is 50 bases, and arguably this causes more alignment failures because the probability to overlap a splice junction is higher. Please let us know if you have a paired-end alignment strategy that would be equally efficient for the shortness of Read 1 and the relative longness of Read 2 on HiSeq, while still guaranteeing that there are no spurious gaps introduced at the 5′ end of Read 1.

Bibliography

Khamis AM, Hamilton AR, Medvedeva YA, Alam T, Alam I, Essack M, Umylny B, Jankovic BR, Naeger NL, Suzuki M, Harbers M, Robinson GE, Bajic VB. Insights into the Transcriptional Architecture of Behavioral Plasticity in the Honey Bee Apis mellifera. Sci Rep. 2015 Jun 15;5:11136 PubMed: 26073445

Kratz A, Beguin P, Kaneko M, Chimura T, Suzuki AM, Matsunaga A, Kato S, Bertin N, Lassmann T, Vigot R, Carninci P, Plessy C, Launey T. Digital expression profiling of the compartmentalized translatome of Purkinje neurons. Genome Research. 2014 Aug;24(8):1396-410 PubMed: 24904046

Pascarella G, Lazarevic D, Plessy C, Bertin N, Akalin A, Vlachouli C, Simone R, Faulkner GJ, Zucchelli S, Kawai J, Daub CO, Hayashizaki Y, Lenhard B, Carninci P, Gustincich S. NanoCAGE analysis of the mouse olfactory epithelium identifies the expression of vomeronasal receptors and of proximal LINE elements. Front Cell Neurosci. 2014 Feb 18;8:41 PubMed: 24600346

Plessy C, Pascarella G, Bertin N, Akalin A, Carrieri C, Vassalli A, Lazarevic D, Severin J, Vlachouli C, Simone R, Faulkner GJ, Kawai J, Daub CO, Zucchelli S, Hayashizaki Y, Mombaerts P, Lenhard B, Gustincich S, Carninci P. Promoter architecture of mouse olfactory receptor genes. Genome Res. 2012 Mar;22(3):486-97. PubMed: 22194471

Plessy C, Bertin N, Takahashi H, Simone R, Salimullah M, Lassmann T, Vitezic M, Severin J, Olivarius S, Lazarevic D, Hornig N, Orlando V, Bell I, Gao H, Dumais J, Kapranov P, Wang H, Davis CA, Gingeras TR, Kawai J, Daub CO, Hayashizaki Y, Gustincich S, Carninci P. Linking promoters to functional transcripts in small samples with nanoCAGE and CAGEscan. Nat Methods. 2010 Jul;7(7):528-34 PubMed: 20543846

Atanur SS, Birol I, Guryev V, Hirst M, Hummel O, Morrissey C, Behmoaras J, Fernandez-Suarez XM, Johnson MD, McLaren WM, Patone G, Petretto E, Plessy C, Rockland KS, Rockland C, Saar K, Zhao Y, Carninci P, Flicek P, Kurtz T, Cuppen E, Pravenec M, Hubner N, Jones SJ, Birney E, Aitman TJ. The genome sequence of the spontaneously hypertensive rat: Analysis and functional significance. Genome Res. 2010 Jun;20(6):791-803 PubMed: 20430781

Biagioli M, Pinto M, Cesselli D, Zaninello M, Lazarevic D, Roncaglia P, Simone R, Vlachouli C, Plessy C, Bertin N, Beltrami A, Kobayashi K, Gallo V, Santoro C, Ferrer I, Rivella S, Beltrami CA, Carninci P, Raviola E, Gustincich S. Unexpected expression of alpha- and beta-globin in mesencephalic dopaminergic neurons and glial cells. Proc Natl Acad Sci U S A. 2009 Sep 8;106(36):15454-9 PubMed: 19717439