Biology is being revolutionised by technologies for reading the sequence of DNA and RNA, which are becoming routinely accessible for biological and clinical research. With CAGE (Cap Analysis of Gene Expression), we are sequencing the start of RNA molecules, to understand how genes regulate each other, and to take a snapshot of the genetic program at work in the samples that we are studying. The CAGE technology is hypothesis-free and data-driven by design, as it is open to the detection of novel genes.
For samples yielding nanograms of total RNA (in the range of 1–10,000 cells), we have developed nanoCAGE (Plessy et al., 2010). nanoCAGE libraries are enriched for the start of RNA molecules (capped 5′ ends) through the use of the template switching method, which in practice is implemented by adding one oligonucleotide to the reverse transcription. This oligonucleotide tends to interact with the random reverse-transcription primers (needed to detect the non-polyadenylated RNAs) and produce short artifacts, and therefore we developed a method to reduce them, which we termed semi-suppressive PCR. This method differs from the usual suppressive PCR as it uses a different linker on both sides of the cDNA. nanoCAGE is therefore 5′-enriched, comprehensive, and directional.
In one sentence, nanoCAGE can be summarised as a CAGE method for small samples, combining template switching with semi-suppressive PCR.
2010: Original publication of the nanoCAGE protocol, where tags were cleaved with the EcoP15I enzyme. (Plessy et al., 2010).
2011: Detailed protocol published in Cold Spring Harb Protoc. (Salimullah et al., 2011). The tag cleavage with the EcoP15I enzyme, which was mostly useful in the context of concatenation of the tags in long reads, is replaced by a direct sequencing of the cDNA's 5′ end.
2013: Introduction of spacers in the template-switching oligonucleotides to reduce the bias caused by the barcodes (Tang et al., 2013).
2013: Use of locked nucleic acids for a more even coverage of the gene bodies (Harbers et al., 2013).
Questions and answers
The latest published protocol was in Cold Spring Harb Protoc. (Salimullah et al., 2011). In 2013, We added brief update in the comments section, summarising the current evolutions. Do not hesitate to contact Charles Plessy for more information.
We also have made libraries with SuperScript II and III. Benchmarks can indicate that one enzyme performs better, but libraries could be made with every enzyme we tested.
Barcodes and indexes are artificial DNA sequences encoding the identity of a sample. Both words have been used interchangeably in different laboratories or products. With nanoCAGE, we use the following definitions for historical reasons.
Barcodes are identifiers that is part of the forward (CAGE) read, and that have to be extracted before alignment. They are introduced during reverse-transcription through the template-switching oligonucleotides.
Indexes are sequenced as a separate “index” read on the Illumina platform, similarly as in Illumina's TruSeq product line. On current sequencers, the demultiplexing is automatic. They are introduced when adding the sequencing linkers (the library PCR), through the reverse PCR primer.
The linker (
GGG) has a fixed sequence can not be changed: is is an essential
part of the template-switching oligonucleotide that contacts the first-strand
cDNA and interacts with the extra Cs added by the reverse-transcriptase.
The spacer is there to shift in 5′ the other functional regions of the
template-switching oligonucleotide, namely the barcode and the fingerprint,
to reduce their capacity to interact with the first-strand cDNA. It needs to
AT-rich and to have the same sequence in all experiments, to avoid
introducing sample-specific biases (Tang et al.,
After polymerising the first-strand cDNA, the reverse transcriptase may add extra nucleotides. In nanoCAGE we observe a prevalence of extra Cs, reflected by extra Gs at the 5′ end of the first-mate sequence reads. Some publications suggest that these nucleotides are templated by the RNA cap itself, which is a methylguanosine. Thus, by the use of the template-switching mechanism, nanoCAGE libraries are enriched in 5′-full-length sequences. Here are key articles related to this topic:
- Hirzmann, J., Luo, D., Hahnen, J. et al. (1993). Determination of messenger RNA 5'-ends by reverse transcription of the cap structure. Nucleic Acids Res. 21, 3597-3598, suggesting that the cap is reverse-transcribed.
- Ohake, H., Ohtoko, K., Ishimaru, Y. et al. (2004). Determination of the capped site sequence of mRNA based on the detection of cap-dependent nucleotide addition using an anchor ligation method. DNA Res. 11, 305-309, showing that A-caps are reverse-transcribed as Ts.
- Lavie L, Maldener E, Brouha B, Meese EU, Mayer J. (2004). The human L1 promoter: variable transcription initiation sites and a major impact of upstream flanking sequence on promoter activity. Genome Res. 2004 Nov;14(11):2253-60, showing that endogenous reverse-transcriptases also reverse-transcribe the cap.
- Kulpa D, Topping R, Telesnitsky A. (1997) Determination of the site of first strand transfer during Moloney murine leukemia virus reverse transcription and identification of strand transfer-associated reverse transcriptase errors. EMBO J. 1997 Feb 17;16(4):856-65, with another discussion on the reverse-transcription of the cap.
- Oz-Gleenberg I, Herzig E, Hizi A (2011). Template-independent DNA synthesis activity associated with the reverse transcriptase of the long terminal repeat retrotransposon Tf1. FEBS J. 2012 Jan;279(1):142-53, showing that reverse-transcriptases, like other DNA polymerases, add non-templated As to blunt DNA duplexes.
Since the extra Gs do not originate from the genome, make sure that
your alignment pipeline can tolerate their mismatches. Also, depending on how
the mismatches are represented, make sur that they do not cause a shift in the
TSS in your downstream analyses. To process the alignment produced by the
bwa sampe, we added a new flag (
-extraG) to the
tool, to remove mismatched Gs at 5′ ends. (Of course, this will miss
the case where the extra G matches a G on the genome by chance, but
visually inspection of the results, comparing ENCODE HeLa libraries with
nanoCAGE HeLa libraries, convinced us that this simple approach was already a
considerable improvement over the non-corrected data.)
nanoCAGE is a CAGE protocol using template switching to enrich for 5′ ends. CAGEscan, published in the same article (Plessy et al., 2010) is a paired-end approach where the forward read is a CAGE tag and the reverse read indicates the site of random priming. It has been originally published using nanoCAGE for library preparation, but it is applicable to any CAGE protocol where the cDNAs are not cleaved in small tags, for instance in the nAnTi-CAGE protocol (Murata et al., 2014).
nanoCAGE works on total RNAs from a broad range of species. We used it mostly on vertebrate RNAs, but there is also a published report using insect total RNA (honey bee, Khamis et al., 2015). We also produced libraries from plant (Arabidopsis), Yeast (S. pombe) and bacterial (E. coli) total RNA (unpublished results). nanoCAGE also works on other RNA preparations, such as from ribosome pulldown (Kratz et al., 2014).
nanoCAGE uses template switching to add linkers at 5′ ends. We (Plessy et al., 2010) and others have shown that template switching is facilitated by the presence of a cap, but on the other hand it definitely works on non-capped molecules as well. For that reason, the fraction of reads aligning to the ribosomal RNAs is higher in nanoCAGE than in the more stringent CAGE protocols based on the chemical CAP Trapper method, used for instance in the FANTOM and ENCODE projects. Therefore, we can detect the non-capped ERCC spikes in nanoCAGE (after corrections to the reference spike sequences explained in the Biostars forum).
We either sequence on MiSeq with the v2 50 cycles kit, or on HiSeq with
50-bases reads. In both cases, Read 1 is short (around 30 bases) and in our
aln algorithm of BWA is a good
enough tool to align it. The length of Read 2 varies more according to the
sequencer type. On MiSeq it is around 20 bases, so
bwa aln is still a good
option. On HiSeq, it is 50 bases, and arguably this causes more alignment
failures because the probability to overlap a splice junction is higher.
Please let us know if you have a paired-end alignment strategy that would be
equally efficient for the shortness of Read 1 and the relative longness of Read
2 on HiSeq, while still guaranteeing that there are no spurious gaps introduced
at the 5′ end of Read 1.
Poulain S, Kato S, Arnaud O, Morlighem JÉ, Suzuki M, Plessy C✉, Harbers M✉. NanoCAGE: A Method for the Analysis of Coding and Noncoding 5'-Capped Transcriptomes. Methods Mol Biol. 2017 1543:57-109 PubMed: 28349422
Khamis AM☮, Hamilton AR☮, Medvedeva YA, Alam T, Alam I, Essack M, Umylny B, Jankovic BR, Naeger NL, Suzuki M, Harbers M✉, Robinson GE✉, Bajic VB✉. Insights into the Transcriptional Architecture of Behavioral Plasticity in the Honey Bee Apis mellifera. Sci Rep. 2015 Jun 15;5:11136 PubMed: 26073445
Kratz A☮, Beguin P☮, Kaneko M, Chimura T, Suzuki AM, Matsunaga A, Kato S, Bertin N, Lassmann T, Vigot R, Carninci P, Plessy C, Launey T. Digital expression profiling of the compartmentalized translatome of Purkinje neurons. Genome Research. 2014 Aug;24(8):1396-410 PubMed: 24904046
Pascarella G, Lazarevic D, Plessy C, Bertin N, Akalin A, Vlachouli C, Simone R, Faulkner GJ, Zucchelli S, Kawai J, Daub CO, Hayashizaki Y, Lenhard B, Carninci P, Gustincich S. NanoCAGE analysis of the mouse olfactory epithelium identifies the expression of vomeronasal receptors and of proximal LINE elements. Front Cell Neurosci. 2014 Feb 18;8:41 PubMed: 24600346
Harbers M, Kato S, de Hoon M, Hayashizaki Y, Carninci P, Plessy C. Comparison of RNA- or LNA-hybrid oligonucleotides in template-switching reactions for high-speed sequencing library preparation. BMC Genomics. 2013 Sep 30 14(1):665. PubMed: 24079827
Fadloun A, Le Gras S, Jost B, Ziegler-Birling C, Takahashi H, Gorab E, Carninci P, Torres-Padilla ME. Chromatin signatures and retrotransposon profiling in mouse embryos reveal regulation of LINE-1 by RNA. Nat Struct Mol Biol. 2013 Mar;20(3):332-8. PubMed: 23353788
Tang DT, Plessy C, Salimullah M, Suzuki AM, Calligaris R, Gustincich S, Carninci P. Suppression of artifacts and barcode bias in high-throughput transcriptome analyses utilizing template switching. Nucleic Acids Res. 2013 Feb 1;41(3):e44. PubMed: 23180801
Batut P, Dobin A, Plessy C, Carninci P, Gingeras TR. High-fidelity promoter profiling reveals widespread alternative promoter usage and transposon-driven developmental gene expression. Genome Res. 2013 Jan;23(1):169-80. PubMed: 22936248
Saxena A☮, Wagatsuma A☮, Noro Y, Kuji T, Asaka-Oba A, Watahiki A, Gurnot C, Fagiolini M, Hensch TK, Carninci P. Trehalose-enhanced isolation of neuronal sub-types from adult mouse brain. Biotechniques. 2012 Jun;52(6):381-5 PubMed: 22668417
Plessy C☮, Pascarella G☮, Bertin N☮, Akalin A☮, Carrieri C, Vassalli A, Lazarevic D, Severin J, Vlachouli C, Simone R, Faulkner GJ, Kawai J, Daub CO, Zucchelli S, Hayashizaki Y, Mombaerts P, Lenhard B, Gustincich S, Carninci P. Promoter architecture of mouse olfactory receptor genes. Genome Res. 2012 Mar;22(3):486-97. PubMed: 22194471
Bertin N, Plessy C, Carninci P, Harbers M. Definition of Promotome–Transcriptome Architecture Using CAGEscan. 2012. Chapter 3 in in Tag-Based Next Generation Sequencing (eds M. Harbers and G. Kahl), Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim, Germany.
Salimullah M, Sakai M, Plessy C, Carninci P. NanoCAGE: a high-resolution technique to discover and interrogate cell transcriptomes. Cold Spring Harb Protoc. 2011 Jan 1;2011(1):pdb.prot5559. PubMed: 21205859
Plessy C☮, Bertin N☮, Takahashi H☮, Simone R☮, Salimullah M, Lassmann T, Vitezic M, Severin J, Olivarius S, Lazarevic D, Hornig N, Orlando V, Bell I, Gao H, Dumais J, Kapranov P, Wang H, Davis CA, Gingeras TR, Kawai J, Daub CO, Hayashizaki Y, Gustincich S, Carninci P. Linking promoters to functional transcripts in small samples with nanoCAGE and CAGEscan. Nat Methods. 2010 Jul;7(7):528-34 PubMed: 20543846
Atanur SS, Birol I, Guryev V, Hirst M, Hummel O, Morrissey C, Behmoaras J, Fernandez-Suarez XM, Johnson MD, McLaren WM, Patone G, Petretto E, Plessy C, Rockland KS, Rockland C, Saar K, Zhao Y, Carninci P, Flicek P, Kurtz T, Cuppen E, Pravenec M, Hubner N, Jones SJ☮, Birney E☮, Aitman TJ☮. The genome sequence of the spontaneously hypertensive rat: Analysis and functional significance. Genome Res. 2010 Jun;20(6):791-803 PubMed: 20430781
Biagioli M☮, Pinto M☮, Cesselli D, Zaninello M, Lazarevic D, Roncaglia P, Simone R, Vlachouli C, Plessy C, Bertin N, Beltrami A, Kobayashi K, Gallo V, Santoro C, Ferrer I, Rivella S, Beltrami CA, Carninci P, Raviola E, Gustincich S. Unexpected expression of alpha- and beta-globin in mesencephalic dopaminergic neurons and glial cells. Proc Natl Acad Sci U S A. 2009 Sep 8;106(36):15454-9 PubMed: 19717439