Researchers develop a new, more accurate computational tool for long-read RNA sequencing

Researchers develop a new, more accurate computational tool for long-read RNA sequencing


Credit: CC0 Public Domain

In the gene-to-protein journey, a nascent RNA molecule can be cut and spliced, or spliced, in a number of ways before being translated into a protein. This process, known as alternative splicing, allows a single gene to code for several different proteins. Alternative splicing occurs in many biological processes, such as when stem cells mature into tissue-specific cells. However, in the disease setting, alternative splicing may be dysregulated. Therefore, it is important to examine the transcriptome, that is, all the RNA molecules that can come from genes, to understand the root cause of a condition.

However, RNA molecules have historically been difficult to “read” in their entirety because they are often thousands of bases long. Instead, the researchers have relied on so-called short-read RNA sequencing, which breaks RNA molecules up and sequences them into much shorter pieces, between 200 and 600 bases, depending on platform and protocol. Computer programs are then used to reconstruct the complete sequences of the RNA molecules.

Short read RNA sequencing can provide highly accurate sequencing data, with a low error rate per base of approximately 0.1% (meaning one base is incorrectly determined for every 1,000 bases sequenced). However, it is limited in the information it can provide due to the short duration of sequencing reads. In many ways, short read RNA sequencing is like dividing a large picture into many puzzle pieces that have the same shape and size, and then trying to put the picture back together.

Recently, “long read” platforms have become available that can sequence RNA molecules greater than 10,000 bases in length end-to-end. These platforms do not require the RNA molecules to be cleaved prior to sequencing, but have a much higher error rate per base, typically between 5% and 20%. This known limitation has severely hampered the widespread adoption of long read RNA sequencing. In particular, the high error rate has made it difficult to determine the validity of previously unknown new RNA molecules discovered in a particular condition or disease.

To circumvent this problem, researchers at Children’s Hospital of Philadelphia (CHOP) have developed a new computational tool that can more accurately discover and quantify RNA molecules from this error-prone long-read RNA sequencing data. The tool, called ESPRESSO (Error Statistics Promoted Evaluator of Splice Site Options), was reported today in Progress of science.

“Long-read RNA sequencing is a powerful technology that will allow us to discover RNA variation in rare genetic diseases and other conditions, such as cancer,” said Yi Xing, Ph.D., director of the Center for Genomics and Computational Medicine. from CHOP. and lead author of the study.

“We are probably at an inflection point in the way we discover and analyze RNA molecules. The transition from short-read to long-read RNA sequencing represents an exciting technological transformation, and computational tools are urgently needed.” that reliably interpret long read RNA sequencing data. .”

ESPRESSO can accurately discover and quantify different RNA molecules of the same gene, known as RNA isoforms, using only error-prone long-read RNA sequencing data. To do so, the computational tool compares all of the RNA sequencing long reads for a given gene with its corresponding genomic DNA, and then uses the error patterns of individual long reads to confidently identify splice junctions, places where the molecule of Nascent RNA has been cut and spliced. —as well as their corresponding full-length RNA isoforms.

By finding areas of perfect matches between long RNA sequencing reads and genomic DNA, as well as borrowing information from all long RNA sequencing reads for a gene, the tool can identify highly reliable RNA splice junctions and isoforms, including those that have not been previously documented in existing databases.

The researchers evaluated the performance of ESPRESSO using simulated data and data from real biological samples. They found that ESPRESSO outperforms several currently available tools, both in terms of RNA isoform discovery and quantification of RNA isoforms. The researchers also generated and analyzed more than one billion long RNA sequencing reads covering 30 human tissue types and three human cell lines, providing a useful resource for studying human transcriptome variation in isoform resolution of full length RNA.

“ESPRESSO addresses a longstanding problem of long-read RNA sequencing and could usher in new discovery opportunities,” said Dr. Xing. “We envision that ESPRESSO will be a useful tool for researchers to explore the RNA repertoire of cells in various biomedical and clinical settings.”

More information:
Yuan Gao et al, ESPRESSO: Robust discovery and quantification of transcription isoforms from error-prone long-read RNA-seq data, Progress of science (2023). DOI: 10.1126/sciadv.abq5072.

Provided by the Children’s Hospital of Philadelphia

Citation: Researchers Develop New, More Accurate Computational Tool for Long Read RNA Sequencing (Jan 20, 2023) Accessed 2023 Jan 20 at tool-long-read-rna-sequencing.html

This document is subject to copyright. Apart from any fair dealing for private study or research purposes, no part may be reproduced without written permission. The content is provided for informational purposes only.

Leave a Reply

Your email address will not be published. Required fields are marked *