New, more accurate computational tool for long-read RNA sequencing — ScienceDaily

New, more accurate computational tool for long-read RNA sequencing — ScienceDaily

In the gene-to-protein journey, a nascent RNA molecule can be cut and spliced, or spliced, in a number of ways before being translated into a protein. This process, known as alternative splicing, allows a single gene to code for several different proteins. Alternative splicing occurs in many biological processes, such as when stem cells mature into tissue-specific cells. However, in the disease setting, alternative splicing may be dysregulated. Therefore, it is important to examine the transcriptome, that is, all the RNA molecules that could come from genes, to understand the root cause of a condition.

However, RNA molecules have historically been difficult to “read” in their entirety because they are often thousands of bases long. Instead, the researchers have relied on so-called short-read RNA sequencing, which breaks RNA molecules up and sequences them into much shorter pieces, between 200 and 600 bases, depending on platform and protocol. Computer programs are then used to reconstruct the complete sequences of the RNA molecules. Short read RNA sequencing can provide highly accurate sequencing data, with a low error rate per base of approximately 0.1% (meaning one base is incorrectly determined for every 1,000 bases sequenced). However, it is limited in the information it can provide due to the short duration of sequencing reads. In many ways, short read RNA sequencing is like dividing a large picture into many puzzle pieces that have the same shape and size, and then trying to put the picture back together.

Recently, “long read” platforms have become available that can sequence RNA molecules greater than 10,000 bases in length end-to-end. These platforms do not require the RNA molecules to be cleaved prior to sequencing, but have a much higher error rate per base, typically between 5% and 20%. This known limitation has severely hampered the widespread adoption of long read RNA sequencing. In particular, the high error rate has made it difficult to determine the validity of previously unknown new RNA molecules discovered in a particular condition or disease.

To circumvent this problem, researchers at Children’s Hospital of Philadelphia (CHOP) have developed a new computational tool that can more accurately discover and quantify RNA molecules from this error-prone long-read RNA sequencing data. The tool, called ESPRESSO (Error Statistics Promoted Evaluator of Splice Site Options), was reported today in Progress of science.

“Long-read RNA sequencing is a powerful technology that will allow us to discover RNA variation in rare genetic diseases and other conditions, such as cancer,” said Yi Xing, PhD, director of CHOP’s Center for Genomics and Computational Medicine and primary author. of the studio. “We are probably at an inflection point in the way we discover and analyze RNA molecules. The transition from short-read to long-read RNA sequencing represents an exciting technological transformation, and computational tools are urgently needed.” that reliably interpret long read RNA sequencing data. .”

ESPRESSO can accurately discover and quantify different RNA molecules of the same gene, known as RNA isoforms, using only error-prone long-read RNA sequencing data. To do so, the computational tool compares all of the RNA sequencing long reads for a given gene with its corresponding genomic DNA, and then uses the error patterns of individual long reads to confidently identify splice junctions, places where the molecule of Nascent RNA has been cut and spliced, as well as its corresponding full-length RNA isoforms. By finding areas of perfect matches between long RNA sequencing reads and genomic DNA, as well as borrowing information from all long RNA sequencing reads for a gene, the tool can identify highly reliable RNA splice junctions and isoforms, including those that have not been previously documented in existing databases.

The researchers evaluated the performance of ESPRESSO using simulated data and data from real biological samples. They found that ESPRESSO outperforms several currently available tools, both in terms of RNA isoform discovery and quantification of RNA isoforms. The researchers also generated and analyzed more than one billion long RNA sequencing reads covering 30 human tissue types and three human cell lines, providing a useful resource for studying human transcriptome variation in isoform resolution of full length RNA.

“ESPRESSO addresses a longstanding problem of long-read RNA sequencing and could usher in new discovery opportunities,” said Dr. Xing. “We envision that ESPRESSO will be a useful tool for researchers to explore the RNA repertoire of cells in various biomedical and clinical settings.”

This work was supported in part by the National Cancer Institute’s Moonshot Cancer Initiative Translational Immuno-Oncology Network (IOTN) (U01CA233074), other National Institutes of Health (R01GM088342, R01GM121827, and R56HG012310), along with a National de Salud Training Scholarship T32 in Computational Genomics (T32HG000046).

Leave a Reply

Your email address will not be published. Required fields are marked *