Evolinc: New CyVerse Tool Reveals Evolutionary History of lincRNA

Undergraduate students at Centenary College of Louisiana use Evolinc in the classroom.

By Shelley Littin

Long intergenic non-coding RNA transcripts, or lincRNAs, are distributed throughout eukaryotic genomes, providing essential genetic code for numerous biological processes at the cellular level and forming the basis for tissue development and other functions in multicellular organisms.  

Despite their abundance and biological importance, both the function and evolutionary processes governing lincRNAs remain mysteries, in part due to lack of sufficient computing tools and processing power to assess the massive amounts of data contained in lincRNA transcripts.

Now, a research team led by Mark Beilstein at the University of Arizona (UA) – also the headquarters of CyVerse ­­– has developed a tool called Evolinc for identification, analysis, visualization, and comparison of lincRNA transcripts.

Evolinc is described in an article published in Frontiers in Genetics, and is now available as an open access pipeline hosted by CyVerse’s Discovery Environment data management service, allowing researchers worldwide round-the-clock access to the tool for their own lincRNA studies.

“Processing the hundreds of gigabytes of RNA sequencing data has been a limiting step in lincRNA discovery,” noted Beilstein. “Having an analytical pipeline integrated with CyVerse makes it much easier for researchers to process and share these data.”

Evolinc consists of two modules. Module one, called Evolinc-I, is a workflow for identification of lincRNAs, that also enables differential expression analysis and visualization of identified transcripts via a genome browser. Module two, or Evolinc-II, is a genomic and transcriptomic comparative analysis workflow. For a group of related species defined by the user, Evolinc-II determines the phylogenetic depth to which a lincRNA locus is conserved.

“By leveraging CyVerse’s computational resources, Evolinc-II searches through the genomes of closely related organisms to identify whether a particular lincRNA is present,” said Beilstein. “When lincRNAs are shared between species, Evolinc-II can also determine if the lincRNAs are derived from protein coding sequences or other types of genomic features, information which can provide clues about their functions.”

One of the key advantages of combining Evolinc with CyVerse’s cyberinfrastructure, Evolinc lead author Andrew Nelson, a research associate at the UA’s College of Agriculture and Life Sciences, writes: “is the ability to combine various applications together in one streamlined workflow, making the workflow easier to implement by interested researchers.”

RNA, or ribonucleic acids, are an important class of molecules because they not only carry genetic information, but also catalyze biochemical reactions and regulate gene expression. “It’s similar to DNA, but more flexible,” said study co-author and CyVerse co-principal investigator Eric Lyons. “RNA isn’t just this intermediary of genetic information flowing from the genome into proteins. RNA has a huge role in the regulation of nearly all cellular processes, including the activity of individual genes, chromosomes, and whole genomes.”

“The classes and origins of lincRNAs are virtually unknown,” said Nelson. “Having tools that enable researchers to easily identify lincRNAs and assess their evolutionary origins will permit effective classification of lincRNAs. We anticipate that there are common rules governing lincRNA evolution across all of life, but that each lineage of organisms may also have unique patterns. Evolinc will allow us to uncover the rules, and the exceptions.”

“To perform such a comprehensive analysis of lincRNAs requires experts from many different disciplines of biology, each expert in different groups of organisms,” Nelson added.

Combining the resources of high performance computing with access to increasing amounts of RNA-Seq data available through the National Institutes of Health’s National Center for Biotechnology Information (NCBI) Sequence Read Archive, the authors believe that “Evolinc can uncover broad and fine-scale patterns in the way that lincRNAs evolve and ultimately help in linking lincRNAs to their function.”

Students at Centenary College of Louisiana use Evolinc in the classroom.

The project already has become the focus of an outreach program in collaboration with Rebecca Murphy at Centenary College of Louisiana, in which 17 undergraduate genetics students were exposed to basic big data analysis. Each student identified transcriptomic data for a selected plant species and used Evolinc pipelines within the CyVerse Discovery Environment to identify lincRNAs and test them for evolutionary conservation.

Evolinc can be accessed by creating a CyVerse account and logging in through the Discovery Environment. Additional information on Evolinc is available through the CyVerse Wiki.

The team’s work on Evolinc was completed with a $2.5 million award from the National Science Foundation’s Plant Genome Research Program, in addition to a separate NSF award to team member Mark Beilstein. Both Beilstein and Lyons are assistant professors in the UA’s School of Plant Sciences in the College of Agriculture and Life Sciences. Lyons is a member of the UA’s BIO5 Institute.

Image credits: Rebecca Murphy/Centenary College of Louisiana