Next-Gen Bioinformatics: The Expanding Role of Deep-Learning Algorithms

Posted by Jen Brown on Aug 10, 2017 11:40:41 AM

DNA_Sequencing.jpgDeep-learning algorithms have shown significant promise as applications in natural language understanding, decision making, and speech and image recognition. These algorithms are now being applied in bioinformatics applications within the biopharma industry to manage the increasing amounts of data from high-throughput techniques. As a bioinformaticist, I am particularly fascinated with recent applications of these algorithms to predict a variety of biological processes and interactions, particularly with respect to proteins. 

Many methods exist to identify novel protein-protein interactions (PPIs), but they only contribute to a small percentage of the whole PPI database due to low efficacy. Researchers at the Center for Quantitative Biology in Beijing have now applied a deep-learning algorithm to sequence-based prediction of human PPIs, the best model of which had an average training accuracy of 97.19%. Overall, the predictive accuracies for diverse external datasets ranged from 87.99% to 99.21% and showed promise in other species.

PPIs and the Need for High-throughput Computational Methods

Most proteins interact with other proteins in order to function properly, and thus should be studied in the context of those interactions to fully understand their function. PPIs are known to play a critical role in many biological processes including signal transduction, protein folding, cellular organization, and immune response. Transient PPIs are expected to control the majority of cellular processes and are expected to be involved in the entire range of cellular processes.

As a result, the analysis of PPIs may shed light on drug target detection and aid in therapy design. There are many methods that are commonly used to analyze PPIs ranging from co-immunoprecipitation (co-IP) for stable or strong PPIs to crosslinking protein interaction analysis for transient or weak PPIs. Advances in high-throughput technology such as mass spectrometric protein complex identification (MS-PCI) and yeast two-hybrid screens are capable of generating copious amounts of data but tend to be expensive, time consuming, and may not be applicable to proteins from all organisms. This has led to the need in the industry for high-throughput computational methods to identify PPIs with high quality and accuracy.

Harnessing the Power of Deep-learning Algorithms

Bioinformaticists have used computational methods to mine new information regarding PPIs while others have involved the development of new machine-learning algorithms. Most of these studies, however, have provided only the results of cross-validation and have not tested the algorithms ability to predict results using external datasets. Deep-learning algorithms are a class of machine-learning algorithms that operate by mimicking the deep neural networks and learning processes of the human brain and are capable of handling large-scale raw and complex data. Recent applications of deep-learning algorithms in bioinformatics include the prediction of DNA variants causing aberrant splicing, sequence specificities of RNA- and DNA-binding proteins, binding motifs, chromatin effects of sequence alterations with single-nucleotide sensitivity, protein secondary structures, protein backbone angles and solvent accessible surface areas.

The Model

The deep-learning algorithm model developed by Sun et al. applied a stacked autoencoder (SAE) specifically to study sequence-based human PPI predictions, but was also proven to have higher or comparable accuracies than other predictive models for E. coli, Drosophila, and C. elegans. An autoencoder is defined as an “artificial neural network that applies an unsupervised learning algorithm which infers a function to construct hidden structures from hidden data.” An SAE consists of multiple layers of autoencoders which are each trained, layer by layer, with the output of each former layer encoded into the next successive layer. Sun et al. found that the best method of coding the protein sequence model was to use the autocovariance method (AC). The AC method describes how variables at different positions are correlated and interact and is widely used for coding proteins. The models developed using the AC coding method achieved the best results on 10-fold cross-validation (10-CV) and on predicting hold-out test sets.

Future work in this field may focus on developing novel methods for how to best represent protein sequence information and will need to improve the overall prediction accuracy of external datasets, but the application of a deep-learning algorithm to the prediction of PPIs provides promise for a deeper understanding of the vast network of protein interactions.

Are you interested in learning more about how new technologies are used in conjunction with science to accelerate therapy development and reduce costs? If so, download our eBook Next Generation Cohort Studies and Biobanking: How Cloud Technology is Accelerating Translational Research.


Download eBook


T. Sun, B. Zhou, L. Lai and J. Pei. (2017). Sequence-based prediction of protein protein interaction using a deep-learning algorithm. BMC Bioinformatics, 18(1), 277. doi:10.1186/s12859-017-1700-2

Thermo Scientific. (2010). ThermoScientific Pierce Protein Interaction Technical Handbook. v2.