Deep clustering of protein folding simulations

Deep clustering of protein folding simulations

Table of Contents


We examine the problem of clustering bio-molecular simulations using deep learning techniques. Since bio-molecular simulation datasets are inherently high dimensional, it is often necessary to build low dimensional representations that can be used to extract quantitative insights into the atomistic mechanisms that underlie complex biological processes. In this paper, we use a convolutional variational auto encoder (CVAE) to learn low dimensional, biophysically relevant latent features from long time-scale protein folding simulations in an unsupervised manner. We demonstrate our approach on three model protein folding systems, namely the Fs-peptide (14 µs aggregate sampling), villin head piece (single trajectory of 125 µs) and the mixed β-β-α (BBA) protein (223 + 102 µs sampling across two independent trajectories). In these systems, we show that the CVAE latent features learned correspond to distinct conformational substates along the protein folding pathways. The CVAE model predicts nearly 89% of all contacts within the folding trajectories correctly, while being able to extract folded, unfolded and potentially mis folded states in an unsupervised manner. Further, the CVAE model can be used to learn latent features of protein folding that can be applied to other independent trajectories, making it particularly attractive for identifying intrinsic features that correspond to conformational substates that share similar structural features. Together, we show that the CVAE model can quantitatively describe complex biophysical processes such as protein folding.


deep learning, variational auto encoder, protein folding, conformational substates


The phenomenal growth of computing capabilities have accelerated our ability to precisely model and understand complex bio-molecular events at the atomistic scale [1, 2, 3]. Several recent studies have demonstrated how long timescale molecular dynamics (MD) simulations can provide detailed insights into events driving several complex biological phenomena such as protein folding, ligand binding, and membrane transport, often complementing experimental results. MD simulations are governed by a potential energy function that includes both bonded and non bonded terms whose gradient defines a force-field applied to every atom in the bio molecular system [4]. These simulations integrate Newton’s laws of motion for every atom in the system using time-steps that typically are of the order of a femtosecond (10−15s). Even small simulation systems can potentially consist of thousands of atoms; given that bio-molecular events of interest typically occur at micro- and milli-second timescales, the increase in the size and complexity of these simulations is quickly becoming a limiting factor for extracting quantitative insights that are also biologically meaningful [5].

To overcome this challenge, a number of machine learning (ML) techniques are being developed to extract quantitative, biophysically relevant information from MD simulations. In particular, machine learning tools are able to quantify statistical insights into the time-dependent structural changes a biomolecule undergoes in simulations, identify events that characterize large-scale conformational changes at multiple timescales, build low-dimensional representations of simulation data capturing biophysical/biochemical/biological information, use these low-dimensional representations to infer kinetically and energetically coherent conformational substates, and obtain quantitative comparisons with experiments [6].

Since the dimensionality of MD simulations is large (3 × N, where N is the number of atoms, or 2 × (φ, ψ, χ) dihedral angles in the system of interest), ML techniques have focused on building low-dimensional representations of MD simulations. These dimensionality reduction techniques have used linear (e.g., principal component analysis [7], anharmonic conformational analysis [8, 9, 10]), non-linear (e.g., isometric mapping/ isomap [11], diffusion maps [12]) or hybrid approaches (e.g., locally linear embedding [13]) to characterize the conformational landscape sampled within simulations. Traditional ML approaches for analyzing long time-scale simulations typically require well-designed and often hand-crafted features. This in turn requires extensive prior knowledge about the system that is being simulated (for e.g., biophysically relevant reaction coordinates such as contacts between a ligand and its receptor). Often, use of certain ML techniques artificially restrict the simulation data being examined ( for e.g., isolating only a subset of atoms from the simulations), or be prohibitively expensive to pre-/post-process the data. Finally, many of these approaches also require pairwise comparison of individual conformers within the simulation, with a similarity/dissimilarity measure that may be computationally expensive.

Deep structured learning approaches, on the other hand, overcome these challenges by automatically learning lower-level representations (or features) from the input data and successively aggregating them such that they can be used in a variety of supervised, semisupervised and unsupervised machine learning tasks [14]. Deep learning techniques have proven useful for a variety of structural bioinformatics applications, including protein structure prediction from biological sequences, and virtual screening/drug discovery applications [15, 16, 17]. Doerr and colleagues evaluated a variety of dimensionality reduction techniques for MD simulations and demonstrated that a shallow auto encoder could be used to visualize folding events within protein folding trajectories [18]. More recently, Pande and colleagues demonstrated how a reduced dimensionality representation from simulations built using tICA could be propagated using a time-dependent variational auto-encoder [19].

In this paper, we develop a convolutional variational auto-encoder (CVAE) that can automatically reduce the high dimensionality of protein folding trajectories and automatically cluster conformations from MD simulations into a small number of conformational states that share similar structural, and energetic characteristics. Using equilibrium folding simulations of Fs peptide, villin headpiece, and BBA, all model systems for protein folding, our CVAE discovers latent features that automatically captures folding intermediate states, including mis folded states that can be challenging to characterize. We further demonstrate that the learned latent features from the CVAE can be ‘transferred’ across simulations, making it relevant for succinctly summarizing large-scale simulations and compare behaviors across trajectories. Together, we show that deep learning techniques can be used for unsupervised learning of biophysically relevant latent features from long timescale MD simulations.


We have demonstrated how deep learning algorithms can be used to analyze and interpret protein folding simulations. We designed a CVAE that can encode the inherent high dimensionality of the folding trajectories into a low dimensional embedding that is biophysically relevant. We demonstrated our approach on three prototypical systems, namely Fs-peptide, VHP and BBA, all of which have been extensively characterized in previous studies. In all the cases, we note that the learned CVAE embeddings captured the distinction between potentially folded, partially folded, and mis folded states.

In this paper, we used contact matrices determined form the simulations as inputs to the CVAE. Contact matrices are a practical approach to represent simulation datasets, which have been widely used to characterize protein folding pathways [32, 33]. However, the resolution of information captured using contact maps is fairly low and may not be specific (e.g., distinguish between native, non-native and mis folded states). Although the CVAE identified the presence of folded/unfolded and mis folded states in the simulations, there is significant scope for directly using coordinate information (or other physical quantities such as dihedral angles) from simulations for characterizing these pathways.

Complementary to the approaches taken by Doerr and colleagues [18], we build an auto encoder; however augmenting it with a variational formulation allows us to obtain interpretable features from the latent space. As demonstrated in the three systems, the CVAE latent spaces capture a succinct model of protein folding with the ability to distinguish conformational substates that share similar structural features. We have yet to evaluate whether these substates share similar energetic profiles. Further, our CVAE can be used to potentially augment propagators in time [19] such that temporal correlations are captured within these trajectories.

The selection of the hyper parameters, such as the size/stride of convolutional filters and the dimensions of the latent space to embed the simulations were based on empirical evaluations. Ideally, the choice of the latent space representation should be a parameter that can be learned from the simulation data itself (instead of being specified by the user). Further, these latent dimensions should correspond to directions in the landscape that enable the bio-molecular system to sample folded/mis folded states, which has been previously demonstrated by pursuing higher order statistical dependencies in atomic fluctuations in the simulations [8, 5]. We plan to extend our CVAE to automatically learn and infer the latent dimensional space.

Further studies are essential in associating the biophysical relevance of the learned CVAE embeddings. Specifically, we have not evaluated whether the CVAE embeddings for these folding trajectories correspond to biophysical reaction coordinates, i.e., whether the unique directions proposed by the CVAE can ‘fold’ a protein system. Temporal correlations are known to significantly influence bio-molecular events [34]. Although we trained our model to include temporal information (i.e., frames for the training was based on successive conformations in the trajectory), the embeddings learned do not necessarily correspond to detectable bio-molecular events. For e.g., in a protein folding trajectory, a typical event corresponds to ‘whether a β-strand was formed’ – our CVAE is currently unable to identify timepoints where significant structural or dynamical changes have occurred within trajectories. Leveraging our previous experience in developing techniques for event detection [10, 35], we will explore deep learning models for bio-molecular event detection in the near future.

About KSRA

The Kavian Scientific Research Association (KSRA) is a non-profit research organization to provide research / educational services in December 2013. The members of the community had formed a virtual group on the Viber social network. The core of the Kavian Scientific Association was formed with these members as founders. These individuals, led by Professor Siavosh Kaviani, decided to launch a scientific / research association with an emphasis on education.

KSRA research association, as a non-profit research firm, is committed to providing research services in the field of knowledge. The main beneficiaries of this association are public or private knowledge-based companies, students, researchers, researchers, professors, universities, and industrial and semi-industrial centers around the world.

Our main services Based on Education for all Spectrum people in the world. We want to make an integration between researches and educations. We believe education is the main right of Human beings. So our services should be concentrated on inclusive education.

The KSRA team partners with local under-served communities around the world to improve the access to and quality of knowledge based on education, amplify and augment learning programs where they exist, and create new opportunities for e-learning where traditional education systems are lacking or non-existent.

FULL Paper PDF file:

Deep clustering of protein folding simulations



Debsindhu Bhowmik, Shang Gao, Michael T Young and Arvind Ramanathan




Deep clustering of protein folding simulations

Publish in

PDF reference and original file: Click here

Website | + posts

Nasim Gazerani was born in 1983 in Arak. She holds a Master's degree in Software Engineering from UM University of Malaysia.

Website | + posts

Professor Siavosh Kaviani was born in 1961 in Tehran. He had a professorship. He holds a Ph.D. in Software Engineering from the QL University of Software Development Methodology and an honorary Ph.D. from the University of Chelsea.

+ posts

Somayeh Nosrati was born in 1982 in Tehran. She holds a Master's degree in artificial intelligence from Khatam University of Tehran.