Combinatorial Pattern Discovery Approach for the Folding Trajectory Analysis of a β-Hairpin
Public Library of Science (PLoS) -- PLOS Computational Biology
DOI 10.1371/journal.pcbi.0010008

The study of protein folding mechanisms continues to be one of the most challenging problems in computational biology. Currently, the protein folding mechanism is often characterized by calculating the free energy landscape versus various reaction coordinates, such as the fraction of native contacts, the radius of gyration, RMSD from the native structure, and so on. In this paper, we present a combinatorial pattern discovery approach toward understanding the global state changes during the folding process. This is a first step toward an unsupervised (and perhaps eventually automated) approach toward identification of global states. The approach is based on computing biclusters (or patterned clusters)—each cluster is a combination of various reaction coordinates, and its signature pattern facilitates the computation of the Z-score for the cluster. For this discovery process, we present an algorithm of time complexity c∈RO((N + nm) log n), where N is the size of the output patterns and (n × m) is the size of the input with n time frames and m reaction coordinates. To date, this is the best time complexity for this problem. We next apply this to a β-hairpin folding trajectory and demonstrate that this approach extracts crucial information about protein folding intermediate states and mechanism. We make three observations about the approach: (1) The method recovers states previously obtained by visually analyzing free energy surfaces. (2) It also succeeds in extracting meaningful patterns and structures that had been overlooked in previous works, which provides a better understanding of the folding mechanism of the β-hairpin. These new patterns also interconnect various states in existing free energy surfaces versus different reaction coordinates. (3) The approach does not require calculating the free energy values, yet it offers an analysis comparable to, and sometimes better than, the methods that use free energy landscapes, thus validating the choice of reaction coordinates. (An version of this work was presented at the 2005 Asia Pacific Bioinformatics Conference [1].)