eLife

eLife Sciences Publications, Ltd

Top-down machine learning approach for high-throughput single-molecule analysis

David S White
Marcel P Goldschen-Ohm
Randall H Goldsmith
Baron Chanda
Olga Boudker
Leon D Islas
Leon D Islas
Leon D Islas
Fred Sigworth

Volume:
9

DOI 10.7554/eLife.53357

Abstract

During a chemical or biological process, a molecule may transition through a series of states, many of which are rare or short-lived. Advances in technology have made it easier to detect these states by gathering large amounts of data on individual molecules. However, the increasing size of these datasets has put a strain on the algorithms and software used to identify different molecular states.

Now, White et al. have developed a new algorithm called DISC which overcomes this technical limitation. Unlike most other algorithms, DISC requires minimal input from the user and uses a new method to group the data into categories that represent distinct molecular states. Although this new approach produces a similar end-result, it reaches this conclusion much faster than more commonly used algorithms.

To test the effectiveness of the algorithm, White et al. studied how individual molecules of a chemical known as cAMP bind to parts of proteins called cyclic nucleotide binding domains (or CNDBs for short). A fluorescent tag was attached to single molecules of cAMP and data were collected on the behavior of each molecule. Previous evidence suggested that when four CNDBs join together to form a so-called tetramer complex, this affects the binding of cAMP. Using the DISC system, White et al. showed that individual cAMP molecules interact with all four domains in a similar way, suggesting that the binding of cAMP is not impacted by the formation of a tetramer complex.

Analyzing this data took DISC less than 20 minutes compared to existing algorithms which took anywhere between four hours and two weeks to complete. The enhanced speed of the DISC algorithm could make it easier to analyze much larger datasets from other techniques in addition to fluorescence. This means that a greater number of states can be sampled, providing a deeper insight into the inner workings of biological and chemical processes.

Keywords

Single-molecule methods are powerful tools for providing insight into heterogeneous dynamics underlying chemical and biological processes otherwise obscured in bulk-averaged measurements (Moerner et al., 2015). Use of these techniques has expanded rapidly, with modalities spanning electrophysiology, fluorescence, and force spectroscopy to probe diverse physical phenomena. Generally, single-molecule data are obtained as a time trajectory where molecular behavior is observed as a series of transitions between a set of discrete states obscured by experimental noise. Following the growing realization that molecules involved in physiological and chemical processes exhibit complex kinetics and a diversity of behavior, there is an increasing demand for high-throughput technologies to adequately sample different sub-populations and rare but important events (Hill et al., 2017). As a result, there has been tremendous progress in improving both the number of single molecules that can be observed simultaneously and the total observation time of each molecule. For example, the observation window prior to photobleaching in conventional fluorescence paradigms such as single-molecule Förster resonance energy transfer (smFRET) or colocalization single-molecule spectroscopy (CoSMoS) can be dramatically extended with recently developed photostable dyes (Grimm et al., 2015; Altman et al., 2012). The current generation of metal-oxide semiconductor (sCMOS) detectors enables simultaneous imaging of 1 × 10^{4} molecules in a total internal fluorescence microscopy (TIRFM) configuration and can be coupled with nanofabricated zero-mode waveguides (ZMWs) to enable access to high concentrations (Levene et al., 2003; Chen et al., 2014; Juette et al., 2016). Non-fluorescence-based single-molecule experiments such as plasmon rulers, scattering, magnetic tweezers, and single-molecule centrifugation generate a tremendous amount of data through parallel measurement of hundreds of molecules with orders of magnitude longer recordings than a typical fluorescence experiment (Berghuis et al., 2016; Ye et al., 2018; Yang et al., 2016; Popa et al., 2016; Young and Kukura, 2019).

Unfortunately, despite these advances in generating statistically robust data sets, standard analysis algorithms impose a computational bottleneck at this scale of data generation (Hill et al., 2017; Juette et al., 2016). This is particularly true when the dynamics and physical states of a system are unknown.

Typical statistical modeling of single-molecule trajectories often adopts one of two approaches. The first is a probabilistic approach that models a molecule’s behavior as a Markov chain, wherein the molecule transitions between hidden discrete states whose outputs are measured experimentally (hidden Markov model, HMM). This involves estimating the transition probabilities between a small set of postulated states with defined outputs using methods to maximize the likelihood of the model given the observations or Bayesian inference to estimate model parameter distributions. Numerous software packages have been developed for implementing HMMs, such as QuB (Nicolai and Sachs, 2013; Qin et al., 2000), HaMMy (McKinney et al., 2006), SMART (Greenfeld et al., 2012), vbFRET (Bronson et al., 2009), ebFRET (van de Meent et al., 2014) and SPARTAN (Juette et al., 2016), each of which utilize a different HMM training method. For example, QuB implements the fast segmental k-means algorithm (SKM) which combines k-means clustering and the Viterbi algorithm to identify transitions between postulated states (Juang and Rabiner, 1990), whereas vbFRET adapts variational Bayesian inference for parameter estimation at faster speeds than traditional HMM training in both smFRET and single-particle tracking experiments (Blanco and Walter, 2010; Persson et al., 2013). Although powerful statistical tools are very useful for single-molecule analysis, HMMs have notable limitations, especially in the context of high-throughput analysis and unknown system dynamics. For example, HMMs are often used in a supervised manner where the user postulates model parameters such as the number of states, their measured outputs, and the allowed transitions between them. As this information is often not known a priori, it is desirable to test multiple models and rank them according to Bayesian probabilistic approaches or objective functions, such as the Bayesian Information Criterion (BIC). This process can dramatically increase the analysis time to ensure the parameters space has been sufficiently explored, which restrict their usefulness in high-throughput single-molecule analysis. Although variants such as infinite HMMs using Bayesian nonparametric inference try to naturally learn the trajectory of the states without the typical parametric model selection, these too are often computationally prohibitive for large data sets (Hines et al., 2015; Sgouralis and Pressé, 2017; Sgouralis et al., 2018).

The second class of single-molecule analysis approaches idealization as an unsupervised clustering problem from machine learning (Li and Yang, 2019). Typically, clusters of intensity values (e.g. states) are determined using bottom-up hierarchical agglomerative clustering (HAC) algorithms, which begin by treating each observation of *N* total observations as singleton clusters and perform *N*-1 iterations wherein pairs of clusters are merged until all data-points belong to a single cluster. For each number of possible clusters, an objective function can be minimized to find the optimal trade-off between the complexity and fit. This implementation results in time complexity of O(*N*^{2}) owning to the need of computing a *N*x*N* similarity matrix to determine which clusters should be merged at each iteration. In practice, a separate algorithm called change-point (CP) detection precedes HAC to reduce the solution domain of the objective function by identifying statistically significant stepwise changes in signal over time. We denote this combination of algorithms as CP-HAC. At each identified CP, the data are divided into two segments, each described by the mean values of the data-points between sequential change-points. By using the segments as initial clusters rather than all *N* data-points, the HAC computation can be dramatically reduced. The pioneering application of CP-HAC to single-molecule data addressed CP detection and clustering in the presence of Poisson noise (Watkins and Yang, 2005). Variants of this framework such as STaSI use other merit functions for Gaussian noise, including the Student’s t-test for fast CP detection and minimum description length for state selection (Shuang et al., 2014). An advantage of CP-HAC methods is that they only require a confidence interval and/or an objective function for the analysis, unlike HMMs which require a model to fit. This makes them very attractive in situations where there is no prior knowledge about the different physical states. In common experimental modalities such as smFRET, CP-HAC methods offer superior computational speed over HMM approaches; however, their quadratic time-complexity renders them inefficient on long trajectories (Shuang et al., 2014). In addition, simulation studies have suggested CP-HAC algorithms yield lower event detection accuracy than HMM approaches (Hadzic et al., 2018).

Despite the utility of HMM and CP-HAC methods, there is an outstanding need for an analysis platform to provide accurate model-free idealization with sufficiently high computational performance to keep up with the increasing scale of data generation. Although advances in computing hardware can, to a degree, mitigate these issues (Smith et al., 2019; Song and Yang, 2017), there remains a pressing need for more computationally efficient algorithms. Here, we present a new algorithm for efficient and accurate idealization of large single-molecule datasets in a model-independent manner. Our method, DISC (DIvisive Segmentation and Clustering), enhances existing statistical learning methods and enables rapid state and event detection. The DISC algorithm draws inspiration from other model-free algorithms like CP-HAC and SKM that rely on unsupervised algorithms, such as k-means and hierarchical clustering. We advance these ideas by adapting divisive clustering algorithms from data mining and information theory to improve the rate and accuracy of identifying signal amplitude clusters (states) in a top-down process as opposed to the typical bottom-up clustering (Pelleg and Moore, 2000; Karypis et al., 2000; Hamerly and Elkan, 2003). We further couple our model-free state detection with the Viterbi algorithm to enable robust event detection on par with HMM methods (Juang and Rabiner, 1990; Qin, 2004; Viterbi, 1967). Overall, DISC is an unsupervised method that combines statistical learning approaches with the high event detection accuracy of HMMs at a fraction of the computational cost, enabling convenient application to large datasets.

The goal of the DISC algorithm is time series idealization: the hard assignment of data points into discrete states. DISC approaches the problem of idealization as an unsupervised problem in machine learning wherein the number of significant states for a given single-molecule trajectory are not known a priori. This process of learning both the significant states and the transitions between them is accomplished in three phases: 1) divisive segmentation, 2) HAC, and 3) the Viterbi algorithm (Figure 1a, Figure 1—figure supplement 1). The first two phases use unsupervised statistical learning to identify the intensities of states following an appropriate user-specified objective function. The second phase uses the Viterbi algorithm to decode the most probable sequence of transitions between the identified states.

The primary distinction of DISC vs other model-free single-molecule analysis algorithms is the use of divisive clustering. Rather than being limited by the quadratic time-complexity of bottom-up HAC, DISC adopts the exact opposite hierarchical clustering algorithm: top-down divisive clustering. Divisive clustering initializes all data-points to a single cluster and iteratively splits data into sub-clusters until each data point is its own singleton cluster. Given that there are 2^{N}^{−1}-1 ways of spitting *N* data points into two sub-clusters, the complexity of top-down processes has led to their infrequent use. However, efficient implementations involving the use of sub-routines to determine how clusters should be split and whether the new clusters should be accepted has resulted in more efficient and accurate algorithms. For example, clusters can be sub-partitioned using k-means clustering and an objective function can determine if the fit of the data improves with an additional sub-cluster (Pelleg and Moore, 2000; Karypis et al., 2000; Hamerly and Elkan, 2003). Therefore, unlike HAC which makes a final decision at the end of the clustering, top-down processes make a binary decision during each iteration. For each iteration, if the objective function does not improve with the addition of new sub-clusters, the split is rejected. If the fit does improve, the split is accepted and each sub-cluster is again further partitioned into two new sub-clusters. The algorithm terminates once clusters cannot be split for further improvement of the objective function. Given this internal heuristic of cluster acceptance or rejection, the most probable number of states is typically identified within a few iterations, as opposed to the *N*-1 iterations necessary for HAC. In addition, when using k-means for partitioning data points into sub-clusters, the top-down algorithms can reach O(*N*) time complexity (Karypis et al., 2000). This simple implementation and accelerated computational speed make divisive clustering an attractive alternative that enables high-throughput single-molecule idealization.

The first phase of DISC is divisive segmentation. Consider an observed single-molecule trajectory $x=\left\{{x}_{1},\text{}\dots {x}_{N}\right\}$ where each ${x}_{n}$ is the observed intensity value $x$ at time-stamp *n* for *N* total observations contaminated by Gaussian noise. Like CP-HAC, the goal of the divisive segmentation is to identify and allocate each data-point into the optimal number of idealized states denoted by *K*. Following our Gaussian assumption, each state ${\varphi}_{j}\in \{{\varphi}_{1},\dots ,{\varphi}_{K}\}$ is described by the mean (µ) and standard deviation ($\sigma $) of data points allocated to the state ${\varphi}_{j}=\left({\mu}_{j},{\sigma}_{j}\right)$. We denote a series of transitions between states as $y=\left\{{y}_{1},\text{}\dots {y}_{N}\right\}$ where ${y}_{n}$ ∈ $\left\{{\varphi}_{1},\dots ,{\varphi}_{K}\right\}$ and 1 ≤ *K* ≤ *N*. Divisive segmentation aims to iteratively yield $x$ by determining whether data-points in a given cluster are better described by one or two states. At the onset of divisive segmentation, it is assumed that $x$ is described by a single idealized state. We will denote this initial fit as $y}_{0}=\left\{{y}_{{0}_{1}},\text{}\dots {y}_{{0}_{N}}\right\$ where ${y}_{{0}_{i}}$ ∈ $\left\{{\varphi}_{0}\right\}$ and ${\varphi}_{0}=\left({\mu}_{0},{\sigma}_{0}\right)$.

Allocating each data-point into two unique states is accomplished in two sequential phases: CP detection and k-means clustering. As opposed to standard divisive algorithms that allocate data points via k-means clustering only, we find CP identification prior to clustering advantageous. Not only does it reduce the solution domain of the objective function, but it also speeds up subsequent clustering while providing a reasonable estimate of state transitions. CP detection is performed with the popular recursive binary segmentation algorithm (Watkins and Yang, 2005; Scott and Knott, 1974). For each time stamp *n* in $x$, a hypothesis test is conducted to evaluate the probability that a CP occurred at position *n* via

In the context of single-molecule idealization, a CP is the location indicating a significant difference in mean intensity values between the data segments separated at location *n*, where the mean values of each segment are computed by

$\mu}_{1}=\frac{1}{n}\sum _{i=1}^{n}{x}_{i}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}{\mu}_{2}=\frac{1}{N-n}\sum _{i=n+1}^{N}{x}_{i$

To determine whether there is a statistically significant difference between the two segments, we use a two-way Student’s t-test of unequal sample size but uniform variance to evaluate the differences in mean. This is the same approach used in STaSI (Shuang et al., 2014). Specifically, a *t*-value is computed for each position *n* by

$t}_{k}=\frac{\left|{\mu}_{1}-\text{}{\mu}_{2}\right|}{\sigma \sqrt{\frac{1}{n}+\text{}\frac{1}{N-n}}$

where $c=\underset{k}{\mathit{a}\mathit{r}\mathit{g}\mathit{m}\mathit{a}\mathit{x}}\phantom{\rule{thinmathspace}{0ex}}k(t)$

${t}_{max}=\underset{k}{\mathit{m}\mathit{a}\mathit{x}}\phantom{\rule{thinmathspace}{0ex}}k(t)$

For a user specified confidence interval, a critical value is used to determine whether to accept the change-point. If *t _{max}* > critical-value, we reject

Following the completion of CP detection, times-series $x$ can be described as a series of *C*+1 intensity segments where *C* is the total number of CP identified given by an idealized state trajectory where ${y}_{n}$ ∈ $\{{\varphi}_{1},\dots ,{\varphi}_{K}\}$ and 1 ≤ *K* ≤ *C*+1. Like CP-HAC, the next goal is to discover the optimal number of states *K* into which to cluster the *C*+1 intensity segments generated by CP detection. Rather than iteratively merging each segment like bottom-up algorithms, we use divisive segmentation to cluster the data points in a top-down fashion. For divisive segmentation, all identified segments are partitioned into two unique clusters using the k-means algorithm, where the center of each cluster is described by the mean values of the CP-idealized data points within the cluster. For efficiency, DISC uses a modified k-means algorithm that is both deterministic and faster than standard implementations through use of triangle inequality for computational reduction (Elkan, 2003). Overall, this results in a series of transitions between two states $\text{}{y}_{1}=\left\{{y}_{{1}_{1}},\text{}\dots {y}_{{1}_{N}}\right\}$, where $y}_{{1}_{n}$ ∈ {$\varphi $* _{1},*$\varphi $

Now that the data-points are allocated to two separate clusters with identified transitions, the goal is to determine if $x$ is better fit with one or two states (${y}_{0}$ vs ${y}_{1}$). Like CP detection, this decision follows a hypothesis test where

To determine whether one or two states provides a better fit, we use the Bayesian Information Criterion (BIC) which is defined in a general form as

$BIC=-2\mathrm{ln}(\hat{\mathcal{\mathcal{L}}})+M\phantom{\rule{thinmathspace}{0ex}}ln(N)$

where $\hat{\mathcal{\mathcal{L}}}$ is the likelihood for the estimated model with $\mathcal{\mathcal{N}}\left(x\text{}\right|\text{}\mu ,\text{}\sigma )=\text{}\frac{1}{\sigma \sqrt{2\pi}}\mathrm{e}\mathrm{x}\mathrm{p}\left(\frac{-{\left(x-\mu \right)}^{2}}{{\sigma}^{2}}\right)$

$\hat{\mathcal{\mathcal{L}}}=\text{}\prod _{i=1}^{N}\sum _{j=1}^{K}{\pi}_{j}\ast \mathcal{\mathcal{N}}\left({x}_{i}\text{}\right|\text{}{\mu}_{j},\text{}{\sigma}_{j})$

To test the null hypothesis that $x$ is described by one state instead of two, BIC values are computed for $x$ with a fit of a single-state (BIC_{1}) and fit with two-states from divisive segmentation (BIC_{2}). If BIC_{2} > BIC_{1}, *H _{0}* is accepted and we believe $x$ is sufficiently described by a single-state. If the BIC

Self-termination of divisive segmentation results in a series of estimated states and transitions within the trajectory. While this algorithm is exceptionally fast owing to its top-down greedy design, the reliance on local choices for state assignments rather than evaluation of the entire trajectory for an optimal decision can produce sub-optimal results. This error often surfaces as an over-sampling of the number of states and an under-sampling of the kinetic transitions. Over-fitting the number of states results from a downward dissemination of error from early splits: if bisecting a given cluster is suboptimal, two different parent clusters may each produce highly similar and redundant sub-clusters thereby overfitting the number of states. Fortunately, this over-estimate of the number of states can be corrected using bottom-up clustering. Therefore, the second phase of DISC uses HAC to compute the similarities between all identified states at a global level and assess the fit of the whole trajectory rather than segmented portions. Like CP-HAC schemes, an objective function is used to determine the overall fit and number of states in the trace by merging highly similar clusters whose separation may arise during divisive segmentation (Figure 1b). For a general application, we continue to use BIC for evaluating fit vs complexity. The similarity between neighboring states $\varphi $* _{i}* and $\varphi $

$d({\varphi}_{i},\phantom{\rule{thinmathspace}{0ex}}{\varphi}_{j})=\sqrt{\frac{2{n}_{\varphi i}{n}_{\varphi j}}{{n}_{\varphi i}+{n}_{\varphi j}}}{\Vert {\mu}_{\varphi i}-{\mu}_{\varphi j}\Vert}_{2}$

The improvement of HAC on divisive segmentation for state detection is shown in Figure 2. Although divisive segmentation alone tends to slightly over-estimate the number of states, it provides a more reasonable estimate than CP detection alone (Figure 2a). The comparative performances in terms of speed and accuracy of these algorithms is further explored in Figure 2b. While CP detection alone (Figure 2b, blue) is very fast, it consistently yields a higher number of total states as compared to the ground truth. As CP-HAC frameworks must explore this large state space in its entirety, they can achieve higher accuracy than CP detection alone, but they are much slower algorithms (Figure 2b, green). In contrast, the use of top-down clustering in divisive segmentation dramatically reduces the total state-space for exploration, resulting in a faster algorithm than CP-HAC with much higher accuracy than CP detection alone (Figure 2b, purple). Finally, the sequential combination of divisive segmentation and HAC used in DISC lead to the highest state detection accuracy with minimal computational cost (Figure 2b, orange).

Following state refinement with HAC, the trajectory is again described as a series of temporal transitions between identified intensity states. Although the overall states are well estimated at this point, fast transitions are often missed during CP analysis of single-molecule trajectories (Hadzic et al., 2018). To ensure events are accurately detected, the final phase of DISC applies the Viterbi algorithm (Viterbi, 1967).

The goal of the Viterbi algorithm is to identify the most probable sequence of hidden states through a series of observations. In our scenario, we have *K* total states and *N* total observations in our trajectory . In a naïve manner, determining the most probable sequence of hidden states *y* could be accomplished by evaluating the likelihood of every possible hidden state sequence and choosing the most probable. However, as there are *K ^{N}* possible paths though the trajectory, this quickly becomes computationally intractable. A solution to this problem is the Viterbi algorithm, which makes use of dynamics programming to store only the most optimal state sequencing leading up to a given time point. In general, the Viterbi algorithm uses the observation that that the most probable state sequence leading up to data point n can be deduced by examining the most probable path leading up to the previous time point,

Formally, the Viterbi algorithm is described with a *K x N* trellis for states $j\in K$ and observations $n\in N$ (Figure 1a). Each cell of trellis ${v}_{n}\left(j\right)$ represents the probability of being in state *j* after seeing the first *n* observations and passing through the most probable state sequence for the given model parameters, $\lambda $. The value ${v}_{n}\left(j\right)$ is computed by recursively taking the most probable path up to this cell by

${v}_{n}\left(j\right)=\underset{{y}_{1},\dots {y}_{n-1}}{max}p({y}_{1}\cdot \cdot \cdot {y}_{n-1},{x}_{1}\cdot \cdot \cdot {x}_{n},{y}_{n}=j|\lambda )$

where $\lambda $ is first order Markov process of $\lambda =(\pi ,a,b)$. The primary components of a hidden Markov model include the initial probability of observing each state $\varphi $$v}_{n}\left(j\right)=\underset{\mathit{i}\phantom{\rule{thinmathspace}{0ex}}\in \phantom{\rule{thinmathspace}{0ex}}\mathit{N}}{max}\left\{{v}_{n-1}(i){a}_{ij}{b}_{j}({x}_{n})\right\$

$\psi}_{j}\left(n\right)=\underset{\mathit{i}\phantom{\rule{thinmathspace}{0ex}}\in \phantom{\rule{thinmathspace}{0ex}}\mathit{N}}{\mathrm{a}\mathrm{r}\mathrm{g}\mathrm{m}\mathrm{a}\mathrm{x}}\left\{{v}_{n-1}(i){a}_{ij}{b}_{j}({x}_{n})\right\$

where ${\psi}_{j}\left(n\right)$ is a helper function to store the $n-1$ state index $i$ on the highest probability path.Upon termination, the forward likelihood of the entire state sequence *y* up to time point N+1 having been produced by the given observations and HMM parameters is

$P\left(y|x,\lambda \right)=\underset{\mathit{i}\phantom{\rule{thinmathspace}{0ex}}\in \phantom{\rule{thinmathspace}{0ex}}\mathit{N}}{max}\left\{{v}_{T}(i)\right\}$

Deducing the most probable hidden state sequence $y$ through observations $x$ can be accomplished in a backtracking step by

${y}_{n}=\text{}{\psi}_{n}({y}_{n}+1)\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}N\ge n\ge 2$

To assess the improvement of the Viterbi algorithm for event detection, we simulated a two-state system with a constant k_{on} and varying k_{off} rate (Figure 2c). As shown previously (Hadzic et al., 2018), we found that results from CP detection alone were accurate for slower events, but often failed to identify faster transitions (Figure 2d). By refining the results of unsupervised clustering with the Viterbi algorithm, we found that event detection accuracy was significantly improved over CP detection and clustering alone across two-orders of magnitude of varying k_{off} (Figure 2d). Notably, as the changes in rates also affect the change in state occupancy, the high accuracy values returned after Viterbi refinement further highlight the power of DISC for resolving short-lived and rare transitions (Figure 2—figure supplement 1). In general, this improvement was anticipated since we are not the first to apply the Viterbi algorithm to the problem of idealization using unsupervised clustering. Although commonly used in the application of HMMs for hard assignment of data-points into *K* discrete states, the SKM algorithm has shown that the Viterbi algorithm can successfully decode a path sequence following state clustering using k-means as opposed to more rigorous HMM training procedures (Juang and Rabiner, 1990; Qin, 2004). Therefore, both SKM and DISC can yield the event detection power of standard HMM approaches without the need of rigorous model training. However, unlike SKM, DISC has the added benefit of identifying the states naturally without the need for any user supervision such as initial state specification. This makes DISC a powerful alternative as a computationally efficient unsupervised single-molecule analysis algorithm.

While the parameters used above are valid for trajectories with Gaussian noise, we do not claim they are optimal for all experimental modalities. We have intentionally developed DISC as a flexible framework for adaptation to different types of data. Although we used the Student’s t-test for CP detection in the presence of Gaussian noise, additional merit functions may be more appropriate in different situations (Song and Yang, 2017; Watkins and Yang, 2005; Li and Yang, 2019). The same holds true for the use of BIC for state selection as other objective function can be substituted as needed. For example, the harsh penalty for parameters in BIC can lead to underfitting in certain cases; therefore, less stringent Akaike information criterion (AIC) or Hannan-Quinn information criterion (HQC) may be more appropriate depending on the separation of states and noise (Akaike, 1974; Hannan and Quinn, 1979). It is important to note that while DISC performs idealization through unsupervised clustering, obtaining accurate results does require the user to determine the appropriate information criterion and CP detection methods as idealization results heavily depend on these variables. Critically, the central innovation of DISC is to take advantage of the best features of both top-down and bottom-up forms of cluster identification that leads to both fast and accurate state detection.

We validate DISC using simulated single-molecule trajectories using kinetic parameters obtained from our recent studies exploring the regulatory mechanisms of cyclic nucleotide binding domains (CNBDs) from hyperpolarization-activated cyclic nucleotide gated ion channels (HCN) which regulate pacemaking in heart and brain cells (Materials and methods) (Goldschen-Ohm et al., 2017; Goldschen-Ohm et al., 2016). In these experiments, isolated CNBDs are tethered into ZMWs whereupon we monitor the binding and unbinding dynamics of fluorescent cyclic nucleotides (e.g. fcAMP) at physiological concentrations to uncover the elementary dynamics associated with channel gating. While ligand binding has been observed at the single-molecule level via both FRET and CoSMoS (co-localization), we adapt our simulations to the latter case so our dynamics are not limited in time by acceptor photobleaching. Notably, trajectories obtained with CoSMoS exhibit heterogeneous bound intensity values which vary with each binding event (Figure 3—figure supplement 1). While we are uncertain as to the exact source of this fluctuation, it is likely caused by shifts of the molecule in the heterogeneous excitation field of the ZMW or dye photodynamics (Levene et al., 2003; Dempsey et al., 2009). While the excitation field changes particularly sharply in ZMWs, TIRF and confocal microscopy also contain a heterogenous excitation field (Moerner and Fromm, 2003). Minor changes in apparent dye brightness due to dye conformational or photodynamics (such as in Protein-induced fluorescence enhancement, PIFE), shifts of dye orientation, or partial quenching via electron transfer are all commonly observed (Stennett et al., 2015). Thus, heterogeneous intensity values are a common and inconvenient feature in real life single-molecule fluorescence data. Including this additional noise source in our simulations yields a closer representation of experimentally obtained data.

In total, we simulated 4000 trajectories composed of 2000 data points each, totaling 8 × 10^{6} data points. Each trajectory is 200 s in duration collected at frame rate of 10 Hz. We varied the complexity of the trajectory by simulating one to four independent CNBDs inside a given ZMW (two to five intensity states) and vary the signal to noise ratio (SNR) according to typical values from a ZMW experiment using Gaussian noise (Materials and methods) (Goldschen-Ohm et al., 2017; Goldschen-Ohm et al., 2016). We include the observed heterogenous bound intensities by randomly modulating each binding event according to our fit of the experimentally observed data (Figure 3—figure supplement 2). Given that each simulation features a different number of possible states, the total time spent within each state changes; therefore, these simulations also address the ability to capture states with unequal and even rare observation probabilities (see Figure 3—source data 1). We benchmark the results of DISC against commonly used HMM and CP-HAC methods: vbFRET and STaSI (Bronson et al., 2009; Shuang et al., 2014). These algorithms were chosen following the results of a recent comparative study that determined these to be the best performers among their class of analysis methods (Hadzic et al., 2018). In addition, DISC, STaSI and vbFRET all perform trajectory-by-trajectory idealization and are written entirely in MATLAB (MathWorks) which standardizes computational performance (Materials and methods).

Across all the simulations, DISC provides the highest average accuracy, precision and recall (Figure 3a, terms defined in Materials and methods). While no algorithm can idealize a trajectory in the presence of SNR = 1, DISC returns the lowest accuracy at SNR = 2. We suspect this result is due to the use of robust BIC for state detection; accuracy would likely be improved with less penalizing objective functions, such as AIC. While vbFRET performs the best at SNR = 2, the overall accuracy is still quite low: an average accuracy value for each number of simulated states is near chance. This low value demonstrates the inability of many algorithms to analyze data in presence of high noise and reinforces the common practice of discarding noisy data to create a more reliable dataset. For SNR > 3, which accounts for most of our experimentally obtained data (Figure 3—figure supplement 1c), DISC performs exceptionally well with highest average accuracy (0.91 ± 0.05) and is robust against false positives (precision = 0.96 ± 0.04) and false negatives (recall = 0.93 ± 0.03) across all simulated conditions (Figure 3a). While vbFRET matches the recall of DISC in this SNR range (0.94 ± 0.05), the tendency to overfit the number of states at higher SNR lowers precision (0.80 ± 0.18) and overall accuracy (0.76 ± 0.19). We find STaSI returns the lowest overall accuracy (0.47 ± 0.17) likely do to an overfitting the number of states (precision = 0.57 ± 0.2) and a tendency to miss transitions (recall = 0.75 ± 0.10). Notably, DISC is the only method unaffected by inclusion of heterogeneous state intensities of fcAMP likely due to the use of a Gaussian derived BIC for state selection (Figure 3—figure supplement 3).

Critically, DISC not only returned high accuracy results, DISC was also much faster than the other methods. Idealization of all 4000 trajectories by DISC was completed in just over two minutes, whereas STaSI took over fifteen minutes and vbFRET took over twelve hours. To thoroughly explore the computational efficiency of DISC, we simulated data with increasing durations per trajectory at a constant SNR and number of states. Remarkably, we find DISC is 400-fold to 1,200-fold faster than vbFRET and 2-fold to 8,700-fold faster than STaSI due to STaSI’s quadratic time dependence (Figure 4a; Shuang et al., 2014). For example, a trajectory of 10^{6} data points can be analyzed by DISC in 10 s compared to 3 hr for vbFRET and 27 hr for STaSI. Thus, DISC can handle the analysis of long trajectories, unlike CP-HAC methods. While fluorescence measurements from a single fluorophore at room-temperature rarely contain this many data points, large trajectory lengths are common in non-fluorescence experiments or fluorescence experiments with replenishing fluorescent labels such as in single-molecule genome sequencing and studies of catalysts via fluorogenic reactions (Eid et al., 2009; English et al., 2006; Sambur et al., 2016). To evaluate performance on more typical data, we compared the results of each algorithm on simulated smFRET trajectories that are limited in duration by acceptor photobleaching (Figure 4—figure supplement 1, Materials and methods). For simulations featuring two or three states FRET, we find DISC 5.5-fold faster than STaSI and 235-fold faster than vbFRET while maintaining the highest accuracy. This result validates the use of DISC for the analysis of large volumes of shorter fluorescence trajectories, especially compared to HMM approaches. This feature is particularly important as advances in hardware such as CMOS cameras and lab-on-chip methods generate larger smFRET data sets (Juette et al., 2016).

Finally, we evaluate the effect of trajectory duration on DISC accuracy. As expected, we find that the accuracy increases with increasing number of data points per trajectory. This result also indicates the minimum number of data points needed for an accurate idealization for a given SNR (Figure 4b). Overall, the results of our simulations suggest DISC is more or comparably accurate and critically, is substantially faster than standard idealization approaches, making it an enabling technology for analysis of high-throughput single-molecule experiments.

To verify performance of DISC in an experimental configuration with high volumes of experimental data, we analyzed a large single-molecule data set obtained from ZMWs that explore HCN dynamics (Figure 5a). Previous macroscopic studies of HCN channel gating have revealed that ligand binding to CNBDs exhibits both positive and negative cooperativity depending on the ligation state and the membrane potential (Kusch et al., 2012; Thon et al., 2015). Cyclic AMP regulates cardiac pacemaking via HCN channels and, therefore, this unusual allostery has significant physiological implications. However, as this allosteric analysis was based on global fits of ensemble binding data, the reliability of model parameters remains an open question (Hines et al., 2015). Evaluating cooperativity is an experiment well-suited for single-molecule investigation given the ability to observe the total time a molecule spends in each liganded state and directly extract state transition probabilities. Therefore, to directly assess the cooperativity between HCN2 CNBDs upon ligand binding, we use single-molecule fluorescence and monitor the binding of individual fcAMP molecules to our previously described tetramerized CNBDs inside ZMWs. (Figure 5a; Goldschen-Ohm et al., 2016).

Our initial dataset included 13,670 ZMWs each monitored for 800 s at a sampling rate of 10 Hz (Materials and methods). All trajectories were obtained in the presence of 1 µM fcAMP which is near the ligand dissociation constant for individual CNBDs (Goldschen-Ohm et al., 2016). As shown with other high-throughput collection platforms, an essential part of analysis at this scale is the application of stringent criteria to select traces that yield meaningful information about the system (Chen et al., 2014; Juette et al., 2016). Therefore, we first analyzed all trajectories with DISC to find reliable data prior to trace selection. DISC successfully processed this entire data set within 20 min using a standard MacBook Air (1.6 GHz Intel Core i5). The same analysis completed with STaSI yielded unphysical results in 4 hr (Figure 5—figure supplement 1). We estimated analysis with vbFRET would take weeks to complete and was therefore not performed. While correcting for non-specific binding is often a necessity in CoSMoS experiments, we find the passivated surfaces within the ZMWs greatly reduce non-specific absorption of fcAMP to either the metallic or glass surfaces, thus minimizing this concern (Figure 5—figure supplement 2; Smith et al., 2019; Eid et al., 2009; Foquet et al., 2008). Using the idealized fits obtained from DISC, we screened our data to select reliable trajectories for our analysis. Standard cut-offs in state separation, the total number of observed states, and a filter for kinetic activity ensured that each trajectory arose from a ZMW featuring a singly occupied and functional tetrameric CNBD. Notably, we noticed an asynchronous decay of protein activity over excitation time that may be caused by singlet-oxygen formation and subsequent inhibition of cyclic nucleotide binding to CNBDs (Figure 5—figure supplement 3; Idikuda et al., 2018). In conclusion, we retained 293 molecules totaling 1.2 × 10^{5} s of combined protein activity across 53,474 events.

To determine if the binding of cAMP to CNBDs is cooperative, we first calculated the total time each molecule spends in each of the liganded states (0 to 4 fcAMPs) using the state assignments from the idealized fits (Figure 5b, Figure 5—figure supplement 4). The resulting distribution of state occupancies is fit with a binomial distribution to evaluate the independence of each CNBD (Figure 5c, Materials and methods). Our binomial fit matches the distribution well and returns the probability of occupancy at 1 µM fcAMP for a single CNBD in the tetramer as 31% which is similar to our previous monomeric CNBDs studies (Goldschen-Ohm et al., 2016), suggesting a lack of cooperativity between the CNBDs. Notably, our measured state-occupancy distribution is strikingly different that than the unusual cooperativity modeled from either activated or non-activated channels (Figure 5—figure supplement 5; Kusch et al., 2012; Thon et al., 2015). We further explored the underlying dynamics of our data using the idealized single-molecule transitions obtained from DISC. Using QuB, we built a simple HMM of sequential ligand binding across four binding sites that was globally optimized across each molecule’s idealized state trajecotry (Nicolai and Sachs, 2013; Qin et al., 2000; Figure 5d, Materials and methods). As expected for non-cooperative processes, the optimized k_{on} and k_{off} rates for each state transition exhibit a strong linear relationship (Figure 5e). Combined, these results strongly suggest that CNBD units act independently during ligand binding. We postulate that the macroscopically observed cooperativity is either an artifact of model fitting or that it requires the presence of the transmembrane domains of the HCN channel and is not an intrinsic property of the CNBDs.

We developed a new algorithm for rapid and accurate unsupervised idealization of single-molecule trajectories. Our approach combines unsupervised statistical learning of discrete states with the event detection power of the Viterbi algorithm to quickly identify both significant states and transitions in a model-independent manner. Software implementing the DISC algorithm that includes a graphical user interface is available at https://github.com/ChandaLab/DISC (Chanda et al., 2019; copy archived at https://github.com/elifesciences-publications/DISC).

Like CP-HAC methods, DISC is not a fully probabilistic approach. While fully probabilistic HMM training approaches are beneficial for providing unbiased estimates of the parameter distributions, their high accuracy comes at the cost of significantly increased computational time. This cost is especially apparent for the recently developed infinite HMM approaches that aim to learn the true number of states from a potentially infinite number of possibilities. However, accomplishing this task costs hundreds of iterations per trace to provide a reproducible fit with some approaches taking days to analyze single trajectories (Hines et al., 2015; Sgouralis and Pressé, 2017; Sgouralis et al., 2018). Thus, while exhaustive search algorithms may be desirable in other contexts, they are clearly not suited for large datasets associated with high-throughput experiments. In contrast, by simulating data closely resembling the binding of fcAMP to pacemaker channels, we find that DISC surpasses the accuracy of common CP-HAC and HMM algorithms with a dramatic improvement in computational speed. Therefore, DISC satisfies the need for accuracy and speed in high-throughput analysis. In this regard, DISC is like the SKM algorithm for estimating the parameters of an HMM without direct HMM training. However, unlike SKM which relies on user-specified states, DISC uses unsupervised statistics to learn the states. Therefore, DISC offers the idealization power of SKM with the state-learning capabilities of CP-HAC.

We used DISC to analyze a large dataset obtained from ZMWs to evaluate cooperativity between CNBDs from pacemaker ion channels upon ligand binding. The rapid and robust idealization provided by DISC enabled stringent trace selection to ensure only reliable trajectories were analyzed. These data show that, in contrast to model inferences from ensemble measurements of HCN2 channels, CNBD tetramers do not exhibit cooperative ligand binding. This result suggests that allosteric interactions between binding sites may be coordinated by the channel’s transmembrane domains.

Although we have demonstrated the use of DISC on single-molecule fluorescence data, the framework can be easily extended to other data paradigms due to its modular nature. For example, the use of BIC for state determination or the Student’s t-test for change-point analysis could be interchanged with other information theoretic approaches or merit functions where appropriate. To allow for easy comparison of a given data set, the provided software and graphical user interface (GUI) allows the user to select from several options the desired parameters such as choice of information criteria. This flexibility makes DISC suitable for a wide array of experimental data provided it can be described as a series of transitions between discrete states, including, for example, single-channel current recordings, force spectroscopy and smFRET. However, there is no inherent knowledge within DISC to consider various sources of experimental noise, such as photo-blinking or baseline drift; therefore, correcting for these noise sources prior to DISC analysis will likely improve idealization accuracy.

Finally, our results show that DISC provides a dramatic improvement in computational speed over current state-of-the-art approaches while either improving or maintaining high accuracy for both state determination and event detection. This increase in speed is directly applicable to analyzing the growing datasets obtained in single-molecule fluorescence paradigms to adequately sample population dynamics. For example, the use of sCMOS camera enables smFRET measurements of tRNA conformational changes during protein translations across thousands of molecules simultaneously with millisecond resolution (Juette et al., 2016). Additionally, magnetic tweezers have enabled week-long mechanical measurements of single-protein folding and unfolding, shifting observable dynamics to pathological time-scales and allowing the detection of rare events (Popa et al., 2016). Thus, highly computationally efficient and robust algorithms such as DISC may be well suited for analysis of a wide variety of single molecule datasets beyond the standard smFRET data.

Reagent type (species) or resource | Designation | Source or reference | Identifiers | Additional information |
---|---|---|---|---|

Biological sample | GCN4pLI-HCN2 tetramer | DOI: 10.7554/eLife.20797 | ||

Chemical compound, drug | Protocatechuate 3,4-Dioxygenase | Millipore-Sigma | cas no. 9029-47-4 | from Pseudomonas sp. |

Chemical compound, drug | 3,4-Dihydroxybenzoic acid | Millipore-Sigma | cas no. 99-50-3 | Protocatechuic Acid |

Chemical compound, drug | (±)-6-Hydroxy-2,5,7,8-tetramethylchromane-2-carboxylic acid | Millipore-Sigma | cas no. 53188-07-1 | Trolox |

Chemical compound, drug | 8-[DY-547]-AET-cAMP | BIOLOG | Cat. No.: D 109 | DOI:10.1016/j.neuron.2010.05.022 |

Software, algorithm | MATLAB | MathWorks | RID:SCR_001622 | |

Software, algorithm | QuB | DOI: 10.1142/1793048013300053 | https://qub.mandelics.com/ | |

Software, algorithm | DISC | This work | https://github.com/ChandaLab/DISC | |

Software, algorithm | vbFRET | DOI: 10.1016/j.bpj.2009.09.031 | http://vbfret.sourceforge.net/ | |

Software, algorithm | STaSI | DOI: 10.1021/jz501435p | https://github.com/LandesLab/STaSI | |

Other | Zero-Mode Waveguide | Pacific Biosciences | Non-Commercial Zero-Mode Waveguides (Astro Chips) |

Single-molecule trajectories were simulated as a Markov process of transitions between discrete states. All simulations were performed with a frame rate of 10 Hz and featured variable total durations, SNR, and number of states. The primary kinetic scheme used was adapted from our recent studies of fcAMP binding to isolated monomeric CNBDs (Goldschen-Ohm et al., 2016). This model is a four-state scheme where both the unbound (U) and bound states (B) exhibit conformational changes (U’ ⇔ U ⇔ B ⇔ B’), yet exhibit only two different observable states (ie, U’/U are indistinguishable via fluorescence intensity, as are B/B’). fcAMP binding occurs between U and B. The rate constants (s^{−1} or M^{−1} s^{−1}) are: k_{U’U} = 0.15; k_{U,U’}=0.04; k_{U,B} = 2.3x10^{−6} * [fcAMP]; k_{B,U} = 0.95; k_{B,B’}=0.51; k_{B’,B} = 0.31 at 1 µM fcAMP. To mimic the tetrameric nature of HCN channels with no cooperativity, we extrapolated up to four bound states by summing independent CNBD trajectories prior to the addition of noise. To include realistic SNR, state-intensities, and heterogeneity distribution of bound intensities, we analyzed the direct fcAMP excitation and emission trajectories following acceptor photobleaching from the monomeric CNBD dataset used in our previous work (Figure 3—figure supplement 1; Goldschen-Ohm et al., 2016). This dataset consisted of 861 single molecules for a combined acquisition time of 44,090 s (4775 total binding events). All trajectories had a SNR >2 and all events persisted for longer than two frames, which resulted in an imbalance in the bound and unbound events. For each simulated trajectory, state intensities were each drawn from log normal distributions fit to monomeric CNBD single-molecule data, with average intensities between subsequent states being uniform. Gaussian noise was applied to trajectories at specified SNR. To quantitate the heterogeneous intensities from fcAMP binding, the mean of individual bound event intensities were taken for each identified event, so long as the event was >2 frames in duration. Heterogeneity was computed as the absolute percent difference for each event vs the mean bond intensity for the given trajectory by:

$Percent\text{}Heterogenity\text{}=\left|\text{}\frac{I{}_{event}-\text{}I{}_{bound}}{\text{}I{}_{bound}}\text{}\right|\times \text{}100\mathrm{\%}$

The heterogeneity of unbound events was minimal and was therefore not included in the simulations. For each simulated event, heterogenous bound intensity emissions were each drawn from an exponential fit monomeric CNBD single-molecule data. Gaussian noise was added to trajectories as specified.

Simulated smFRET data were downloaded from the kinSoftChallenge on June 11^{th}, 2019 (https://sites.google.com/view/kinsoftchallenge/home). Data used came from the provided training data sets titled: ‘Level 1’ and ‘Level 2’ with folder names ‘sim_190212_194543_level1’ and ‘sim_190212_202530_level2’.

DISC, STaSI, and vbFRET are all written entirely in MATLAB (MathWorks). Each algorithm was used outside of their graphical user interfaces (GUIs) to more accurately compare the computational time of native functions within each algorithm. User parameters in DISC include: the confidence interval of CP detection and the objective function for clustering. Unless otherwise stated, a 95% confidence interval was applied for CP detection and BIC was used for all clustering. For analysis with STaSI and vbFRET, we used the recommended default values set by their authors (Bronson et al., 2009; Shuang et al., 2014). For STaSI, this means a 99.8% confidence interval of CP detection. In vbFRET, users must provide the number of states and fitting attempts per trace (left at the default value of 10). To circumvent providing the number of states, we modified the provided *vbFRET_no_gui.m* script to perform analysis outside of the vbFRET GUI. The modified script begins by fitting the trace to one state and increases the number of states until two more beyond the number of states with the maximum evidence to ensure the maximum fit has been obtained. As no changes were made to native vbFRET functions, implementing this script has no effect on vbFRET’s accuracy. We expect changing parameters in both STaSI and vbFRET may lead to different results; however, it was not our goal to optimize the use of these algorithms. Also, as a thorough investigation into the performance of STaSI and vbFRET has been conducted elsewhere, we did not investigate why these algorithms presented lower performance than DISC (Hadzic et al., 2018).

All quantifications of computational time were performed using the tic and toc functions in MATLAB. For idealization accuracy, each event returned by a given algorithm is classified as a True Positive (TP), False positive (FP), or False Negative (FN). We define a TP as being in the correct state (±10% the correct intensity level’s standard deviation) and correct event duration (±1 frame) for a given simulated event. FPs are either added events or correct events in the wrong state. FNs are missed events. For each trajectory, we computed accuracy, precision, and recall as:

$Accuracy=\frac{TP}{(TP+FP+FN)}$

$Precision=\frac{TP}{(TP+FP)}$

$Recall=\frac{TP}{(TP+FN)}$

Accuracy represents the overall performance, whereas precision and recall highlight the false positive error rate (overfitting the data) and false negative rate (underfitting the data), respectively.

The expression, purification, biotinylation, and fluorescence labeling of tetrameric CNBDs were performed as previously described (Goldschen-Ohm et al., 2016). Non-commercial arrays of ZMWs were purchased from Pacific Biosciences. These waveguides featured a polyphosphonate passivation layer on the aluminum walls and a biotinylated polyethylene glycol (PEG) layer on the glass surface to reduce non-specific binding (Foquet et al., 2008; Eid et al., 2009). The PEG-Biotin surface was incubated with 0.05 mg/mL streptavidin (Prospec, cat # PRO-791) for 5 min in a buffer containing: 40 mM HEPES, 600 mM NaCl, 20% glycerol, 2 mM TCEP, 0.1% LDAO (Sigma, cat no. 40236), 2 mg/mL bovine serum albumin (BSA), 1 mM Trolox (Sigma, cas no. 53188-07-1), 2.5 mM protocatechuic acid (Sigma, cas no. 99-50-3) (PCA), pH 7.5 (Buffer A). After incubation, the ZMW chip was thoroughly rinsed with Buffer A to remove unbound streptavidin. Next, biotinylated tetrameric-CNBDs were diluted in Buffer A with the addition of the PCA/PCD oxygen scavenging system by adding 250 nM of protocatechuate 3,4-dioxygenase (PCD) from Pseudomonas sp. (Sigma, cas no. 9029-47-4) to between 100 pM and 2 nM for surface immobilization in ZMWs (Buffer B) (Aitken et al., 2008). This resulted in ≈100 occupied ZMWs out of the total ≈1000 ZMWs per field of view identified by fluorescence bleach steps of DY-650 that labels each of the four CNBDs. Fluorescently labeled cAMP (fcAMP; 8-(2-DY-547]-aminoethylthio) adenosine-3’,5’-cylic monophosphate) (BioLog, cat # D 109) was added at 1 µM for all single-molecule experiments in Buffer B.

ZMW chips were placed on top of an inverted microscope (Olympus IX-71, 100X, NA 1.49) and imaged under 532 nm (60 W/ cm^{2}) or 640 nm (25 W/ cm^{2}) (Coherent) as described previously (Goldschen-Ohm et al., 2017; Goldschen-Ohm et al., 2016). The only notable difference is that unlike previous experiments we did not use FRET to monitor binding in order to obtain data for extended periods (Goldschen-Ohm et al., 2016). We excited DY-650 with 640 nm to identify ZMWs featuring DY-650-labeled tetrameric-CNBDs. Next, fcAMP was continuously imaged with 532 nm for 8000 frames at 10 Hz to monitor binding activity. All emission spectra were split with a 650 nm long pass dichroic (Semrock Brightline FF650) and bandpass filtered using pairs of edge filters (532–623.8 nm, 632.9–945 nm; Semrock Cy3/Cy5-A-OMF) and imaged onto two separate EMCCDs (Andor iXon Ultra X-9899) using Metamorph software (Molecular Devices). All data were collected using ZMWs of 150–200 nm diameter which are large enough to accommodate tetrameric CNBD complex (each monomeric CNBD is 4 × 6×20 nm) (Goldschen-Ohm et al., 2016).

All analysis was performed using custom software written in MATLAB (Mathworks) or ImageJ. Single-molecule trajectories of each ZMW were extracted from tiff stacks saved by Metamorph software using MATLAB. Locations of ZMWs were obtained using a threshold mask of the brightfield image of the whole ZMW array. ZMW locations were refined with a 2D Gaussian fit to the local intensity height map. The time-dependent fluorescence at each ZMW was obtained by projecting the average image intensity in a 5-pixel diameter circle onto the ZMW location throughout each image in the stack.

A total of 13,670 individual ZMWs (1.1 × 10^{8} data points) were processed from the single-molecule tetrameric CNBD experiments. Each trajectory was idealized with DISC using a 95% confidence interval for CP detection and BIC for state selection (Figure 5—figure supplement 1). To reflect the ability to cleanly resolve the individual occupation states, we computed the separation of each sequential state vs the noise within a state by:

$State\text{}Separation\text{}=\text{}\frac{1}{K-1}\sum _{i=2}^{K}\frac{\left({\mu}_{i}-\text{}{\mu}_{i-1}\right)}{{\sigma}_{i-1}}$

where K is the total number of states, µ is the mean intensity value of a state, and σ is the standard deviation of the data points belonging to a state. This ensures that states are separated well enough to resolve, as would be expected for sequential ligand binding. Traces featuring 4 to 6 identified states with state separation ≥ 3 were retained for further analysis.To ensure a given trajectory contained a functional tetrameric CNBD, we kept traces that spent less than 50% of the time in the unbound state, resulting in a total of 480 trajectories for visual inspection. The observed asynchronous decay of protein activity was corrected using the CP detection method to identify the most likely point in a given trajectory where protein behavior dramatically changed. This was accomplished using MATLAB's findchangepoint function using the change in standard deviation as the statistic. Data points following the identified CP location were discarded from the analysis to include only the frames of consistent fcAMP binding to presumably functional proteins (Figure 5—figure supplement 3).

In conclusion, a total of 293 molecules totaling 1.2 × 10^{5} s (≈34.5 hr) of combined protein activity across 53,474 events was included for the final analysis. Each trajectory exhibited four or five conformational states (3 to 4 fcAMPs bound). Binomial fitting of the total time spent in each state was performed using MATLAB’s mle function. HMM modeling of single-molecule binding events was performed with QuB (Nicolai and Sachs, 2013; Qin et al., 2000). Idealized trajectories from DISC were exported to QuB with the first and last events removed. A sequential model of 0 to 4 ligand binding sites was globally optimized to simultaneously describe the idealized binding trajectories for all molecules.

This paper was supported by the following grants:

We thank Dr. Mike Sanguinetti for the wild-type HCN2 plasmid and Dr. Vadim A Klenchin for the purification of the tetrameric-CNBD. We also thank Owen Rafferty for his assistance in the development of the GUI for running the DISC algorithm.

Reviewing editor, *eLife*.

No competing interests declared.

Conceptualization, Resources, Data curation, Software, Formal analysis, Investigation, Methodology.

Conceptualization, Software, Formal analysis, Investigation, Methodology.

Conceptualization, Supervision, Investigation, Methodology, Project administration.

Conceptualization, Supervision, Funding acquisition, Investigation, Methodology, Project administration.

Simulated and raw data in addition to analysis scripts are available at https://doi.org/10.5281/zenodo.3727917.

The following dataset was generated:

In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses.

**Acceptance summary:**

This manuscript by White et al. describes a new statistical method for event detection in single-molecule biophysics. The performance of the new algorithm and workflow is higher than that of other well-established methods of single-molecule statistical analysis when probed against the same benchmarks. This new methodology combines a new top-down segmentation approach with other well-established event detection methods to yield a procedure that should accelerate and facilitate parameter extraction from single-molecule experiments.

**Decision letter after peer review:**

Thank you for submitting your article "High-Throughput Single-Molecule Analysis via Divisive Segmentation and Clustering" for consideration by *eLife*. Your article has been reviewed by three peer reviewers, including Leon D Islas as the Reviewing Editor and Reviewer #1, and the evaluation has been overseen by Olga Boudker as the Senior Editor. The following individual involved in review of your submission has agreed to reveal their identity: Fred Sigworth (Reviewer #2).

The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.

Summary:

The authors present us with a manuscript describing a new method to carry out idealization and parameter estimation of single-molecule data. The method merges several good-performing algorithms to segment the data and find transitions between states and finally discover the most probable number of states giving rise to the data. Importantly, they compare the performance of their DISC algorithm to other two well established methods and discover that DISC is faster and perhaps more accurate in certain experimental conditions.

The reviewers agree the DISC method is an important contribution to the arsenal of single-molecule trajectories analysis methods and the authors have provided a good characterization of its performance. The paper has merits and the method is an important contribution to the field, however, there are a number of important comments that were raised that need to be dealt with by the authors.

Essential revisions:

The paper needs to be rewritten in a manner that enough formal description of the method is given to allow the target readership (SM experimentalists, presumably) to understand how it works and what exactly it does. As it is, the paper reads more like an "advertisement" of an algorithm described in detail elsewhere.

1) The new Divisive Segmentation (DS) algorithm is described in general terms in the text but a formal description is not provided. A reasonable description of the CP analysis portion of DS is provided in Supplemental Note 2 (though a suitable pointer to this from the main text is not given). In the text you then note that you employ a modified k-means algorithm. The reader asks, what is the modification? The citations are unhelpful. Finally, the HAC part is also fairly well described in Supplemental Note 3. But the heart of the DS algorithm does not receive a similarly formal description. A similar problem arises with the mention of the Viterbi algorithm. We are told that you "use" it, but the only references are to the ion-channel and speech-recognition literature. Beside not explaining briefly what it does (for the naive reader's benefit), you do not explain what emission probability densities you use, nor how you set transition probabilities.

2) A question that arises regarding assumptions and parameters: how well does the DS/HSC perform for traces with rarely-visited states? Your selection of total fraction bound > 0.5 (subsection “Trace selection and analysis of binding activity”) suggests that your approach might be best for situations where the various states are roughly equi-probable. An important point to check is the performance of DISC as a function of the kinetics of the underlying process. How well does DISC estimates the number of states for fast kinetics, with a small number of data points in each state (short dwell-times)? One of the goals of single-molecule analysis is the determination not only of the most probable number of states, but also the dwell-times in each state and the transition rates between them. Was an effort made to include dwell-time estimation in DISC?

3) The authors need to be careful using the term high throughput as this means different things to different scientific fields. For instance, high throughput drug discovery often means looking at millions of compounds. Certainly 4000 molecules is relatively high throughput compared to a handful, but even the earliest single-molecule fluorescence papers included a hundred traces, so does a 40x increase justify the semantic transition to high throughput?

4) As authors show in the assessment of simulated traces in Figure 3, none of the methods compared work in a meaningful way when looking at SNR less than or equal to 3. The data simulated in the manuscript appears to be >10:1 according to a first approximation. So none of the methods assessed does a meaningful job revealing the 'true' number of states even when the data appear to be very good.

5) In the Introduction and Discussion sections, the authors downplay the general usefulness of existing Markov modeling methods. This is unfortunately misleading. The value of Markov modeling for time-series data analysis is proven in the extensive literature across a large number of fields using a variety of experimental approaches. While recent advancements in data throughput (e.g., sCMOS cameras) have put pressure on these algorithms to be efficient, this is a relatively minor problem that can be overcome by better implementations. Related to the above, Supplementary Note 1 includes a great deal of important introductory information that does a better job of placing DISC into the context of existing methods. This information belongs in the main text to substantively review the subject matter.

6) Given that the authors know (or should reasonably expect) that their experimental system has exactly five states, how does their "model free" approach compare to a traditional, model-based approach with a reasonable starting state (for example using SKM or any of the many HMM algorithms implemented in QuB, SPARTAN, or SMART)? This is important to justify the method being widely valuable, rather than just an incrementally better change-point algorithm.

7) The manuscript is lacking algorithm and implementation details. For example, the authors claim to use the Viterbi algorithm for idealization of time traces using the output of DISC, but the details of this step are never described. Looking at the code, it appears that the authors actually implemented the segmental k-means algorithm (SKM; Qin, 2004) using the state sequence from DISC as the input for the first iteration. SKM iteratively runs the Viterbi algorithm to identify the optimal state sequence followed by re-estimation of model parameters using this state sequence. Strictly speaking DISC does use Viterbi, but the description in the manuscript is not as novel as would appear. This is a critical point because it suggests that DISC just implements an existing, widely-used approach (SKM), and only differs in that it provides a means to initialize the starting model. If this is the case, it should simple be stated in this way. If this is not the case, then further clarity is definitely needed.

8) The SNR of the simulated traces in Figure 2A and Figure 3B seem to be far higher than specified. The most meaningful SNR, as it's usually defined, is the fluorescence intensity relative to the standard deviation of fluorescence background noise (or noise within the signal, if it is static). Correspondingly, SNR is a measure of experimental noise and is independent of the interpretation of the data. The authors instead define SNR in an unusual way that measures the noise in the signal relative to the separation between states (see Equation 2 in the Materials and methods). This may be a valuable metric, but it should not be called SNR. This is very misleading for the field. As used, it is also problematic because it has a circular dependence on the interpretation of the data.

9) In the system that was used to demonstrate the power of the DISC method the authors select for high SNR data and they know that there are five states. The question they sought to answer was whether or not the data exhibited random or cooperative transitions between states, hence, it is not clear that the data even needed to be idealized to assess this question. Gaussian fitting can be used to define the number of states – particularly when the SNR is this high; the idealization thus only provides access to kinetic information, which the authors do not appear as interested in. Hence, the biological question that appears to be of interest relates to the question of cooperativity potentially arising from the transmembrane domains, and this would have been a good way to showcase the method, but this assessment is not included.

10) Some important performance benchmarks are not discussed in the manuscript. For example, how well does it deal with changing or drifting baselines? It is a common feature of fluorescence measurements to be affected by steady reduction in the average fluorescence (drift). This is sometimes also encountered in other type of single-molecule experiments. It would be important for the authors to comment on the performance of DISC under such circumstances.

Essential revisions:

The paper needs to be rewritten in a manner that enough formal description of the method is given to allow the target readership (SM experimentalists, presumably) to understand how it works and what exactly it does. As it is, the paper reads more like an "advertisement" of an algorithm described in detail elsewhere.

We have taken this comment seriously and have made dramatic rewrites to the paper, particularity in the describing how the DISC algorithm works. The motivation and mathematics behind the DISC algorithm have been added in a newly written “Theory” section of the manuscript.

1) The new Divisive Segmentation (DS) algorithm is described in general terms in the text but a formal description is not provided. A reasonable description of the CP analysis portion of DS is provided in Supplemental Note 2 (though a suitable pointer to this from the main text is not given). In the text you then note that you employ a modified k-means algorithm. The reader asks, what is the modification? The citations are unhelpful. Finally, the HAC part is also fairly well described in Supplemental Note 3. But the heart of the DS algorithm does not receive a similarly formal description. A similar problem arises with the mention of the Viterbi algorithm. We are told that you "use" it, but the only references are to the ion-channel and speech-recognition literature. Beside not explaining briefly what it does (for the naive reader's benefit), you do not explain what emission probability densities you use, nor how you set transition probabilities.

We thank the reviewers for pointing out these shortcomings in our earlier version. We have added a new Theory section to clarify all aspects of the DISC algorithm. Details from the previous Supplementary Notes 2 and 3 have been added to the main text to elaborate on change-point detection, unsupervised clustering, and the use of the Viterbi algorithm. New text has been added to provide a formal introduction to the novel divisive segmentation algorithm. All modifications are included in the new Theory section.

2) A question that arises regarding assumptions and parameters: how well does the DS/HSC perform for traces with rarely-visited states? Your selection of total fraction bound > 0.5 (subsection “Trace selection and analysis of binding activity”) suggests that your approach might be best for situations where the various states are roughly equi-probable. An important point to check is the performance of DISC as a function of the kinetics of the underlying process. How well does DISC estimates the number of states for fast kinetics, with a small number of data points in each state (short dwell-times)? One of the goals of single-molecule analysis is the determination not only of the most probable number of states, but also the dwell-times in each state and the transition rates between them. Was an effort made to include dwell-time estimation in DISC?

We agree with the reviewers that the ability to detect rarely-visited states and fast dwell times are important features of any idealization algorithm. We would like to clarify that our use of trace selection based on at least 50% bound fraction criteria was intended to ensure that we are analyzing trajectories of functional proteins. This criterion was applied only to the experimental data but not to simulated data traces which were used for benchmarking DISC with other existing idealization algorithms. Therefore, we do not think that the utility of DISC is limited to trajectories containing at least 50% bound fraction.

Our simulations also highlight the ability of DISC to resolve short lived states and fast dwell times. Figure 2 plots event detection vs varying rates across 2 orders of magnitude, wherein the fastest rate of k_{off} = 1 frames^{-1} is present on average ~3% of the time. This figure shows that DISC maintains a high precision and recall for fast rates. Similarly, our multiple ligand binding simulations show, as expected, that the state occupancies for each additional bound state is progressively lower and the state with four ligands appears only 1% of the time. DISC can resolve these states with high accuracy and precision.

To emphasize the changing state occupancies in each simulation, we have added Figure 2—figure supplement 1 and Figure 3—source data 1. Figure 2—figure supplement 1 shows the occupancy of each of the two states with varying k_{off} values and plots accuracy values as a function of state occupancy. Figure 3—source data 1 contains a table that shows the average time spent in each liganded state for each simulated number of CNBDs. In regards to Figure 2—figure supplement 1, the following text has been added:

“Notably, as the changes in rates also affect the change in state occupancy, the high accuracy values returned after Viterbi refinement further highlight the power of DISC for resolving short-lived and rare transitions (Figure 2—figure supplement 1).”

3) The authors need to be careful using the term high throughput as this means different things to different scientific fields. For instance, high throughput drug discovery often means looking at millions of compounds. Certainly 4000 molecules is relatively high throughput compared to a handful, but even the earliest single-molecule fluorescence papers included a hundred traces, so does a 40x increase justify the semantic transition to high throughput?

We have toned down our use of the word “high-throughput” throughout the manuscript. However, we do wish to empathize that our analysis involved an initial 13,670 trajectories, each comprised of 8000 frames, totaling 1 x 10^{8} data points. This dataset is dramatically larger than standard single-molecule experiments. This is not a 40x increase in the number of trajectories, but a >100x increase in the number of trajectories, and with each trajectory recorded for 10x longer than typical single-molecule fluorescence trajectories. Therefore, the total experimental time analyzed was 1,000x larger than the norm. Importantly, other algorithms are insufficient to process data at this scale with accurate results, as we state in our text about this dataset:

“DISC successfully processed this entire data set within 20 minutes using a standard MacBook Air (1.6 GHz Intel Core i5). The same analysis completed with STaSI yielded unphysical results in 4 hours (Figure 5—figure supplement 1). We estimated analysis by vbFRET would take weeks to complete and was therefore not performed.”

4) As authors show in the assessment of simulated traces in Figure 3, none of the methods compared work in a meaningful way when looking at SNR less than or equal to 3. The data simulated in the manuscript appears to be >10:1 according to a first approximation. So none of the methods assessed does a meaningful job revealing the 'true' number of states even when the data appear to be very good.

We agree with the reviewers that this is indeed a surprising result. The simulated data plotted in Figure 3 is of SNR = 6 as indicated in the figure caption. We believe STaSI and vbFRET overfit the number of states because of heterogenous intensities included in our simulations (Figure 3—figure supplements 1-3). Without the inclusion of this additional noise source, the accuracy of each algorithm improves (Figure 3—figure supplement 4); however, this is real noise we observe and its therefore a more accurate representation of the algorithm’s performance on real data, and a major advantage of our new method. The tendency of STaSI to overfit the data is additionally highlighted in Figure 5—figure supplement 1. We discuss this phenomenon in our text:

“For SNR > 3, which accounts for most of our experimentally obtained data (Figure 2—figure supplement 1C), DISC performs exceptionally well with highest average accuracy (0.91 ± 0.05) and is robust against false positives (precision = 0.96 ± 0.04) and false negatives (recall = 0.93 ± 0.03) across all simulated conditions (Figure 3A). While vbFRET matches the recall of DISC in this SNR range (0.94 ± 0.05), the tendency to overfit the number of states at higher SNR lowers precision (0.80 ± 0.18) and overall accuracy (0.76 ± 0.19). We find STaSI returns the lowest overall accuracy (0.47 ± 0.17) likely do to an overfitting the number of states (precision = 0.57 ± 0.2) and a tendency to miss transitions (recall = 0.75 ± 0.10). Notably, DISC is the only method unaffected by inclusion of heterogeneous state intensities of fcAMP likely due to the use of a Gaussian derived BIC for state selection (Figure 3—figure supplement 4).”

For data less than an SNR of 2, we find different reasons for low accuracy. In the case of DISC and STaSI, often the algorithms return a fit of 1 state owing to the high noise level. In the case of vbFRET, we find that the states are better identified, but the events are poorly located. The ability to detect the states does lead to vbFRET having the highest accuracy for SNR = 2. As we state in our manuscript:

“While no algorithm can idealize a trajectory in the presence of SNR = 1, DISC returns the lowest accuracy at SNR = 2. We suspect this result is due to using the robust BIC for state detection and accuracy would likely be improved with less penalizing objective functions, such as AIC. While vbFRET performs the best at SNR = 2, the overall accuracy is still quite low: an average accuracy value for each number of simulated states is near chance. This low value demonstrates the inability of many algorithms to analyze data in presence of high noise and reinforces the common practice of discarding noisy data to create a more reliable data set.”

5) In the Introduction and Discussion sections, the authors downplay the general usefulness of existing Markov modeling methods. This is unfortunately misleading. The value of Markov modeling for time-series data analysis is proven in the extensive literature across a large number of fields using a variety of experimental approaches. While recent advancements in data throughput (e.g., sCMOS cameras) have put pressure on these algorithms to be efficient, this is a relatively minor problem that can be overcome by better implementations. Related to the above, Supplementary Note 1 includes a great deal of important introductory information that does a better job of placing DISC into the context of existing methods. This information belongs in the main text to substantively review the subject matter.

Your point is well-taken. We have rewritten our Introduction and Discussion sections to address these concerns. Specifically, Supplementary Note #1 was removed from the Supplementary Materials and reworked into our Introduction section. We believe the new section paints a fair portrait of the power of HMM for standard single-molecule analysis while also highlighting their current limitations for application to large datasets.

6) Given that the authors know (or should reasonably expect) that their experimental system has exactly five states, how does their "model free" approach compare to a traditional, model-based approach with a reasonable starting state (for example using SKM or any of the many HMM algorithms implemented in QuB, SPARTAN, or SMART)? This is important to justify the method being widely valuable, rather than just an incrementally better change-point algorithm.

We did reasonably expect our system to have 5 states which is one reason we chose this system to showcase our DISC results. By fitting all > 13,000 trajectories with DISC, we see that 89% of the trajectories returned a fit between 1-5 states, which suggests DISC is doing a good job in identifying states of experimental data. The same analysis with STaSI yielded poor results (Figure 5—figure supplement 1).

Importantly, the inclusions of empty ZMWs, protein decay during excitation, and the probability of more than one protein per ZMW led to distribution of observed number of states. Since our data was heterogenous, it needed to be idealized on a trajectory-by-trajectory basis. If every trace was homogenous, we expect other model-building software such as QuB, SPARTAN, or SMART would do a better job, albeit perhaps slower. However, none of these algorithms can perform trajectory-by-trajectory analysis without a user manually building a separate model for each trace, making the analysis of >13,000 unfeasible. It is for this reason we compared DISC against STaSI and vbFRET, which also perform trajectory-by-trajectory idealization. We state our rationale for choosing these algorithms in the manuscript:

“We benchmark the results of DISC against commonly used HMM and CP-HAC methods: vbFRET and STaSI (Bronson et al., 2009; Shuang et al., 2014). These algorithms were chosen following the results of a recent comparative study that determined these to be the best performers among their class of analysis methods (Hadzic et al., 2018). In addition, DISC, STaSI and vbFRET all perform trajectory-by-trajectory idealization and are written entirely in MATLAB (MathWorks) which standardizes computational performance (Materials and methods).”

7) The manuscript is lacking algorithm and implementation details. For example, the authors claim to use the Viterbi algorithm for idealization of time traces using the output of DISC, but the details of this step are never described. Looking at the code, it appears that the authors actually implemented the segmental k-means algorithm (SKM; Qin 2004) using the state sequence from DISC as the input for the first iteration. SKM iteratively runs the Viterbi algorithm to identify the optimal state sequence followed by re-estimation of model parameters using this state sequence. Strictly speaking DISC does use Viterbi, but the description in the manuscript is not as novel as would appear. This is a critical point because it suggests that DISC just implements an existing, widely-used approach (SKM), and only differs in that it provides a means to initialize the starting model. If this is the case, it should simple be stated in this way. If this is not the case, then further clarity is definitely needed.

We agree with the reviewer that our previous version did not draw close enough parallel between DISC and SKM. We have added the following text to our Theory section when discussing the resolving power of the Viterbi algorithm:

“In general, this improvement was anticipated since we are not the first to apply the Viterbi algorithm to the problem of idealization using unsupervised clustering. Although commonly used in the application of HMMs for hard assignment of data-points into *K* discrete states, the SKM algorithm has shown that the Viterbi algorithm can successfully decode a path sequence following state clustering using k-means as opposed to more rigorous HMM training procedures (Juang and Rabiner, 1990; Qin et al., 2004). Therefore, both SKM and DISC can yield the event detection power of standard HMM approaches without the need of rigorous model training. However, unlike SKM, DISC has the added benefit of identifying the states naturally without the need for any user supervision such as initial state specification. This makes DISC a powerful alternative as a computationally efficient unsupervised single-molecule analysis algorithm.”

8) The SNR of the simulated traces in Figure 2A and Figure 3B seem to be far higher than specified. The most meaningful SNR, as it's usually defined, is the fluorescence intensity relative to the standard deviation of fluorescence background noise (or noise within the signal, if it is static). Correspondingly, SNR is a measure of experimental noise and is independent of the interpretation of the data. The authors instead define SNR in an unusual way that measures the noise in the signal relative to the separation between states (see Equation 2 in the Materials and methods). This may be a valuable metric, but it should not be called SNR. This is very misleading for the field. As used, it is also problematic because it has a circular dependence on the interpretation of the data.

We agree with the reviewer that this term is indeed misleading for experimental data. Our goal was to create a metric that accounted for the separation between state intensities while also accounting for the noise. For the experimental traces, we have changed the term in the manuscript to “state separation” since the computed SNR value would indeed be a result of the idealization and not a true SNR. The term SNR was left unchanged for simulations as it is the appropriate term in that context, and uses the traditional definition that the reviewer describes. The Materials and methods section has been updated with the following text:

“To reflect the ability to cleanly resolve the individual occupation states, we computed the separation of each sequential state vs the noise within a state by:

where K is the total number of states, μ is the mean intensity value of a state, and σ is the standard deviation of the data points belonging to a state. This ensures that states are separated well enough to resolve, as would be expected for sequential ligand binding. Therefore, traces featuring between 4-6 identified states with state separation ≥ 3 were retained for analysis.”

9) In the system that was used to demonstrate the power of the DISC method the authors select for high SNR data and they know that there are five states. The question they sought to answer was whether or not the data exhibited random or cooperative transitions between states, hence, it is not clear that the data even needed to be idealized to assess this question. Gaussian fitting can be used to define the number of states -particularly when the SNR is this high; the idealization thus only provides access to kinetic information, which the authors do not appear as interested in. Hence, the biological question that appears to be of interest relates to the question of cooperativity potentially arising from the transmembrane domains, and this would have been a good way to showcase the method, but this assessment is not included.

We agree with the reviewers that we did not use our idealization to their full potential. Therefore, we modeled our idealized trajectories in QuB to directly extract global rates for a five-state sequential HMM. The results of this analysis are now shown in Figure 5D,E. This process optimized on and off rates from our idealized trajectories show a linear relationship. This provides further evidence to the non-cooperative nature of our system. The manuscript has been updated to address this new analysis.

10) Some important performance benchmarks are not discussed in the manuscript. For example, how well does it deal with changing or drifting baselines? It is a common feature of fluorescence measurements to be affected by steady reduction in the average fluorescence (drift). This is sometimes also encountered in other type of single-molecule experiments. It would be important for the authors to comment on the performance of DISC under such circumstances.

We acknowledge that not all possible performance benchmarks have been addressed in this manuscript; however, we believe our simulations have thoroughly explored DISC accuracy across a variety of dynamics, number of states, signal-to-noise to be confident in its performance on typical single-molecule data. As with most idealization algorithms, pre-processing the data (e.g. filtering, baseline correction, etc.) prior to idealization will likely increase idealization accuracy. There is nothing inherently built in to the DISC algorithm to be aware of a drifting baseline and therefore DISC accuracy will likely decrease with the magnitude of drift. We have added the following to our Discussion section to ensure the readers are aware of this limitation:

“However, there is no inherent knowledge within DISC to consider varying sources of experimental noise, such as photo-blinking or baseline drift. Therefore, the idealization accuracy of DISC will likely be improved when used after typical pre-processing of single-molecule trajectories.”

[Editors' note: we include below the reviews that the authors received from another journal, along with the authors’ responses.]

We thank the reviewers for their continued feedback and critical assessment of our work. We have addressed the provided comments and, in the process, significantly strengthened the manuscript. Our new changes are specifically aimed to more concisely explain our approach and provide additional context for the use of DISC to a more general audience.

Reviewer #1:

1) The revised manuscript has addressed my concerns regarding generality of the approach and improved presentation. The manuscript is now acceptable for publication in Nature Communications.

We thank the reviewer for their kind words and thoughtful comments that have helped strengthen our manuscript.

Reviewer #2:

I have reviewed this paper before. While the authors have addressed minor comments, the main difficulty lies with the heart of the paper itself.

Let me explain briefly again the heart of the problem. The Hidden Markov model (HMM) has been a paradigm shifting model across fields because, given a noise model and the assumption of Markov transitions between states, it finds: 1) noise parameters; 2) transition matrices; 3) trajectories consistent with those assumptions.

Because the HMM has a coupled kinetic and observation model (which is supported by Mathematics which are 100% clear), as we apply the HMM to data, it will sometimes adjust the noise level in one state to make sure that the Markovian (exponential waiting time) assumption is not violated overall.

Now the authors do not use the HMM as, in any HMM, the number of states must be specified. Fine.

Suppose, for argument’s sake, that the number of states was known and, for whatever reason, we chose not to use a (normal) HMM because perhaps the number of states was too large and the forward filter was too slow (e.g., too long a time trace). Now, suppose, as the authors do, that the data were to be messy, and we were first to denoise traces before learning kinetics. Then what?

The short answer is that any denoising/clustering step that were not simultaneously consistent with the Markov assumption and the noise model would give results that are then inconsistent with our starting assumptions. In other words, garbage in, garbage out.

In the case of the current paper, what I highlight here is just the very beginning (the divisive segmentation step) of the very many steps that run against the basic assumptions of the (unknown state) HMM that forms the heart of their problem.

Indeed, for the current work, the problems are much worse as the number of states themselves are unknown.

I understand the need, as the authors do, for an efficient state number identifier within the HMM paradigm. I think there are many approximate (or even deterministic/non-Bayesian) ways forward (perhaps a Laplacian approximation on a likelihood or an ABC within a Bayesian paradigm) that could be conceived of. But the math needs to make sense. Right now, the math makes no sense.

I have no doubt that the method proposed by the authors is efficient at what it does. But I would never use it, nor would I recommend anyone use it because the steps in the algorithm are mutually contradictory and the method (as captured in Figure 1 of the supplement) is just plain wrong.

While we accept the criticism that our method is not based on completely self-consistent first principle arguments (e.g. theoretically optimal scenario), we strongly assert that this tradeoff is completely acceptable given the significant consequent gains in performance (e.g. practicality and efficiency). In fact, the tradeoff for functionality over optimality in analysis is crucial for scientific progress. Consider super-resolution localization microscopy wherein in single emitters are identified beyond the diffraction limit by fitting to the point-spread function. Although the theoretically optimal scenario is to fit each emitter using a 2D Bessel Function, the entire community fits with 2D Gaussians. Not only would the more rigorous fitting be impractically slow, but no additional benefit can be derived by fitting with 2D Bessel functions under the relevant experimental conditions (finite noise, etc.). The use of 2D Gaussians is not as mathematically rigorous, but it is incredibly practical. Similarly, the semi-empirical force fields that make molecular dynamics simulations so accessible and powerful are not as rigorous as ab initio methods, but again, the community of users is fine with the compromise: a negligible degree of accuracy is sacrificed for a tremendous benefit in practicality. If anything, these less than optimal approaches are much more prevalent, by a wide margin, than the theoretically self-consistent methods.

In the case of DISC, we show the value of practical applicability provides tremendous and enabling advantages for large data sets. The fact that accuracy does not suffer in our method is direct evidence that the sacrifice of first principles rigor is miniscule and completely acceptable. In fact, the common tradeoffs that we found were being made in the larger big-data communities served as the inspiration for this work. For example, although reviewer 2 has clear issues with our use of a hybrid change-point and clustering algorithm for determining states in a trajectory, similar divisive clustering approaches minimizing “ad-hoc” objective-functions are extremely common in statistics (e.g. X-Means^{1} [2,641 citations] and Bisecting K-Means^{2} [3,157 citations]) and companies such as Twitter rely on change-point detection algorithms daily to analyze production and user experience data^{3}. Further, the segmental k-means algorithm (SKM, 564 citations)^{4} assigns data points into a set number of user-provided states using k-means clustering and uses the Viterbi algorithm to find events. Pioneering software such as QuB for analyzing single-ion channel trajectories heavily depend on the SKM algorithm owning to its computational speed and accuracy (202 primary citations)^{5}. Indeed, a recent groundbreaking study in Nature Methods described the use of SKM to analyze their large dataset of >10,000 molecules^{6}. These authors make it abundantly clear that:

“State assignment is performed using SKM. Although several other Markov modeling utilities are available and could be used for this purpose (e.g., HaMMy, vbFRET, SMART, iSMS and TwoTone), SKM is much faster than the alternatives, and this facilitates analysis of the large data sets associated with sCMOS.”

Experimental labs on the forefront of technological breakthroughs need more efficient algorithms to keep up with the growing scale of data generation. Although some degree of efficiency can be achieved by more sophisticated hardware (as discussed below by reviewer 3), this by no means diminishes the value of new algorithms capable of performing high throughput analysis on standard computers. We believe our manuscript demonstrates that DISC offers dramatically superior speed with maintained accuracy for both simulated and real data sets. We expect the majority of biophysical investigators will have no issue trading theoretical optimality for practical applicability.Many groups have already contacted us about using DISC, providing conspicuous evidence of this prediction.

Also, in regards to Supplemental Figure 1, to our knowledge this figure is correct. However, if something is indeed wrong, we would appreciate a reviewer to tell what is wrong (and why) to strengthen our manuscript and ensure we properly convey information.

Reviewer #3:

Authors have done a good job with providing new material to answer the initial questions and technical/experimental concerns of this reviewer and in my opinion this article is now almost ready to showcase the DISC method in a fair manner. My main concern, that was brought upon with the authors before and here again to a lesser degree would remain, is the way information is presented for the most probable group of users: biologists. Since the goal in publishing any scientific paper is eventually to benefit the researchers who would want to implement and take advantage of the knowledge or techniques presented, I would suggest authors to name multiple real-world biology related application examples in the text so a user that doesn’t carry much technical knowledge in analysis techniques and machine learning, would be able to grasp the limitations and applications for their own class of projects. That said, my opinion about the scope and presentation of DISC is positive and would like to see it in print. There are still few areas that need attention and I will list them below. Out of the fourteen points this reviewer brought up with the authors before, nine are resolved and five need attention. Four of these five points will be resolved by adding a paragraph or changing parts of statements and more clarification/specification in the paper. The last one needs a new run of the simulation. Other than that, there are two suggestions with respect to figures that need to be addressed properly. To briefly point these out: Comments#1, #2, #5, and #6 can be resolved by adding/altering text in the article while comment#9 is of technical nature and needs to be resolved by plotting the requested information. With respect to Figures, Figure 1 needs details for the axis and Figure 4 needs a new run of simulation to present a vital information for the limiting case of lack of detection. Details come below.

Note: "Comment#N" is a comment to the provided answer by the authors to question #N in the former round of review by this reviewer.

We appreciate the thorough and thoughtful comments from this reviewer which have helped us strengthen our manuscript. We agree that the tone of our manuscript and the reliance on technical jargon (e.g. unsupervised clustering) is not friendly to non-expert audiences that will be the main users of our technology. As a result, a large portion of our “The DISC Algorithm” section has been rewritten to reduce unfamiliar jargon and make the assumptions and limitations of DISC clearer.

1) Comment#1: Including the reference and mentioning it in the text satisfies the question. A small point would be, the speed is not a main issue since high throughput data analysis can always get addressed with the introduction of new better hardware which happens every year.

We thank the reviewer for pointing this out. Although technological progress will certainly continue to improve computational speed, faster algorithms benefit everyone now rather than sometime in the future, and also benefit many labs that do not consistently have the very latest hardware or experience to build the optimal hardware platform.

2) Comment#2: although the answer provided is not convincing, the added part in the paper is good enough to cover this issue. The reason the answer is not convincing is: hardware advances happen every year and within a couple of years, there will be more options for faster analysis and on the other hand, in many labs, working with GPU and parallel processing is becoming routine and a computer science/electrical engineering/physics PhD student or postdoc can implement methods to achieve shorter analysis times this way.

See response to point #1 about hardware and other technological advances.

3) Comment#5: what the word “unsupervised” bears is on dispute here. The fact that after picking a function based on the parameters of the system and the hardware you don’t need supervision, doesn’t remove the need for picking that function and make sure that it is the correct function. That said, unsupervised learning has a specific meaning used in analysis which is in line with what authors include. So the issue would be mainly explaining this clearly for the readers. Author’s usage of the technical term unsupervised learning might be misleading for the reader who is not familiar with machine learning terminology and in search of a method for data analysis. So maybe authors want to rephrase their description to clearly state that the unsupervised analysis doesn’t mean that the user can’t implement the code in a wrong way and achieve results that are not trustable, but the user can only trust the results if the proper function is known and implemented initially. The fact that they took it from supplementary note 5 to the main text, is good and removes the initial apparent contradiction in their text. As long as the discussed point is stated clearly in the main text, the issue is resolved in the opinion of this reviewer.

We agree with the reviewer that our language was unclear and potentially misleading. We have therefore made corrections throughout the manuscript when appropriate to ensure the definition of “unsupervised” is conveyed and discuss exactly what the user has to do to provide a reasonable answer. We have also added the following text to the final paragraph of our “The DISC Algorithm” section to highlight the importance of implementing DISC correctly:

“However, it is important to note that while DISC performs idealization using unsupervised clustering, obtaining accurate results does require the user to determine the appropriate information criterion and change-point detection methods to use as obtained results heavily depend on these variables.”

4) Comment#6: A satisfying answer to this question would take into account different type of problems a biologist would be able to investigate with DISC method and give examples of real systems and problems. Authors have done this to some extent in discussion but giving real biological examples to show the abilities and limitations would always help. I clarify that this is not a suggestion for the authors to perform biological experiments but to showcase and discuss possible and specific applications and why they can benefit or not by this method.

We agree with the reviewer direct applications of DISC to specific biological problems was poorly showcased. We have modified our Discussion section to include the new text.

“Finally, our results show that DISC provides a dramatic improvement in computational speed over current state-of-the-art approaches while either improving or maintaining high accuracy for both state determination and event detection. This speed increase is directly applicable to analyzing the growing data-sets obtained in single-molecule fluorescence paradigms to adequately sample population dynamics. For example, the use of sCMOS cameras enabled smFRET measures of tRNA conformational changes during protein translations across thousands of molecules simultaneously with millisecond resolution^{7}. Additionally, magnetic tweezers have enabled week-long mechanical measurements of single-protein folding and misfolding, shifting observable dynamics to pathological timescales and allowing the detection of incredibly rare events(Popa et al., 2016).”

5) Comment#9: Based on the answer authors prepared to this point, this question rises: if two points are closer than the detection ability of the method, the method wouldn’t be able to resolve them so the precision would fall, or as the authors rephrased in their answer to the review, the signal will reduce to noise (this case is not showcased in the color map in Figure 3.) meaning no detection happened for at least one of the points (since two or more points are seen as one) so since authors answered this question with a description related to noise level, such a result is not seen in the color panel since we see only signal to noise levels of above 3. It would be useful if the authors write a paragraph about this issue (meaning about the limitation of detection in too dense areas or too noisy areas) and then bring up their interpretation that states in such a case they interpret the unseen emitter as noise which then needs to be shown in their figure as signal to noise ratios of 1 and 2 (currently it shows 3 and above). This will clarify the resolution limit question the reviewer had before.

We remind the reviewer that we are looking at a *signal-to-noise ratio*, and not signal and/or noise separately. We therefore believe our simulations have addressed the valid concern of the reviewer, albeit in a different manner.

We compute signal-to-noise (SNR) as the height between states divided by the standard deviation of the data points in the previous state (as described in our Materials and methods). For example, if we have a two-state system with observed intensities of state [1] = 100 and state [2] = 200, then we have a step height of 100. If we apply Gaussian noise to this process at σ [1] = 20, then SNR = (200100) / 20 = 5. Now, if we were to only change the height of state [2] as the reviewer has suggested, say to state [2] = 150, then we would have: SNR = (150-100) / 20 = 2.5. Therefore, changing the step height without changing the σ [1] results in a changing of SNR. However, if we were to change σ [1], say to σ [1] = 10, we would have a SNR = (150-100) / 10 = 5. Thus, detecting a step height detection completely dependent on the noise in the trajectory and is not an independent entity.

Also, detecting a minimal step height is less translatable than assessing SNR ratios given that different experiments have different expected observed values. For example, while our EMCCD experiments obtain intensities values on the scale of 100s arbitrary units, correlative FRET experiments [e.g. smFRET] only has a range of 0 to 1.

In regards to Figure 3, we have updated the figure to include simulations are SNR = 1 and 2. The text has also been updated to address this change.

“Across all the simulations, DISC provides the highest average accuracy, precision and recall (Figure 3A). While no algorithm can idealize a trajectory in the presence of SNR = 1, DISC returns the lowest accuracy at SNR = 2 as result of using the robust BIC for state detection and would likely be improved with less penalizing objective functions, such as AIC. While vbFRET performs the best at SNR = 2, an average accuracy value for each number of simulated states is close to chance. This demonstrates the inability of many algorithms to analyze data in presence of high noise and reinforces the common practice of discarding noisy data to create a more reliable data set. For SNR > 3, which accounts for the majority of our experimentally obtained data (Supplemental Figure 2), DISC performs exceptionally well with highest average accuracy (0.91 ± 0.05) and is robust against false positives (precision = 0.96 ± 0.04) and false negatives (recall = 0.93 ± 0.03) across all simulated conditions (Figure 3A). While vbFRET matches the recall of DISC in this SNR range (0.93 ± 0.05), the tendency to overfit the number of states at higher SNR lowers precision (0.80 ± 0.18) and overall accuracy (0.76 ± 0.19) (Figure 3A). Notably, DISC is the only method unaffected by inclusion of heterogenous state intensities of fcAMP likely due to the use of a Gaussian derived BIC for state selection (Supplemental Figure 8).”

6) Figure 1: sub figures a and b need specific indication for intensity and time so we know what is the time period we are looking at and if the intensity is normalized or not.

The goal of Figure 1 is only to convey how the algorithm works on an example trajectory and therefore have removed details for clarity (e.g. only showing 5 date points the Viterbi Trellis of Figure 1A and not showing BIC in Figure 1C). As we show the results of DISC on plenty of figures in the main text and supplemental, we prefer not to update this figure with additional details simply to reduce clutter.

7) Figure 4: It’s necessary that this plot includes the case for signal to noise ratio of 1. This would be a good test to see if the method is working fine in the limiting case where emitters are not recognized. If it provides an unreasonable accuracy at signal to noise ratio of 1, there would be a flaw in the model that needs to be taken care of before other people start to use the code. This is particularly important since in practical cases, samples might provide a range of signal to noise ratio in different locations even in a single experiment, so the user wouldn’t want to mistakenly take noise for data specifically if the method is running unsupervised and provides wrongly high accuracy for such situations. The current plot doesn’t investigate this possibility and takes it for granted.

We agree that showing that DISC will indeed fail in a scenario of low SNR is important to assess the reliability of the method. We have updated Figure 4 with a new simulation showing accuracy results of DISC at a SNR = 1 across varying number of data points. As expected, DISC performs poorly at this signal-to-noise ratio.

References

1) Pelleg, D. & Moore, A. W. in *Icml.* 727-734.

2) Steinbach, M., Karypis, G. & Kumar, V. in *KDD workshop on text mining.* 525-526 (Boston).

3) James, N. A., Kejariwal, A. & Matteson, D. S. Leveraging Cloud Data to Mitigate User Experience from 'Breaking Bad'. *2016 Ieee International Conference on Big Data (Big Data)*, 3499-3508 (2016).

4) Juang, B.-H. & Rabiner, L. R. The segmental K-means algorithm for estimating parameters of hidden Markov models. *IEEE Transactions on acoustics, speech, and signal Processing***38**, 1639-1641 (1990).

5) Qin, F. Restoration of single-channel currents using the segmental k-means method based on hidden Markov modeling. *Biophysical Journal***86**, 1488-1501, doi:10.1016/s00063495(04)74217-4 (2004).

6) Juette, M. F.et al. Single-molecule imaging of non-equilibrium molecular ensembles on the millisecond timescale. *Nature Methods***13**, 341-344, doi:10.1038/nmeth.3769 (2016).

Supplementary materials

Citing articles via

Tweets

https://www.researchpad.co/tools/openurl?pubtype=article&doi=10.7554/eLife.53357&title=Top-down machine learning approach for high-throughput single-molecule analysis&author=David S White,Marcel P Goldschen-Ohm,Randall H Goldsmith,Baron Chanda,Olga Boudker,Leon D Islas,Leon D Islas,Leon D Islas,Fred Sigworth,&keyword=divisive segmentation,zero mode waveguides,HCN channels,cooperativity,unsupervised analysis,Human,&subject=Tools and Resources,Structural Biology and Molecular Biophysics,

© 2020 Newgen KnowledgeWorks | Privacy & Cookie Policy | Powered by: Nova