PLoS Computational Biology

Public Library of Science

Abstract

High-throughput immune repertoire sequencing (RepSeq) experiments are becoming a common way to study the diversity, structure and composition of lymphocyte repertoires, promising to yield unique insight into individuals’ past infection history. However, the analysis of these sequences remains challenging, especially when comparing two different temporal or tissue samples. Here we develop a new theoretical approach and methodology to extract the characteristics of the lymphocyte repertoire response from different samples. The method is specifically tailored to RepSeq experiments and accounts for the multiple sources of noise present in these experiments. Its output provides expansion parameters, as well as a list of potentially responding clonotypes. We apply the method to describe the response to yellow fever vaccine obtained from samples taken at different time points. We also use our results to estimate the diversity and clone size statistics from data.

Next generation sequencing allows us to gain access to repertoire-wide data supporting more comprehensive repertoire analysis and more robust vaccine design [1]. Despite large-scale efforts [2], how repertoire statistics respond to such acute perturbations is unknown. Longitudinal repertoire sequencing (RepSeq) makes possible the characterization of repertoire dynamics. Despite the large number of samples (clones) in these datasets lending it to model-based inference, there are few existing model-based approaches to this analysis. Most current approaches (e.g. [3]) quantify repertoire response properties using measurement statistics that are limited to what is observed in the sample, rather than what transpires in the individual. Model-based approaches, in contrast, can in principle capture features of the actual repertoire response to, for instance ongoing, natural stimuli, modeled as a point process of infections, and giving rise to diffusion-like response dynamics. Another regime for model-based approaches is the response to a single, strong perturbation, such as a vaccine, giving rise to a stereotyped, transient response dynamics. In either case, a measurement model is needed since what is observed (molecule counts) is indirect. We also only observe a small fraction of the total number of clones, so some extrapolation is necessary. Finally, both the underlying clonal population dynamics and the transformation applied by the measurement is stochastic, each contributing its own variability, making inferences based on sample ratios of molecule counts inaccurate.

Inference of frequency variation from sequencing data has been intensely researched in other areas of systems biology, such as in RNAseq studies. There, approaches are becoming standardized (DESEQ2 [4], EdgeR [5], *etc*.) and technical problems have been formulated and partly addressed. The differences between RNAseq and RepSeq data, however, means that direct translation of these methods is questionable. Moreover, the known structure of clonal populations may be leveraged for model-based inference using RepSeq, potentially providing advantages over existing RNAseq-based approaches.

Here, we take a generative modeling approach to repertoire dynamics. Our model incorporates known features of clonal frequency statistics and the statistics of the sequencing process. The models we consider are designed to be learnable using RepSeq data, and then used to infer properties of the repertoires of the individuals providing the samples. To guide its development, we have analyzed a longitudinal dataset around yellow fever vaccination (some results of this analysis are published [6]). Yellow fever serves as model of acute infection in humans and here we present analyses of this data set that highlights the inferential power of our approach to uncover perturbed repertoire dynamics.

To describe the stochastic dynamics of an individual clone, we define a probabilistic rule relating its frequency *f*′ at time *t*′ to its frequency *f* at an earlier time *t*: *G*(*f*′, *t*′|*f*, *t*). In this paper, *t* and *t*′ will be pre- and post-vaccination time points, but more general cases may be considered. It is also useful to define the probability distribution for the clone frequency at time *t*, *ρ*(*f* ) (Fig 1A).

The true frequencies of clones are not directly accessible experimentally. Instead, sequencing experiments give us number of reads for each clonotypes, *n*, which is a noisy function of the true frequency *f*, described by the conditional probability *P*(*n*|*f* ) (Fig 1B). Correcting for this noise to uncover the dynamics of clones is essential and is a central focus of this paper.

Our method proceeds in two inference steps, followed by a prediction step. First, using same-day replicates at time *t*, we jointly learn the characteristics of the frequency distribution *ρ*(*f* ) (Fig 1A) and the noise model *P*(*n*|*f* ) (Fig 1B). Second, by comparing repertoires between two time points *t* and *t*′, we infer the parameters of the evolution operator *G*(*f*′, *t*′|*f*, *t* ), using the noise model and frequency distribution learned in the first step (Fig 1C). Once these two inferences have been performed, the dynamics of individual clones can be estimated by Bayesian posterior inference. These steps are described in the remaining Results sections. In the rest of this section, we define and motivate the classes of model that we chose to parametrize the three building blocks of the model, schematized in Fig 1: the clone size distribution *ρ*(*f*), the noise model *P*(*n*|*f*), and the dynamical model *G*(*f*′, *t*′|*f*, *t*).

This method differs from existing approaches of differential expression detection [4, 5] in at least three ways. First, it can explicitly account for the finite count of cells with a given clonotype. That level of description does not exist in differential expression. Second, it follows a Bayesian approach, which gives the posterior probability of expansion of particular clones, rather than a *p* -value (although see [7] for a recent Bayesian approach to differential expression). Third, it includes information about the clone size distribution as a prior to assess the likelihood of expansion, and can thus extract information about clonal structure and diversity. A detailed description of classical differential expression analysis is given in the Methods section.

The distribution of clone sizes in memory or unfractioned TCR repertoires has been observed to follow a power law in human [8–10] and mice [11, 12]. These observations justify parametrizing the clone size distribution as

$$

and
$\begin{array}{c}\hfill \rho \left(f\right)=C{f}^{-\nu},\phantom{\rule{2.em}{0ex}}{f}_{\text{min}}\le f<1,\end{array}$

$$

$\begin{array}{c}\hfill {\rho}_{N}({f}_{1},\dots ,{f}_{N})\propto \underset{i=1}{\overset{N}{\prod}}\rho \left({f}_{i}\right)\delta (\underset{i=1}{\overset{N}{\sum}}{f}_{i}-1).\end{array}$

This condition,
$$
${\sum}_{i=1}^{N}{f}_{i}=1$
, will be typically satisfied for large *N* as long as 〈*f*〉 = ∫d*ffρ*(*f*) = *N*^{−1} (see Methods), but we will need to enforce it explicitly during the inference procedure.

The noise model captures the variability in the number of sequenced reads as a function of the true frequency of its clonotypes in the considered repertoire or subrepertoire. The simplest and lowest-dispersion noise model assumes random sampling of reads from the distribution of clonotypes. This results in *P*(*n*|*f*) being given by a Poisson distribution of mean *fN*_{read}, where *N*_{read} is the total number of sequence reads. Note that for the data analyzed in this paper, reads are collapsed by unique barcodes corresponding to individual mRNA molecules.

Variability in mRNA expression as well as library preparation introduces uncertainty that is far larger than predicted by the Poisson distribution. This motivated us to model the variability in read counts by a negative binomial of mean
$$
$\stackrel{\xaf}{n}=f{N}_{\text{read}}$
and variance
$$
$\stackrel{\xaf}{n}+a{\stackrel{\xaf}{n}}^{\gamma}$
, where *a* and *γ* control the over-dispersion of the noise. Negative binomial distributions were chosen because they allow us to control the mean and variance independently, and reduce to Poisson when *a* = 0. These distributions are also popular choices for modeling RNAseq variability in differential expression methods [4, 13].

A third noise model was considered to account explicitly for the number of cells representing the clone in the sample, *m*. In this two-step model, *P*(*m*|*f*) is given by a negative binomial distribution of mean
$$
$\stackrel{\xaf}{m}=fM$
and variance
$$
$\stackrel{\xaf}{m}+a{\stackrel{\xaf}{m}}^{\gamma}$
, where *M* is the total number of cells represented in the sample. *P*(*n*|*m*) is a Poisson distribution of mean *mN*_{read}/*M*. The resulting noise model is then given by *P*(*n*|*f*) = ∑_{m} *P*(*n*|*m*)*P*(*m*|*f*). The number of sampled cells, *M*, is unknown and is a parameter of the model. Note that this two-step process with the number of cells as an intermediate variable is specific to repertoire sequencing, and has no equivalent in RNAseq differential expression analysis. The choice of order between the Poisson distribution and the negative binomial is mainly one of tractability. Ultimately the main motivation for the model is that it performs better empirically (see below).

Finally, we must specify the dynamical model for the clonal frequencies. In the context of vaccination or infection, it is reasonable to assume that only a fraction *α* of clones respond by either expanding or contracting. We also assume that expansion or contraction does not depend on the size of the clone itself. Defining *s* = ln(*f*′/*f*) as the log-fold factor of expansion or contraction, we define:

$$

with
$\begin{array}{c}\hfill G({f}^{\prime}=f{e}^{s},{t}^{\prime}|f,t)\mathrm{d}{f}^{\prime}={\rho}_{s}\left(s\right)\mathrm{d}s.\end{array}$

$$

where
$\begin{array}{c}\hfill {\rho}_{s}\left(s\right)=(1-\alpha )\delta (s-{s}_{0})+\alpha {\rho}_{\text{exp}}(s-{s}_{0}),\end{array}$

To study variations arising from experimental noise, we analysed replicates of repertoire sequencing experiments. The tasks of learning the noise model and the distribution of clone sizes are impossible to dissociate. To infer *P*(*n*|*f*), one needs to get a handle on *f*, which is unobserved, and for which the prior distribution *ρ*(*f*) is essential. Conversely, to learn *ρ*(*f*) from the read counts *n*, we need to deconvolve the experimental noise, for which *P*(*n*|*f*) is needed. Both can be learned simultaneously from replicate experiments (i.e. *f*′ = *f*), using maximum likelihood estimation. For each clone, the probability of observing *n* read counts in the first replicate and *n*′ read counts in the second replicate reads:

$$

where
$\begin{array}{c}\hfill P(n,{n}^{\prime}|{\theta}_{\text{null}})={\int}_{{f}_{\text{min}}}^{1}\mathrm{d}f\phantom{\rule{0.166667em}{0ex}}\rho \left(f\right|{\theta}_{\text{null}})P\left(n\right|f,{\theta}_{\text{null}})P\left({n}^{\prime}\right|f,{\theta}_{\text{null}}),\end{array}$

While Eq 5 gives the likelihood of a given read count pair (*n*, *n*′), we need to correct for the fact that we only observe pairs for which *n* + *n*′ > 0. In general, many clones in the repertoire are small and missed in the acquisition process. In any realization, we expect *n* + *n*′ > 0 for only a relatively small number of clones, *N*_{obs} ⪡ *N*. Typically, *N*_{obs} is of order 10^{5}, while *N* is unknown but probably ranges from 10^{7} for mouse to 10^{8} − 10^{10} for humans [14, 15]. Since we have no experimental access to the unobserved clones (*n* = *n*′ = 0), we maximize the likelihood of the read count pairs
$$
$({n}_{i},{n}_{i}^{\prime})$
, *i* = 1, …, *N*_{obs}, conditioned on the clones appearing in the sample:

$$

$\begin{array}{c}\hfill {\stackrel{^}{\theta}}_{\text{null}}=\underset{{\theta}_{\text{null}}}{argmax}\underset{i=1}{\overset{{N}_{\text{obs}}}{\prod}}\frac{P({n}_{i},{n}_{i}^{\prime}|{\theta}_{\text{null}})}{1-P(0,0|{\theta}_{\text{null}})}.\end{array}$

While the condition *N*〈*f*〉 = 1 ensures normalization on average, we may instead require that normalization be satisfied for the particular realization of the data, by imposing:

$$

where
$\begin{array}{c}\hfill Z=NP(0,0){\u27e8f\u27e9}_{\rho \left(f\right|n+{n}^{\prime}=0)}+\underset{i=1}{\overset{{N}_{\text{obs}}}{\sum}}{\u27e8f\u27e9}_{\rho \left(f\right|{n}_{i},{n}_{i}^{\prime})}=1,\end{array}$

To test the validity of the maximum likelihood estimator, Eq 6, we created synthetic data for two replicate sequencing experiments with known parameters *θ*_{null} under the two-step noise model, and approximately the same number of reads as in the real data. To do so efficiently, we developed a sampling protocol that deals with the large number of unobserved clones implicitly (see Methods). Applying the maximum likelihood estimator to these synthetic data, we correctly inferred the ground truth over a wide range of parameter choices (S1 Fig).

Next, we applied the method to replicate sequencing experiments of unfractioned repertoires of 6 donors over 5 time points spanning a 1.5 month period (30 donor-day replicate pairs in total). For a typical pair of replicates, a visual comparison of the (*n*, *n* ′) pairs generated by the Poisson and two-step noise models with the data shows that the Poisson distribution fails to explain the large observed variability between the two replicates, while the two-step model can (Fig 2A–2C). The normalized log-likelihood of the two-step model was slightly but significantly higher than that of the negative binomial model, and much larger than that of the Poisson model (Fig 2D). The two-step model was able to reproduce accurately the distribution of read counts *P*(*n* ) (Fig 3A), as well as the conditional distribution *P*(*n*′|*n* ) (Fig 3B), even though those observables were not explicitly constrained by the fitting procedure. In particular, *P*(*n*) inherits the power law of the clone frequency distribution *ρ*(*f* ), but with deviations at low count numbers due to experimental noise, which agree with the data. Also, the two-step model outperformed the negative binomial noise model at describing the long tail of the read count distribution for clones that were not seen in one of the two replicates (see S2 Fig).

Fig 4 shows the learned values of the parameters for all 30 pairs of replicates across donors and timepoints. While there is variability across donors and days, probably due to unknown sources of biological and methodological variability, there is a surprising degree of consistency. Despite being inferred indirectly from the characteristics of the noise model, estimates for the number of cells in the samples, *M*, are within one order of magnitude of their expected value based on the known concentration of lymphocytes in blood (about one million cells per sample). Likewise, *f*_{min} is very close to the smallest possible clonal frequency, *N*_{cell}, where *N*_{cell} ≈ 4 · 10_{11} is the total number of T cells in the organism [16].

The inferred models can also be used to estimate the diversity of the entire repertoire (observed or unobserved). The clone frequency distribution, *ρ*(*f*), together with the estimate of *N* can be used to estimate Hill diversities (see Methods):

$$

$\begin{array}{c}\hfill {D}_{\beta}={\left(\underset{i=1}{\overset{N}{\sum}}{f}_{i}^{\beta}\right)}^{\frac{1}{1-\beta}}={(N\u27e8{f}^{\beta}\u27e9)}^{\frac{1}{1-\beta}}.\end{array}$

In Fig 5 estimates, we show the values, across donor and days, of three different diversities: species richness, i.e. the total number of clones *N* (*β* = 0); Shannon diversity, equal to the exponential of the Shannon entropy (*β* = 1); and Simpson diversity, equal to the inverse probability that two cells belong to the same clone (*β* = 2). In particular, estimates of *N* ≈ 10^{9} fall between the lower bound of 10^{8} unique TCRs reported in humans using extrapolation techniques [14] and theoretical considerations giving upper-bound estimates of 10^{10} [15] or more [17].

Now that the baseline for repertoire variation has been learned from replicates, we can learn something about its dynamics following immunization. The parameters of the expansion model (Eq 4) can be set based on prior knowledge about the typical fraction of responding clones and effect size. Alternatively, they can be inferred from the data using maximum likelihood estimation (Empirical Bayes approach). We define the likelihood of the read count pairs (*n*, *n*′) between time points *t* and *t*′ as:

$$

where
$$
${\theta}_{\text{exp}}=\{\alpha ,{s}_{0},\stackrel{\xaf}{s}\}$
characterizes
$\begin{array}{c}\hfill \begin{array}{cc}& {P}_{\text{exp}}(n,{n}^{\prime}|{\theta}_{\text{null}},{\theta}_{\text{exp}})=\hfill \\ & {\int}_{{f}_{\text{min}}}^{1}\mathrm{d}f\rho \left(f\right)\int \mathrm{d}s{\rho}_{s}\left(s\right|{\theta}_{\text{exp}})P\left(n\right|f,{\theta}_{\text{null}})P\left({n}^{\prime}\right|f{e}^{s},{\theta}_{\text{null}}),\hfill \end{array}\end{array}$

$$

$\begin{array}{c}\hfill {\stackrel{^}{\theta}}_{\text{exp}}=\underset{{\theta}_{\text{exp}}}{argmax}\underset{i=1}{\overset{{N}_{\text{obs}}}{\prod}}\frac{{P}_{\text{exp}}({n}_{i},{n}_{i}^{\prime}|{\stackrel{^}{\theta}}_{\text{null}},{\theta}_{\text{exp}})}{1-{P}_{\text{exp}}(0,0|{\stackrel{^}{\theta}}_{\text{null}},{\theta}_{\text{exp}})}.\end{array}$

This maximization was performed via gradient-based methods. In Methods we give an example of an alternative semi-analytic approach to finding the optimum using the expectation maximization algorithm.

In addition to normalization at *t*, we also need to impose normalization at *t*′:

$$

with
$\begin{array}{c}\hfill {Z}^{\prime}=NP(0,0){\u27e8{f}^{\prime}\u27e9}_{\rho \left({f}^{\prime}\right|n+{n}^{\prime}=0)}+\underset{i=1}{\overset{{N}_{\text{obs}}}{\sum}}{\u27e8{f}^{\prime}\u27e9}_{{\rho}^{\prime}\left({f}^{\prime}\right|{n}_{i},{n}_{i}^{\prime})},\end{array}$

We first tested the method on synthetic data generated with the expansion model of Eq 9, with an exponentially distributed effect size for the expansion with scale parameter, $$ $\stackrel{\xaf}{s}$ :

$$

where Θ(
$\begin{array}{c}\hfill {\rho}_{\text{exp}}\left({s}^{\prime}\right)=\frac{1}{\stackrel{\xaf}{s}}{e}^{-{s}^{\prime}/\stackrel{\xaf}{s}}\Theta \left({s}^{\prime}\right),\end{array}$

Once learned, the model can be used to compute the posterior probability of a given expansion factor by marginalizing *f*, and using Bayes’ rule,

$$

$\begin{array}{c}\hfill \rho \left(s\right|n,{n}^{\prime})\propto {\rho}_{s}\left(s\right)\int P\left(n\right|f)P\left({n}^{\prime}\right|f{e}^{s})\rho \left(f\right)\mathrm{d}f.\end{array}$

We illustrate different posterior shapes from synthetic data as a function of the observed count pairs in S3 Fig. We see for instance that the width of the posterior narrows when counts are both large, and that the model ascribes a fold-change of *s*_{0} to clones with *n*′ ⪅ n.

Note that the value of the true responding fraction *α* is correctly learned from our procedure, regardless of our ability to tell with perfect certainty which particular clones responded. By contrast, a direct estimate of the responding fraction from the number of significantly responding clones, as determined by differential expression software such as EdgeR [13], is likely to misestimate that fraction. We applied EdgeR (see Methods) to a synthetic repertoire of *N* = 10^{9} clones, a fraction *α* = 0.01 of which responded with mean effect
$$
$\stackrel{\xaf}{s}=1$
, and sampled with *N*_{read} = 10^{6}. EdgeR found 6, 880 significantly responding clones (corrected p-value 0.05) out of *N*_{obs} = 1,995,139, i.e. a responding fraction 6, 880/1, 995, 139 ≈ 3 ⋅ 10^{−3} of the observed repertoire, and a responding fraction 6, 880/10^{9} ≈ 7 ⋅ 10^{−6} of the total repertoire, underestimating the true fraction *α* = 10^{−2}.

Next, we ran the inference procedure on sequences obtained from human blood samples across time points following yellow fever vaccination. To guide the choice of prior for *s*, we plotted the histograms of the naive log fold-change ln *n*′/*n* (Fig 6B). These distributions show symmetric exponential tails, although we should recall that these are likely dominated by measurement and sampling noise. Yet, the difference between the pair of replicates (black) and the pre- and post-vaccination timepoints (red) motivates us to model the statistics of expansion factors as:

$$

with typical effect size
$$
$\stackrel{\xaf}{s}$
. We also tested other forms of the prior (asymmetric exponential, centered and off-centered Gaussian), but they all yielded lower likelihoods of the data (Table 1).
$\begin{array}{c}\hfill {\rho}_{\text{exp}}\left(s\right)=\frac{1}{2\stackrel{\xaf}{s}}{e}^{-\left|s\right|/\stackrel{\xaf}{s}},\end{array}$

Form of prior | Average data likelihood | |
---|---|---|

full asymmetric exp. | $$ $\n \n \n (\n \n \n 1\n \n \n -\n \n \n \alpha \n \n \n )\n \n \n \n \delta \n \n \n \n (\n \n \n s\n \n \n -\n \n \n \n s\n \n \n 0\n \n \n \n )\n \n \n \n +\n \n \n \alpha \n \n \n \Theta \n \n \n \n (\n \n \n s\n \n \n -\n \n \n \n s\n \n \n 0\n \n \n \n )\n \n \n \n \n e\n \n \n \n -\n \n \n \n \n s\n \n \n -\n \n \n \n s\n \n \n 0\n \n \n \n \n \n s\n \n \n \xaf\n \n \n \n \n \n \n /\n \n \n \n s\n \n \n \xaf\n \n \n $ | -1.894891 |

symmetric exp. | $$ $\n \n \n (\n \n \n 1\n \n \n -\n \n \n \alpha \n \n \n )\n \n \n \n \delta \n \n \n \n (\n \n \n s\n \n \n -\n \n \n \n s\n \n \n 0\n \n \n \n )\n \n \n \n +\n \n \n \alpha \n \n \n \n e\n \n \n \n -\n \n \n \n \n \n |\n \n \n s\n \n \n -\n \n \n \n \n s\n \n \n 0\n \n \n \n \n |\n \n \n \n \n \n s\n \n \n \xaf\n \n \n \n \n \n \n /\n \n \n 2\n \n \n \n s\n \n \n \xaf\n \n \n $ | -1.894303 |

centered Gaussian | $$ $\n \n \n (\n \n \n 1\n \n \n -\n \n \n \alpha \n \n \n )\n \n \n \n \delta \n \n \n \n (\n \n \n s\n \n \n -\n \n \n \n s\n \n \n 0\n \n \n \n )\n \n \n \n +\n \n \n \alpha \n \n \n \n e\n \n \n \n -\n \n \n \n \n \n (\n \n \n s\n \n \n -\n \n \n \n s\n \n \n 0\n \n \n \n )\n \n \n \n 2\n \n \n \n \n 2\n \n \n \n \sigma \n \n \n 2\n \n \n \n \n \n \n \n /\n \n \n \n \n 2\n \n \n \pi \n \n \n \n \n \sigma \n \n $ | -1.894723 |

off-centered Gaussian | $$ $\n \n \n (\n \n \n 1\n \n \n -\n \n \n \alpha \n \n \n )\n \n \n \n \delta \n \n \n \n (\n \n \n s\n \n \n -\n \n \n \n s\n \n \n 0\n \n \n \n )\n \n \n \n +\n \n \n \alpha \n \n \n \n e\n \n \n \n -\n \n \n \n \n \n (\n \n \n s\n \n \n -\n \n \n \n (\n \n \n \n s\n \n \n 0\n \n \n \n +\n \n \n \n s\n \n \n 1\n \n \n \n )\n \n \n \n )\n \n \n \n 2\n \n \n \n \n 2\n \n \n \n \sigma \n \n \n 2\n \n \n \n \n \n \n \n /\n \n \n \n \n 2\n \n \n \pi \n \n \n \n \n \sigma \n \n $ | -1.895101 (s_{1} ≥ 0.1) |

We applied the inference procedure (Eq 10) between the repertoires taken the day of vaccination (day 0), and at one of the other time points (day -7, day 7, day 15, and day 45) after vaccination. Since there are two replicates at each time point, we can make 4 comparisons between any pair of time points. The results are shown in Fig 6C in log-scale.

Same-day comparisons (day 0 vs day 0) gave effectively zero mean effect sizes (
$$
$\stackrel{\xaf}{s}<0.1$
, below the discretization step of the integration procedure), or equivalently *α* ≈ 0, as expected. Comparisons with other days yielded inferred values of *α* and
$$
$\stackrel{\xaf}{s}$
mostly distributed along the same ‘ridge’, as observed on synthetic data (Fig 6A), with variations across replicates and donors. The mean effect size
$$
$\stackrel{\xaf}{s}$
is highest at day 15, where the peak of the response occurs, but is also substantially different from 0 at all time points except day 0 (including before vaccination at day −7), with often high values of *α*. We speculate that these fluctuations reflect natural variations of the repertoire across time, experimental batch effects, as well as biological variability due to differences in the affinities and precursor frequencies of responding clonotypes. As a consequence of the natural diversity, values of the responding fraction *α* are not learned with great precision, as can be seen from the variability across the 4 choices of replicate pair, and are probably gross overestimations of the true probability that a naive T cell responds to an infection, which is believed to be of order 10^{−5} − 10^{−3} [18].

The posterior probability on expansion factors *ρ*(*s*|*n*, *n* ′) (Eq 13, S4 Fig) can be used to study the fate and dynamics of particular clones. For instance, we can identify responding clones as having a low posterior probability of being not expanding *P*_{null} = *P*(*s* ≤ 0|*n*, n′) < 0.025. *P*_{null} is the Bayesian counterpart of a p-value but differs from it in a fundamental way: it gives the probability that expansion happened given the observations, when a p-value would give the probability of the observations in absence of expansion. We can define a similar criterion for contracting clones.

To get the expansion or contraction factor of each clone, we can compute the posterior average and median, 〈*s*〉_{n,n′} = ∫d*s sρ*(*s*|*n*, *n*′) and *s*_{median} (*F*(*s*_{median}|*n*, *n*′) = 0.5, for the cumulative density function,
$$
$F\left(s\right|n,{n}^{\prime})={\int}_{-\infty}^{s}\rho \left(\stackrel{\u02dc}{s}\right|n,{n}^{\prime})\mathrm{d}\stackrel{\u02dc}{s}$
), corresponding to our best estimate for the log fold-change. In Fig 7A, we show how the median Bayesian estimator differs from the naive estimator *s*_{naive} = ln *n*′/*n*. While the two agree for large clones for which relative noise is smaller, the naive estimator over-estimates the magnitude of log fold-changes for small clones because of the noise. The Bayesian estimator accounts for that noise and gives a more conservative and more realistic estimate.

Fig 7B shows all count pairs (*n*, *n*′) between day 0 and day 15 following yellow fever vaccination, with red clones above the significance threshold line *P*_{null} = 0.025 being identified as responding. Expanded clones can also be read off a plot showing how both *P*_{null} and 〈*s*〉_{n,n′} vary as one scans values of the count pairs (*n*, *n* ′) (S5 Fig).

Given the uncertainty in the expansion model parameters
$$
${\theta}_{\text{exp}}=(\stackrel{\xaf}{s},\alpha )$
, we wondered how robust our list of responding clonotypes was to those variations. In Fig 7C, we show the overlap of lists of strictly expanding clones (*P*(*s* ≤ 0|*n*, *n*′)<0.025) as a function of *θ*_{exp}, relative to the optimal value
$$
${\stackrel{^}{\theta}}_{\text{exp}}$
(black circle). The ridge of high overlap values exactly mirrors the ridge of high likelihood values onto which the learned parameters fall (Fig 6D). Values of
$$
${\stackrel{^}{\theta}}_{\text{exp}}$
obtained for other replicate pairs (square symbols) fall onto the same ridge, meaning that these parameters lead to virtually identical lists of candidates for response.

The list of identified responding clones can be used to test hypotheses about the structure of the response. For example, recent work has highlighted a power law relationship between the initial clone size and clones subsequent fold change response in a particular experimental setting [19]. We can plot the relationship in our data as the posterior mean log fold change versus the posterior initial frequency, *f* (Fig 7D). While the relationship is very noisy, emphasizing the diversity of the response, it is consistent with a decreasing dependency of the fold change with the clone size prior to the immune response.

The robustness of our candidate lists rests on their insensitivity to the details of how the model explains typical expansion. In S3 Fig, we show how the posterior belief varies significantly for count pairs (0, *n*′), *n*′ > 0, across a range of values of
$$
$\stackrel{\xaf}{s}$
and *α* passing along the ridge of plausible models (Fig 7C). A transition from a low to high value of the most probable estimate for *s* characterizes their shapes and arises as
$$
$\stackrel{\xaf}{s}$
becomes large enough that expansion from frequencies near *f*_{min} is plausible, and the dominant mass of clones there makes this the dominate posterior belief. Thus, these posteriors are shaped by *ρ*_{s}(*s*) at low
$$
$\stackrel{\xaf}{s}$
, and *ρ*(*f*) at high
$$
$\stackrel{\xaf}{s}$
. Our lists vary negligibly over this transition, and thus are robust to it.

Our probabilistic framework describes two sources of variability in clonotype abundance from repertoire sequencing experiments: biological repertoire variations and spurious variations due to noise. We found that in a typical experiment, noise is over-dispersed relative to Poisson sampling noise. This makes the use of classical difference tests such as Fisher’s exact test or a chi-squared test inappropriate in this context, and justifies the development of specific methods. Even in very precise single-cell experiments that do not suffer from expression noise and PCR biases (but are often limited to smaller repertoires owing to high costs), the discrete nature of cell counts creates an irreducible source of Poisson noise. In that case our method would offer a Bayesian alternative to existing approaches.

As a byproduct, our method learned the properties of the clone size distribution, which is consistent with a power law of exponent ≈ − 2.1 robust across individuals and timepoints, consistent with previous reports [8–10]. Using these parameters, various diversity measures could be computed, such as the species richness (10^{8}–10^{9} ), which agrees with previous bounds [14, 15], or the “true diversity” (the exponential of the Shannon entropy), found to range between 10^{6} and 10^{8}. The inferred null models were found to be conserved across donors and time, indicating that they should be valid for other datasets obtained with the same protocol. This implies that our method could be applicable to situations where replicate experiments are not available, as is often the case. On the other hand, the procedure for learning the null model should be repeated for each distinct protocol using different technologies, using replicate experiments. We applied our method to data from mRNA sequencing experiments, which has the advantage over current DNA immune repertoire sequencing methods of being able to incoorporate unique molecular barcodes. Genomic DNA-based sequencing does not suffer from expression noise, however the technology is prone to PCR and statistical noise and primer biases. Given that our ultimate choice of noise distributions is often empirically motivated, different modeling choices may be applicable to gDNA datasets.

The proposed probabilistic model of clonal expansion is described by two parameters: the fraction of clones that respond to the immune challenge, and the typical effect size (log fold-change). While these two parameters were difficult to infer precisely individually, a combination of them could be robustly learned. Despite this ambiguity in the model inference, the list of candidate responding clonotypes is largely insensitive to the parameter details. For clonotypes that rose from very small read counts to large ones, the inferred fold-change expansion factor depended strongly on the priors, and resulted from a delicate balance between the tail of small clones in the clone size distribution and the tail of large expansion events in the distribution of fold-changes.

While similar approaches have been proposed for differential expression analysis of RNA sequencing data [4, 5, 13, 20], the presented framework was specifically built to address the specific challenges of repertoire sequencing data. Here, the aim is to count proliferating cells, as opposed to evaluating average expression of genes in a population of cells. We specifically describe two steps that translate cell numbers into the observed TCR read counts: random sampling of cells that themselves carry a random number of mRNA molecules, which are also amplified and sampled stochastically. Another difference with previous methods is the explicit Bayesian treatment, which allows us to calculate a posterior probability of expansion, rather than a less interpretable *p*-value.

Here we applied the presented methodology to an acute infection. We have previously shown that it can successfully identify both expanding (from day 0 to 15 after vaccination) and contracting (from day 15 to day 45) clonotypes after administering a yellow fever vaccine. However the procedure is more general and can also be extended to be used in other contexts. For instance, this type of approach could be used to identify response in B-cells during acute infections, by tracking variations in the size of immunoglobulin sequence lineages (instead of clonotypes), using lineage reconstruction methods such as Partis [21]. The framework could also be adapted to describe not just expansion, but also switching between different cellular phenoypes during the immune response, *e.g.* between the naive, memory, effector memory, *etc.* phenotypes, which can obtained by flow-sorting cells before sequencing [22]. Another possible application would be to track the clones across different tissues and organs, and detect migrations and local expansions [23]. The approach requires replicates to quantify natural variability, but this need only be quantified once for the same experimental conditions.

The proposed framework is not limited to identifying a response during an acute infection, but can also be used as method for learning the dynamics from time dependent data even in the absence of an external stimulus [3]. Here we specifically assumed expansion dynamics with strong selection. However, the propagator function can be replaced by a non-biased random walk term, such as genetic drift. In this context the goal is not to identify responding clonotypes but it can be used to discriminate different dynamical models in a way that accounts for different sources of noise inherently present in the experiment. Alternatively, the framework can also be adapted to describe chronic infections such as HIV [24], where expansion events may be less dramatic and more continuous or sparse, as the immune system tries to control the infection over long periods of time.

All code used to produce the results in this work was custom written in Python 3 and is publicly available online at https://github.com/mptouzel/bayes_diffexpr.

Here we derive the condition for which the normalization in the joint density is implicitly satisfied. The normalization constant of the joint density is

$$

with
$\begin{array}{c}\hfill \mathcal{Z}={\int}_{{f}_{\text{min}}}^{1}\cdots {\int}_{{f}_{\text{min}}}^{1}\underset{i=1}{\overset{N}{\prod}}\rho \left({f}_{i}\right)\delta (Z-1){\mathrm{d}}^{N}\stackrel{\to}{f}\phantom{\rule{0.277778em}{0ex}},\end{array}$

$$

$\begin{array}{c}\hfill \delta (Z-1)={\int}_{-i\infty}^{i\infty}\frac{\mathrm{d}\mu}{2\pi}{e}^{\mu (Z-1)}\end{array}$

$$

$\begin{array}{cc}& ={\int}_{-i\infty}^{i\infty}\frac{\mathrm{d}\mu}{2\pi}{e}^{-\mu}\underset{i=1}{\overset{N}{\prod}}{e}^{\mu {f}_{i}}\phantom{\rule{0.277778em}{0ex}}.\end{array}$

Crucially, the multi-clone integral in Eq (15) over $$ $\stackrel{\to}{f}$ then factorizes. Exchanging the order of the integrations we obtain

$$

with
$$
$\langle {e}^{\mu f}\rangle ={\int}_{{f}_{\text{min}}}^{1}\rho \left(f\right){e}^{\mu f}\mathrm{d}f$
. Now define the large deviation function,
$$
$I\left(\mu \right)\u2254-\frac{\mu}{N}+log\langle {e}^{\mu f}\rangle $
, so that
$\begin{array}{c}\hfill \mathcal{Z}={\int}_{-i\infty}^{i\infty}\frac{\mathrm{d}\mu}{2\pi}{e}^{-\mu}{\u27e8{e}^{\mu f}\u27e9}^{N}\phantom{\rule{0.277778em}{0ex}},\end{array}$

$$

$\begin{array}{c}\hfill \mathcal{Z}={\int}_{-i\infty}^{i\infty}\frac{\mathrm{d}\mu}{2\pi}{e}^{-NI\left(\mu \right)}\phantom{\rule{0.277778em}{0ex}}.\end{array}$

Note that *I*(0) = 0. With *N* large, this integral is well-approximated by the integrand’s value at its saddle point, located at *μ** satisfying *I*′(*μ**) = 0. Evaluating the latter gives

$$

$\begin{array}{c}\hfill \frac{1}{N}=\frac{\u27e8f{e}^{{\mu}^{*}f}\u27e9}{\u27e8{e}^{{\mu}^{*}f}\u27e9}\phantom{\rule{0.277778em}{0ex}}.\end{array}$

If the left-hand side is equal to 〈*f*〉, the equality holds only for *μ* * = 0 since expectations of products of correlated random variables are not generally products of their expectations. In this case, we see from Eq (19) that
$$
$\mathcal{Z}=1$
, and so the constraint *N*〈*f*〉 = 1 imposes normalization.

The procedure for null model sampling is summarized as (1) fix main model parameters, (2) solve for remaining parameters using the normalization constraint, *N*〈*f*〉 = 1, and (3) starting with frequencies, sample and use to specify the distribution of the next random variable in the chain.

In detail, we first fix: (a) the model parameters (e.g. {*α*, *a*, *γ*, *M*}), excluding *f*_{min}; (b) the desired size of the full repertoire, *N*; (c) the sequencing efficiency (average number of UMI per cell), *ϵ*, for each replicate. From the latter we get the mean number of reads per sample,
$$
${N}_{\text{reads}}^{\text{eff}}=\u03f5M$
. Note that the actual sampled number of reads is stochastic and so will differ from this fixed value.

We then solve for remaining parameters. Specifically, *f*_{min} is fixed by the constraint that the average sum of all frequencies, under the assumption that their distribution factorizes, is unity:

$$

$\begin{array}{c}\hfill N{\u27e8f\u27e9}_{\rho \left(f\right)}=1\end{array}$

This completes the parameter specification.

We then sample from the corresponding chain of random variables. Sampling the chain of random variables of the null model can be performed efficiently by only sampling the *N*_{obs} = *N*(1 − *P*(0, 0)) observed clones. This is done separately for each replicate, once conditioned on whether or not the other count is zero. Samples with 0 molecule counts can in principle be produced with any number of cells, so cell counts must be marginalized when implementing this constraint. We thus used the conditional probability distributions *P*(*n*|*f*) = ∑_{m} *P*(*n*|*m*)*P*(*m*|*f*) with *m*, *n* = 0, 1, …. *P*(*n*′|*f*) is defined similarly. Note that these two conditional distributions differ only in their sampling efficiency, *ϵ*. Together with *ρ*(*f*), these distributions form the full joint distribution, which is conditioned on the clone appearing in the sample, i.e. *n*+ *n*′ > 0 (denoted
$$
$\mathcal{O}$
),

$$

with the renormalization accounting for the fact that (
$\begin{array}{c}\hfill P(n,{n}^{\prime},f|\mathcal{O})=\frac{P\left(n\right|f)P\left({n}^{\prime}\right|f)\rho \left(f\right)}{1-\int \mathrm{d}f\rho \left(f\right)\mathrm{d}fP(n=0|f)P({n}^{\prime}=0|f)}\phantom{\rule{0.277778em}{0ex}},\end{array}$

$$

$\begin{array}{ccc}\hfill P\left({q}_{x0}\right|\mathcal{O})& =& \underset{n>0}{\sum}\int \mathrm{d}fP(n,{n}^{\prime}=0,f|\mathcal{O})\phantom{\rule{0.277778em}{0ex}},\hfill \end{array}$

$$

$\begin{array}{ccc}\hfill P\left({q}_{0x}\right|\mathcal{O})& =& \underset{{n}^{\prime}>0}{\sum}\int \mathrm{d}fP(n=0,{n}^{\prime},f|\mathcal{O})\phantom{\rule{0.277778em}{0ex}},\hfill \end{array}$

$$

$\begin{array}{ccc}P({q}_{xx}|\mathcal{O})& =& {\displaystyle \underset{\underset{{n}^{\prime}>0}{n>0,}}{\sum}\int \text{d}fP(n,{n}^{\prime},f|\mathcal{O})}.\end{array}$

Conditioning on $$ $\mathcal{O}$ ensures normalization, $$ $P\left({q}_{x0}\right|\mathcal{O})+P\left({q}_{0x}\right|\mathcal{O})+P\left({q}_{xx}\right|\mathcal{O})=1$ . Each sampled clone falls in one the three regions according to these probabilities. Their clone frequencies are then drawn conditioned on the respective region,

$$

$\begin{array}{ccc}\hfill P\left(f\right|{q}_{x0})& =& \underset{n>0}{\sum}P(n,{n}^{\prime}=0,f|\mathcal{O})/P\left({q}_{x0}\right|\mathcal{O})\phantom{\rule{0.277778em}{0ex}},\hfill \end{array}$

$$

$\begin{array}{ccc}\hfill P\left(f\right|{q}_{0x})& =& \underset{{n}^{\prime}>0}{\sum}P(n=0,{n}^{\prime},f|\mathcal{O})/P\left({q}_{0x}\right|\mathcal{O})\phantom{\rule{0.277778em}{0ex}},\hfill \end{array}$

$$

$\begin{array}{ccc}\hfill P\left(f\right|{q}_{xx})& =& \underset{n>0,{n}^{\prime}>0}{\sum}P(n,{n}^{\prime},f|\mathcal{O})/P\left({q}_{xx}\right|\mathcal{O}).\hfill \end{array}$

Using the sampled frequency, a pair of molecule counts for the three quadrants are then sampled as (*n*, 0), (0, *n*′), and (*n*, *n*′), respectively, with *n* and *n*′ drawn from the renormalized, finite-count domain of the conditional distributions, *P*(*n*|*f*, *n* > 0).

Using this sampling procedure we demonstrate the validity of the null model and its inference by sampling across the observed range of parameters and re-inferring their values (see S1 Fig).

The replicate model parameters are *θ* = (*ν*, *a*, *γ*, log_{10} *M*, log_{10} *f*_{min}). Let *C*(*θ*) = *Z*(*θ*) − 1 be the constraint equation such that we wish to satisfy *C*(*θ*) = 0. Let *θ** denote the parameters maximizing the likelihood subject to *C*(*θ*) = 0. Then the hyperplane orthogonal to the gradient ∇_{θ} *C*(*θ**) and passing through *θ** is the local subspace in which the constraint is satisfied. The projection of Hessian of the log likelihood, *H*, into this subspace is given by,

$$

where the matrix
$$
$P=\stackrel{\to}{n}{\stackrel{\to}{n}}^{\top}$
projects onto
$$
$\stackrel{\to}{n}$
, the unit vector co-linear with ∇
$\begin{array}{c}\hfill \stackrel{^}{H}=H-PH-HP+PHP\end{array}$

When computing error bars for the diversities, we use the standard deviation of the statistics of a Monte Carlo estimate of the log diversities obtained via parameter value samples from the multivariate Gaussian approximation of the likelihood using the projected Hessian, $$ $\stackrel{^}{H}$ .

Differential expression deals with RNA-seq data, which reports the bulk expression of a large number of genes in a population of cells, and aims to detect significant differences in expression across different populations, either at different times, or under different conditions.

Repertoire sequencing (RepSeq) and expression analysis aim at inferring fundamentally different quantities, although both do it through the number of reads per gene. In differential expression analysis, one is interested in reconstructing the level of expression of particular genes, which are the same in all cells, while in RepSeq one is interested in the number of cells expressing a given clonotype. Thus, in RepSeq the number of transcripts will depend on the number of cells carrying that clonotypes, but also on their expression level, which is assumed to be clonotype-independent but noisy. There are thus three levels of noise in RepSeq: cell sampling noise, expression noise, and mRNA capture noise. By constrast in differential expression there is expression noise, cell-to-cell variability, and capture noise. These sources of noises combine in a different manner than in RepSeq.

edgeR [5], a classical differential expression analysis software, proceeds by learning a noise model using a negative binomial model for expression noise from two identical conditions. Then, comparing RNA-seq data from two datasets, it evaluates a p-value corresponding to the probability that the observed difference in expression between the two datasets has occured just because of noise. We applied edgeR treating each clonotype as a separate gene.

For a set of clone frequencies,
$$
${\left\{{f}_{i}\right\}}_{i=1}^{N}$
, the Hill family of diversities are obtained from the Rényi entropies, as *D*_{β} = exp *H*_{β}, with
$$
${H}_{\beta}=\frac{1}{1-\beta}ln\left[{\sum}_{i=1}^{N}{f}_{i}^{\beta}\right]$
. We use *ρ*(*f*) to compute their ensemble averages over *f*, again under the assumption that the joint distribution of frequencies factorizes. We obtain an estimate for *D*_{0} = *N* using the model-derived expression, *N*_{obs} + *P*(*n* = 0)*N* = *N*, where *N*_{obs} is the number of clones observed in one sample, and
$$
$P(n=0)={\int}_{{f}_{\text{min}}}^{1}P(n=0|f)\rho \left(f\right)\mathrm{d}f$
. For *β* = 1, we compute exp(*N*〈 − *f* log *f*〉_{ρ(f)}) and for *β* = 2, we use 1/(*N*〈*f*^{2}〉_{ρ(f)}).

Since the differential expression model involves expansion and contraction in the test condition, some normalization in this condition is needed such that it produces roughly the same total number of cells as those in the reference condition, consistent with the observed data. One approach (the one taken below) is to normalize at the level of clone frequencies. Here, we instead perform the inefficient but more straightforward procedure of sampling all *N* clones and discarding those clones for which (*n*, *n*′) = (0, 0). A slight difference in the two procedures is that *N*_{obs} is fixed in the former, while is stochastic in the latter.

The frequencies of the first condition, *f*_{i}, are sampled from *ρ*(*f*) until they sum to 1 (i.e. until before they surpass 1, with a final frequency added that takes the sum exactly to 1). An equal number of log-frequency fold-changes, *s*_{i}, are sampled from *ρ*(*s*). The normalized frequencies of the second condition are then
$$
${f}_{i}^{\prime}={f}_{i}{e}^{{s}_{i}}/{\sum}_{j}\phantom{\rule{4pt}{0ex}}{f}_{j}{e}^{{s}_{j}}$
. Counts from the two conditions are then sampled from *P*(*n*|*f*) and *P*(*n*′|*f*′), respectively. Unobserved clones, i.e. those with (*n*, *n*′) = (0, 0), are then discarded.

To learn the parameters of *ρ*(*s* ), we performed a grid search, refined by an iterative, gradient-based search to obtain the maximum likelihood. We tested different forms of prior shown in Table 1.

For a more formal approach, expectation maximization (EM) can be employed when tractable. Here in a simple setting, we demonstrate this approach of obtaining the optimal parameter estimates from the data by calculating the expected log likelihood over the posterior and then maximizing with respect to the parameters. In practice, we first perform the latter analytically and then evaluate the former numerically. We choose a symmetric exponential as a tractable prior for this purpose:

$$

with
$$
$\stackrel{\xaf}{s}>0$
, and no shift,
$\begin{array}{c}\hfill {\rho}_{\text{exp}}\left(s\right|\stackrel{\xaf}{s})={e}^{-\left|s\right|/\stackrel{\xaf}{s}}/2\stackrel{\xaf}{s}\end{array}$

$$

where
$$
${\stackrel{\xaf}{s}}^{\prime}$
is the current estimate. Maximizing
$\begin{array}{c}\hfill Q\left(\stackrel{\xaf}{s}\right|{\stackrel{\xaf}{s}}^{\prime})=\underset{i=1}{\overset{{N}_{\text{obs}}}{\sum}}{\int}_{-\infty}^{\infty}\mathrm{d}s\rho \left(s\right|{n}_{i},{n}_{i}^{\prime},{\stackrel{\xaf}{s}}^{\prime})log\left[P\right({n}_{i},{n}_{i}^{\prime},s\left|\stackrel{\xaf}{s}\right)]\phantom{\rule{0.277778em}{0ex}},\end{array}$

$$

$\begin{array}{cc}\hfill \frac{\partial log\left[{\rho}_{\text{exp}}\left(s\right|\stackrel{\xaf}{s}))\right]}{\partial \stackrel{\xaf}{s}}& =\frac{1}{{\rho}_{\text{exp}}\left(s\right|\stackrel{\xaf}{s})}\frac{\partial {\rho}_{\text{exp}}\left(s\right|\stackrel{\xaf}{s})}{\partial \stackrel{\xaf}{s}}\end{array}$

$$

so that
$$
$\frac{\partial Q\left(\stackrel{\xaf}{s}\right|{\stackrel{\xaf}{s}}^{\prime})}{\partial \stackrel{\xaf}{s}}={\sum}_{i=1}^{{N}_{\text{obs}}}{\int}_{-\infty}^{\infty}\mathrm{d}s\rho \left(s\right|{n}_{i},{n}_{i}^{\prime},{\stackrel{\xaf}{s}}^{\prime})\frac{\partial log\left[{\rho}_{\text{exp}}\left(s\right|\stackrel{\xaf}{s}))\right]}{\partial \stackrel{\xaf}{s}}=0$
implies
$\begin{array}{cc}& =\frac{\left|s\right|-\stackrel{\xaf}{s}}{{\stackrel{\xaf}{s}}^{2}}\phantom{\rule{0.277778em}{0ex}},\end{array}$

$$

so that
$$
${\stackrel{\xaf}{s}}^{*}=\frac{1}{{N}_{\text{obs}}}{\sum}_{i=1}^{{N}_{\text{obs}}}{\stackrel{\xaf}{s}}_{({n}_{i},{n}_{i}^{\prime})}$
, where
$\begin{array}{c}\hfill \underset{i=1}{\overset{{N}_{\text{obs}}}{\sum}}{\int}_{-\infty}^{\infty}\mathrm{d}s\rho \left(s\right|{n}_{i},{n}_{i}^{\prime},{\stackrel{\xaf}{s}}^{\prime})\frac{\left|s\right|-{\stackrel{\xaf}{s}}^{*}}{{\stackrel{\xaf}{s}}^{*2}}=0\end{array}$

$$

$\begin{array}{c}\hfill {\stackrel{\xaf}{s}}_{(n,{n}^{\prime})}={\int}_{-\infty}^{\infty}\mathrm{d}s\left|s\right|\rho \left(s\right|n,{n}^{\prime},{\stackrel{\xaf}{s}}^{\prime}).\end{array}$

The latter integral is computed numerically from the model using
$$
$\rho \left(s\right|n,{n}^{\prime},{\stackrel{\xaf}{s}}^{\prime})=P(n,{n}^{\prime},s|{\stackrel{\xaf}{s}}^{\prime})/{\int}_{-\infty}^{\infty}P(n,{n}^{\prime},s|{\stackrel{\xaf}{s}}^{\prime})\mathrm{d}s$
. *Q* is maximized at
$$
$\stackrel{\xaf}{s}={\stackrel{\xaf}{s}}^{*}$
since
$$
$\frac{{\partial}^{2}log\left[{\rho}_{\text{exp}}\left(s\right|\stackrel{\xaf}{s}))\right]}{\partial {\stackrel{\xaf}{s}}^{2}}{|}_{\stackrel{\xaf}{s}={\stackrel{\xaf}{s}}^{*}}=-{\stackrel{\xaf}{s}}^{*-2}<0$
. Thus, we update
$$
${\rho}_{\text{exp}}\left(s\right|\stackrel{\xaf}{s})$
with
$$
$\stackrel{\xaf}{s}\leftarrow {\stackrel{\xaf}{s}}^{*}$
. The number of updates typically required for convergence was small.

The constraint of equal repertoire size, *Z*′ = *Z* can be satisfied with a suitable choice of the shift parameter, *s*_{0}, in the prior for differential expression, *ρ*_{s}(*s*), namely *s*_{0} = −ln *Z*′/*Z*. The latter arises from the coordinate transformation *s* ← Δ*s* + *s*_{0}, and adds a factor of
$$
${e}^{{s}_{0}}$
to all terms of *Z*′.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Citing articles via

Tweets

https://www.researchpad.co/tools/openurl?pubtype=article&doi=10.1371/journal.pcbi.1007873&title=Inferring the immune response from repertoire sequencing&author=&keyword=&subject=Research Article,Biology and Life Sciences,Molecular Biology,Molecular Biology Techniques,Cloning,Research and Analysis Methods,Molecular Biology Techniques,Cloning,Biology and Life Sciences,Immunology,Vaccination and Immunization,Medicine and Health Sciences,Immunology,Vaccination and Immunization,Medicine and Health Sciences,Public and Occupational Health,Preventive Medicine,Vaccination and Immunization,Research and Analysis Methods,Mathematical and Statistical Techniques,Mathematical Models,Biology and Life Sciences,Immunology,Immune Response,Medicine and Health Sciences,Immunology,Immune Response,Medicine and Health Sciences,Infectious Diseases,Viral Diseases,Yellow Fever,Biology and Life Sciences,Genetics,Gene Expression,Biology and life sciences,Molecular biology,Molecular biology techniques,Sequencing techniques,RNA sequencing,Research and analysis methods,Molecular biology techniques,Sequencing techniques,RNA sequencing,Biology and Life Sciences,Cell Biology,Cellular Types,Animal Cells,Blood Cells,White Blood Cells,Lymphocytes,Biology and Life Sciences,Cell Biology,Cellular Types,Animal Cells,Immune Cells,White Blood Cells,Lymphocytes,Biology and Life Sciences,Immunology,Immune Cells,White Blood Cells,Lymphocytes,Medicine and Health Sciences,Immunology,Immune Cells,White Blood Cells,Lymphocytes,

© 2023 Newgen KnowledgeWorks | Privacy & Cookie Policy | Powered by: Nova