PLoS Genetics

Public Library of Science

Simultaneous SNP selection and adjustment for population structure in high dimensional prediction models

Sahir R. Bhatnagar
Yi Yang
Tianyuan Lu
Erwin Schurr
JC Loredo-Osti
Marie Forest
Karim Oualkacha
Celia M. T. Greenwood
Heather J Cordell
Heather J Cordell
Scott M. Williams
Heather J Cordell
Scott M. Williams
Heather J Cordell
Scott M. Williams
Heather J Cordell
Scott M. Williams

DOI 10.1371/journal.pgen.1008766,
Volume: 16,
Issue: 5,

Abstract

This work addresses a recurring challenge in the analysis and interpretation of genetic association studies: which genetic variants can best predict and are independently associated with a given phenotype in the presence of population structure? Not controlling confounding due to geographic population structure, family and/or cryptic relatedness can lead to spurious associations. Much of the existing research has therefore focused on modeling the association between a phenotype and a single genetic variant in a linear mixed model with a random effect. However, this univariate approach may miss true associations due to the stringent significance thresholds required to reduce the number of false positives and also ignores the correlations between markers. We propose an alternative method for fitting high-dimensional multivariable models, which selects SNPs that are independently associated with the phenotype while also accounting for population structure. We provide an efficient implementation of our algorithm and show through simulation studies and real data examples that our method outperforms existing methods in terms of prediction accuracy and controlling the false discovery rate.

Genome-wide association studies (GWAS) have become the standard method for analyzing genetic datasets owing to their success in identifying thousands of genetic variants associated with complex diseases (https://www.genome.gov/gwastudies/). Despite these impressive findings, the discovered markers have only been able to explain a small proportion of the phenotypic variance; this is known as the missing heritability problem [1]. One plausible reason is that there are many causal variants that each explain a small amount of variation with small effect sizes [2]. Methods such as GWAS, which test each variant or single nucleotide polymorphism (SNP) independently, may miss these true associations due to the stringent significance thresholds required to reduce the number of false positives [1]. Another major issue to overcome is that of confounding due to geographic population structure, family and/or cryptic relatedness which can lead to spurious associations [3]. For example, there may be subpopulations within a study that differ with respect to their genotype frequencies at a particular locus due to geographical location or their ancestry. This heterogeneity in genotype frequency can cause correlations with other loci and consequently mimic the signal of association even though there is no biological association [4, 5]. Studies that separate their sample by ethnicity to address this confounding suffer from a loss in statistical power due to the drop in sample size.

To address the first problem, multivariable regression methods have been proposed which simultaneously fit many SNPs in a single model [6, 7]. Indeed, the power to detect an association for a given SNP may be increased when other causal SNPs have been accounted for. Conversely, a stronger signal from a causal SNP may weaken false signals when modeled jointly [6].

Solutions for confounding by population structure have also received significant attention in the literature [8–11]. There are two main approaches to account for the relatedness between subjects: 1) the principal component (PC) adjustment method and 2) the linear mixed model (LMM). The PC adjustment method includes the top PCs of genome-wide SNP genotypes as additional covariates in the model [12]. The LMM uses an estimated covariance matrix from the individuals’ genotypes and includes this information in the form of a random effect [3].

While these problems have been addressed in isolation, there has been relatively little progress towards addressing them jointly at a large scale. Region-based tests of association have been developed where a linear combination of *p* variants is regressed on the response variable in a mixed model framework [13]. In case-control data, a stepwise logistic-regression procedure was used to evaluate the relative importance of variants within a small genetic region [14]. These methods however are not applicable in the high-dimensional setting, i.e., when the number of variables *p* is much larger than the sample size *n*, as is often the case in genetic studies where millions of variants are measured on thousands of individuals.

There has been recent interest in using penalized linear mixed models, which place a constraint on the magnitude of the effect sizes while controlling for confounding factors such as population structure. For example, the LMM-lasso [15] places a Laplace prior on all main effects while the adaptive mixed lasso [16] uses the *L*_{1} penalty [17] with adaptively chosen weights [18] to allow for differential shrinkage amongst the variables in the model. Another method applied a combination of both the lasso and group lasso penalties in order to select variants within a gene most associated with the response [19]. However, methods such as the LMM-lasso are normally performed in two steps. First, the variance components are estimated once from a LMM with a single random effect. These LMMs normally use the estimated covariance matrix from the individuals’ genotypes to account for the relatedness but assumes no SNP main effects (i.e. a null model). The residuals from this null model with a single random effect can be treated as independent observations because the relatedness has been effectively removed from the original response. In the second step, these residuals are used as the response in any high-dimensional model that assumes uncorrelated errors. This approach has both computational and practical advantages since existing penalized regression software such as `glmnet` [20] and `gglasso` [21], which assume independent observations, can be applied directly to the residuals. However, recent work has shown that there can be a loss in power if a causal variant is included in the calculation of the covariance matrix as its effect will have been removed in the first step [13, 22].

In this paper we develop a general penalized LMM framework called `ggmix` that simultaneously selects variables and estimates their effects, accounting for between-individual correlations. We develop a blockwise coordinate descent algorithm with automatic tuning parameter selection which is highly scalable, computationally efficient and has theoretical guarantees of convergence. Our method can handle several sparsity inducing penalties such as the lasso [17] and elastic net [23]. Through simulations and three real data examples, we show that `ggmix` leads to more parsimonious models compared to the two-stage approach or principal component adjustment with better prediction accuracy. Our method performs well even in the presence of highly correlated markers, and when the causal SNPs are included in the kinship matrix.

All of our algorithms are implemented in the `ggmix` R package hosted on CRAN with extensive documentation (https://sahirbhatnagar.com/ggmix). We provide a brief demonstration of the `ggmix` package in S2 Text.

The rest of the paper is organized as follows. In Results, we compare the performance of our proposed approach and demonstrate the scenarios where it can be advantageous to use over existing methods through simulation studies and three real data analyses. This is followed by a discussion of our results, some limitations and future directions in Discussion. Materials and methods describes the `ggmix` model, the optimization procedure and the algorithm used to fit it.

In this section we demonstrate the performance of `ggmix` in a simulation study and three real data applications.

We evaluated the performance of `ggmix` in a variety of simulated scenarios. For each simulation scenario we compared `ggmix` to the `lasso` and the `twostep` method. For the `lasso`, we included the top 10 principal components from the simulated genotypes used to calculate the kinship matrix as unpenalized predictors in the design matrix. For the `twostep` method, we first fitted an intercept only model with a single random effect using the average information restricted maximum likelihood (AIREML) algorithm [24] as implemented in the `gaston` R package [25]. The residuals from this model were then used as the response in a regular `lasso` model. Note that in the `twostep` method, we removed the kinship effect in the first step and therefore did not need to make any further adjustments when fitting the penalized model. We fitted the `lasso` using the default settings and `standardize = FALSE` in the `glmnet` package [20], with 10-fold cross-validation (CV) to select the optimal tuning parameter. For other parameters in our simulation study, we defined the following quantities:

- •

- •

- •

- •

- •

- •

- •

We simulated data from the model

$\begin{array}{c}\hfill \mathbf{\text{Y}}=\mathbf{\text{X}}\mathit{\beta}+\mathbf{\text{P}}+\mathit{\epsilon},\end{array}$

where $\mathbf{\text{P}}\sim \mathcal{N}(0,\eta {\sigma}^{2}\mathbf{\Phi})$ is the polygenic effect and $\mathit{\epsilon}\sim \mathcal{N}(0,(1-\eta ){\sigma}^{2}\mathbf{\text{I}})$ is the error term. Here, - None of the

*causal*SNPs are included in*kinship*set.- All of the

*causal*SNPs are included in the*kinship*set.

Both kinship matrices were meant to contrast the model behavior when the causal SNPs are included in both the main effects and random effects (referred to as proximal contamination [8]) versus when the causal SNPs are only included in the main effects. These scenarios are motivated by the current standard of practice in GWAS where the candidate marker is excluded from the calculation of the kinship matrix [8]. This approach becomes much more difficult to apply in large-scale multivariable models where there is likely to be overlap between the variables in the design matrix and kinship matrix. We simulated random genotypes from the BN-PSD admixture model with 1D geography and 10 subpopulations using the `bnpsd` package [26, 27]. In Fig 1, we plot the estimated kinship matrix from a single simulated dataset in the form of a heatmap where a darker color indicates a closer genetic relationship.

In Fig 2 we plot the first two principal component scores calculated from the simulated genotypes used to calculate the kinship matrix in Fig 1, and color each point by subpopulation membership. We can see that the PCs can identify the subpopulations which is why including them as additional covariates in a regression model has been considered a reasonable approach to control for confounding.

Using this set-up, we randomly partitioned 1000 simulated observations into 80% for training and 20% for testing. The training set was used to fit the model and select the optimal tuning parameter only, and the resulting model was evaluated on the test set. Let $\widehat{\lambda}$ be the estimated value of the optimal regularization parameter, ${\widehat{\mathit{\beta}}}_{\widehat{\lambda}}$ the estimate of * β* at regularization parameter $\widehat{\lambda}$, and ${\widehat{S}}_{\widehat{\lambda}}=\{j;{\left({\widehat{\mathit{\beta}}}_{\widehat{\lambda}}\right)}_{j}\ne 0\}$ the index of the set of non-zero estimated coefficients. To compare the methods in the context of true positive rate (TPR), we selected the largest tuning parameter that would result in a false positive rate (FPR) closest to 5%, but not more. Note that in practice, this approach to selecting the tuning parameter is generally not possible since we do not know the underlying true model in advance. For real data, we suggest an information criterion approach described in S1 Text or a sample splitting approach such as the one we used for the UK Biobank analysis. We also compared the model size ($|{\widehat{S}}_{\widehat{\lambda}}|$), test set prediction error based on the refitted unpenalized estimates for each selected model, the estimation error ($\Vert \widehat{\mathit{\beta}}-{\mathit{\beta}\Vert}_{2}^{2}$), and the variance components (

The results are summarized in Table 1. We see that `ggmix` outperformed the `twostep` in terms of TPR, and was comparable to the `lasso`. This was the case, regardless of true heritability and whether the causal SNPs were included in the calculation of the kinship matrix. For the `twostep` however, the TPR at a FPR of 5%, drops, on average, from 0.84 (when causal SNPs are not in the kinship) to 0.76 (when causal SNPs are in the kinship). Across all simulation scenarios, `ggmix` had the smallest estimation error, and smallest root mean squared prediction error (RMSE) on the test set while also producing the most parsimonious models. Both the `lasso` and `twostep` selected more false positives, even in the null model scenario. Both the `twostep` and `ggmix` overestimated the heritability though `ggmix` was closer to the true value. When none of the causal SNPs were in the kinship, both methods tended to overestimate the truth when *η* = 10% and underestimate when *η* = 30%. Across all simulation scenarios `ggmix` was able to (on average) correctly estimate the error variance. The `lasso` tended to overestimate *σ*^{2} in the null model while the `twostep` overestimated *σ*^{2} when none of the causal SNPs were in the kinship matrix.

Note: median (inter-quartile range) is given for model size.

Overall, we observed that variable selection results and RMSE for `ggmix` were similar regardless of whether the causal SNPs were in the kinship matrix or not. This result is encouraging since in practice the kinship matrix is constructed from a random sample of SNPs across the genome, some of which are likely to be causal, particularly in polygenic traits.

In particular, our simulation results show that the principal component adjustment method may not be the best approach to control for confounding by population structure, particularly when variable selection is of interest.

Three datasets with different features were used to illustrate the potential advantages of `ggmix` over existing approaches such as PC adjustment in a `lasso` regression. In the first two datasets, family structure induced low levels of correlation and sparsity in signals. In the last, a dataset involving mouse crosses, correlations were extremely strong and could confound signals.

With more than 500,000 participants, the UK Biobank is one of the largest genotyped health care registries in the world. Among these participants, 147,731 have been inferred to be related to at least one individual in this cohort [29]. Such a widespread genetic relatedness may confound association studies and bias trait predictions if not properly accounted for. Among these related individuals, 18,150 have a documented familial relationship (parent-offspring, full siblings, second degree or third degree) that was previously inferred in [30]. We attempted to derive a polygenic risk score for height among these individuals. As suggested by a reviewer, the goal of this analysis was to see how the different methods performed for a highly polygenic trait in a set of related individuals. We compared the `ggmix`-derived polygenic risk score to those derived by the `twostep` and `lasso` methods.

We first estimated the pairwise kinship coefficient among the 18,150 reportedly related individuals based on 784,256 genotyped SNPs using KING [31]. We grouped related individuals with a kinship coefficient > 0.044 [31] into 8,300 pedigrees. We then randomly split the dataset into a training set, a model selection set and a test set of roughly equal sample size, ensuring all individuals in the same pedigree were assigned into the same set. We inverse normalized the standing height after adjusting for age, sex, genotyping array, and assessment center following Yengo et al. [32].

To reduce computational complexity, we selected 10,000 SNPs with the largest effect sizes associated with height from a recent large meta-analysis [32]. Among these 10,000 SNPs, 1,233 were genotyped and used for estimating the kinship whereas the other 8,767 SNPs were imputed based on the Haplotype Reference Consortium reference panel [33]. The distribution of the 10,000 SNPs by chromosome and whether or not the SNP was imputed is shown in S1 Fig. We see that every chromosome contributed SNPs to the model with 15% coming from chromosome 6. The markers we used are theoretically independent since Yengo et al. performed a COJO analysis which should have tuned down signals due to linkage disequilibrium [32]. We used `ggmix`, `twostep` and `lasso` to select SNPs most predictive of the inverse normalized height on the training set, and chose the λ with the lowest prediction RMSE on the model selection set for each method. We then examined the performance of each derived polygenic risk score on the test set. Similar to simulation study, we adjusted for the top 10 genetic PCs as unpenalized predictors when fitting the `lasso` models, and supplied the kinship matrix based on 784,256 genotyped SNPs to `ggmix` and `twostep`.

We found that with a kinship matrix estimated using all genotyped SNPs, `ggmix` had the possibility to achieve a lower RMSE on the model selection set compared to the `twostep` and `lasso` methods (Fig 3A). An optimized `ggmix` -derived polygenic risk score that utilized the least number of SNPs was also able to better predict the trait with lower RMSE on the test set (Fig 3B).

We additionally applied a Bayesian Sparse Linear Mixed Model (`BSLMM` ) [34] implemented in the GEMMA package [35] to derive a polygenic risk score on the training set. A posterior probability of inclusion of each SNP was provided and prediction was based on all SNPs with a positive posterior probability. We found that although the `BSLMM` -based polygenic risk score leveraged the most SNPs, it did not achieve a comparable prediction accuracy as the other three methods (Fig 3B). Likely due to the small effect sizes of these SNPs, only 94, 35 and 1 SNPs had a posterior inclusion probability above 0.05, 0.10 and 0.50, respectively. The model would have further reduced prediction accuracy if the prediction was based only on these SNPs.

In the most recent Genetic Analysis Workship 20 (GAW20), the causal modeling group investigated causal relationships between DNA methylation (exposure) within some genes and the change in high-density lipoproteins ΔHDL (outcome) using Mendelian Randomization (MR) [36]. Penalized regression methods were used to select SNPs strongly associated with the exposure in order to be used as an instrumental variable (IV) [37, 38]. However, since GAW20 data consisted of families, `twostep` methods were used which could have resulted in a large number of false positives or false negatives. `ggmix` now provides an alternative approach that could be used for selecting the IV while accounting for the family structure of the data.

We applied `ggmix` to all 200 GAW20 simulation datasets, each of 679 observations, and compared its performance to the `twostep` and `lasso` methods. Using a Factored Spectrally Transformed Linear Mixed Model (FaST-LMM) [39] adjusted for age and sex, we validated the effect of rs9661059 on blood lipid trait to be significant (genome-wide *p* = 6.29 × 10^{−9} ). Though several other SNPs were also associated with the phenotype, these associations were probably mediated by CpG-SNP interaction pairs and did not reach statistical significance. Therefore, to avoid ambiguity, we only focused on chromosome 1 containing 51,104 SNPs, including rs9661059. Given that population admixture in the GAW20 data was likely, we estimated the population kinship using REAP [40] after decomposing population compositions using ADMIXTURE [41]. We used 100,276 LD-pruned whole-genome genotyped SNPs for estimating the kinship. Among these, 8100 were included as covariates in our models based on chromosome 1. The causal SNP was also among the 100,276 SNPs. All methods were fit according to the same settings described in our simulation study, and adjusting for age and sex. We calculated the median (inter-quartile range) number of active variables, and RMSE (standard deviation) based on five-fold CV on each simulated dataset.

On each simulated replicate, we calibrated the methods so that they could be easily compared by fixing the true positive rate to 1 and then minimizing the false positive rate. Hence, the selected SNP, rs9661059, was likely to be the true positive for each method, and non-causal SNPs were excluded to the greatest extent. All three methods precisely chose the correct predictor without any false positives in more than half of the replicates, as the causal signal was strong. However, when some false positives were selected (i.e. when the number of active variables > 1), `ggmix` performed comparably to `twostep`, while the `lasso` was inclined to select more false positives as suggested by the larger third quartile number of active variables (Table 2). We also observed that `ggmix` outperformed the `twostep` method with lower CV RMSE using the same number of SNPs. Meanwhile, it achieved roughly the same prediction accuracy as `lasso` but with fewer non-causal SNPs (Table 2). It is also worth mentioning that there was very little correlation between the causal SNP and SNPs within a 1Mb-window around it (see S2 Fig), making it an ideal scenario for the `lasso` and related methods.

Note: median (inter-quartile range) is given for model size.

We also applied the `BSLMM` method by performing five-fold CV on each of the 200 simulated replicates. We found that while `BSLMM` achieved a lower CV RMSE compared to the other methods (Table 2), this higher prediction accuracy relied on approximately 80% of the 51,104 SNPs with a positive posterior inclusion probability. This may suggest overfitting in this dataset. We additionally tried imposing a stricter posterior inclusion probability threshold (0.05, 0.10 and 0.50) in order to improve feature selection. These thresholds however, resulted in overly sparse models as most SNPs had a low posterior probability. It is also noteworthy that we did not adjust for age and sex in the `BSLMM` model, as the current implementation of the method in the GEMMA package does not allow adjustment for covariates.

Mouse inbred strains of genetically identical individuals are extensively used in research. Crosses of different inbred strains are useful for various studies of heritability focusing on either observable phenotypes or molecular mechanisms, and in particular, recombinant congenic strains have been an extremely useful resource for many years [42]. However, ignoring complex genetic relationships in association studies can lead to inflated false positives in genetic association studies when different inbred strains and their crosses are investigated [43–45]. Therefore, a previous study developed and implemented a mixed model to find loci associated with mouse sensitivity to mycobacterial infection [46]. The random effects in the model captured complex correlations between the recombinant congenic mouse strains based on the proportion of the DNA shared identical by descent. Through a series of mixed model fits at each marker, new loci that impact growth of mycobacteria on chromosome 1 and chromosome 11 were identified.

Here we show that `ggmix` can identify these loci, as well as potentially others, in a single analysis. We reanalyzed the growth permissiveness in the spleen, as measured by colony forming units (CFUs), 6 weeks after infection from *Mycobacterium**bovis* Bacille Calmette-Guerin (BCG) Russia strain as reported in [46].

By taking the consensus between the “main model” and the “conditional model” of the original study, we regarded markers D1Mit435 on chromosome 1 and D11Mit119 on chromosome 11 as two true positive loci. We directly estimated the kinship between mice using genotypes at 625 microsatellite markers. The estimated kinship was entered directly into `ggmix` and `twostep`. For the `lasso` , we calculated and included the first 10 principal components of the estimated kinship. To evaluate the robustness of different models, we bootstrapped the 189-sample dataset and repeated the analysis 200 times. We then conceived a two-fold criteria to evaluate performance of each model. We first examined whether a model could pick up both true positive loci using some λ. If the model failed to pick up both loci simultaneously with any λ, we counted as modeling failure on the corresponding boostrap replicate; otherwise, we counted as modeling success and recorded which other loci were picked up given the largest λ. Consequently, similar to the strategy used in the GAW20 analysis, we optimized the models by tuning the penalty factor such that these two true positive loci were picked up, while the number of other active loci was minimized. Significant markers were defined as those captured in at least half of the successful bootstrap replicates (Fig 4).

We demonstrated that `ggmix` recognized the true associations more robustly than `twostep` and `lasso`. In almost all (99%) bootstrap replicates, `ggmix` was able to capture both true positives, while the `twostep` failed in 19% of the replicates and the `lasso` failed in 56% of the replicates by missing at least one of the two true positives (Fig 4). The robustness of `ggmix` is particularly noteworthy due to the strong correlations between all microsatellite markers in this dataset (see S3 Fig). These strong correlations with the causal markers, partially explain the poor performance of the `lasso` as it suffers from unstable selections in the presence of correlated variables (e.g. [47]).

We also identified several other loci that might also be associated with susceptibility to mycobacterial infection (Table 3). Among these new potentially-associated markers, D2Mit156 was found to play a role in control of parasite numbers of *Leishmania**tropica* in lymph nodes [48]. An earlier study identified a parent-of-origin effect at D17Mit221 on CD4M levels [49]. This effect was more visible in crosses than in parental strains. In addition, D14Mit131, selected only by `ggmix` , was found to have a 9% loss of heterozygosity in hybrids of two inbred mouse strains [50], indicating the potential presence of putative suppressor genes pertaining to immune surveillance and tumor progression [51]. This result might also suggest association with anti-bacterial responses yet to be discovered.

Note: median (inter-quartile range) is given for model size.

We have developed a general penalized LMM framework called `ggmix` which simultaneously selects SNPs and adjusts for population structure in high dimensional prediction models. We compared our method to the `twostage` procedure, where in the first stage, the dependence between observations is adjusted for in a LMM with a single random effect and no covariates (i.e. null model). The residuals from this null model can then be used in any model for independent observations because the relatedness has been effectively removed from the original response. We also compared our method to the `lasso` and `BSLMM` which are closely related to `ggmix` since they also jointly model the relatedness and SNPs in a single step. The key differences are that the `lasso` uses a principal component adjustment and `BSLMM` is a Bayesian method focused on phenotype prediction.

Through an extensive simulation study and three real data analyses that mimic many experimental designs in genetics, we show that the current approaches of PC adjustment and two-stage procedures are not necessarily sufficient to control for confounding by population structure leading to a high number of false positives. Our simulation results show that `ggmix` outperforms existing methods in terms of sparsity and prediction error even when the causal variants are included in the kinship matrix (Table 1). Many methods for single-SNP analyses avoid this proximal contamination [8] by using a leave-one-chromosome-out scheme [52], i.e., construct the kinship matrix using all chromosomes except the one on which the marker being tested is located. However, this approach is not possible if we want to model many SNPs (across many chromosomes) jointly to create, for example, a polygenic risk score. For the purposes of variable selection, we would also want to model all chromosomes together since the power to detect an association for a given SNP may be increased when other causal SNPs have been accounted for. Conversely, a stronger signal from a causal SNP may weaken false signals when modeled jointly [6], particularly when the markers are highly correlated as in the mouse crosses example.

In the UK Biobank, we found that with a kinship matrix estimated using all genotyped SNPs, `ggmix` had achieved a lower RMSE on the model selection set compared to the `twostep` and `lasso` methods. Furthermore, an optimized `ggmix`-derived polygenic risk score that utilized the least number of SNPs was also able to better predict the trait with lower RMSE on the test set. In the GAW20 example, we showed that while all methods were able to select the strongest causal SNP, `ggmix` did so with the least amount of false positives while also maintaining good predictive ability. In the mouse crosses example, we showed that `ggmix` is robust to perturbations in the data using a bootstrap analysis. Indeed, `ggmix` was able to consistently select the true positives across bootstrap replicates, while `twostep` failed in 19% of the replicates and `lasso` failed in 56% of the replicates by missing of at least one of the two true positives. Our re-analysis of the data also lead to some potentially new findings, not found by existing methods, that may warrant further study. This particular example had many markers that were strongly correlated with each other (see S3 Fig). Nevertheless, we observed that the two true positive loci were the most often selected while none of the nearby markers were picked up in more than 50% of the 200 bootstrap replicates. This shows that our method does recognize the true positives in the presence of highly correlated markers. Nevertheless, we think the issue of variable selection for correlated SNPs warrants further study. The recently proposed Precision Lasso [47] seeks to address this problem in the high-dimensional fixed effects model.

We emphasize here that previously developed methods such as the LMM-lasso [15] use a two-stage fitting procedure without any convergence details. From a practical point of view, there is currently no implementation that provides a principled way of determining the sequence of tuning parameters to fit, nor a procedure that automatically selects the optimal value of the tuning parameter. To our knowledge, we are the first to develop a coordinate gradient descent (CGD) algorithm in the specific context of fitting a penalized LMM for population structure correction with theoretical guarantees of convergence. Furthermore, we develop a principled method for automatic tuning parameter selection and provide an easy-to-use software implementation in order to promote wider uptake of these more complex methods by applied practitioners.

Although we derive a CGD algorithm for the *ℓ*_{1} penalty, our approach can also be easily extended to other penalties such as the elastic net and group lasso with the same guarantees of convergence. A limitation of `ggmix` is that it first requires computing the covariance matrix with a computation time of $\mathcal{O}\left({n}^{2}k\right)$ followed by a spectral decomposition of this matrix in $\mathcal{O}\left({n}^{3}\right)$ time where *k* is the number of SNP genotypes used to construct the covariance matrix. This computation becomes prohibitive for large cohorts such as the UK Biobank [53] which have collected genetic information on half a million individuals. When the matrix of genotypes used to construct the covariance matrix is low rank, there are additional computational speedups that can be implemented. While this has been developed for the univariate case [8], to our knowledge, this has not been explored in the multivariable case. We are currently developing a low rank version of the penalized LMM developed here, which reduces the time complexity from $\mathcal{O}\left({n}^{2}k\right)$ to $\mathcal{O}\left(n{k}^{2}\right)$. There is also the issue of how our model scales with an increasing number of covariates (*p*). Due to the coordinate-wise optimization procedure, we expect this to be less of an issue, but still prohibitive for *p* > 1 × 10^{5}. The `biglasso` package [54] uses memory mapping strategies for large *p*, and this is something we are exploring for `ggmix`.

As was brought up by a reviewer, the simulations and real data analyses presented here contained many more markers used to estimate the kinship than the sample size (*n*/*k* ≤ 0.1). In the single locus association test, Yang el al. [22] found that proximal contamination was an issue when *n*/*k* ≈ 1. We believe further theoretical study is needed to see if these results can be generalized to the multivariable models being fit here. Once the computational limitations of sample size mentioned above have been addressed, these theoretical results can be supported by simulation studies.

There are other applications in which our method could be used as well. For example, there has been a renewed interest in polygenic risk scores (PRS) which aim to predict complex diseases from genotypes. `ggmix` could be used to build a PRS with the distinct advantage of modeling SNPs jointly, allowing for main effects as well as interactions to be accounted for. Based on our results, `ggmix` has the potential to produce more robust and parsimonious models than the `lasso` with better predictive accuracy.

Our method is also suitable for fine mapping SNP association signals in genomic regions, where the goal is to pinpoint individual variants most likely to impact the undelying biological mechanisms of disease [55].

Let *i* = 1, …, *N* be a grouping index, *j* = 1, …, *n*_{i} the observation index within a group and ${N}_{T}={\sum}_{i=1}^{N}{n}_{i}$ the total number of observations. For each group let ${\mathit{y}}_{i}=({y}_{1},\dots ,{y}_{{n}_{i}})$ be the observed vector of responses or phenotypes, **X**_{i} an *n*_{i} × (*p* + 1) design matrix (with the column of 1s for the intercept), ${\mathit{b}}_{i}$ a group-specific random effect vector of length *n*_{i} and ${\epsilon}_{i}=({\epsilon}_{i1},\dots ,{\epsilon}_{i{n}_{i}})$ the individual error terms. Denote the stacked vectors $\mathbf{Y}={({\mathit{y}}_{i},\dots ,{\mathit{y}}_{N})}^{T}\in {\mathbb{R}}^{{N}_{T}\times 1}$, $\mathit{b}={({\mathit{b}}_{i},\dots ,{\mathit{b}}_{N})}^{T}\in {\mathbb{R}}^{{N}_{T}\times 1}$, $\epsilon ={({\epsilon}_{i},\dots ,{\epsilon}_{N})}^{T}\in {\mathbb{R}}^{{N}_{T}\times 1}$, and the stacked matrix $\mathbf{X}=({\mathbf{X}}_{1}^{T},\dots ,{\mathbf{X}}_{N}^{T})\in {\mathbb{R}}^{{N}_{T}\times (p+1)}$. Furthermore, let $\mathit{\beta}={({\mathit{\beta}}_{0},{\mathit{\beta}}_{1},\dots ,{\mathit{\beta}}_{p})}^{T}\in {\mathbb{R}}^{(p+1)\times 1}$ be a vector of fixed effects regression coefficients corresponding to **X** . We consider the following linear mixed model with a single random effect [56]:

$\begin{array}{c}\hfill \mathbf{Y}=\mathbf{X}\mathit{\beta}+\mathit{b}+\epsilon ,\end{array}$

where the random effect $\begin{array}{c}\hfill \mathit{b}\sim \mathcal{N}(0,\eta {\sigma}^{2}\mathbf{\Phi})\phantom{\rule{2em}{0ex}}\epsilon \sim \mathcal{N}(0,(1-\eta ){\sigma}^{2}\mathbf{I}).\end{array}$

Here, ${\mathbf{\Phi}}_{{N}_{T}\times {N}_{T}}$ is a known positive semi-definite and symmetric covariance or kinship matrix calculated from SNPs sampled across the genome, ${\mathbf{I}}_{{N}_{T}\times {N}_{T}}$ is the identity matrix and parameters $\begin{array}{c}\hfill \mathbf{Y}|(\mathit{\beta},\eta ,{\sigma}^{2})\sim \mathcal{N}(\mathbf{X}\mathit{\beta},\eta {\sigma}^{2}\mathbf{\Phi}+(1-\eta ){\sigma}^{2}\mathbf{I}).\end{array}$

The LMM-Lasso method [15] considers an alternative but equivalent parameterization given by:

$\begin{array}{c}\hfill \mathbf{Y}|(\mathit{\beta},\delta ,{\sigma}_{g}^{2})\sim \mathcal{N}(\mathbf{X}\mathit{\beta},{\sigma}_{g}^{2}(\mathbf{\Phi}+\delta \mathbf{I})),\end{array}$

where $\delta ={\sigma}_{e}^{2}/{\sigma}_{g}^{2}$, ${\sigma}_{g}^{2}$ is the genetic variance and ${\sigma}_{e}^{2}$ is the residual variance. We instead consider the parameterization in Eq 1 since maximization is easier over the compact set $\begin{array}{c}\hfill -\ell (\mathbf{\Theta})\propto \frac{{N}_{T}}{2}\text{log}\left({\sigma}^{2}\right)+\frac{1}{2}\text{log}\left(\text{det}\right(\mathbf{V}\left)\right)+\frac{1}{2{\sigma}^{2}}{(\mathbf{Y}-\mathbf{X}\mathit{\beta})}^{T}{\mathbf{V}}^{-1}(\mathbf{Y}-\mathbf{X}\mathit{\beta}),\end{array}$

where Let **Φ** = **UDU**^{T} be the eigen (spectral) decomposition of the kinship matrix **Φ**, where ${\mathbf{U}}_{{N}_{T}\times {N}_{T}}$ is an orthonormal matrix of eigenvectors (i.e. **UU**^{T} = **I**) and ${\mathbf{D}}_{{N}_{T}\times {N}_{T}}$ is a diagonal matrix of eigenvalues Λ_{i}. **V** can then be further simplified [56]

$\begin{array}{cc}\hfill \mathbf{V}& =\eta \mathbf{\Phi}+(1-\eta )\mathbf{I}\hfill \\ & =\eta \mathbf{U}\mathbf{D}{\mathbf{U}}^{T}+(1-\eta )\mathbf{U}\mathbf{I}{\mathbf{U}}^{T}\hfill \\ & =\mathbf{U}\eta \mathbf{D}{\mathbf{U}}^{T}+\mathbf{U}(1-\eta )\mathbf{I}{\mathbf{U}}^{T}\hfill \\ & =\mathbf{U}(\eta \mathbf{D}+(1-\eta \left)\mathbf{I}\right){\mathbf{U}}^{T}\hfill \\ & =\mathbf{U}\tilde{\mathbf{D}}{\mathbf{U}}^{T},\hfill \end{array}$

where
$\begin{array}{cc}\hfill \tilde{\mathbf{D}}& =\eta \mathbf{D}+(1-\eta )\mathbf{I}\hfill \\ & =\eta \left[\begin{array}{cccc}{\Lambda}_{1}& \hfill & \hfill & \hfill \\ \hfill & {\Lambda}_{2}& \hfill & \hfill \\ \hfill & \hfill & \ddots & \hfill \\ \hfill & \hfill & \hfill & {\Lambda}_{{N}_{T}}\end{array}\right]+(1-\eta )\left[\begin{array}{cccc}1& \hfill & \hfill & \hfill \\ \hfill & 1& \hfill & \hfill \\ \hfill & \hfill & \ddots & \hfill \\ \hfill & \hfill & \hfill & 1\end{array}\right]\hfill \\ & =\left[\begin{array}{cccc}1+\eta ({\Lambda}_{1}-1)& \hfill & \hfill & \hfill \\ \hfill & 1+\eta ({\Lambda}_{2}-1)& \hfill & \hfill \\ \hfill & \hfill & \ddots & \hfill \\ \hfill & \hfill & \hfill & 1+\eta ({\Lambda}_{{N}_{T}}-1)\end{array}\right]\hfill \end{array}$

$\begin{array}{cc}& =\text{diag}\{1+\eta ({\Lambda}_{1}-1),1+\eta ({\Lambda}_{2}-1),\dots ,1+\eta ({\Lambda}_{{N}_{T}}-1)\}.\end{array}$

Since Eq 5 is a diagonal matrix, its inverse is also a diagonal matrix:
$\begin{array}{c}\hfill {\tilde{\mathbf{D}}}^{-1}=\text{diag}\{\frac{1}{1+\eta ({\Lambda}_{1}-1)},\frac{1}{1+\eta ({\Lambda}_{2}-1)},\dots ,\frac{1}{1+\eta ({\Lambda}_{{N}_{T}}-1)}\}.\end{array}$

From Eqs 4 and 6, log(det(**V**)) simplifies to

$\begin{array}{cc}\hfill \text{log}\left(\text{det}\right(\mathbf{V}\left)\right)& =\text{log}\left(\text{det}\left(\mathbf{U}\right)\text{det}\right(\tilde{\mathbf{D}}\left)\text{det}\left({\mathbf{U}}^{T}\right)\right)\\ & =\text{log}\left\{\prod _{i=1}^{{N}_{T}}\right(1+\eta ({\Lambda}_{i}-1)\left)\right\}\\ & =\sum _{i=1}^{{N}_{T}}\text{log}(1+\eta ({\Lambda}_{i}-1)),\end{array}$

since det($\begin{array}{cc}\hfill {\mathbf{V}}^{-1}& ={\left(\mathbf{U}\tilde{\mathbf{D}}{\mathbf{U}}^{T}\right)}^{-1}\hfill \\ & ={\left({\mathbf{U}}^{T}\right)}^{-1}{\left(\tilde{\mathbf{D}}\right)}^{-1}{\mathbf{U}}^{-1}\hfill \\ & =\mathbf{U}{\tilde{\mathbf{D}}}^{-1}{\mathbf{U}}^{T},\hfill \end{array}$

since for an orthonormal matrix $\begin{array}{cc}\hfill -\ell (\mathbf{\Theta})& \propto \frac{{N}_{T}}{2}\text{log}\left({\sigma}^{2}\right)+\frac{1}{2}\sum _{i=1}^{{N}_{T}}\text{log}(1+\eta ({\Lambda}_{i}-1))+\frac{1}{2{\sigma}^{2}}{(\mathbf{Y}-\mathbf{X}\mathit{\beta})}^{T}\mathbf{U}{\tilde{\mathbf{D}}}^{-1}{\mathbf{U}}^{T}(\mathbf{Y}-\mathbf{X}\mathit{\beta})\hfill \\ & =\frac{{N}_{T}}{2}\text{log}\left({\sigma}^{2}\right)+\frac{1}{2}\sum _{i=1}^{{N}_{T}}\text{log}(1+\eta ({\Lambda}_{i}-1))+\frac{1}{2{\sigma}^{2}}{({\mathbf{U}}^{T}\mathbf{Y}-{\mathbf{U}}^{T}\mathbf{X}\mathit{\beta})}^{T}{\tilde{\mathbf{D}}}^{-1}({\mathbf{U}}^{T}\mathbf{Y}-{\mathbf{U}}^{T}\mathbf{X}\mathit{\beta})\hfill \\ & =\frac{{N}_{T}}{2}\text{log}\left({\sigma}^{2}\right)+\frac{1}{2}\sum _{i=1}^{{N}_{T}}\text{log}(1+\eta ({\Lambda}_{i}-1))+\frac{1}{2{\sigma}^{2}}{(\tilde{\mathbf{Y}}-\tilde{\mathbf{X}}\mathit{\beta})}^{T}{\tilde{\mathbf{D}}}^{-1}(\tilde{\mathbf{Y}}-\tilde{\mathbf{X}}\mathit{\beta})\hfill \\ & =\frac{{N}_{T}}{2}\text{log}\left({\sigma}^{2}\right)+\frac{1}{2}\sum _{i=1}^{{N}_{T}}\text{log}(1+\eta ({\Lambda}_{i}-1))+\frac{1}{2{\sigma}^{2}}\sum _{i=1}^{{N}_{T}}\frac{{({\tilde{Y}}_{i}-{\sum}_{j=0}^{p}{\tilde{X}}_{ij+1}{\mathit{\beta}}_{j})}^{2}}{1+\eta ({\Lambda}_{i}-1)},\hfill \end{array}$

where $\tilde{\mathbf{Y}}={\mathbf{U}}^{T}\mathbf{Y}$, $\tilde{\mathbf{X}}={\mathbf{U}}^{T}\mathbf{X}$, ${\tilde{Y}}_{i}$ denotes the We define the *p* + 3 length vector of parameters **Θ** ≔ (Θ_{0}, Θ_{1}, …, Θ_{p+1}, Θ_{p+2}, Θ_{p+3}) = (* β*,

$\begin{array}{c}\hfill {Q}_{\lambda}(\mathbf{\Theta})=f(\mathbf{\Theta})+\lambda \sum _{j\ne 0}{v}_{j}{P}_{j}\left({\beta}_{j}\right),\end{array}$

where $\begin{array}{c}\hfill {\widehat{\mathbf{\Theta}}}_{\lambda}=\underset{\mathbf{\Theta}}{\text{arg}\phantom{\rule{2pt}{0ex}}\text{min}}\phantom{\rule{2pt}{0ex}}{Q}_{\lambda}(\mathbf{\Theta}).\end{array}$

This is the general set-up for our model. In the next Section we provide more specific details on how we solve Eq 11. We note here that the main difference between the proposed model, and the We use a general purpose block coordinate gradient descent algorithm (CGD) [58] to solve Eq 11. At each iteration, we cycle through the coordinates and minimize the objective function with respect to one coordinate only. For continuously differentiable *f*(⋅) and convex and block-separable *P*(⋅) (i.e. *P*(* β*) = ∑

**Algorithm 1**: Block Coordinate Gradient Descent

Set the iteration counter *k* ← 0, initial values for the parameter vector **Θ**^{(0)} and convergence threshold *ϵ*;

**for** λ ∈ {λ_{max}, …, λ_{min}} **do**

**repeat**

$\begin{array}{c}{\mathit{\beta}}^{(k+1)}\leftarrow \underset{\mathit{\beta}}{\text{arg}\phantom{\rule{4pt}{0ex}}\text{min}}\phantom{\rule{4pt}{0ex}}{Q}_{\lambda}(\mathit{\beta},{\eta}^{\left(k\right)},{{\sigma}^{2}}^{\phantom{\rule{0.166667em}{0ex}}\phantom{\rule{0.166667em}{0ex}}\left(k\right)})\\ {\eta}^{(k+1)}\leftarrow \underset{\eta}{\text{arg}\phantom{\rule{4pt}{0ex}}\text{min}}\phantom{\rule{4pt}{0ex}}{Q}_{\lambda}({\mathit{\beta}}^{(k+1)},\eta ,{{\sigma}^{2}}^{\phantom{\rule{0.166667em}{0ex}}\phantom{\rule{0.166667em}{0ex}}\left(k\right)})\\ {{\sigma}^{2}}^{\phantom{\rule{0.166667em}{0ex}}\phantom{\rule{0.166667em}{0ex}}(k+1)}\leftarrow \underset{{\sigma}^{2}}{\text{arg}\phantom{\rule{4pt}{0ex}}\text{min}}\phantom{\rule{4pt}{0ex}}{Q}_{\lambda}({\mathit{\beta}}^{(k+1)},{\eta}^{(k+1)},{\sigma}^{2})\end{array}$

*k* ← *k* + 1

**until***convergence criterion is satisfied*: ‖**Θ**^{(k+1)} − **Θ**^{(k)}‖_{2} < *ϵ*;

end

Recall that the part of the objective function that depends on * β* has the form

$\begin{array}{c}\hfill {Q}_{\lambda}(\mathbf{\Theta})=\frac{1}{2}\sum _{i=1}^{{N}_{T}}{w}_{i}{({\tilde{Y}}_{i}-\sum _{j=0}^{p}{\tilde{X}}_{ij+1}{\mathit{\beta}}_{j})}^{2}+\lambda \sum _{j=1}^{p}{v}_{j}\left|{\mathit{\beta}}_{j}\right|,\end{array}$

where
$\begin{array}{c}\hfill {w}_{i}\u2254\frac{1}{{\sigma}^{2}(1+\eta ({\Lambda}_{i}-1\left)\right)}.\end{array}$

Conditional on *η*^{(k)} and *σ*^{2}^{(k)}, it can be shown that the solution for *β*_{j}, *j* = 1, …, *p* is given by

$\begin{array}{c}\hfill {\mathit{\beta}}_{j}^{(k+1)}\leftarrow \frac{{\mathcal{S}}_{\lambda}\left({\sum}_{i=1}^{{N}_{T}}{w}_{i}{\tilde{X}}_{ij}\right({\tilde{Y}}_{i}-{\sum}_{\ell \ne j}{\tilde{X}}_{i\ell}{\mathit{\beta}}_{\ell}^{\left(k\right)}\left)\right)}{{\sum}_{i=1}^{{N}_{T}}{w}_{i}{\tilde{X}}_{ij}^{2}},\end{array}$

where ${\mathcal{S}}_{\lambda}\left(x\right)$ is the soft-thresholding operator
$\begin{array}{c}\hfill {\mathcal{S}}_{\lambda}\left(x\right)=\text{sign}\left(x\right){\left(\right|x|-\lambda )}_{+},\end{array}$

sign($\begin{array}{c}\hfill \text{sign}\left(x\right)=\{\begin{array}{cc}-1\hfill & x<0\hfill \\ 0\hfill & x=0\hfill \\ 1\hfill & x>0\hfill \end{array},\end{array}$

and (Given **β**^{(k+1)} and *σ*^{2}^{(k)}, solving for *η*^{(k+1)} becomes a univariate optimization problem:

$\begin{array}{c}\hfill {\eta}^{(k+1)}\leftarrow \underset{\eta}{\text{arg}\phantom{\rule{4pt}{0ex}}\text{min}}\frac{1}{2}\sum _{i=1}^{{N}_{T}}\text{log}(1+\eta ({\Lambda}_{i}-1))+\frac{1}{2{{\sigma}^{2}}^{\phantom{\rule{0.166667em}{0ex}}\left(k\right)}}\sum _{i=1}^{{N}_{T}}\frac{{({\tilde{Y}}_{i}-{\sum}_{j=0}^{p}{\tilde{X}}_{ij+1}{\mathit{\beta}}_{j}^{(k+1)})}^{2}}{1+\eta ({\Lambda}_{i}-1)}.\end{array}$

We use a bound constrained optimization algorithm [60] implemented in the Conditional on **β**^{(k+1)} and *η*^{(k+1)}, *σ*^{2}^{(k+1)} can be solved for using the following equation:

$\begin{array}{c}\hfill {{\sigma}^{2}}^{\phantom{\rule{0.166667em}{0ex}}(k+1)}\leftarrow \underset{{\sigma}^{2}}{\text{arg}\phantom{\rule{4pt}{0ex}}\text{min}}\frac{{N}_{T}}{2}\text{log}\left({\sigma}^{2}\right)+\frac{1}{2{\sigma}^{2}}\sum _{i=1}^{{N}_{T}}\frac{{({\tilde{Y}}_{i}-{\sum}_{j=0}^{p}{\tilde{X}}_{ij+1}{\beta}_{j})}^{2}}{1+\eta ({\Lambda}_{i}-1)}.\end{array}$

There exists an analytic solution for Eq 12 given by:

$\begin{array}{c}\hfill {{\sigma}^{2}}^{\phantom{\rule{0.166667em}{0ex}}(k+1)}\leftarrow \frac{1}{{N}_{T}}\sum _{i=1}^{{N}_{T}}\frac{{({\tilde{Y}}_{i}-{\sum}_{j=0}^{p}{\tilde{X}}_{ij+1}{\beta}_{j}^{(k+1)})}^{2}}{1+{\eta}^{(k+1)}({\Lambda}_{i}-1)}.\end{array}$

In this section we describe how to determine the sequence of tuning parameters λ at which to fit the model. Recall that our objective function has the form

$\begin{array}{c}\hfill {Q}_{\lambda}(\mathbf{\Theta})=\frac{{N}_{T}}{2}\text{log}\left({\sigma}^{2}\right)+\frac{1}{2}\sum _{i=1}^{{N}_{T}}\text{log}(1+\eta ({\Lambda}_{i}-1))+\frac{1}{2}\sum _{i=1}^{{N}_{T}}{w}_{i}{({\tilde{Y}}_{i}-\sum _{j=0}^{p}{\tilde{X}}_{ij+1}{\mathit{\beta}}_{j})}^{2}+\lambda \sum _{j=1}^{p}{v}_{j}\left|{\beta}_{j}\right|.\end{array}$

The Karush-Kuhn-Tucker (KKT) optimality conditions for Eq 13 are given by:
$\begin{array}{c}\hfill \begin{array}{cc}\hfill \frac{\partial}{\partial {\beta}_{1},\dots ,{\beta}_{p}}{Q}_{\lambda}(\mathbf{\Theta})& ={\mathbf{0}}_{p}\hfill \\ \hfill \frac{\partial}{\partial {\beta}_{0}}{Q}_{\lambda}(\mathbf{\Theta})& =0\hfill \\ \hfill \frac{\partial}{\partial \eta}{Q}_{\lambda}(\mathbf{\Theta})& =0\hfill \\ \hfill \frac{\partial}{\partial {\sigma}^{2}}{Q}_{\lambda}(\mathbf{\Theta})& =0.\hfill \end{array}\end{array}$

The equations in Eq 14 are equivalent to

$\begin{array}{c}\hfill \begin{array}{c}\hfill \sum _{i=1}^{{N}_{T}}{w}_{i}{\tilde{X}}_{i1}({\tilde{Y}}_{i}-\sum _{j=0}^{p}{\tilde{X}}_{ij+1}{\beta}_{j})=0\\ \hfill \frac{1}{{v}_{j}}\sum _{i=1}^{{N}_{T}}{w}_{i}{\tilde{X}}_{ij}({\tilde{Y}}_{i}-\sum _{j=0}^{p}{\tilde{X}}_{ij+1}{\beta}_{j})=\lambda {\gamma}_{j}\\ \hfill {\gamma}_{j}\in \{\begin{array}{cc}\text{sign}\left({\widehat{\beta}}_{j}\right)\hfill & \text{if}\phantom{\rule{1em}{0ex}}{\widehat{\beta}}_{j}\ne 0\hfill \\ [-1,1]& \text{if}\phantom{\rule{1em}{0ex}}{\widehat{\beta}}_{j}=0\hfill \end{array},\phantom{\rule{2em}{0ex}}\text{for}\phantom{\rule{2em}{0ex}}j=1,\dots ,p\\ \hfill \frac{1}{2}\sum _{i=1}^{{N}_{T}}\frac{{\Lambda}_{i}-1}{1+\eta ({\Lambda}_{i}-1)}(1-\frac{{({\tilde{Y}}_{i}-{\sum}_{j=0}^{p}{\tilde{X}}_{ij+1}{\beta}_{j})}^{2}}{{\sigma}^{2}(1+\eta ({\Lambda}_{i}-1))})=0\\ \hfill {\sigma}^{2}-\frac{1}{{N}_{T}}\sum _{i=1}^{{N}_{T}}\frac{{({\tilde{Y}}_{i}-{\sum}_{j=0}^{p}{\tilde{X}}_{ij+1}{\beta}_{j})}^{2}}{1+\eta ({\Lambda}_{i}-1)}=0,\end{array}\end{array}$

where $\begin{array}{c}\hfill \begin{array}{c}\hfill \frac{1}{{v}_{j}}\sum _{i=1}^{{N}_{T}}|{w}_{i}{\tilde{X}}_{ij}({\tilde{Y}}_{i}-{\tilde{X}}_{i1}{\beta}_{0})|\le \lambda ,\phantom{\rule{1em}{0ex}}\forall j=1,\dots ,p\\ \hfill {\beta}_{0}=\frac{{\sum}_{i=1}^{{N}_{T}}{w}_{i}{\tilde{X}}_{i1}{\tilde{Y}}_{i}}{{\sum}_{i=1}^{{N}_{T}}{w}_{i}{\tilde{X}}_{i1}^{2}}\\ \hfill \frac{1}{2}\sum _{i=1}^{{N}_{T}}\frac{{\Lambda}_{i}-1}{1+\eta ({\Lambda}_{i}-1)}(1-\frac{{({\tilde{Y}}_{i}-{\tilde{X}}_{i1}{\beta}_{0})}^{2}}{{\sigma}^{2}(1+\eta ({\Lambda}_{i}-1))})=0\\ \hfill {\sigma}^{2}=\frac{1}{{N}_{T}}\sum _{i=1}^{{N}_{T}}\frac{{({\tilde{Y}}_{i}-{\tilde{X}}_{i1}{\beta}_{0})}^{2}}{1+\eta ({\Lambda}_{i}-1)}.\end{array}\end{array}$

We can solve the KKT system of equations in Eq 16 (with a numerical solution for $\begin{array}{c}\hfill {\lambda}_{max}=\underset{j}{\text{max}}\left\{\right|\frac{1}{{v}_{j}}\sum _{i=1}^{{N}_{T}}\widehat{{w}_{i}}{\tilde{X}}_{ij}({\tilde{Y}}_{i}-{\tilde{X}}_{i1}{\widehat{\beta}}_{0})\left|\right\},\phantom{\rule{1em}{0ex}}j=1,\dots ,p.\end{array}$

Following Friedman et al. [20], we choose The way in which we have derived the sequence of tuning parameters using the KKT conditions, allows us to implement warm starts. That is, the solution $\widehat{\mathbf{\Theta}}$ for λ_{k} is used as the initial value Θ^{(0)} for λ_{k+1}. This strategy leads to computational speedups and has been implemented in the `ggmix` R package.

We use an empirical Bayes approach (e.g. [61]) to predict the random effects * b*. Let the maximum a posteriori (MAP) estimate be defined as

$\begin{array}{c}\hfill \widehat{\mathit{b}}=\underset{\mathit{b}}{arg\; max}f\left(\mathit{b}\right|\mathbf{Y},\mathit{\beta},\eta ,{\sigma}^{2}),\end{array}$

where, by using Bayes rule, $\begin{array}{cc}\hfill f\left(\mathit{b}\right|\mathbf{Y},\mathit{\beta},\eta ,{\sigma}^{2})& =\frac{f\left(\mathbf{Y}\right|\mathit{b},\mathit{\beta},\eta ,{\sigma}^{2})\pi \left(\mathit{b}\right|\eta ,{\sigma}^{2})}{f\left(\mathbf{Y}\right|\mathit{\beta},\eta ,{\sigma}^{2})}\hfill \\ & \propto f\left(\mathbf{Y}\right|\mathit{b},\mathit{\beta},\eta ,{\sigma}^{2})\pi \left(\mathit{b}\right|\eta ,{\sigma}^{2})\hfill \\ & \propto \text{exp}\{-\frac{1}{2{\sigma}^{2}}{(\mathbf{Y}-\mathbf{X}\mathit{\beta}-\mathit{b})}^{T}(\mathbf{Y}-\mathbf{X}\mathit{\beta}-\mathit{b})-\frac{1}{2\eta {\sigma}^{2}}{\mathit{b}}^{T}{\mathbf{\Phi}}^{-1}\mathit{b}\}\hfill \\ & =\text{exp}\{-\frac{1}{2{\sigma}^{2}}[{(\mathbf{Y}-\mathbf{X}\mathit{\beta}-\mathit{b})}^{T}(\mathbf{Y}-\mathbf{X}\mathit{\beta}-\mathit{b})+\frac{1}{\eta}{\mathit{b}}^{T}{\mathbf{\Phi}}^{-1}\mathit{b}\left]\right\}.\hfill \end{array}$

Solving for Eq 17 is equivalent to minimizing the exponent in Eq 18:
$\begin{array}{c}\hfill \widehat{\mathit{b}}=\underset{\mathit{b}}{\text{arg}\phantom{\rule{4pt}{0ex}}\text{min}}\{{(\mathbf{Y}-\mathbf{X}\mathit{\beta}-\mathit{b})}^{T}(\mathbf{Y}-\mathbf{X}\mathit{\beta}-\mathit{b})+\frac{1}{\eta}{\mathit{b}}^{T}{\mathbf{\Phi}}^{-1}\mathit{b}\}.\end{array}$

Taking the derivative of Eq 19 with respect to $\begin{array}{cc}\hfill 0& =-2(\mathbf{Y}-\mathbf{X}\mathit{\beta}-\mathit{b})+\frac{2}{\eta}{\mathbf{\Phi}}^{-1}\mathit{b}\hfill \\ & =-(\mathbf{Y}-\mathbf{X}\mathit{\beta})+\mathit{b}+\left(\frac{1}{\eta}{\mathbf{\Phi}}^{-1}\right)\mathit{b}\hfill \\ \hfill (\mathbf{Y}-\mathbf{X}\mathit{\beta})& =({\mathbf{I}}_{{N}_{T}\times {N}_{T}}+\frac{1}{\eta}{\mathbf{\Phi}}^{-1})\mathit{b}\hfill \\ \hfill \widehat{\mathit{b}}& ={({\mathbf{I}}_{{N}_{T}\times {N}_{T}}+\frac{1}{\widehat{\eta}}{\mathbf{\Phi}}^{-1})}^{-1}(\mathbf{Y}-\mathbf{X}\widehat{\mathit{\beta}})\hfill \\ & ={({\mathbf{I}}_{{N}_{T}\times {N}_{T}}+\frac{1}{\widehat{\eta}}\mathbf{U}{\mathbf{D}}^{-1}{\mathbf{U}}^{T})}^{-1}(\mathbf{Y}-\mathbf{X}\widehat{\mathit{\beta}}),\hfill \end{array}$

where $(\widehat{\mathit{\beta}},\widehat{\eta})$ are the estimates obtained from Algorithm 1.Here we describe the method used for predicting the unobserved phenotype **Y**^{⋆} in a set of individuals with predictor set **X**^{⋆} that were not used in the model training e.g. a testing set. Let *q* denote the number of observations in the testing set and *N* − *q* the number of observations in the training set. We assume that a `ggmix` model has been fit on a set of training individuals with observed phenotype **Y** and predictor set **X**. We further assume that **Y** and **Y**^{⋆} are jointly multivariate Normal:

$\left[{}_{\mathbf{Y}}^{\mathbf{Y}\star}\right]~N\left(\left[\begin{array}{c}{\mathit{\mu}}_{{\text{1}}_{\left(q\times 1\right)}}\\ {\mathit{\mu}}_{{\text{2}}_{\left(N-q\right)\times 1}}\end{array}\right],\left[\begin{array}{cc}{\mathbf{\Sigma}}_{{11}_{\left(q\times q\right)}}& {\mathbf{\Sigma}}_{{12}_{q\times \left(N-q\right)}}\\ {\mathbf{\Sigma}}_{{21}_{\left(N-q\right)\times q}}& {\mathbf{\Sigma}}_{{22}_{\left(N-q\right)\times \left(N-q\right)}}\end{array}\right]\right).$

Then, from standard multivariate Normal theory, the conditional distribution **Y**^{⋆}|**Y**, *η*, *σ*^{2}, *β*, **X**, **X**^{⋆} is $\mathcal{N}({\mu}^{\star},{\mathbf{\Sigma}}^{\star})$ where

$\begin{array}{cc}\hfill {\mathit{\mu}}^{\star}& ={\mathit{\mu}}_{1}+{\mathbf{\Sigma}}_{12}{\mathbf{\Sigma}}_{22}^{-1}(\mathbf{Y}-{\mu}_{2})\\ \hfill {\mathbf{\Sigma}}^{\star}& ={\mathbf{\Sigma}}_{11}-{\mathbf{\Sigma}}_{12}{\mathbf{\Sigma}}_{22}^{-1}{\mathbf{\Sigma}}_{21}.\end{array}$

The phenotype prediction is thus given by:

$\begin{array}{cc}\hfill {\mathit{\mu}}_{q\times 1}^{\star}& ={\mathbf{X}}^{\star}\mathit{\beta}+\frac{1}{{\sigma}^{2}}{\mathbf{\Sigma}}_{12}{\mathbf{V}}^{-1}(\mathbf{Y}-\mathbf{X}\mathit{\beta})\hfill \\ & ={\mathbf{X}}^{\star}\mathit{\beta}+\frac{1}{{\sigma}^{2}}{\mathbf{\Sigma}}_{12}\mathbf{U}{\tilde{\mathbf{D}}}^{-1}{\mathbf{U}}^{T}(\mathbf{Y}-\mathbf{X}\mathit{\beta})\hfill \\ & ={\mathbf{X}}^{\star}\mathit{\beta}+\frac{1}{{\sigma}^{2}}{\mathbf{\Sigma}}_{12}\mathbf{U}{\tilde{\mathbf{D}}}^{-1}(\tilde{\mathbf{Y}}-\tilde{\mathbf{X}}\mathit{\beta})\hfill \\ & ={\mathbf{X}}^{\star}\mathit{\beta}+\frac{1}{{\sigma}^{2}}\eta {\sigma}^{2}{\mathbf{\Phi}}^{\star}\mathbf{U}{\tilde{\mathbf{D}}}^{-1}(\tilde{\mathbf{Y}}-\tilde{\mathbf{X}}\mathit{\beta})\hfill \\ & ={\mathbf{X}}^{\star}\mathit{\beta}+\eta {\mathbf{\Phi}}^{\star}\mathbf{U}{\tilde{\mathbf{D}}}^{-1}(\tilde{\mathbf{Y}}-\tilde{\mathbf{X}}\mathit{\beta}),\hfill \end{array}$

where In order to choose the optimal value of the tuning parameter λ, we use the generalized information criterion [62] (GIC):

$\begin{array}{c}\hfill GI{C}_{\lambda}=-2\ell (\widehat{\mathit{\beta}},{\widehat{\sigma}}^{2},\widehat{\eta})+{a}_{n}\xb7{\widehat{df}}_{\lambda},\end{array}$

where ${\widehat{df}}_{\lambda}$ is the number of non-zero elements in ${\widehat{\mathit{\beta}}}_{\lambda}$ [63] plus two (representing the variance parameters The `ggmix` method is written in an R package, which is freely available on CRAN at https://cran.r-project.org/package=ggmix. The complete documentation for this package is available at https://sahirbhatnagar.com/ggmix/. Scripts for running the analyses and reproducing the tables and figures reported in the manuscript are available in an `RMarkdown` document at https://github.com/sahirbhatnagar/ggmix/blob/master/manuscript/bin/tables-figures.Rmd.

Part of this research has been conducted using the UK Biobank Resource under project number 27449. We appreciate the generosity of UK Biobank volunteers. MF was at the Lady Davis Institute when she undertook this research. We appreciate advice on an earlier version of the manuscript provided by Dr. Simon Gravel, Dr. David Fardo and Dr. Abbas Khalili.

TA Manolio, FS Collins, NJ Cox, DB Goldstein, LA Hindorff, DJ Hunter, et al . Finding the missing heritability of complex diseases.

J Yang, B Benyamin, BP McEvoy, S Gordon, AK Henders, DR Nyholt, et al . Common SNPs explain a large proportion of the heritability for human height.

W Astle, DJ Balding, et al . Population structure and cryptic relatedness in genetic association studies.

M Song, W Hao, JD Storey. . Testing for genetic associations in arbitrarily structured populations.

J Marchini, LR Cardon, MS Phillips, P Donnelly. . The effects of human population structure on large genetic association studies.

CJ Hoggart, JC Whittaker, M De Iorio, DJ Balding. . Simultaneous analysis of all SNPs in genome-wide and re-sequencing association studies.

C Lippert, J Listgarten, Y Liu, CM Kadie, RI Davidson, D Heckerman. . FaST linear mixed models for genome-wide association studies.

HM Kang, JH Sul, NA Zaitlen, Sy Kong, NB Freimer, C Sabatti, et al . Variance component model to account for sample structure in genome-wide association studies.

J Yu, G Pressoir, WH Briggs, IV Bi, M Yamasaki, JF Doebley, et al . A unified mixed-model method for association mapping that accounts for multiple levels of relatedness.

J Eu-Ahsunthornwattana, EN Miller, M Fakiola, SM Jeronimo, JM Blackwell, HJ Cordell, et al . Comparison of methods to account for relatedness in genome-wide association studies with family-based data.

AL Price, NJ Patterson, RM Plenge, ME Weinblatt, NA Shadick, D Reich. . Principal components analysis corrects for stratification in genome-wide association studies.

K Oualkacha, Z Dastani, R Li, PE Cingolani, TD Spector, CJ Hammond, et al . Adjusted sequence kernel association test for rare variants controlling for cryptic and family relatedness.

HJ Cordell, DG Clayton. . A unified stepwise regression procedure for evaluating the relative effects of polymorphisms within a gene using case/control or family data: application to HLA in type 1 diabetes.

B Rakitsch, C Lippert, O Stegle, K Borgwardt. . A Lasso multi-marker mixed model for association mapping with population structure correction.

D Wang, KM Eskridge, J Crossa. . Identifying QTLs and epistasis in structured plant populations using adaptive mixed LASSO.

J Friedman, T Hastie, R Tibshirani. . Regularization paths for generalized linear models via coordinate descent.

J Yang, NA Zaitlen, ME Goddard, PM Visscher, AL Price. . Advantages and pitfalls in the application of mixed-model association methods.

AR Gilmour, R Thompson, BR Cullis. . Average information REML: an efficient algorithm for variance parameter estimation in linear mixed models.

Dandine-Roulland C. gaston: Genetic Data Handling (QC, GRM, LD, PCA) and Linear Mixed Models; 2018. Available from: https://CRAN.R-project.org/package=gaston.

C Bycroft, C Freeman, D Petkova, G Band, LT Elliott, K Sharp, et al . The UK Biobank resource with deep phenotyping and genomic data.

U Biobank. . Genotyping and quality control of UK Biobank, a large-scale, extensively phenotyped prospective resource.

A Manichaikul, JC Mychaleckyj, SS Rich, K Daly, M Sale, WM Chen. . Robust relationship inference in genome-wide association studies.

L Yengo, J Sidorenko, KE Kemper, Z Zheng, AR Wood, MN Weedon, et al . Meta-analysis of genome-wide association studies for height and body mass index in 700000 individuals of European ancestry.

S McCarthy, S Das, W Kretzschmar, O Delaneau, AR Wood, A Teumer, et al . A reference panel of 64,976 haplotypes for genotype imputation.

G Davey Smith, S Ebrahim. . ‘Mendelian randomization’: can genetic epidemiology contribute to understanding environmental determinants of disease?

T Thornton, H Tang, TJ Hoffmann, HM Ochs-Balcom, BJ Caan, N Risch. . Estimating kinship in admixed populations.

DH Alexander, J Novembre, K Lange. . Fast model-based estimation of ancestry in unrelated individuals.

A Fortin, E Diez, D Rochefort, L Laroche, D Malo, GA Rouleau, et al . Recombinant congenic strains derived from A/J and C57BL/6J: a tool for genetic dissection of complex traits.

BJ Bennett, CR Farber, L Orozco, HM Kang, A Ghazalpour, N Siemers, et al . A high-resolution association mapping panel for the dissection of complex traits in mice.

R Cheng, JE Lim, KE Samocha, G Sokoloff, M Abney, AD Skol, et al . Genome-wide association studies and the problem of relatedness among advanced intercross lines and other highly recombinant populations.

T Di Pietrantonio, C Hernandez, M Girard, A Verville, M Orlova, A Belley, et al . Strain-specific differences in the genetic control of two closely related mycobacteria.

H Wang, BJ Lengerich, B Aragam, EP Xing. . Precision Lasso: accounting for correlations and linear dependencies in high-dimensional genomic data.

Y Sohrabi, H Havelková, T Kobets, M Šíma, V Volkova, I Grekov, et al . Mapping the Genes for Susceptibility and Response to Leishmania tropica in Mouse.

AU Jackson, A Fornés, A Galecki, RA Miller, DT Burke. . Multiple-trait quantitative trait loci analysis using a large mouse sibship.

MC Stern, F Benavides, EA Klingelberger, CJ Conti. . Allelotype analysis of chemically induced squamous cell carcinomas in F1 hybrids of two inbred mouse strains with different susceptibility to tumor progression.

PR Loh, G Tucker, BK Bulik-Sullivan, BJ Vilhjalmsson, HK Finucane, RM Salem, et al . Efficient Bayesian mixed-model analysis increases association power in large cohorts.

N Allen, C Sudlow, P Downey, T Peakman, J Danesh, P Elliott, et al . UK Biobank: Current status and what it means for epidemiology.

M Pirinen, P Donnelly, CC Spencer, et al . Efficient computation with a linear mixed model on large-scale data sets with applications to genetic studies.

J Schelldorfer, P Bühlmann, G DE, S VAN. . Estimation for High-Dimensional Linear Mixed-Effects Models Using L1-Penalization.

RH Byrd, P Lu, J Nocedal, C Zhu. . A limited memory algorithm for bound constrained optimization.

HD Bondell, A Krishna, SK Ghosh. . Joint Variable Selection for Fixed and Random Effects in Linear Mixed-Effects Models.

8 Sep 2019

Dear Dr Bhatnagar,

Thank you very much for submitting your Research Article entitled 'Simultaneous SNP selection and adjustment for population structure in high dimensional prediction models' to PLOS Genetics. Your manuscript was fully evaluated at the editorial level and by independent peer reviewers. The reviewers appreciated the attention to an important problem, but raised some substantial concerns about the current manuscript. Based on the reviews, we will not be able to accept this version of the manuscript, but we would be willing to review again a much-revised version. We cannot, of course, promise publication at that time.

Should you decide to revise the manuscript for further consideration here, your revisions should address the specific points made by each reviewer. We will also require a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript.

If you decide to revise the manuscript for further consideration at PLOS Genetics, please aim to resubmit within the next 60 days, unless it will take extra time to address the concerns of the reviewers, in which case we would appreciate an expected resubmission date by email to plosgenetics@plos.org.

If present, accompanying reviewer attachments are included with this email; please notify the journal office if any appear to be missing. They will also be available for download from the link below. You can use this link to log into the system when you are ready to submit a revised version, having first consulted our Submission Checklist.

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see our guidelines.

Please be aware that our data availability policy requires that all numerical data underlying graphs or summary statistics are included with the submission, and you will need to provide this upon resubmission if not already present. In addition, we do not permit the inclusion of phrases such as "data not shown" or "unpublished results" in manuscripts. All points should be backed up by data provided with the submission.

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

PLOS has incorporated Similarity Check, powered by iThenticate, into its journal-wide submission system in order to screen submitted content for originality before publication. Each PLOS journal undertakes screening on a proportion of submitted articles. You will be contacted if needed following the screening process.

To resubmit, use the link below and 'Revise Submission' in the 'Submissions Needing Revision' folder.

[LINK]

We are sorry that we cannot be more positive about your manuscript at this stage. Please do not hesitate to contact us if you have any concerns or questions.

Yours sincerely,

Heather J Cordell

Associate Editor

PLOS Genetics

Scott Williams

Section Editor: Natural Variation

PLOS Genetics

Reviewer's Responses to Questions

**Comments to the Authors:**

**Please note here if the review is uploaded as an attachment.**

Reviewer #1: The authors demonstrate a method for genome-wide association that combines mixed-model based control of population structure or genetic relatedness with multi-variate regressions. The idea is to increase power for GWAS applications by including multiple markers at once while accounting for structure. Previous tools for this application are somewhat limited, so this method aims to provide a more comprehensive and computationally efficient tools. They include a well-structured R package to implement their method.

This is an important topic, and their software is much faster and easier to use than existing tools. However I have several comments on their presentation, particularly the comparisons with existing methods and some limitations of the simulations:

Major issues:

1. I think that the novelty of the method is not made clear in the paper. The model itself appears virtually identical to that of the glmmlasso R package (Schelldorfer et al 2011) except limited to one random effect – the objective function appears identical, and those authors also derived a block coordinate gradient descent algorithm, though I don’t know if they are identical. The addition here seems mostly in terms of computational efficiency and flexibility (combinations of L1 and L2 penalties vs just L1 for Schellendorfer). Using the SVD to rotate and diagonalize the LMM so that the random effect covariance matrix is diagonal is a very useful addition and makes the software very fast, but this is not really emphasized in the methods. Also, the BSLMM model (Zhou et al 2013 Plos Genetics) is also effectively a variable selection LMM, though it’s Bayesian instead of frequentist and requires MCMC so is slower. But it’s performance could be compared to ggmix and the two-step LASSO presented in this paper.

2. While this method is much faster than glmmlasso (and presumably BSLMM), how does it scale to large numbers of markers? Typical genomics datasets today contain >1e5-1e6 markers. The datasets used here seem to contain 10K-50K. There is some discussion about limitations of scaling to large N, but not large p.

3. The first set of simulations include population structure and admixture. But all 10K markers are simulated in linkage equilibrium. This is an ideal situation for LASSO. But in real data, nearby markers will be partially correlated. How well does this method select the correct marker when there are other correlated markers? What does it do when the true causal variant is not in the data, but imperfectly correlated markers are (the typical GWAS setting)? Does it tend to select the nearest marker, or does it select a set of nearby markers? I can’t tell in the GAW20 and mouse datasets the extent of the LD among markers and whether this is addressed there. Also, the simulation result that including the causal markers in the kinship matrix has little impact is encouraging. But the Yang et al 2014 paper, which discusses the impact of including causal markers in the estimated kinship matrix, suggests that this is only likely when N/M is large (ie as many or more individuals as markers). In the simulations presented here, N/M is <0.1, which is probably small enough that proximal contamination is not an issue. If N/M were ~1, then the impact might be greater. Note: M is the effective number of markers after accounting for LD among markers.

Minor issues

• 97: Wang et al 2011 does not treat the variance components as fixed, but iteratively estimates them along with beta.

• 142: How much of the total variance is accounted for by causal SNPs vs the background in the simulations?

• 355: X_1,…,X_N should each be transposed

• Fig 5: I don’t think that the red line is a p-value threshold. Maybe “significance threshold”? How can this be <100 if there were 200 bootstraps and significance required >50% inclusion?

• I believe Eq (30) is wrong. The term V^{-1} shouldn’t be present in the likelihood if b is included.

Reviewer #2: I have a few comments and suggestions:

Page 2, Line 33: It this true: better sensitivity and specificity? What are you using to define sensitivity?

Page 7, line 143-163. I found the definitions of the different X matrices confusing. I think you can simplify to indexes: eg: causal for list of causal snp indexes, kinship for list of snp index in the kinship matrix, etc. I’d also refer the the snps as covariates and not label them as fixed, and state when the causal set is in the kinship set.

Page 7, line 159: Did you try a larger number of SNPs? In GWAS the number is much larger and PRS methods use much more than 50 independent SNPs for estimation.

Page 10, line 183: if I understand your sparsity estimate correctly, with a causal rate of 1%, setting all B to zero would give a value of 99%?

Page 11, 188: I am curious why you only reported the ‘optimal’ value of the penalty parameter. Is your method outperforming in terms of sparsity because it just does a better job of selecting a sparse model? Your false positive rate is lower but the true positive rate is much lower. If one is searching for the set of true causal variants, they are usually willing to take the tradeoff of better sensitivity for weeding through the false positives. I would prefer to see curves of FP versus TP rates with the values at the optimal tuning parameter marked. In practice I found AIC/BIC somewhat conservative compared to CV or controlling for error rate via phenotype permutation.

Page 11, Line 196: I am sorry if I missed this somewhere, but how was the model tuned for the lasso and twostep in the training data? Did ggmix use the GIC in the materials and methods?

Page 14, Line 243 typo- methods

Page 12: Figure 3. This data might be better summarized in a table that could include the additional data in the supplementary files.

The math is a bit beyond my abilities, but I have previously read a paper that suggests maximizing the log likelihood, −12[ln|V|+(Y−βX)TV−1(Y−BX)] subject to the L1/L2 norm penalty to control for relatedness in penalized regression methods for genetic data (where V is the variance covariance matrix). Is this essentially what you are doing?

It would have been interesting to use the set of related individuals in the UK Biobank on a few traits where PRS works well.

Reviewer #3: The authors present a penalized multi-variate regression model, ggmix, that jointly models multiple genotypes in mixed-model setup that incorporates the kinship or the genetic relationship matrix (GRM). It is an important problem in statistical genetics and I appreciate that the authors developed a comprehensive algorithm that simultaneously incorporates population structure and variable selection problem. The methods are well described and the paper was easy to follow, but I have some concerns in the experiments and evaluation metric. Overall, I think the paper presents an important problem, but the results are not convincing.

First, in simulation results, the ‘correct sparsity’ measure is basically ‘accuracy’ measure in binary classification problem of whether the regression coefficients are zero or non-zero. This ‘accuracy’ measure is often misleading when class distribution is imbalanced. For example, if 99% of coefficients are zero, you can get 99% accuracy by just classifying everything to be zero, which is clearly not a good model. I suggest adapting different measure, such as MCC. I can see that twostep and LASSO both maintains FPR at 0.05, which is why ‘correct sparsity’ is around 0.95. At the same time, we can see that LASSO and twostep achieve higher TPR. Also from a slightly different point of view, in Figure 3(D), comparing TPR at different points on FPR is not a fair comparison. To compare different methods in the context of TPR and FPR, either AUC or TPR at the same rate of FPR should be considered.

Intro is slightly misleading, especially in lines 107-109, because I first thought that ggmix takes out causal (i.e. selected) variables out of the relationship matrix, then I later realized that loss of power due to causal SNPs included in the GRM still happens in ggmix, but you aim to minimize the loss by joint modeling – but it is not clear why this would be the case. Is there any theoretical or simulation-based evidence that joint modeling achieves higher power in such a case?

In the mouse data, the model parameters were optimized so that the two loci are picked up, and then the evaluation metric is based on whether these two loci are picked up, which is circular.

In discussion, it was mentioned that leave-one-chromosome-out approach is possible, but has not been tried. What would be the compelling reason to model all chromosomes together in the proposed problem, especially when the model is still additive and trans-interaction term is not directly modeled?

**********

**Have all data underlying the figures and results presented in the manuscript been provided?**

Large-scale datasets should be made available via a public repository as described in the *PLOS Genetics*data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

**Do you want your identity to be public for this peer review?** For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: No

11 Feb 2020

* Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. *

Dear Dr Bhatnagar,

Thank you very much for submitting your Research Article entitled 'Simultaneous SNP selection and adjustment for population structure in high dimensional prediction models' to PLOS Genetics. Your manuscript was fully evaluated at the editorial level and by independent peer reviewers. The reviewers appreciated the attention to an important topic but identified some aspects of the manuscript that should be improved.

We therefore ask you to modify the manuscript according to the review recommendations before we can consider your manuscript for acceptance. Your revisions should address the specific points made by each reviewer.

In addition we ask that you:

1) Provide a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript.

2) Upload a Striking Image with a corresponding caption to accompany your manuscript if one is available (either a new image or an existing one from within your manuscript). If this image is judged to be suitable, it may be featured on our website. Images should ideally be high resolution, eye-catching, single panel square images. For examples, please browse our archive. If your image is from someone other than yourself, please ensure that the artist has read and agreed to the terms and conditions of the Creative Commons Attribution License. Note: we cannot publish copyrighted images.

We hope to receive your revised manuscript within the next 30 days. If you anticipate any delay in its return, we would ask you to let us know the expected resubmission date by email to plosgenetics@plos.org.

If present, accompanying reviewer attachments should be included with this email; please notify the journal office if any appear to be missing. They will also be available for download from the link below. You can use this link to log into the system when you are ready to submit a revised version, having first consulted our Submission Checklist.

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Please be aware that our data availability policy requires that all numerical data underlying graphs or summary statistics are included with the submission, and you will need to provide this upon resubmission if not already present. In addition, we do not permit the inclusion of phrases such as "data not shown" or "unpublished results" in manuscripts. All points should be backed up by data provided with the submission.

PLOS has incorporated Similarity Check, powered by iThenticate, into its journal-wide submission system in order to screen submitted content for originality before publication. Each PLOS journal undertakes screening on a proportion of submitted articles. You will be contacted if needed following the screening process.

To resubmit, you will need to go to the link below and 'Revise Submission' in the 'Submissions Needing Revision' folder.

[LINK]

Please let us know if you have any questions while making these revisions.

Yours sincerely,

Heather J Cordell

Associate Editor

PLOS Genetics

Scott Williams

Section Editor: Natural Variation

PLOS Genetics

Reviewer's Responses to Questions

**Comments to the Authors:**

**Please note here if the review is uploaded as an attachment.**

Reviewer #1: The authors have significantly revised their manuscript. I think that the inclusions of BSLMM in the comparisons is useful. However, I have a couple additional concerns.

1. Are you sure you’re extracting the model fit from BSLMM correctly? The specifics of use are not described in the methods. It’s a Bayesian model, so gives a probability of inclusion of each SNP. From the prefix.param.txt file, you should use the gamma column to report the number of SNPs that cross a posterior inclusion probability threshold. If you count how many betas are != 0, that will likely be large. But this is not an accurate estimate of the number of markers included in the model. You can also get the posterior on the number of SNPs included from the prefix.gamma.txt file. If you’re counting SNPs that are included, what posterior probability threshold are you using for inclusion? If you’re reporting the % of SNPs included, are you reporting the posterior mean?

2. Also, the authors didn’t apply BSLMM to several of the analyses. There is a write.plink function in the snpStats package that could be used to write GEMMA-compatible input files from R. Also, the PhenotypeSimulator package seems to have a writeStandardOutput function that can write bimbam or Gemma output. Especially if the extraction of estimates of needs to be revised and in fact works more similarly to the other methods, then I would recommend applying it to all analyses (Biobank may be too large).

3. I am still a bit confused about when tuning parameters were set based on TPR / FPR and when based on GIC or CV or other direct methods. In real data, it’s generally not possible to set based on TPR/FPR, but I think most of your comparisons now are done that way. I understand that the goal is to show that your model can work well, and so comparing to the truth is useful. But I think more clarity is needed about when you’re demonstrating the true performance of the model vs when you’re demonstrating how the model would actually be run by a practitioner who did not know any true positive effects going in.

4. 412-414. Is this statement true? Schelldorfer et al 2011 used what they called a "Block Coordinate Gradient Descent method" for their penalized LMM

5. 501: I think that this is underselling your method. The main difference is that your method is orders of magnitude faster than lmmlasso, so it’s actually usable. A secondary difference is that it is limited to only one random effect.

6. 288: how do you get a p-value from ggmix?

Reviewer #3: The authors addressed most of my major concerns in this revision. I have one more comment:

The authors first state they could not run BSLMM on simulated data and mouse data because of data format issues. Converting genotypes to PLINK format should be straightforward and should not be an issue. I understand for simulating large amount of data, it might be practically difficult, but I cannot understand why mouse microsatellite data cannot be converted to PLINK format. I do not think you need additional experiments with BSLMM but please remove the statement that converting the data to PLINK format is not possible.

**********

**Have all data underlying the figures and results presented in the manuscript been provided?**

Large-scale datasets should be made available via a public repository as described in the *PLOS Genetics*data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: Yes

Reviewer #3: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

**Do you want your identity to be public for this peer review?** For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #3: No

8 Apr 2020

Dear Dr Bhatnagar,

We are pleased to inform you that your manuscript entitled "Simultaneous SNP selection and adjustment for population structure in high dimensional prediction models" has been editorially accepted for publication in PLOS Genetics. Congratulations!

Before your submission can be formally accepted and sent to production you will need to complete our formatting changes, which you will receive in a follow up email. Please be aware that it may take several days for you to receive this email; during this time no action is required by you. Please note: the accept date on your published article will reflect the date of this provisional accept, but your manuscript will not be scheduled for publication until the required changes have been made.

Once your paper is formally accepted, an uncorrected proof of your manuscript will be published online ahead of the final version, unless you’ve already opted out via the online submission form. If, for any reason, you do not want an earlier version of your manuscript published online or are unsure if you have already indicated as such, please let the journal staff know immediately at plosgenetics@plos.org.

In the meantime, please log into Editorial Manager at https://www.editorialmanager.com/pgenetics/, click the "Update My Information" link at the top of the page, and update your user information to ensure an efficient production and billing process. Note that PLOS requires an ORCID iD for all corresponding authors. Therefore, please ensure that you have an ORCID iD and that it is validated in Editorial Manager. To do this, go to ‘Update my Information’ (in the upper left-hand corner of the main menu), and click on the Fetch/Validate link next to the ORCID field. This will take you to the ORCID site and allow you to create a new iD or authenticate a pre-existing iD in Editorial Manager.

If you have a press-related query, or would like to know about one way to make your underlying data available (as you will be aware, this is required for publication), please see the end of this email. If your institution or institutions have a press office, please notify them about your upcoming article at this point, to enable them to help maximise its impact. Inform journal staff as soon as possible if you are preparing a press release for your article and need a publication date.

Thank you again for supporting open-access publishing; we are looking forward to publishing your work in PLOS Genetics!

Yours sincerely,

Heather J Cordell

Associate Editor

PLOS Genetics

Scott Williams

Section Editor: Natural Variation

PLOS Genetics

Twitter: @PLOSGenetics

----------------------------------------------------

Comments from the reviewers (if applicable):

Reviewer's Responses to Questions

**Comments to the Authors:**

**Please note here if the review is uploaded as an attachment.**

Reviewer #1: I am satisfied with the changes made by the authors

**********

**Have all data underlying the figures and results presented in the manuscript been provided?**

Large-scale datasets should be made available via a public repository as described in the *PLOS Genetics*data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

**Do you want your identity to be public for this peer review?** For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

----------------------------------------------------

**Data Deposition**

If you have submitted a Research Article or Front Matter that has associated data that are not suitable for deposition in a subject-specific public repository (such as GenBank or ArrayExpress), one way to make that data available is to deposit it in the Dryad Digital Repository. As you may recall, we ask all authors to agree to make data available; this is one way to achieve that. A full list of recommended repositories can be found on our website.

The following link will take you to the Dryad record for your article, so you won't have to re‐enter its bibliographic information, and can upload your files directly:

http://datadryad.org/submit?journalID=pgenetics&manu=PGENETICS-D-19-01153R2

More information about depositing data in Dryad is available at http://www.datadryad.org/depositing. If you experience any difficulties in submitting your data, please contact help@datadryad.org for support.

Additionally, please be aware that our data availability policy requires that all numerical data underlying display items are included with the submission, and you will need to provide this before we can formally accept your manuscript, if not already present.

----------------------------------------------------

**Press Queries**

If you or your institution will be preparing press materials for this manuscript, or if you need to know your paper's publication date for media purposes, please inform the journal staff as soon as possible so that your submission can be scheduled accordingly. Your manuscript will remain under a strict press embargo until the publication date and time. This means an early version of your manuscript will not be published ahead of your final version. PLOS Genetics may also choose to issue a press release for your article. If there's anything the journal should know or you'd like more information, please get in touch via plosgenetics@plos.org.

24 Apr 2020

PGENETICS-D-19-01153R2

Simultaneous SNP selection and adjustment for population structure in high dimensional prediction models

Dear Dr Bhatnagar,

We are pleased to inform you that your manuscript entitled "Simultaneous SNP selection and adjustment for population structure in high dimensional prediction models" has been formally accepted for publication in PLOS Genetics! Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out or your manuscript is a front-matter piece, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Genetics and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Matt Lyles

PLOS Genetics

On behalf of:

The PLOS Genetics Team

Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom

plosgenetics@plos.org | +44 (0) 1223-442823

plosgenetics.org | Twitter: @PLOSGenetics

Citing articles via

Tweets

https://www.researchpad.co/tools/openurl?pubtype=article&doi=10.1371/journal.pgen.1008766&title=Simultaneous SNP selection and adjustment for population structure in high dimensional prediction models&author=Sahir R. Bhatnagar,Yi Yang,Tianyuan Lu,Erwin Schurr,JC Loredo-Osti,Marie Forest,Karim Oualkacha,Celia M. T. Greenwood,Heather J Cordell,Heather J Cordell,Scott M. Williams,Heather J Cordell,Scott M. Williams,Heather J Cordell,Scott M. Williams,Heather J Cordell,Scott M. Williams,&keyword=&subject=Research Article,Research and Analysis Methods,Simulation and Modeling,Physical Sciences,Mathematics,Applied Mathematics,Algorithms,Research and Analysis Methods,Simulation and Modeling,Algorithms,Biology and Life Sciences,Computational Biology,Genome Analysis,Genome-Wide Association Studies,Biology and Life Sciences,Genetics,Genomics,Genome Analysis,Genome-Wide Association Studies,Biology and Life Sciences,Genetics,Human Genetics,Genome-Wide Association Studies,Biology and Life Sciences,Genetics,Molecular Genetics,Biology and Life Sciences,Molecular Biology,Molecular Genetics,Physical Sciences,Mathematics,Probability Theory,Random Variables,Covariance,Research and Analysis Methods,Mathematical and Statistical Techniques,Mathematical Models,Biology and Life Sciences,Genetics,Heredity,Genetic Mapping,Variant Genotypes,Biology and Life Sciences,Genetics,Genetic Loci,

Newgen KnowledgeWorks | Privacy & Cookie Policy | Powered by: Nova