Copy number variants comprise a large proportion of variation in human genomes. Large rare CNVs, especially those disrupting genes or changing the dosages of genes, can carry relatively strong risks for neurodevelopmental and neuropsychiatric disorders. Kernel-based association methods have been developed for the analysis of rare CNVs and shown to be a valuable tool. Kernel methods model the collective effect of rare CNVs using flexible kernel functions that capture the characteristics of CNVs and measure CNV similarity of individual pairs. Typically kernels are created by summarizing similarity within an artificially defined “CNV locus” and then collapsing across all loci. In this work, we propose a new kernel-based test, CONCUR, that is based on the CNV location information contained in standard processing of the variants and which obviates the need for arbitrarily defined CNV loci. CONCUR quantifies similarity between individual pairs as the common area under their copy number profile curves and is designed to detect CNV dosage, length and dosage-length interaction effects. In simulation studies and real data analysis, we demonstrate the ability of the CONCUR test to detect CNV effects under diverse CNV architectures with power and robustness over existing methods.
Copy number variants (CNVs) are unbalanced structural variants that are typically 1 kilobase pair (kb) in size or larger and are comprised of more or fewer copies of a region of DNA with respect to the reference genome. CNVs are typically characterized by two descriptive features. The first feature is CNV dosage, or the total number of copies present, with > 2 copies corresponding to duplications and < 2 copies corresponding to deletions. The second is the CNV length, typically measured in base pairs (bp) or kilobase pairs.
CNVs are important risk factors for many human diseases and traits, including Crohn’s disease, HIV susceptibility, and body mass index [1–3]. Large and rare CNVs are particularly implicated in neuropsychiatric disorders including autism spectrum disorder, schizophrenia, bipolar disorder, and attention deficit disorder . For example, multiple studies have confirmed a greater burden of rare CNVs in schizophrenia cases compared with normal controls, both genome-wide and in specific neurobiological pathways important to schizophrenia (e.g., calcium channel signaling and binding partners of the fragile X mental retardation protein).
Rare CNVs (e.g., < 1% frequency) in the genome are intractable to test individually for disease association and instead are examined with collapsing methods. Collapsing methods summarize variant characteristics across multiple variants in a targeted region, such as a gene set, a chromosome or the whole genome, and perform a test of the collective CNV effects on the phenotype. By accumulating information across multiple rare variants, collapsing methods can have enhanced power to detect the effects of rare CNVs that are difficult to detect individually but collectively have a significant impact. Collapsing tests for rare CNVs are primarily built on the foundation of rare single nucleotide polymorphism (SNP) association tests but with additional complexity to accommodate the length and dosage features of CNVs. As with SNPs, the effects of CNVs can vary between loci, and collapsing methods demonstrate improved power as they account for this heterogeneity. However, CNV collapsing tests also need to account for within-locus heterogeneity due to differential dosage effects or length effects within a CNV region.
Similar to SNP collapsing tests, there are two families of tests for rare CNV analysis: burden-based methods and kernel-based methods. Burden-based tests, e.g., Raychaudhuri et al. , summarize the CNV features of an individual via total CNV counts or average length and model the CNV effects as fixed effects, assuming etiological homogeneity of features across multiple CNVs in a targeted region. Kernel-based tests, such as CCRET  and CKAT , aggregate CNV information via genetic similarity based on certain CNV features and model CNV effects as random effects to account for the between-locus etiological heterogeneity. By design, burden tests are optimal when the association signal is driven by homogeneous effects across CNVs, and kernel-based tests are optimal in the presence of etiological heterogeneity. Burden testing often involves multiple analyses on subsets of CNVs according to their dosage (e.g., deletions only or duplications only) or size (e.g. > 100kb, > 500kb) to increase homogeneity; whereas, kernel-based tests do not have such requirements.
In this work, we focus on kernel-based methods, as etiological heterogeneity is becoming a more practically encountered scenario due to high-resolution CNV detection technologies permitting the detection of CNVs of smaller length. In kernel-based association tests, the association between CNVs and the trait is evaluated by examining the correlation between trait similarity and CNV similarity quantified in a kernel. For kernel construction, we can refer to kernel-based tests for SNPs. Since SNPs are evaluated at the same single base-pair position (referred to as a locus) across individuals, it is natural to assess similarity locus-by-locus and aggregate the locus-level similarity over all loci in the target region to obtain an overall measure of SNP similarity. A locus unit for CNVs, however, is not so obvious since CNVs span multiple base pairs and may overlap partially between individuals.
To address this issue, standard CNV kernel-based tests construct CNV regions (CNVRs). For example, the CNV Collapsing Random Effects Test (CCRET) , developed previously by our group, defines CNVRs by a user-specified amount of CNV overlapping among different individuals (e.g., Fig 1 of Tzeng et al. ). CNV similarity between an individual pair is quantified first within each CNVR, and then CNVR-level similarity is summed over all CNVRs in the target region to characterize overall CNV similarity. However, a drawback of this approach is that CNVRs defined in this fashion are contingent on the unique CNV overlapping patterns among individuals in a study, and the defined CNVRs can vary from one study to another. The arbitrary choice of overlapping threshold also impacts the formation of locus units and consequently how the “between-locus” and “within-locus” heterogeneous effects of CNVs are accounted for.
To avoid the issues introduced by arbitrarily defined CNVRs as in CCRET, the CNV Kernel Association Test (CKAT)  adopts a different strategy to quantify CNV similarity between two individuals. Specifically, CKAT allows users to define the CNVR as a biologically relevant region, e.g., a chromosome. CKAT also introduces a new kernel function to measure CNV similarity based on both dosage and length features between two CNV events. This CNV-level similarity is then aggregated to derive a measure of CNVR-level similarity using a shift-by-one scanning algorithm that “aligns” CNVs in two profiles based on their ordinal position. A multiple-testing correction is applied if multiple CNVRs are involved in the targeted region. Although the new strategy bypasses the need for an arbitrarily defined locus unit, the scanning alignment may yield unreliable results if CNVRs are too large. Additionally, CKAT aligns pairs of CNVs based on their ordinal position, which may or may not optimally capture similarity dependent on the CNV signal sources. There may also be computational considerations with a scanning algorithm when n is large.
To address these challenges in quantifying CNV similarity using kernel-based methods, in this work we propose a new approach called the Copy Number profile Cur ve-based (CONCUR) association test. Based on the concept of copy number (CN) profile curves (Fig 1), the CONCUR association test naturally incorporates both CNV dosage and length features and can capture their main effects as well as dosage-length interactions. Additionally, building the kernel based on CN profile curves permits the quantification of CNV similarity without the need of pre-specified locus units. Moreover, CNV length may be incorporated flexibly in units which are supported in good resolution by the sequencing technology or which are computationally stable. Like CCRET and CKAT, the test is built in the framework of kernel machine regression and is powerful under heterogeneous signals and can adjust for confounders. We use simulation studies to demonstrate the improved power CONCUR over existing kernel-based methods in a variety of settings and illustrate the practical utility of CONCUR by conducting pathway analysis on data from the Swedish Schizophrenia Study and the Taiwan Biobank.
CONCUR assesses the collective effects of rare CNVs on a phenotype in a kernel machine regression framework where the kernel construction does not require a pre-defined CNV locus or region. As such, CONCUR is built on two major components: (a) the CN profile curve, with which we describe an individual’s CNVs across the genome or a region of interest; and (b) the common area under the curve (cAUC) kernel, with which we measure CNV similarity between individuals and characterize the CNV effects on the phenotype. In a CN profile curve (e.g., Fig 1), CNV dosage is shown on the y-axis as jumps or troughs diverging from a baseline of 2 copies. The start and end points of the jumps and troughs correspond to the start and end locations of the CNV and are shown on the x-axis. At genomic locations where there are no CNV events, the y-axis (dosage) takes value 2 (i.e., the baseline value). CN profile curves are intended to be a visualization of CNV activity and concurrence across samples and contribute to the CONCUR method through the concept of cAUC.
By superimposing two CN profile curves, we identify regions of overlapping CNVs of the same type (i.e., deletion or duplication) and propose to use the common area under the curve (cAUC) to quantify CNV similarity between two individuals. To implement the idea, first the raw dosage values in the CN profile curve are centered and scaled to obtain the duplication profile curve and deletion profile curve. Let d denote the dosage of a CNV. The scaling and centering can be achieved by the dosage transform functions: aDup(d) = |2 − d|c for duplications and 0 otherwise, and aDel(d) = |2 − d|c for deletions and 0 otherwise, where c is some pre-specified constant. Second, we superimpose the duplication profile curves of two individuals and note the overlapping regions where both curves are non-zero. Third, for each overlapping region, we multiply the minimum of the two respective transformed dosage values by the length of the overlap, and save this measure of “area of commonality”. Finally, we calculate the cAUC between two individuals as the sum of all such areas of commonality in their duplication profile curves plus the sum of all areas in their deletion profile curves. Fig 1 and S1 Fig. illustrate the cAUCs between various pairs of individuals when setting c = 1 in the dosage transform functions, aDup(d) and aDel(d). Generally speaking, the cAUC between individuals with overlapping CNVs of the same dosage d is the overlapping length times |d − 2|. For example, for individuals with overlapping CNVs of dosage 0 (Fig 1(b)), the cAUC is the overlapping length times |0 − 2|. The cAUC between individuals with overlapping CNVs of the same type but different dosages, d1 and d2, is the length of the overlap times |d1 − d2 | (e.g., dosages of 3 versus 4 in Fig 1(c)). If there are multiple overlaps in the individuals’ CN profile curves, the cAUC between two individuals is the sum of all areas of commonality (e.g., sum of shaded regions in Fig 1(d)). The cAUC kernel measures similarity in both CNV length and dosage and hence characterizes the joint dosage and length effects. Using the semi-parametric kernel machine regression framework, CONCUR regresses the trait values on CNV effects captured by the cAUC kernel, and evaluates the association between traits and CNV profiles via a score-based variance component test.
We conducted three sets of simulations: whole genome analysis based on TwinGene pseudo-CNV (TGP) data (referred to as TGP-WG simulations); chromosome 1 analysis based on TGP data (referred to as TGP-Chr1 simulations); and chromosome 1 analysis based on Taiwan B iobank (TWB) CNV data (referred to as TWB-Chr1 simulations). The dosage values are integers in the TGP dataset and are continuous in TWB dataset. With these simulations, we evaluated the performance of CONCUR, CCRET, and CKAT under various signal patterns and different sources of effect heterogeneity. For reference, a table of all simulation settings is provided in S1 Table.
To implement CCRET, we applied the functions from the CCRET package to convert the PLINK data to CCRET design matrices and computed the dosage kernel matrix. We used 1-bp overlapping of CNVs among different individuals to form CNVRs as in the CCRET paper ; that is, as long as CNVs of different individuals overlapped by ≥1bp, it was considered an “overlap”. For CKAT, we designated a chromosome as a CNVR and performed an association test for each CNVR using the CKAT package. CNV lengths within each chromosome were scaled to be in [0, 1] by dividing by the range of CNV activity in each chromosome, i.e., the maximal ending position minus the minimal starting position of observed CNVs on each chromosome. The Gaussian kernel scaling parameter was set to be 1. In the TGP-WG simulations, as there were 22 CKAT p-values corresponding to the 22 chromosomes, we took the minimum p-value and used Bonferroni’s procedure to compute the adjusted p-value for multiple testing. Finally, we built the CCRET and CKAT kernels using CNVs’ categorical dosage (duplication, deletion, and normal) as required in the software packages. To assure comparability, CONCUR was built using categorical dosages (referred to as CONCUR_cat). We also implemented CONCUR using the original integer dosage values (referred to as CONCUR_int) in the TGP-Chr1 simulations, and using the original continuous dosages (referred to as CONCUR_cont) in the TWB-Chr1 simulation.
The TwinGene pseudo (TGP) CNV dataset of 2000 individuals is publicly available at https://www4.stat.ncsu.edu/~jytzeng/Software/CCRET/software_ccret.php. Autosome-wide pseudo-CNV data were simulated by mimicking the CNV profiles of unrelated individuals in the TwinGene study , and the details are described in Tzeng et al. . Briefly, the TwinGene study used a cross-sectional sampling design and included over 6,000 unrelated subjects born between 1911 and 1958 from the Swedish Twin Registry [9, 10]. CNV calls were generated using the Illumina OmniExpress BeadChip for 72,881 SNP markers and using PennCNV (version June 2011)  as the CNV calling algorithm with recommended model parameters. From the full callset, high quality rare CNVs (frequency < 1% and size > 100kb) were extracted to form the simulation pool for the pseudo-CNV data. CNV dosages in this dataset are integers (0, 1, 2, 3, and 4). In the TGP whole genome data (chromosome 1 to chromosome 22), each of the 2000 individuals has at least one CNV, and in the chromosome 1 analysis, 291 individuals have CNV activity.
The Taiwan Biobank (TWB) CNV dataset is from the Taiwan Biobank project https://www.twbiobank.org.tw/new_web/. This data was studied in the real data analysis and further details regarding it are shared in the section CNV analysis on triglycerides in the Taiwan Biobank. CNV dosage values in this dataset are continuous with dosages ≥ 2.3 indicating duplications and ≤ 1.7 indicating deletions. Out of the 11,664 individuals who were included in the real data analysis, we took a random sample of 2000 individuals and kept their data for all CNVs in chromosome 1. In this sample, 1432 individuals out of the 2000 had CNV activity in chromosome 1.
For the purpose of simulating phenotypes, we constructed “CNV segments” based on the CNVs in the focal dataset. The endpoints of the segments correspond to locations where a CNV in any one of the samples begins or ends, resulting in segments that contain either one or more intersecting CNVs. Within a segment, CNV dosage of an individual is a constant, and CNVs across individuals may have different dosages though they share the same starting and ending positions defined by the boundaries of the segment. Note that different segments will naturally have different lengths. In the simulation studies, we built design matrices ZDup, ZDel, and ZLen which codified CNV features by segment in the CNV dataset. The dosage matrices, ZDup and ZDel, took value 0 for those individuals without CNVs in the segment and were coded as the number of additional or missing copies comprising the CNV otherwise. ZLen was the length of the CNV segment in kb for individuals with CNV events and was 0 for individuals without CNVs in the segment.
A case-control phenotype was generated from the logistic modelwhere Zij• is the (i,j) entry of matrix Z•, i = 1, ⋯, N indexes individuals, and j = 1, ⋯, R indexes CNV segments. A binary covariate Xi was simulated from Bernoulli(0.5) for each individual. and are the log odds ratios of segment j for the presence of a CNV versus the absence. Likewise, controls the effect of CNV length in segment j, and and allow the effects of CNV length to differ by dosage. βj• (or < 0) corresponds to a deleterious (or protective) CNV effect, and βj• was set to 0 in non-causal segments. We set βX = log(1.1) and γ0 = -2, which corresponds to a baseline disease rate of roughly 0.12. We also fixed to reflect the observation that length tends to act like an effect modifier of dosage effects.
In the TGP-WG simulations, we generated phenotypes from CNV dosage × length effects and from dosage-only effects. We chose these signals to roughly replicate the simulation settings applied to assess CKAT in  (dosage × length signal) and CCRET in  (dosage signal). In the TGP-Chr1 and TWB-Chr1 simulations, signals were generated from CNV dosage × length effects. Below we describe the settings for the Chr1-based simulations; the TGP-WG simulations basic settings are similar and are detailed in S2 Appendix.
In the Chr1-based simulations, we considered three types of causal effects: causal effects from both duplications and deletions, causal effects from duplications only, and causal effects from deletions only. Under each effect type, we designated varying percentages of the causal segments to be deleterious (D) or protective (P). When both duplications and deletions were causal, the settings included (DDup, PDup, DDel, PDel) = (90, 10, 90, 10), i.e., both causal duplications and causal deletions had 90% deleterious and 10% protective effects, as well as (90, 10, 10, 90) and (10, 90, 90, 10). In the scenarios where duplications or deletions alone were causal, settings included (D•, P•) = (90,10), (50,50), and (10,90).
In the TGP-Chr1 simulations, we randomly selected 40 segments across chromosome 1 to be causal, comprised of 20 segments containing ≥1 duplication and 20 segments containing ≥1 deletion. As the segments were formed purely based on the relative CNV patterns among individuals and could be very short in length, we also required causal segments to be at least as long as the median length of all segments of that type (35kb for duplications, 46kb for deletions) to ensure that they had realistic lengths. We allowed for the possibility of duplication and deletion effects arising from the same location, and used categorical dosages to simulate the length-dosage effects. That is, when simulating phenotypes using Eq 1, for individual i and segment j, we set if a duplication was present in the segment and 0 otherwise, and set if a deletion was present in the segment and 0 otherwise. We implemented CONCUR in all TGP-Chr1 simulations using the CONCUR_cat kernel as well as the CONCUR_int kernel. We refer to the three scenarios described here (i.e., effects from duplications only, from deletions only, and from combined effects) as TGP-Chr1(a) to distinguish it from the sensitivity analyses introduced below.
In the scenario with causal effects from duplications and deletions combined, we considered two additional scenarios as sensitivity analyses: TGP-Chr1(b) examines the methods’ performance under inaccuracy in the called end-points of CNVs; and TGP-Chr1(c) imposes a more rare baseline disease rate. In TGP-Chr1(b), we added random uniformly distributed errors to the endpoints (BP1 and BP2) of all CNVs. The CNV endpoints in the error-added data differed from the CNV endpoints in the true data used to generate phenotypes by up to ±2.5% of the total length of the CNV. In TGP-Chr1(c), we set γ0 = −3 to lower the baseline disease rate to roughly 0.05.
In the TWB-Chr1 simulations, we randomly selected 600 CNV segments across chromosome 1 to be causal, comprised of 300 segments containing ≥1 duplication and 300 segments containing ≥1 deletion. We imposed similar criteria on the length of causal segments as in the TGP-Chr1 simulations, and we allowed for duplication and deletion effects from the same location. Unlike the TGP-Chr1 simulations, here we used continuous dosages to simulate the length-dosage effects in all three scenarios (referred to as TWB-Chr1(a) simulations). In the scenario with combined duplication and deletion causal effects, we also generated signals from categorical dosages (referred to as TWB-Chr1(b) simulations). When simulating phenotypes based on continuous dosage signals using Eq 1, we constructed the dosage matrices such that for a CNV of dosage d in segment j for individual i, if a duplication was present and 0 otherwise, and if a deletion was present and 0 otherwise. We applied the CONCUR_cont kernel and CONCUR_cat kernel in both settings (a) and (b) of the TWB-Chr1 simulations to evaluate their robustness to signals arising from dosage values of the same type (categorical or continuous) or of an incongruent type.
We implemented case-control sampling to obtain 2000 cases and 2000 controls for each simulation replication. Type I error rates were evaluated based on 5000 replications, and power was estimated based on 300 replications at each effect size. For all methods, we adjusted for a simulated binary covariate as a fixed effect in the kernel machine regression. We employed the small-sample variance components test of Chen et al.  and obtained p-values using Davies’ method  as implemented in the CKAT R functions.
The type I error rates of CONCUR, CCRET, and CKAT were examined at nominal levels of 0.01, 0.05, and 0.1 in the TGP-WG simulations (Table 1). All methods had type I error rates around the nominal level.
The power of the methods under causal dosage × length effects in TGP-WG is shown in S2 Fig. We observe that CONCUR has the best or comparable power with the second best method (CCRET) across different patterns of deleterious-protective effects and in the duplication, deletion, and combined effects scenarios. Both CONCUR and CKAT are designed to detect dosage × length signals, but CKAT struggled to pick up the signals. One possible reason might be the multiple testing penalty applied to CKAT. In addition, the CNV signals in these simulations originate from aligned genomic regions; such signals may or may not be well-captured by CKAT, since its scanning algorithm may incorporate CNV similarity from off-position CNV events or same-position CNV events dependent on the data.
We note that we do not expect the methods to display similar relative performance in the duplication-only causal effects and deletion-only causal effects scenarios. This is because the causal segments for duplications versus deletions have different characteristics, due to differences in the patterns of duplications overlapping versus deletions overlapping in the data. In addition, the strength of the signal in the duplication-only and deletion-only simulations differs due to differences in the length of the segments as well as the frequency of CNVs in the protective versus deleterious segments. For example, in the duplication-only effects (D,P) = (10,90) setting, the combination of longer segments and more CNV activity in the protective segments leads to a proportionally larger protective signal than in the corresponding deletion-only setting. The result of these differences is asymmetry in the methods’ performance in the two settings.
The power under causal dosage effects is shown in S3 Fig. As expected, the dosage-based CCRET kernel performs the best, with CONCUR following CCRET or having comparable power.
The results of the TGP-Chr1(a) simulations are shown in Fig 2. We observe that CONCUR has higher than or comparable power to the second best method, and here the second best method is CKAT or CCRET depending on the scenario. After further exploration of the TGP-Chr1(a) simulations as well as TWB-Chr1 dosage × length simulations, it appears that the relative performance between CKAT and CCRET depends more heavily on whether the causal signals can be well-captured by the CKAT kernel than what the patterns of causal effects are (i.e., causal effects from duplications, deletions, or both, or the proportion of deleterious vs. protective effects). The power from CONCUR_cat and CONCUR_int appear to be identical in all settings, which is not surprising given that there are very few CNVs of larger magnitude (0 or 4+) in the data.
Fig 3 shows the performance of the methods applied to data with inaccurate CNV endpoint information (top panel) and under a lower baseline disease rate (5%; lower panel). For both analyses, the top row of Fig 2 is a useful reference showing the methods’ performance under error-free CNV data and a higher baseline disease rate (12%). The top panel of Fig 3 shows that CONCUR still has higher power than the baseline methods, although the gap between CONCUR and CCRET is smaller compared to the error-free scenario. The lower panel of Fig 3 demonstrates that the performance of the methods under a lower baseline disease rate is very comparable to that under a higher disease rate.
The results of TWB-Chr1(a) are shown in Fig 4. We observe little difference in the power of the two CONCUR approaches. We also observe that CONCUR had stronger power than CCRET and CKAT, with the exception of some degenerate behavior in CONCUR under (DDup, PDup, DDel, PDel) = (10, 90, 0, 0). In this setting, we suspect that CONCUR’s behavior is due to the combination of a couple factors. There are few duplication events to begin with in the TWB simulated data, and that combined with the 90% protective effects leads to the simulated controls being comprised primarily of “random” controls with relatively few individuals carrying causal protective CNVs. This results in an extremely weak signal-to-noise ratio. All methods were affected in this setting, such that we needed to significantly boost the range of effect sizes to observe power in any method (e.g., 1.01-1.10 here vs. 1.0005-1.0035 under (DDup, PDup, DDel, PDel) = (0, 0, 10, 90)). We believe that CCRET and CKAT are more robust in this setting due to borrowing information across loci through CNVRs and through across-position alignment, respectively. Finally, as in the TGP-Chr1 simulations, under some scenarios CKAT had higher power than CCRET, e.g., in the settings of (DDup, PDup, DDel, PDel) = (50, 50, 0, 0) and (0, 0, 50, 50). The relative performance of the two methods is again likely dependent on whether the causal signals were well captured by the CKAT kernel.
Fig 5 evaluates the ability of the CONCUR_cont approach to detect a signal from categorical dosages, on which CONCUR_cat, CCRET, and CKAT are built. We observe nearly identical power in CONCUR_cont and CONCUR_cat.
In real data applications, we first conducted CNV association tests on a previously analyzed CNV dataset from the Swedish Schizophrenia Study as a proof of concept. We next conducted a CNV-triglyceride (TG) association analysis on data from the Taiwan Biobank. We constructed kernels for all methods (CONCUR, CKAT and CCRET) based on categorical dosages, and dropped the “_cat” suffix in our discussion of the CONCUR method in this section for simplicity. We used the step-wise Holm method to adjust for multiple testing, which yields more powerful results than the Bonferroni procedure and remains valid when applied to dependent p-values .
We conducted pathway-based CNV analysis on data from the Swedish Schizophrenia Study . The Swedish Schizophrenia Study used a case-control sampling design. Genotyping was done in six batches using Affymetrix 5.0 (3.9% of the subjects), Affymetrix 6.0 (38.6%), and Illumina OmniExpress (57.4%). PennCNV  was used to generate CNV calls. After quality control, we obtained a high quality rare CNV (frequency < 1% and size > 100kb) dataset in 8,457 subjects (3,637 cases and 4,820 controls) . Previous analyses of this data  indicated significant associations of large rare CNVs with schizophrenia risk for both genome-wide dosage effects and gene intersecting effects of selected gene sets.
To evaluate the practical utility of the three kernel-based tests, we performed analysis on the gene sets previously examined in , excluding the PSD pathway as it overlaps the other three PSD-related pathways considered. In the 8 gene sets, large (> 500kb) rare CNVs were found to be associated with schizophrenia by Szatkiewicz et al. , and these associations were corroborated by Tzeng et al.  in a gene-interruption analysis with CNVs > 100kb. In each pathway analysis, we performed association tests for joint dosage and length effects of rare CNVs > 100kb, using a fixed effect term to adjust for batch effects. CONCUR and CKAT kernels were constructed from the raw PLINK data and the CCRET dosage kernel was created using the functions available on the CCRET website. For CKAT, we used pathways as the CNVR unit instead of chromosomes because there were multiple chromosomes with only one gene.
After applying Holm’s procedure to adjust for multiple testing on the 8 pathways, CONCUR and CKAT found significant associations in all 8 pathways, while CCRET identified 4 pathways as significant (Table 2). We observed stronger power in CKAT in the analysis here compared to the power observed in the simulation studies. CKAT and CONCUR are more sensitive to dosage-length effects while CCRET is more sensitive to dosage effects; thus, these results suggest significant CNV effects from dosage × length or length affecting schizophrenia risk in several pathways.
|Gene-set Name||# Genes||# Genes Interrupted in Cases||# Genes Interrupted in Controls||CONCUR||CCRET||CKAT|
|Cytoplasm (Kirov et al. )||266||28||32||0.00124⋆||0.01408||0.00030⋆|
|FMRP targets (Darnell et al. )||810||149||152||2.29E-05⋆||0.00044⋆||0.00026⋆|
|PSD/mGluR5 (Kirov et al. )||38||4||7||0.00040⋆||0.10540||0.00129⋆|
|PSD/NMDAR (Kirov et al. )||61||12||12||0.00102⋆||0.00922⋆||0.00046⋆|
|PSD/PSD-95 (Kirov et al. )||65||13||10||0.00052⋆||0.00144⋆||0.00903⋆|
|Synaptic genes (Ruano et al. )||718||154||164||5.45E-06⋆||0.02005||0.00766⋆|
|Synaptic Proteome (G2Cdb)||1023||121||106||0.00067⋆||0.00010⋆||0.00736⋆|
We applied the proposed CONCUR test to the Taiwan Biobank (TWB) data https://www.twbiobank.org.tw/new_web/ and conducted CNV association analysis with triglyceride (TG) levels on lipid-related pathways. The nationwide biobank project was initiated in 2012 and has recruited more than 15,995 individuals. Peripheral blood specimens were extracted from healthy donors and genotyped using the Affymetrix Genomewide Axiom TWB array, which was designed specifically for a Taiwanese population. The TWB array contains 653,291 SNPs and was used to generate calls for genome-wide CNVs in the following process. First, Affymetrix Power Tools version 1.18.0 was used to produce a summary file of the intensity values of all probes, and the file was input into the Partek Genomic Suite version 6.6 to call CNVs based on the following criteria: at least 35 consecutive SNP markers, p-values of different CN values between two consecutive segments < 0.001, and signal-to-noise ratio (SNR) ≥ 0.3. A duplication was called if its copy number was ≥ 2.3, and a deletion was called if its copy number was ≤ 1.7. Several previous studies   have demonstrated appropriate CNV calls with these parameters. After quality control, we obtained CNV data in 14,595 unrelated individuals. Our CNV association analyses focused on a subset of 11,664 individuals who had non-missing TG levels.
We referenced the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway database  to identify lipid-related pathways. Among the 17 pathways related to “lipid metabolism”, 15 pathways included genes intersected by the TWB CNV data and were selected as candidate pathways in our analysis. For each pathway analysis, we adjusted for sex, age, BMI, and the top 10 principal components representing the population structure as covariates with fixed effects. As before, CKAT was implemented with each pathway comprising a single CNVR.
Similar to the Swedish Schizophrenia Study analysis, after applying Holm’s procedure to adjust for multiple testing on the 15 pathways, all 15 pathways were found significant by CONCUR and CKAT, and 1 pathway was found significant by CCRET (Table 3). CKAT again demonstrated much better power than in the simulation studies. The relative performance among the three methods may be due to more dominant length or dosage × length signals.
|Gene-set Names||# Genes||# Genes Interrupted||CONCUR||CCRET||CKAT||Neg. Control|
(Fatty acid biosynthesis)
(Fatty acid biosynthesis)
(Fatty acid degradation)
(Synthesis and degradation of ketone bodies)
(Primary acid bile biosynthesis)
(Steroid hormone biosynthesis)
(Ether lipid metabolism)
(Arachnidonic acid metabolism)
(Linoleic acid metabolism)
(alpha-Linolenic acid metabolism)
(Biosynthesis of unsaturated fatty acids)
Next, we explored associations with TG levels in the TWB data on a by-chromosome basis (Table 4). Using the Holm method to adjust for multiple testing on 22 chromosomes, CONCUR found all 22 chromosomes significantly associated with TG, CKAT found 12 significant chromosomes, and CCRET found none. It is not unexpected to see all 22 chromosomes identified to be significantly associated with TG, since the genes in the 15 significant lipid metabolism pathways examined are located across all 22 chromosomes. For example, for the 12 chromosomes identified by both CONCUR and CKAT, the number of pathway genes intersected by CNVs ranges from 8 to 41, and for those chromosomes uniquely identified by CONCUR (excluding chromosome 13), the number of intersected genes ranges from 4 to 36. In chromosome 13, while there is only one pathway gene, all CNVs are located in chr13q, which is a well-studied region related to cholesterol metabolism   . Since cholesterol is strongly related to TG levels, CNVs in chr13q and chr13q22-q32 may impact TG levels by affecting the metabolism efficiency of TG and cholesterol. To further interpret the significant CONCUR test result, we examined the subregion chr13q22-q32 that is highlighted in  and contains or overlaps with the markers in  and . By applying CONCUR, CKAT and CCRET to this subregion, we obtained the p-values as 0.0000143, 0.0004388 and 0.0242815, respectively. These results suggest a length or dosage × length signal arising from chr13q22-q32, which CONCUR and CKAT can detect with good power. This length-driven CNV signal is not well captured by CCRET in both of the subregion and chromosome-wide analyses, since CCRET does not account for CNV length features. The strength of the signal from chr13q22-q32 may be diluted for CKAT when the entire chromosome is treated as a CNVR.
|Chr||# Genes Interrupted||CONCUR||CCRET||CKAT||Neg. Control|
As further assurance that these associations are less likely due to false positives, we conducted a CONCUR negative-control analysis by repeating the by-chromosome analysis using permuted TG levels. The resulting p-values are shown in Table 4; the p-value range of those chromosomes identified by both CONCUR and CKAT (i.e., 0.483 to 0.888) is similar to that of those uniquely identified by CONCUR (i.e., 0.485 to 0.963). In addition, we also examined the quantile-quantile plots (Q-Q plots) of CONCUR p-values from negative-control analyses, by generating (a) 20 TG-permuted datasets for each of the 15 pathways, and (b) 1000 TG-permuted datasets for chromosome 13. The Q-Q plots, shown in S4 Fig, suggest that the p-values follow the expected null distribution, Uniform(0,1).
Finally, we illustrate in S3 Appendix possible CONCUR post-hoc analyses probing the potential sources of a CNV association identified at the aggregate-level. As an example, we looked more closely at pathway hsa01040 (biosynthesis of unsaturated fatty acids), for which both CONCUR and CKAT were significant but not CCRET. In short, we calculated summary statistics describing CNV length and dosage in hsa01040 for individuals with different levels of TG (low, medium, and high), and examined CNV features in all CNVs together and in duplications and deletions separately. We also used heatmaps to visualize CNVs in the 23 genes in hsa01040 (Fig A in S3 Appendix), displaying the duplications and deletions intersecting genes in all CNV profiles categorized by their TG level. These exploratory analyses suggested that for duplications only, there may exist “promising” differences in CNV length and relatively weaker differences in dosage across TG levels. Because these “promising” associations from a stratified analysis reflected only marginal associations of a CNV feature and did not account for the effect heterogeneity that motivates the application of kernel-based methods, we also applied CONCUR to duplications and deletions separately, and found a very significant association with TG in duplications (p-value < 1 × 10−8) and a weaker signal in deletions (p-value = 0.0313).
We introduce CONCUR to leverage the strength of kernel-based methods to assess the collective effects of rare CNVs on disease risk and incorporate several desired features. First, CONCUR permits the quantification of CNV similarity in an CNVR-free manner, avoiding the need of arbitrarily defining CNVRs as in current practice. Second, CONCUR incorporates both length and dosage information via the cAUC kernel, and is capable of detecting dosage, length and length-dosage interaction effects. Third, as the technology for detecting smaller CNVs improves, we expect to observe more length variation in CNVs and an increasing need to accommodate length effects in CNV association studies. However, there are shortcomings in the standard kernel choices for handling CNV length. For example, a linear (or polynomial) kernel, which scores length similarity in a multiplicative fashion, cannot always reflect the true level of length similarity between an individual pair: e.g., a pair of CNVs of length 20 would be equally similar to two CNVs with lengths 1kb and 400kb (as 20 × 20 = 1 × 400). The alternative Gaussian kernel as in CKAT would still require a pre-specified scaling factor. CONCUR addresses these issues by using the common AUC of the CN profile curves of an individual pair, quantifying CNV similarity in dosage and length simultaneously. Finally, unlike current kernel methods which require discretized copy numbers, CONCUR is directly applicable to continuous and discrete copy numbers. We implement the CONCUR test in the R package ‘CONCUR’.
CONCUR shares some philosophy with several CNV analysis strategies in the literature. For example, Aguirre et al.  characterized the copy number changes in the pancreatic adenocarcinoma genome by detecting the minimum common regions (MCR) of recurrent copy number changes across tumor samples and using MCRs to prioritize genes that might be involved in pancreatic carcinogenesis. Harada et al.  also examined the minimal overlapping/common regions of frequent CNV activities among pancreatic cancer samples and among normal samples to identify candidate regions that might contain critical oncogenes or tumor suppressor genes. Furthermore, Mei et al.  proposed algorithms for identifying common CNV regions across individuals of homogeneous phenotypes for downstream association analysis. Built on similar concepts to these “common regions”, CONCUR quantifies CNV similarity between sample pairs based on the “size” of the common regions as reflected in congruent location and dosage, and provides an association test to evaluate dosage and length effects.
In the analyses performed in this study, we calculated the cAUC using CNV dosage values transformed by the functions aDup(d) = |d − 2| for duplications and 0 otherwise, and aDel(d) = |d − 2| for deletions and 0 otherwise. That is, we used copy number 2 as a reference value, and defined CNV similarity as the overlapping CNV length scaled linearly according to the magnitude of dosage deviation from the reference value. As indicated in the method section, CONCUR can be flexibly extended to accommodate other schemes of quantifying common area by adopting different a•(⋅) functions in the calculation of the cAUC, e.g., a•(d) = |d − 2|c with c ≠ 1. Finally, overlapping area may be further weighted by inverse frequencies or according to CNV type (e.g., deletion) when needed, to augment the contribution from overlapping regions from rarer CNVs or from CNVs with more severe impact, respectively.
We note that although both CONCUR and CKAT are designed to capture CNV dosage and length information, the two kernels are constructed based on different philosophies, and each method has sensitivity to certain effect mechanisms. CONCUR first quantifies the similarity in an individual pair within the same genomic locations, and then collapses the similarity information across different locations in a user-specified region (e.g., whole genome, chromosome, or gene set). In contrast, CKAT treats the user-specified region as a CNVR, and collapses CNV information across different locations by incorporating off-position similarities and/or same-position similarities to obtain a CNVR-level measure of similarity.
The different philosophies of quantifying CNV similarity may also explain the differences in CKAT’s relative performance compared to CCRET here versus in the CKAT paper . Here we observe that CKAT has higher or comparable power compared to CCRET in some scenarios and lower power in other scenarios; however, in the CKAT paper, CKAT outperforms CCRET in the majority of the considered scenarios. This discrepancy is likely due to differences in the assumed effect mechanisms in our approach versus those in the CKAT paper. In the simulation study here, certain genomic regions are selected to be causal. Whereas, in the CKAT paper simulations, causal effects arise from CNVs, not genomic regions. Each individual has either 0 or 1 CNV with randomly generated endpoints; CNVs of the same type (i.e., duplication or deletion) are either all causal or all non-causal, depending on the scenario. Under this design, causal CNVs in different individuals may fall in different genomic locations, yet CNVs of the same type will have similar effects. CKAT powerfully detects these signals because its similarity quantification approach (i.e., based on pairs of CNVs rather than aligned CNV activity in a fixed genomic location) better captures the CNV-driven (as opposed to locus-driven) signal. Whereas, methods that quantify similarity based on aligned positions, such as CCRET and CONCUR, can suffer from power loss under this signal, as exemplified by the performance of CCRET in .
In the Swedish Schizophrenia Study, all procedures were approved by ethical committees at the Karolinska Institutet (Dnr No. 04/-449/4 and No. 2015/2081-31/2) and University of North Carolina (No. 04-1465 and No. 18-1938). All subjects provided written informed consent (or legal guardian consent and subject assent). The Taiwan Biobank study was approved by the ethical committee at Taichung Veterans General Hospital (IRB TCVGH No. CE16270B-2). Consent was not obtained because the data were de-identified.
For individual i, i = 1, ⋯, n, denote Yi the phenotype of individual i. Codify the CNV information in matrix Zi with dimension Pi × 4 as in the standard PLINK format of CNV data, where Pi is the number of CNVs that individual i has, and each row of Zi records four features of CNV p, p = 1, ⋯, Pi: dosage (denoted as dp) chromosome (denoted as Chrp), start location (denoted as BP1p), and end location (denoted as BP2p). The dosage dp can be integer, continuous or categorical values. For example, in the Swedish Schizophrenia pathway analysis, an individual might have between 1-88 CNVs, and CNV lengths might range from 100kb up to 7841kb. Let Xi = (Xi1, ⋯, Xir)T be the r covariates. Under the kernel machine regression framework, we model the association between phenotypes and CNVs as followswhere μi = E(Yi|Xi, Zi), the conditional phenotype mean given covariates and CNVs; g(⋅) is the canonical link, which transforms the conditional phenotype mean μi so that the mean is on the same scale of the linear predictors formed by covariates and CNV data. For continuous phenotypes, g(μi) = μi; for binary phenotypes, g(μi) = log[(μi/(1 − μi)]. h(Zi) is an unknown smooth function of the variant features characterized by a kernel function k(⋅, ⋅).
The proposed cAUC kernel is built on the concept of a CN profile curve as shown in Fig 1. Consider the genomic location x from chromosome k for individual i. Given the CN profile curve, we define the duplication profile curve, , and the deletion profile curve, , which recenter and rescale the CN values in CN profile curves through the “dosage transform functions” as described below, and allow us to compute cAUC similarity from duplications and from deletions in a more flexible manner. Specifically, let q = 1, ⋯, Pik index the CNV features (dq, BP1q, BP2q) occurring on chromosome k of individual i. Then we construct duplication and deletion profile curves respectively describing duplications and deletions on chromosome k for individual i as follows:where x is a location on the genome on the same scale as BP1q and BP2q; I is the indicator function such that I(⋅) = 1 if the condition contained within is satisfied and equals 0 if otherwise; and a•(d) is a dosage transform function which determines the reference copy number value and controls how different copy number values contribute more or less to similarity in profiles. If an individual has no CNVs in chromosome k, then their duplication and deletion profile curves are identically equal to zero, i.e., for all x. Although not explicitly shown, and are functions of Zi as the information of dq, BP1q, BP2q and chromosome k for subject i is obtained from Zi.
In this study, we designated aDup(dq) = |dq − 2| if dq is from a duplication and 0 otherwise. That is, for a given chromosome k and individual i, the function equals the magnitude of the duplication (i.e., number of additional copies compared to the reference copy number 2) for x inside a duplication and equals 0 otherwise. For deletions, aDel(dq) and can be obtained in an analogous way.
We propose to quantify the similarity between individuals i and j by comparing vs. and vs. over chromosomes k = 1, ⋯, 22 using the following kernel function:where captures the minimum of the two functions evaluated at x and μ(x) is the counting measure. We refer to the kernel function as the cAUC kernel as it computes the minimal common area under the two individuals’ duplication and deletion profile curves. The cAUC kernel matrix KcAUC is constructed such that its (i, j)th element is kcAUC(Zi, Zj ). The cAUC kernel is a valid kernel as shown in S1 Appendix.
As an illustrating example, we calculate the cAUC between individuals 1 and 2 from Fig 1. Both individuals have duplication profile curves on chromosome 1 as , since they have no duplications. For individual 1, the deletion profile curve is if x ∈ [200, 400] on chromosome 1 and 0 otherwise; for individual 2, if x ∈ [100, 500], 1 if x ∈ [600, 800], and 0 otherwise. To compute their cAUC, we characterize the individuals’ curves in 2 genomic regions: (1) x ∈ [200, 400], in which ; and (2) x ∉ [200, 400], in which . This discretization allows us to compute the cAUC by multiplying the minimum value of the two curves in each region by the length of the region to obtain kcAUC(Z1, Z2) = (2 × 200) + 0 = 400.
The intuition of the cAUC kernel is to quantify similarity using the length of overlapping CNVs between two individuals, with dosage information of the two overlapping CNVs determining how the overlapping length is scaled. The minimum operator enforces that the overlapping length is scaled by the CNV of smaller magnitude in a pair with different magnitudes. The similarity between CNVs of different types (i.e., duplication vs. deletion) is 0. The similarity between CNVs of the same type depends on the copy number values via the dosage transform function, a•(d). Legal choices of a•(d) will upweight the contribution from similar CNVs of greater magnitude in duplication or deletion, which are often more rare and have higher impact. The family of dosage transform functions a•(d) = |d − 2|c provides a spectrum of weighting schemes, with c < 1 down-weighting and c > 1 upweighting the contribution of higher magnitude CNVs. Across copy number data of varying types and varying sample-level characteristics, the a•(⋅) dosage transform function allows for flexible scaling of dosage to appropriately customize the cAUC measure of similarity.
The association between phenotype and CNVs is examined by testing the hypothesis H0: h(⋅) = 0. To do so, we define the vector of subject-specific CNV effects H = (h(Z1), ⋯, h(Zn)) and treat H as random effects which follow N(0, τK), where τ ≥ 0 is a variance component and K is a n × n kernel matrix with its (i, j)th entry being k(Zi, Zj ). Following Liu et al.  , testing H0: h(⋅) = 0 is equivalent to testing τ = 0 under a generalized linear mixed model. As in  , we use a score-based test, which isfor continuous phenotypes, and is for binary phenotypes. In the score statistic T, Y = (Y1, ⋯, Yn)T and σ2 is the variance of the continuous Y. Define μ = (μ1, ⋯, μn)T with i-th element μi = E(Y∣Xi, Zi); then is the estimate of μ in Eq 2 under H0: h(⋅) = 0. Specifically, for continuous phenotypes and for binary phenotypes. The score statistic asymptotically follows a weighted chi-square distribution  . Recently, Chen et al.  derived the corresponding small-sample distribution, which is used to calculate the p-value in this work.