MotivationRapid development in long-read sequencing and scaffolding technologies is accelerating the production of reference-quality assemblies for large eukaryotic genomes. However, haplotype divergence in regions of high heterozygosity often results in assemblers creating two copies rather than one copy of a region, leading to breaks in contiguity and compromising downstream steps such as gene annotation. Several tools have been developed to resolve this problem. However, they either focus only on removing contained duplicate regions, also known as haplotigs, or fail to use all the relevant information and hence make errors.ResultsHere we present a novel tool, purge_dups, that uses sequence similarity and read depth to automatically identify and remove both haplotigs and heterozygous overlaps. In comparison with current tools, we demonstrate that purge_dups can reduce heterozygous duplication and increase assembly continuity while maintaining completeness of the primary assembly. Moreover, purge_dups is fully automatic and can easily be integrated into assembly pipelines.Availability and implementationThe source code is written in C and is available at https://github.com/dfguan/purge_dups.Supplementary information Supplementary data are available at Bioinformatics online.
The superior and increasing throughput of long-read sequencing technologies, such as from Pacific Biosciences (Pacbio) and Oxford Nanopore Technologies (ONT), is revolutionizing the sequencing of genomes for new species (Phillippy, 2017). Long-read assemblers, such as Falcon (Chin et al., 2016) and Canu (Koren et al., 2017), typically generate haplotype-fused paths of a diploid genome, with Falcon-unzip (Chin et al., 2016) further able to separate the initial assembly into primary contigs and haplotigs. However, when there is high heterozygosity as in many outbred species, for example, most insects and marine animals, the allelic relationships between haplotypic regions can be hard to identify, causing not only haplotigs to be mislabeled as primary contigs, but also overlaps to be kept among the primary contigs. The majority of these retained overlaps are between homologous chromosomes, and the resulting duplication harms downstream processes, such as scaffolding and gene annotation, leading to incorrect results.
Tools such as purge_haplotigs (Roach et al., 2018) and HaploMerger2 (Huang et al., 2017) have been designed to resolve this problem. Purge_haplotigs makes use of both read depth and sequence similarity to identify haplotigs. However, it does not identify heterozygous overlaps, and requires users to specify read-depth cutoffs manually. HaploMerger2 seeks to identify both haplotigs and overlaps, but it ignores read depth and relies only on the alignment of contigs to each other.
Here we describe a novel purging tool, purge_dups, to resolve the haplotigs and overlaps in a primary assembly, using both sequence similarity and read depth. Purge_dups is now being used routinely in the Vertebrate Genomes Project assembly pipeline.
Given a primary assembly and long-read sequencing data, we apply the following steps to identify haplotigs and overlaps. A more detailed description of the methods is available in the Supplementary Material.
We evaluated the performance of purge_dups (v1.0.0) on four Falcon-unzip primary assemblies: Arabidopsis thaliana (At) (Chin et al., 2016), Anopheles coluzzi (Ac) (Kingan et al., 2019),grape Vitis vinifera L. cv. Cabernet Sauvignon (Vv) and pinecone soldierfish Myripristis murdjan (Mm), and compared our results to those of purge_haplotigs (v1.0.4), HaploMerger2. The expected genome sizes and heterozygosities of these genomes calculated by GenomeScope (Vurture et al., 2017) are given in Supplementary Table S1, with heterozygosity ranging from 0.6% (Ac) to 1.6% (Vv).
K-mer comparison analysis (Mapleson et al., 2017) shows that purge_dups removes 96.4% of duplicated haploid-unique k-mers in the Falcon-unzip assembly of Mm (Fig. 1). Comparable figures for HaploMerger2 and purge_haplotigs are 95.7% and 81.2% respectively (Supplementary Fig. S1) and for At are 88.4%, 87.3% and 80.7% respectively (Supplementary Fig. S2). Supplementary Figures S3 and S4 show examples of regions where purge_dups removes both contained and overlapping duplication, whereas purge_haplotigs only removes fully contained duplication.
Table 1 presents statistics on assembly and for the four assemblies, using Benchmarking Universal Single-Copy Orthologs (BUSCOs) (Simão et al., 2015) to assess the consequences of purging for gene set completeness and duplication. Results are given for the original assemblies, purge_haplotigs, purge_dups and HaploMerger2 (with and without repeat masking). All purging methods remove a substantial amount of sequence from the primary assembly and decrease BUSCO duplication. No single method performs uniformly best across all assemblies and all metrics. However purge_haplotigs consistently leaves more duplicated sequence and genes. For all assemblies other than Mm, purge_dups gives the highest fraction of single-copy complete genes, and the lowest fraction of missing genes. Although purge_dups has only a limited ability to explicitly handle repeats it does not exhibit signs of significant overpurging.
|BUSCO scores (%)||Assembly size (Mb)||Num. Contigs|
For Mm, we also had 10X Genomics linked read data, and used this for scaffolding using Scaff10x (https://github.com/wtsi-hpag/Scaff10X). Following this with a round of polishing with Arrow closed a number of gaps, reducing contig number further and increasing contig N50. For the purge_haplotigs assembly, this resulted in 221 scaffolds with N50 8.17 Mb, and the final contig N50 3.48 Mb, whereas scaffolding the purge_dups assembly generated 222 scaffolds with N50 23.68 Mb, and contig N50 increased substantially from 2.63 Mb to 11.98 Mb. The nominal contiguity was even greater for the scaffolded HaploMerger2 masked assembly with scaffold N50 34.53 Mb, and contig N50 16.39 Mb. However, when we further assessed the scaffolds with QUAST (Gurevich et al., 2013), the purge_dups scaffolds had the highest NGA50 (characteristic length of material correctly aligned to the genome) of 16.73 Mb, while HaploMerger2 scaffolds only had 7.86 Mb NGA50, with 126 scaffold misassemblies compared to 22 for purge_dups (Supplementary Table S2).
The improvements that purging makes to contiguity following scaffolding indicate that divergent heterozygous overlaps can be a significant barrier to scaffolding, and that it is important to remove them as well as removing contained haplotigs. To our knowledge, scaffolders that use long-range information, such as Scaff10X with linked reads or SALSA with Hi-C data, do not handle heterozygous overlaps. We therefore recommend applying purge_dups directly after initial assembly, prior to scaffolding. Although HaploMerger2 can also link adjacent contigs using overlap information after purging, our tests suggest that it makes false joins, perhaps because it does not use read depth to distinguish haplotypic duplication from repeat duplication.
In conclusion, purge_dups can significantly improve genome assemblies by removing overlaps and haplotigs caused by sequence divergence in heterozygous regions. This both removes false duplications in primary draft assemblies while retaining completeness and sequence integrity, and can improve scaffolding. It runs autonomously without requiring user specification of cutoff thresholds, allowing it to be included in an automated assembly pipeline.
We thank members of the Vertebrate Genomes Project assembly group for input and advice, including Arang Rhie, Zemin Ning, William Chow, Ying Yan, Adam Phillippy and Erich Jarvis. The Mm genome was sequenced at the Sanger Institute as part of the Vertebrate Genomes Project, we thank members of the Sanger Institute DNA pipelines group for generating the sequence data and Byrappa Venkatesh for providing the sample and we thank Jonas Korlach, Mara Lawniczak, Haynes Heaton and Christine Lambert for supplying raw data for Ac.
This work was supported by the National Key Research and Development Program of China [2017YFC0907503, 2018YFC0910504 and 2017YFC1201201 to D.G. and Y.W.]; China Scholarship Council to D.G.; Wellcome Trust [WT207492 to S.A.M. and R.D., and WT206194 to J.W. and K.H.].
Conflict of Interest: R.D. is a consultant for Dovetail Inc. All other authors declared no conflict of interest.