The genome of Chenopodium pallidicaule : An emerging Andean super grain

Luz Rayda Gomez Pando

GENOMIC RESOURCES ARTICLE The genome of Chenopodium pallidicaule: An emerging Andean super grain Hayley Mangelson1, David E. Jarvis1 , Patricia Mollinedo2, Oscar M. Rollano-Penaloza2, Valeria D. Palma-Encinas2, Luz Rayda Gomez-Pando3, 1 , and Peter J. Maughan1,4 Eric N. Jellen Manuscript received 9 July 2019; revision accepted 24 September 2019. Department of Plant and Wildlife Sciences, Brigham Young University, 5144 LSB, Provo, Utah 84602, USA 1 Institute of Natural Product Research, Universidad Mayor de San Andrés, La Paz, Bolivia 2 Departamento de Fitotecnia, Facultad de Agronomía, Universidad Nacional Agraria de La Molina, La Molina, Peru 3 4 Author for correspondence: Jeff_Maughan@byu.edu Citation: Mangelson, H., D. E. Jarvis, P. Mollinedo, O. M. Rollano-Penaloza, V. D. Palma-Encinas, L. R. Gomez-Pando, E. N. Jellen, and P. J. Maughan. 2019. The genome of Chenopodium pallidicaule: An emerging Andean super grain. Applications in Plant Sciences 7(11): e11300. doi:10.1002/aps3.11300 PREMISE: Cañahua is a semi-domesticated crop grown in high-altitude regions of the Andes. It is an A-genome diploid (2n = 2x = 18) relative of the allotetraploid (AABB) Chenopodium quinoa and shares many of its nutritional benefits. Cañahua seed contains a complete protein, a low glycemic index, and offers a wide variety of nutritionally important vitamins and minerals. METHODS: The reference assembly was developed using a combination of short- and longread sequencing techniques, including multiple rounds of Hi-C–based proximity-guided assembly. RESULTS: The final assembly of the ~363-Mbp genome consists of 4633 scaffolds, with 96.6% of the assembly contained in nine scaffolds representing the nine haploid chromosomes of the species. Repetitive element analysis classified 52.3% of the assembly as repetitive, with the most common repeat identified as long terminal repeat retrotransposons. MAKER annotation of the final assembly yielded 22,832 putative gene models. DISCUSSION: When compared with quinoa, strong patterns of synteny support the hypothesis that cañahua is a close A-genome diploid relative, and thus potentially a simplified model diploid species for genetic analysis and improvement of quinoa. Resequencing and phylogenetic analysis of a diversity panel of cañahua accessions suggests that coordinated efforts are needed to enhance genetic diversity conservation within ex situ germplasm collections. KEY WORDS Amaranthaceae; Andean crops; Chenopodium pallidicaule; Hi-C; proximity-guided assembly. Chenopodium pallidicaule Aellen (also known as cañahua) is a species of the goosefoot family (Amaranthaceae) related to the increasingly popular seed crop, quinoa (C. quinoa Willd). Gade (1970) noted that cañahua is a partially domesticated crop that provides food security to many subsistence farmers across the Andean Altiplano—the high plateau situated at 3500–4200 m above sea level between the western and eastern Andean Cordilleras of west-central South America. Cultivation of cañahua dates back more than 7000 years when it was a staple crop in ancient Incan and pre-Incan societies. It has several common names in native languages, including cañahua, cañigua, cañihua, cañawa, and kañiwa (Gade, 1970). After the Spanish Conquest, cultivation was likely discouraged in colonial society due to its association with indigenous cultures (Ruas et al., 1999). Although it never completely regained its former status, subsistence farmers across the Andean region continue to grow cañahua due to its tolerance to abiotic stress (i.e., frost, drought, and salinity) in addition to its high nutritional quality. Despite the increasing popularity of its close relative quinoa, cañahua remains practically unknown and underutilized as a food resource (Rastrelli et al., 1996) outside of the Andes. Cañahua has a unique nutritional profile that is ideal for human consumption in areas where protein is limited. Its seed contains 15–18% protein, with a complete set of essential amino acids, including 5–6% lysine, which is typically limiting in monocotyledonous grain crops (Penarrieta et al., 2008). In addition to high-quality protein, cañahua offers a wide variety of other health-promoting compounds, including antioxidants, phenols, and flavonoids (RepoCarrasco-Valencia et al., 2010). Cañahua seeds contain vanillic acid, a phenolic compound which acts as a flavor enhancer and lends a pleasant taste to cañahua, particularly when ground and toasted as a flour called cañihuaco (Penarrieta et al., 2008). With a poverty rate of nearly 50% in the rural highlands of the Altiplano, cañahua represents an incredibly important resource in the prevention of poverty-induced malnutrition and in improving food security throughout the region (Repo-Carrasco et al., 2003). Applications in Plant Sciences 2019 7(11): e11300; http://www.wileyonlinelibrary.com/journal/AppsPlantSci © 2019 Mangelson et al. Applications in Plant Sciences is published by Wiley Periodicals, Inc. on behalf of the Botanical Society of America. This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited. 1 of 12 Applications in Plant Sciences 2019 7(11): e11300 Gade (1970) noted nearly half a century ago that the continued presence of cañahua in the Altiplano will depend on its genetic transformation into a more efficient crop. Agronomic issues that have prevented more extensive cultivation of cañahua include non-uniform seed ripening and small seed size that make harvesting and processing of the seed difficult (Mujica, 1994). Despite its unique agronomic and nutritional qualities, very few of the genetic resources needed to accelerate the improvement of cañahua have been investigated. Ruas et al. (1999) published a phylogenetic study of 19 Chenopodium L. species based on RAPD markers, including two cañahua accessions that were found to be nearly identical. Vargas et al. (2011) developed the first microsatellite markers for cañahua, including 34 polymorphic markers, exhibiting a total of 154 different alleles. A phylogeny of 43 cañahua accessions showed clear distinctions between wild and cultivated lines, including a distinct subclade of erect morphotypes. Kolano et al. (2011) cytologically characterized the genome size and rDNA loci of cañahua. Their findings predicted a 2C value for the cañahua genome of 0.886 ± 0.034 pg (~433 Mbp per haploid genome) with a single copy of both 35S (subterminal) and 5S (interstitial) rDNA loci. As a part of the genome analysis of quinoa, Jarvis et al. (2017) reported a draft assembly of the cañahua genome (accession PI 478407). Quinoa is an allotetraploid (2n = 4x = 36), presumably resulting from a relatively recent (3.3–6.3 mya) polyploidization event between North American and Eurasian diploids representing the A and B subgenomes of modern quinoa, respectively (Štorchová et al., 2015). Although cañahua is not believed to be the direct A-genome donor of quinoa, it is a related A-genome diploid. The draft genome reported by Jarvis et al. (2017) was based solely on Illumina short reads and was thus highly fragmented, consisting of 3015 scaffolds and spanning a total length of 337 Mbp, with an N50 of 356 kbp. Here we report the use of PacBio long reads and Hi-C–based proximity-guided assembly to develop a reference-quality, chromosome-scale assembly of cañahua. The genome was fully annotated using a deeply sequenced transcriptome developed from six combinations of tissue types and abiotic stresses. Additionally, genetic diversity within the species was characterized with a panel of 30 cultivated and wild cañahua varieties. METHODS Plant material The cañahua accession PI 478407 was used to develop the reference assembly. It was originally collected in 1981 at the Instituto Boliviano de Tecnologia, Patacamaya, Bolivia, and is freely available from the United States Department of Agriculture (USDA; Ames, Iowa, USA; https://npgsweb.ars-grin.gov/). The diversity panel consisted of 30 accessions from three germplasm collections: specifically, eight cañahua varieties from the USDA collection (https:// npgsweb.ars-grin.gov/), one landrace and two wild accessions from the Universidad Nacional Agraria La Molina (UNALM; Lima, Peru), and 21 accessions from Universidad Major de San Andrés (UMSA; La Paz, Bolivia). A complete list of all plant materials used is provided in Table 1. Whole genome assembly In vivo Hi-C and proximity-guided assembly techniques were used to improve the previously published short-read draft assembly reported http://www.wileyonlinelibrary.com/journal/AppsPlantSci Mangelson et al.—The cañahua genome • 2 of 12 by Jarvis et al. (2017), referred to hereafter as the ALLPATHS-LG shortread assembly (ASRA). Fresh leaf tissue from a single, dark-treated (72 h) 3-week-old plant that was derived directly from selfing of the original cañahua ‘PI 478407’ plant used by Jarvis et al. (2017) was sent to Phase Genomics (Seattle, Washington, USA) for in vivo Hi-C–based proximity-guided ligation and 80-bp paired-end sequencing followed by alignment to the ASRA assembly using BWA version 0.7 (Li and Durbin, 2010). Only reads that aligned uniquely to the scaffolds were retained. Proximo, a proximity-guided assembly method based on the ligating adjacent chromatin enables scaffolding in situ (LACHESIS) assembler (Burton et al., 2013), was used to cluster, order, and orient scaffolds from the ASRA assembly, producing the first proximity-guided assembly (PGA1). Following the development of PGA1, long reads were used for gap-filling. High-molecular-weight DNA was extracted from leaf tissue of a single, 72-h dark-treated cañahua (PI 478407) plant using the QIAGEN Genomic-tip 500/G Kit (Hilden, Germany). Single-molecule, real-time sequencing using the PacBio Sequel platform (Menlo Park, California, USA) was performed at the Brigham Young University DNA Sequencing Center (Provo, Utah, USA). The PBJelly2 pipeline from PBSuite version 15.8.24 (English et al., 2012) was used to align the long reads to PGA1 in order to gap-fill the assembly. Arrow version 0.22.0 (Chin et al., 2013) and Pilon version 1.22 (Walker et al., 2014) were used for genome polishing with the previously described PacBio long reads and Illumina paired-end reads, respectively. This gap-filled and polished assembly is henceforth referred to as PGA1.5. To correct for possible errors introduced by low PacBio read coverage and relaxed PBJelly2 parameters, a contig-breaking tool, Polar Star (https://github.com/phasegenomics/polar_star), was employed. Polar Star aligns long reads to an assembly, then calculates the read depth at each base. Read depth is smoothed in a 100-bp sliding window, then regions of high, low, and normal read depth are merged. These classifications are made based on the read depth distribution. Low-read-depth outliers are identified, and the assembly is broken at each such location. Following Polar Star, PGA1.5 underwent a second de novo, proximity-guided assembly. Assembly errors (inversions and rearrangements) were identified and adjusted manually using Juicebox version 1.9.8 (Durand et al., 2016). The result was a chromosome-scale, polished assembly referred to as PGA2 (Appendix S1). Transcriptome assembly, gene annotation, and repeat modeling RNA-Seq data was generated on the Illumina Hi-Seq platform from cañahua (PI 478407) leaf, root, inflorescence, and apical meristem tissues grown in both non-stressed and salt-stressed conditions, as detailed by Jarvis et al. (2017). The reads were trimmed using Trimmomatic version 0.32 (Bolger et al., 2014) to remove Illumina adapters and trailing bases with a quality score below 20, then aligned to the PGA2 reference using HiSat2 version 2.0.4 (Kim et al., 2015; Pertea et al., 2016) with default parameters except the max intron length was set to 50,000 bp. After alignment, the resulting sequence alignment map (SAM) file was sorted and indexed using SAMtools version 1.6 (Li et al., 2009) and assembled into putative transcripts using StringTie version 1.3.4 (Pertea et al., 2016). Whole-genome annotation of the PGA2 assembly was performed by MAKER version 2.31.10 (Holt and Yandell, 2011) using the cañahua transcriptome as expressed sequence tag (EST) evidence, the uniprot_sprot database (downloaded 25 September 2018) and quinoa protein sequences (Jarvis et al., 2017) as protein homology evidence, and the consensi.fa.classified output from RepeatModeler © 2019 Mangelson et al. Applications in Plant Sciences 2019 7(11): e11300 Mangelson et al.—The cañahua genome • 3 of 12 TABLE 1. Passport and sequence archive information for plant materials used. Raw sequencing data for each accession are deposited in the Sequence Read Archive (SRA) at the National Center for Biotechnology Information (NCBI). Name Collectiona Accession ID Collection location Altitude (m a.s.l.) Sequencing technology SRA IDb WGS reference information PI 478407 PI 478407 PI 478407 PI 478407 USDA USDA USDA USDA PI 478407 PI 478407 PI 478407 PI 478407 −17.2333, −67.9166 −17.2333, −67.9166 −17.2333, −67.9166 −17.2333, −67.9166 3800 3800 3800 3800 PacBio Hi-C (Illumina) WGS (Illumina) RNA-Seq SRR9661228 SRR9661229 SRR4425239c SRR4425240– SRR4425243c Diversity panel information P1 P2 P4 U7 U8 U9 U12 U13 U14 U15 U16 B17 B18 B20 B21 B22 B23 B24 B25 B26 B27 B28 B29 B30 B31 B32 B33 B34 B35 B36 UNALM UNALM UNALM USDA USDA USDA USDA USDA USDA USDA USDA UMSA UMSA UMSA UMSA UMSA UMSA UMSA UMSA UMSA UMSA UMSA UMSA UMSA UMSA UMSA UMSA UMSA UMSA UMSA BYU 1780 BYU 1781 BYU 1785 PI 510525 PI 510526 PI 510527 PI 510530 PI 665279 PI 665280 PI 665281 PI 665282 Bol-1.1 Bol-3.1 Bol-19.1 Bol-20.123 Bol-21.123 Bol-22.123 Bol-23.123 Bol-24.123 Bol-25.123 Bol-26.123 Bol-28.123 Bol-29.123 Bol-30.123 Bol-4.3 Bol-6.2 Bol-7.1 Bol-8.1 Bol-13.3 Bol-27.123 −15.6967, −70.20510 −15.7268, −70.23560 −15.7693, −70.27050 −16.3628, −69.2765 −16.2833, −69.2833 −16.0000, −69.7833 −16.4500, −70.2333 −17.2333, −67.9166 −17.2333, −67.9166 −17.2333, −67.9166 −17.2333, −67.9166 −15.7472, −68.8091 −16.5344, −68.0622 −17.8241, −67.7702 −17.7850, −68.1447 −17.6483, −67.2072 −18.2166, −67.0333 −16.5344, −68.0622 −16.6740, −68.3183 −16.5344, −68.0622 −16.5344, −68.0622 −16.6740, −68.3183 −16.5344, −68.0622 −17.2500, −67.9166 −16.6740, −68.3183 −16.6740, −68.3183 −16.6740, −68.3183 −16.6740, −68.3183 −16.6740, −68.3183 −16.6740, −68.3183 3830 3838 3860 NA NA 3810 NA 3700 3700 3700 3700 3845 3445 3721 4025 3777 3707 3445 3900 3445 3445 3900 3445 3800 3900 3900 3900 3900 3900 3900 WGS (Illumina) WGS (Illumina) WGS (Illumina) WGS (Illumina) WGS (Illumina) WGS (Illumina) WGS (Illumina) WGS (Illumina) WGS (Illumina) WGS (Illumina) WGS (Illumina) WGS (Illumina) WGS (Illumina) WGS (Illumina) WGS (Illumina) WGS (Illumina) WGS (Illumina) WGS (Illumina) WGS (Illumina) WGS (Illumina) WGS (Illumina) WGS (Illumina) WGS (Illumina) WGS (Illumina) WGS (Illumina) WGS (Illumina) WGS (Illumina) WGS (Illumina) WGS (Illumina) WGS (Illumina) SRR9620980 SRR9640749 SRR9640748 SRR9640742 SRR9640741 SRR9640740 SRR9640747 SRR9640746 SRR9640745 SRR9640744 SRR9640743 SRR9640755 SRR9640754 SRR9640757 SRR9640756 SRR9640751 SRR9640750 SRR9640753 SRR9640752 SRR9640759 SRR9640758 SRR9640732 SRR9640733 SRR9640734 SRR9640735 SRR9640736 SRR9640737 SRR9640738 SRR9640739 SRR9640731 Note: m a.s.l. = meters above sea level; NA = not available. a Germplasm collection center. USDA = United States Department of Agriculture, Ames, Iowa, USA; UNALM = Universidad Nacional Agraria La Molina, Lima, Peru; UMSA = Universidad Major de San Andrés La Paz, Bolivia; BYU = Brigham Young University, Provo, Utah, USA. b Sequence Read Archive (SRA) identifier. c Deposited in BioProject ID PRJNA326220. All other sequences are deposited in BioProject ID PRJNA552289. for soft repeat masking. Gene prediction models included an Augustus model for cañahua produced by Benchmarking Universal Single-Copy Orthologs (BUSCO) version 3.0.2 (Simao et al., 2015) and the Arabidopsis thaliana SNAP HMM file (Korf, 2004) for gene prediction. BUSCO version 3.0.2 assessed the completeness of the assembly and annotation using the embryophyta odb10 data set. RepeatModeler version 1.0.11 and RepeatMasker version 4.0.7 (Smit et al., 2013–2015) were used to identify and classify repetitive elements in the final (PGA2) assembly relative to Repbase-derived RepeatMasker libraries version 20181026 (Bao et al., 2015). Chloroplast genome assembly and annotation A reference-guided assembly of the cañahua chloroplast genome was constructed by the Assembly by Reduced Complexity (ARC) http://www.wileyonlinelibrary.com/journal/AppsPlantSci assembler version 1.1.4 (Hunter et al., 2015) using a subset of six million whole-genome, paired-end Illumina reads with the quinoa chloroplast genome (Maughan et al., 2019) as a target. The ARC algorithm uses Bowtie2 (Langmead and Salzberg, 2012) with relaxed parameters to map reads against targets, extract mapped reads from each target, and assemble mapped reads using the SPAdes assembler (Bankevich et al., 2012). The targets are then replaced with newly assembled contigs, and the process is iterated for a predetermined number of cycles or until no additional reads can be incorporated. The ARC pipeline extended the assembled cañahua chloroplast contigs through four (numcycles = 4) successive rounds of mapping and re-assembly. Because chloroplast read depth should be significantly higher than nuclear genome read depth, only assembled contigs with read depth >50× coverage were selected for further assembly. Pacific Biosciences long reads (>15 kbp; n = 246,847) were © 2019 Mangelson et al. Applications in Plant Sciences 2019 7(11): e11300 used to fill gaps between contigs using PBJelly2, a subprogram from PBSuite version 15.8.24 (English et al., 2012). A circularized contig representing the complete plastid genome was constructed using the circularize tool from Geneious (version 11.1.5; https://www. geneious.com/), then the assembly was polished with the same six million paired-end Illumina reads as used in the initial assembly. Annotation of the cañahua chloroplast genome was performed using GeSeq version 1.65 (Tillich et al., 2017) with the quinoa chloroplast annotation (Maughan et al., 2019) and the Max Planck Institute of Molecular Plant Physiology (MPI-MP) chloroplast database as references. ARAGORN version 1.2.3 and HMMER profile searches were enabled, the latter using the embryophyta chloroplast (CDS + rRNA) database. Comparison to the quinoa plastid genome was performed by the nucmer tool from MUMmer version 4.0beta (Marcais et al., 2018) followed by MUMmerplot with all default parameters. Resequencing and single-nucleotide polymorphism discovery DNA was extracted from single plants for each of 30 cañahua accessions using a cetyltrimethylammonium bromide (CTAB) extraction method as described by Doyle and Doyle (1987). Samples were sent to Novogene (San Diego, California, USA) for whole-genome Illumina HiSeq (150-bp paired-end) sequencing from 500-bp insert libraries. Trimmomatic version 0.32 (Bolger et al., 2014) was used to remove Illumina adapters and trailing bases with a quality score below 20 or average per-base quality of 20 over a four-nucleotide sliding window. Reads from each accession were aligned to PGA2 using BWA-MEM version 0.7.17 (Li, 2013) to produce SAM files that were converted to binary alignment map (BAM) format, sorted, and indexed using SAMtools version 1.9 (Li et al., 2009). The BAM files were used as input for InterSnp, a subprogram of the BamBam version 1.4 pipeline (Page et al., 2014), for single-nucleotide polymorphism (SNP) genotyping. SNPhylo version 20160204 (Lee et al., 2014) used the HapMap output files produced by InterSnp to filter and remove SNPs with >10% missing data and minor allele frequency <5%. SNPhylo also filters SNP data sets using linkage disequilibrium (LD) estimates (SNPs with LD < 40% are removed) prior to building bootstrapped (n = 1000) phylogenies based on MUSCLE (Edgar, 2004) sequence alignments. The resulting tree was visualized in FigTree version 1.4.3 (http://tree.bio.ed.ac.uk/soft ware/figtree). Population structure was evaluated using Structure version 2.3.4 (Pritchard et al., 2000) with a range of K = 1 through K = 5 from a 1000 SNP subset of the InterSnp output. ArcMap version 10.3.1 (ArcGIS Desktop, release 10; Esri, Redlands, California, USA) mapping software was used to map the geographic locations of the source materials. The clustering partitions produced by Structure were used to construct a pie chart representing the allelic composition of each mapped individual. Genome comparison Comparisons of coding sequences for quinoa (C. quinoa; CoGe id53523), beet (Beta vulgaris L.; CoGe id37197; Funk et al., 2018), and amaranth (Amaranthus hypochondriacus L.; CoGe id34733; Lightfoot et al., 2017) were made using the CoGe SynMap tool (https://genomevolution.org/coge/) applying the Last algorithm with the recommended DAGChainer option (relative gene order) and Merge syntenic blocks option (quota align merge). The syntenic depth was set to quota align merge, at a ratio of coverage depth of http://www.wileyonlinelibrary.com/journal/AppsPlantSci Mangelson et al.—The cañahua genome • 4 of 12 1 : 1 (beet) or 1 : 2 (quinoa, amaranth). The DAGchainer (Haas et al., 2004) output files were used as input for the MCScanX (Wang et al., 2012) and VGSC toolkit (Xu et al., 2016) for data visualization. RESULTS Whole genome assembly The previous draft assembly of PI 478407 reported in Jarvis et al. (2017) was based solely on Illumina short reads assembled using the ALLPATHS-LG assembler (Gnerre et al., 2011). Although this was an excellent draft assembly, the lack of long-jump libraries (i.e., fosmid) or bacterial artificial chromosome (BAC)-end sequencing resulted in a highly fragmented assembly. The ASRA assembly consisted of 8982 contigs in 3013 scaffolds with a contig and scaffold N50 of 84 kbp and 357 kbp, respectively, spanning a total length of 337 Mbp (Table 2). To improve the ASRA, 179 million Hi-C–based paired-end reads were generated and used to scaffold the ASRA using the Proximo pipeline (Phase Genomics). Seventynine percent (2392) of the ASRA scaffolds were clustered into nine pseudomolecules, corresponding to the nine haploid chromosomes of cañahua (2n = 2x = 18; Appendix S1), producing a substantially improved proximity-guided assembly (PGA1). The number of scaffolds clustered to specific chromosomes ranged from 203 to 317, and the length of the assembled chromosomes ranged from 31.3 to 40.4 Mbp. The PGA1 scaffolds contained 95.3% of the total sequence length (99.7% excluding N gaps) with an N50 and L50 of 35.6 Mbp and 5, respectively (Table 2). Ns occupied 12.3 Mbp (4%) of the assembly, with an average of 1047 gaps (20 or more contiguous Ns) per scaffold. The unincorporated scaffolds (621) were small, representing <5% of the total sequence length of the ASRA, with a mean scaffold size of 25.8 kbp, making them much more difficult to incorporate accurately into chromosomes. PGA1 was further improved by applying a combination of gap-filling and genome-polishing techniques. To close gaps, 10.21 Gbp (1,101,202 reads) of PacBio long reads were generated with a mean read length of 9.3 kbp, providing 23.6× coverage of the cañahua genome. PacBio long reads were aligned to PGA1 using PBJelly2 (English et al., 2012), closing 75% of existing N gaps. Due to potential errors introduced into gaps because of the inherent high error rate of PacBio reads, the assembly quality was improved using TABLE 2. Assembly statistics for the ASRA, PGA1, PGA1.5, and PGA2 assemblies. Assembly statistic ASRA PGA1 PGA1.5 PGA2 Assembly size (Mbp) No. of scaffolds Scaffold N50 size (Mbp) Scaffold L50 count Longest scaffold (Mbp) No. of contigs Contig N50 size (Mbp) Contig L50 count % missing bases Assembly size (Mbp) in top 9 scaffolds Assembly % in top 9 scaffolds 337 3015 0.357 243 2.9 8984 0.083 1096 2.5 20 337 623 35.6 5 40.4 8984 0.083 1096 2.6 321 363 591 37.8 5 43.2 2580 0.516 168 0.2 344 363 4633 38.1 5 45.5 8210 0.236 401 0.1 350 5.8 95.4 94.8 96.5 Note: ASRA = ALLPATHS-LG Short-Read Assembly; PGA1 = Proximity-Guided Assembly 1; PGA1.5 = Proximity-Guided Assembly 1.5; PGA2 = Proximity-Guided Assembly 2. © 2019 Mangelson et al. Applications in Plant Sciences 2019 7(11): e11300 two genome-polishing tools: Arrow (Chin et al., 2013), which produces consensus-quality assemblies from PacBio sequences, followed by Pilon (Walker et al., 2014), which performs a similar function but takes advantage of the significantly lower error rate of Illumina reads to improve the consensus assembly. These polishing steps made changes at 593,821 positions, representing <0.165% of PGA1. The resulting assembly, PGA1.5, had a total size of 363 Mbp, an approximately 7.7% increase from the ASRA. The scaffold N50 of PGA1.5 increased slightly to 37.8 Mbp, while the number of gaps decreased dramatically from 8013 to 2007, which is also reflected in a 10-fold decrease in the number of Ns in the assembly (4% to 0.2%; Table 2). A second round of proximity-guided assembly using PGA1.5 improved the chromosome-scale assembly. Polar Star, which aggressively breaks contigs at low PacBio depth locations based on deviation from mean depth, introduced 5241 breaks that were then tested for rescaffolding using Hi-C–based proximity-guided assembly. This acts as a check on the error-prone PacBio reads and low coverage depth used in the gap-filling process. The result is a dramatically improved proximity-guided assembly, evident by the consistent pattern of Hi-C crosslink density along chromosomes and the resolution of erroneous inversions and rearrangements. The final assembly (PGA2) spans 362.5 Mbp, has a scaffold N50 and L50 of 38.1 Mbp and 5, respectively, with <0.1% of the assembled sequence found in 3586 gaps. Eighty-four percent of the estimated genome size is represented; the remaining 16% is likely composed of repetitive sequence that has collapsed in regions such as centromeres and telomeres due to the use of short reads for the initial assembly. The nine chromosomes contain 96.7% of the total sequence length (99.9% excluding N gaps), ranging in size from 33.5 Mbp to 45.4 Mbp (Appendix S2). Repeat identification and genome annotation RepeatModeler and RepeatMasker were employed to identify and classify repetitive elements in the cañahua genome. Fifty-three percent (191 Mbp) of the cañahua genome was classified as repetitive, with an additional 1.9% (7 Mbp) classified as low complexity (satellites, simple repeats, and small RNAs). A total of 129 Mbp (35.5%) was identified as retrotransposons or DNA elements, with an additional 61 Mbp (16.8%) classified as unknown elements. The most common elements identified were long terminal repeat (LTR) retrotransposons, specifically copia-like and gypsy-like elements, which spanned 67 Mbp (27.1%) of the genome (Appendix S3). The large fraction of unknown elements is not surprising given that the only published studies of repetitive elements in the Chenopodium genus have been limited to rDNA sequences (Maughan et al., 2006; Kolano et al., 2011) and two repetitive sequences, 18-24J and 12-13P, that were only recently characterized cytogenetically (Orzechowska et al., 2018). BLASTn was used to locate the 5S rDNA sequence and the two Chenopodium repetitive elements within the final assembly. Consistent with the findings of Kolano et al. (2011), the 5S rDNA sequence was found in a single genomic location in the centromeric region of chromosome Cp8 (Fig. 1). Orzechowska et al. (2018) previously reported that the 18-24J repeat was almost exclusively found in the Chenopodium B-genome, whereas 12-13P was located at pericentric positions in both the A- and B-genomes. BLASTn searches of the cañahua genome confirmed these observations, with 18-24J identified in only 0.012% of the cañahua genome while the 12-13P repetitive element, occupying 124.6 kbp (0.027%), was localized to putative centromeric regions on all nine chromosomes (Fig. 1). http://www.wileyonlinelibrary.com/journal/AppsPlantSci Mangelson et al.—The cañahua genome • 5 of 12 A transcriptome assembly of cañahua was developed by sequencing RNA-Seq libraries from six unique tissue and abiotic stress combinations. The resulting RNA-Seq libraries generated 66.3 Gbp of data from 663,493,956 paired-end reads with an average of 11.05 Gbp per library. Ninety-eight percent (649,273,284) of the paired RNA-Seq reads aligned to the final PGA2 assembly, with 97.9% of those read pairs aligning concordantly—suggestive of a high-quality genome assembly. A Stringtie (Pertea et al., 2016) reconstruction of the cañahua transcriptome identified 255,893 features, including 214,170 exons in 41,723 primary and alternative transcripts with a mean transcript length of 2.19 kbp and an average of 28,246 features per chromosome. The MAKER pipeline was used to annotate PGA2 using as evidence the cañahua transcriptome (described above) and, as alternative evidence, the transcripts and protein sequences for quinoa (Jarvis et al., 2017), as well as the complete uniprot_sprot database. A total of 22,832 gene models were annotated, which is slightly more than half of the 44,776 gene models annotated in the allotetraploid quinoa (Fig. 1). The average transcript length was 4.6 kbp, with the longest protein sequence spanning 4769 amino acids (annotation ID: CP013000), which is predicted to encode the large sacsin-like gene found in many eukaryotes, including other Amaranthaceae species such as quinoa (XP_021735414), beet (XP_010688704), and spinach (XP_021846357). The mean annotation edit distance (AED), which is a quality measure combining values for sensitivity, specificity, and accuracy to give evidence of a high-quality annotation, was 0.23. AED values <0.3 are indicative of high-quality annotations (Holt and Yandell, 2011). Completeness of the gene space was assessed using the BUSCO platform, which quantifies functional gene content using a large core set of highly conserved orthologous genes (COGs). Of the 1375 plant-specific COGs in the embryophyta database, 1341 (97.5%) were identified in the cañahua genome as complete, with another nine (0.7%) COGs classified as fragmented (complete: 97.5% [single: 95.9%, duplicated: 1.6%], fragmented: 0.7%, missing: 1.8%). Relative to the MAKER de novo annotated proteins and transcripts, BUSCO identified 1260 (91.6%) and 1303 (94.8%) complete COGs, respectively. The discrepancies between the whole genome, protein, and transcript BUSCO findings may be attributed to the difference in gene annotation method between BUSCO and MAKER. Whereas BUSCO uses BLAST to identify known genes, MAKER uses an approach that requires sufficient evidence from a combination of protein, EST, and ab initio gene prediction inputs. The annotation could potentially be improved by further training of the input gene prediction model (Augustus, SNAP). Chloroplast genome reconstruction The cañahua chloroplast assembly spans 151,799 bp in a single circular molecule. Annotation of the chloroplast genome revealed a quadripartite structure, including two copies of an inverted repeat (IR) region separating large and small single-copy regions. One hundred thirty-two genes were identified, including 88 protein-coding genes, 36 tRNA genes, and eight rRNA genes (Fig. 2). Twenty-one genes were located in each IR, including a pseudogene previously characterized in other Amaranthaceae species as rpl23 (Park et al., 2018; Maughan et al., 2019). With a length of 151,799 bp, the cañahua plastid genome is of a similar size to quinoa, which has been reported for multiple quinoa accessions ranging in size from 152,079–152,282 bp, with an average length of 152,134 bp (Hong © 2019 Mangelson et al. M M 35 25M 30 5M 25M 30M 20M 5M 8 M 10 Cp Cp 10M M 20 25M 30M 2 M M 15 M 15 M M 25 20 Cp1 Cp9 • 6 of 12 M 35 M 15M 10M 5M 40 30 15M 20M Mangelson et al.—The cañahua genome 10M Applications in Plant Sciences 2019 7(11): e11300 5M 35M 35M 5M 30M Chenopodium pallidicaule nuclear genome 363 Mbp; 22,832 gene models 15M 10M Cp3 20M Cp7 25M 15M 20M 25M 10M 5M 30M 35M M 40 5M M 35 10 M 4 M 25M 35M 30M 45M 40M 5M 10 M 15 20 M 10M 20M 15M 5M Cp5 M 15 M M 20 M 25 M 30 M 35 Cp p6 M 25 C 30 FIGURE 1. Genome annotation overview. An overview of gene and repetitive element annotations in the Chenopodium pallidicaule genome. Track 1: chromosome names and sizes; Track 2: frequency of pericentromeric 12-13P repetitive elements (purple); Track 3: frequency of 18-24J repetitive element (blue) and the 5S rRNA locus (red); Track 4: frequency of canonical telomeric repeat; Track 5: gene density. et al., 2017; Maughan et al., 2019). Due to the lack of recombination in chloroplast genomes and the relatively recent allotetraploidization event leading to quinoa (3.3–6.3 mya; Jarvis et al., 2017), the high degree of similarity between the cañahua and quinoa chloroplasts supports the hypothesis that the maternal parent in the polyploidization event that led to modern quinoa was an A-genome species (Maughan et al., 2019). Diversity panel resequencing A diversity panel consisting of 30 varieties of cañahua, including 28 landrace varieties and two wild accessions, was sequenced to an average depth of 10.9× coverage (4.7 Gbp) per accession. After read alignment to the PGA2 final assembly, a total of 358,461 SNPs were identified in the diversity panel, which were then filtered to 16,194 SNPs, based on minor allele frequency, missing data, and linkage disequilibrium, with an average of 1799 SNPs per chromosome. Analysis of the consensus, 1000-bootstrap phylogeny of the cañahua diversity panel suggests several major points of interest (Fig. 3A). First, the USDA collection of the species is limited to only two of three major groups, with the majority (seven out of eight accessions) on a single group, suggesting limited diversity within the USDA collection and highlighting the need for international collection efforts to preserve diversity within the species. Second, the Mantel test suggests that there is no correlation between http://www.wileyonlinelibrary.com/journal/AppsPlantSci collection site and genotype (Z = 11,296.22, r = −0.12326, and P = 0.837). This is likely due to a lack of good collection site data for many of the accessions. Indeed, eight of the accessions have as their passport data the latitude and longitude coordinates of the research facilities where they are curated instead of the coordinates of the original collection site (Fig. 3B, Table 1). Another potentially complicating issue is the well-known cultural practice of seed trading among indigenous Andean societies that was an important part of agriculture in the pre-Columbian Altiplano region for thousands of years (Vargas et al., 2011). Lastly, the two wild accessions (P1, P4) are found by themselves on a distinct clade within the phylogeny. A structure analysis (Pritchard et al., 2000) suggests that they are distinct from the landrace and cultivated accessions, showing little admixture with cultivated accessions, even though they were collected in close proximity to a cultivated type (P2, Fig. 3C). This finding agrees with those of Vargas et al. (2011) and is further evidence that the wild accessions may be useful sources of novel genetic diversity for improving cañahua. Genome comparison Syntenic relationships between cañahua and other Amaranthaceae species were explored using DAGChainer (Haas et al., 2004), which identifies colinear sets of homologous gene pairs (syntenic blocks) between genomes. The first species of the family with a © 2019 Mangelson et al. Applications in Plant Sciences 2019 7(11): e11300 Mangelson et al.—The cañahua genome C A pe trn tN CGC rp o B trn tr E-U tr nE- UC trn nY-GUUC trn D-G UA D- U G C ps UC bM K ndh C ndh ycf3* L U psaA psaB-fragment Final-fragm ent psaA-f psaB ragmen t rps14 trnM-C trnfM AU -CAU trnS -UG A rps4 GU trnT-U ndhJ rbc CA Mtrn A m ce LSC 1* oC rp p p sb p sbL J ps sbF trn bE trn Wrpl2 trn P-U CCA 0 P- G UG G G rps 12fra gm ent clpP psbB psbT psbH psb D p sb B atp E a tp cD ac aI ps f4 yc tA ps r aJ rpspl33 18 trn trn T-GG T- G U GU trnG-G CC trnG psbZ -GCC trnS-GGA trnS-GGA AA trnL-U AA trnF-G C UA V- AC trn V-U trn pe p pe etL tG C2 rpo 2 rps I atp H atp CU trnR-U CU trnR-U GA trnS-C atpF atpA psbN petB psbI psbK U petD trnS-GC Chenopodium pallidicaule rpoA rps11 rpl36 infA rps8 rpl14 rpl16 rps3 rpl22 rps19 rpl2 rpl23-CAU trnM AU trnI-C trnQ-UUG rps16 chloroplast genome trnK-UUU matK psbA trnH-GU G 151,799 bp rps19-fra gment rpl2 rpl23 trnM-C A U trnI-C AU ycf2 trn IR A A A L-C ent IR gm B -fra ycf2 hB nd 7 s rp s12 l rp ina F trn psaC ndhE ndhG UU trn rp nd s7 hB rp s1 rrn 23 ndhD ndh F N-G 16 rrn AU I-G trn 16 AG ccsA rpl32 ycf trnL-U UU N-G 1 trn rrn 4 trn r .5 trn R-ACrn5 R-A G CG C AC t UG AA G men rps15 ndhH ndhI ndhA trn A- V- 2 -frag G V- trn ycf1 AC L-C trn rrn AUal I-G in C trn F UG A23 trn rrn 4.5 rrn rrn5 G AC R- CG trn R-A trn S SC photosystem I photosystem II cytochrome b/f complex ATP synthase NADH dehydrogenase RuBisCO large subunit RNA polymerase ribosomal proteins (SSU) ribosomal proteins (LSU) clpP, matK other genes hypothetical chloroplast reading frames (ycf) ORFs transfer RNAs ribosomal RNAs origin of replication polycistronic transcripts • 7 of 12 FIGURE 2. Chloroplast annotation overview. The outside track shows genes transcribed in a clockwise direction, the second track shows genes transcribed in a counterclockwise direction, and the inside track shows G/C content levels. Annotation reveals a quadripartite structure, including two copies of the inverted repeat (bolded line) dividing large and small single-copy regions. reference-quality chromosome-scale assembly was beet (B. vulgaris; 2n = 2x = 18; Dohm et al., 2014). A genomic comparison between cañahua and beet identified 162 syntenic blocks with 11,659 colinear gene pairs (average of 72 genes/block) spanning 271 Mbp. As expected, given the relatively close ancestry of the species, the size (in base pairs) of the syntenic blocks between species was highly correlated (R2 = 0.80). The large blocks of syntenic genes are suggestive of homologous relationships between the chromosomes of the two species (Fig. 4A). Indeed, homologous chromosome pairs can easily be identified for all cañahua and beet chromosomes: http://www.wileyonlinelibrary.com/journal/AppsPlantSci Cp1 = Bv1 (93% shared syntenic block sequence), Cp2 = Bv2 (100%), Cp3 = Bv3 (99%), Cp4 = Bv4 (100%), Cp5 = Bv5 (100%), Cp6 = Bv6 (100%), Cp7 = Bv7 (100%), Cp8 = Bv8 (100%), Cp9 = Bv9 (96%). To maintain the family naming convention, we have assigned the cañahua chromosomes with the same number as their beet homologs (i.e., Cp1 = Bv1, etc.). Beet and cañahua are diploids that share a base chromosome number of x = 9, whereas the base number in Amaranthus is x = 8. Lightfoot et al. (2017) identified evidence of chromosome loss (the homoeolog of Ah5) and chromosome fusion (Ah1) in the amaranth © 2019 Mangelson et al. Applications in Plant Sciences 2019 7(11): e11300 A Mangelson et al.—The cañahua genome • 8 of 12 B C FIGURE 3. Diversity panel. (A) The unrooted tree was developed using 16,194 single-nucleotide polymorphisms (SNPs) filtered to remove SNPs with >10% missing data, minor allele frequency <5%, and linkage disequilibrium <40%. Colors represent the collection source (purple = United States Department of Agriculture, green = Universidad Nacional Agraria La Molina, blue = Universidad Major de San Andrés La Paz), and bolded lines indicate wild accessions. (B) Geographic location (see Table 1 for passport information) combined with population structure information developed by Structure with K = 4. There is no significant correlation between collection site and genetic distance (P = 0.837). The wild Chenopodium pallidicaule accessions are identified with arrows. (C) Population structure and admixture in the diversity panel. genome that likely led to the observed base chromosome number reduction in the amaranths. Our comparison of cañahua to the amaranth genome identified 285 syntenic blocks with 13,200 colinear gene pairs. Although there was an increase in syntenic blocks identified in the cañahua–amaranth comparison, the number of genes per block dropped (46 genes/block) and was accompanied by a lower syntenic block size correlation (R2 = 0.25). The decrease in block size and correlation is reflective of the more distant evolutionary relationship between these two species. Our analysis confirms the chromosome loss and fusion events in the amaranth genome. Indeed, the entirety of Cp9 aligns twice (end-to-end) with Ah1, and one homolog of Cp1 is largely missing (Fig. 4B). The synteny observed among the cañahua and amaranth chromosomes suggests several homoeologous relationships within the amaranth genome. For example, cañahua chromosome Cp6 is clearly homologous across the entirety of amaranth chromosomes Ah2 and Ah3. Indeed, of the 54 Mbp of amaranth sequence that is syntenic with Cp6, 48% (26 Mbp) is syntenic to Ah2 and 52% (28 Mbp) is syntenic to Ah3—clearly suggestive that Ah2 and Ah3 are homoeologs. Using the syntenic data from the cañahua–amaranth comparison and a simple majority rule (>75% syntenic sequence), we identify the following orthologous relationships: Cp1 = Ah5 (homoeolog loss), http://www.wileyonlinelibrary.com/journal/AppsPlantSci Cp2 = Ah10/Ah16, Cp4 = Ah6/Ah4, Cp6 = Ah3/Ah2, Cp7 = Ah8/Ah15, Cp8 = Ah14/Ah9, and Cp9 = Ah1/Ah1 (homoeolog fusion). Only one of the amaranth homoeologs of Cp3 is clearly identifiable in the data, specifically Ah13, with Ah7 likely the homoeolog, but obscured by a large translocation between Ah7 and Ah4. Similarly, the orthologous relationship with Cp5 likely involves Ah11 and Ah12 but is complicated by translocations with Ah2 (Fig. 4B, Appendix S4). Comparison of cañahua to the quinoa genome identified 418 syntenic blocks with 23,410 colinear gene pairs (Appendix S5). When analyzed on a subgenome basis, cañahua had considerably more and gene-dense syntenic blocks with subgenome A (13,073 gene pairs, 71.1 genes/block) relative to subgenome B (10,337 gene pairs, 46.2 genes/block; Appendix S5). The size of the sytenic blocks were also more highly conserved with the A subgenome relative to the B subgenome (R2 = 0.82 and 0.35, respectively) as well as the total number of syntenic genes (Table 3), confirming that cañahua is representative of the A-genome species in the genus. Although both the A and B subgenomes have maintained similar chromosomal structure, the A-subgenome homoeologs in quinoa can be clearly identified via visual inspection of the syntenic sequence dot plots and are supported by the amount of syntenic bases shared © 2019 Mangelson et al. Applications in Plant Sciences 2019 7(11): e11300 A Mangelson et al.—The cañahua genome • 9 of 12 B C FIGURE 4. Genomic comparison of cañahua with beet, amaranth, and quinoa. Synteny dot plot (left) and dual syteny plots (right) show syntenic regions between cañahua and beet (A), amaranth (B), and quinoa (C) coding sequences. The dual synteny plot of the quinoa genome is divided into A- and B-subgenomes with cañahua in the center. Increasing color intensity is associated with increasing homology in the dot plots. The arrows identify the chromosomal fusion (red) and loss (blue) in amaranth. (Fig. 4C, Appendix S4). All quinoa A chromosomes share a higher number of syntenic genes with cañahua than their B homoeologs, except for Cq4A and Cq4B where 1376 and 1416 syntenic genes were identified, respectively. However, the number of genes per syntenic block shared with Cp4 is higher in Cq4A (76.4) than for Cq4B (70.8), as is the total amount of syntenic bases shared with Cp4 (28.3 Mbp), which is suggestive that the assignment of Cq4A to the A-subgenome is likely correct. This is even more significant considering that the B-subgenome of quinoa is larger than the A-subgenome (531 Mbp in the A-subgenome and 670 Mbp in the B-subgenome; Jarvis et al., 2017). We calculated the rate of synonymous substitutions per synonymous site (KS) in duplicate gene pairs between cañahua and the A- and B-subgenome chromosomes found in quinoa (Jarvis et al., 2017). Clear peaks are present at KS = 0.025 and 0.05 for the A- and B-subgenome comparisons, respectively, reflecting a notable difference in the estimated time since the A and B subgenomes of quinoa last shared a common ancestor with cañahua (Appendix S6). Indeed, depending on whether an A. thaliana–based synonymous mutation rate (Koch et al., 2000) or a core eukaryotic–based rate is used (Lynch and Conery, 2000) in the calculation, the A-subgenome of quinoa last shared a common ancestor with cañahua approximately http://www.wileyonlinelibrary.com/journal/AppsPlantSci 0.8–1.5 mya, whereas the B-subgenome and cañahua have been diverged for nearly twice as long (1.7–3.1 mya). KS values suggest that the last common ancestor between cañahua and beet was approximately 16–29.6 mya, whereas the last common ancestor between cañahua and amaranth was more distant at 21.3–39.5 mya (Appendix S6, Table 3). DISCUSSION The value of incorporating Hi-C data and long reads into the assembly is clear when comparing ASRA and PGA2 assemblies. The Hi-C data increased contiguity of PGA2 significantly by reducing the assembly from 3015 scaffolds to nine chromosome-scale scaffolds, while the long-read sequence dramatically reduced the number of gaps (by 75%) in the assembly as well as increasing the total assembly size. One notable disadvantage to developing a genome assembly based on short reads is the difficulty of properly assembling repetitive elements (Richards, 2018). When the read length is shorter than the repeat element, the reads collapse into single contigs, resulting in genome assemblies that can be significantly smaller than the actual size of the genome such as seen here. For example, © 2019 Mangelson et al. Applications in Plant Sciences 2019 7(11): e11300 Mangelson et al.—The cañahua genome • 10 of 12 TABLE 3. Comparison of gene synteny, synonymous substitutions rate, and divergence since the last common ancestor relative to cañahua. Metric Amaranth Beet Quinoa A-subgenome Quinoa B-subgenome Total no. of genesa Unique syntenic genesb % of syntenic genes Syntenic genes/block Average syntenic block size (Mbp) Ks peak value Last common ancestor (mya) 45,947 23,878 52.0 46.3 1.3 0.64 21.33–39.51 45,334 23,075 50.9 71.9 3.1 0.48 16–29.63 43,663 26,230 60.1 71.1 4.7 0.025 0.830–1.54 44,638 25,327 56.7 46.1 4.9 0.05 1.67–3.09 Note: Ks = synonymous substitutions per synonymous site. Total number of annotated genes in cañahua and the comparison species. b Total number of unique syntenic genes in cañahua and the comparison species. a the telomeric repeat in PGA2 was largely collapsed into a single contig that was not scaffolded to any of the chromosomes. Although there are traces of telomere sequence on several of the nine scaffolds (Fig. 1), the integrity of this element was largely lost. Nonetheless, the assembly method reported here is cost-effective, requiring only inexpensive Illumina short-read technology for the initial assembly and Hi-C scaffolding, while the more expensive long reads (PacBio) necessary for gap-filling are only needed at low coverage. The high level of synteny between cañahua chromosomes and the A-subgenome chromosomes of quinoa, as well as the high chloroplast similarity and KS values, provides strong evidence supporting a New World A-genome diploid as the maternal cytoplasm donor of the A-subgenome in the allopolyploidization of quinoa. However, given the closer proximity between the Eurasian landmass (B-subgenome origin) with North America versus South America, a North American A-genome diploid donor is more logical than a South American origin donor, such as cañahua. Thus it is unlikely that cañahua is the direct ancestor of the A-subgenome in quinoa, suggesting that future genomic analyses of the more than 45 putative A-genome diploid Chenopodium species should provide important insight into the polyploidization events that underlie the evolution and domestication of the New World AABB Chenopodium species complex that includes free-living C. berlandieri Moq. subsp. berlandieri, C. quinoa var. melanospermum Hunz., C. quinoa subsp. milleanum Aellen, and C. hircinum Schrad., along with their domesticated forms C. quinoa and C. berlandieri subsp. nuttalliae (Saff.) H. Dan. Wilson & Heiser (Wilson, 1990). Indeed, recently reported read-mapping percentages reveal that C. watsonii A. Nelson and C. sonorense Benet-Pierce & M. G. Simpson, both wild diploids collected in the southwestern United States, align more closely to the quinoa A-subgenome than does cañahua, with C. watsonii exhibiting the highest mapping percentage (Jellen et al., 2019). Whole-genome sequencing of an additional 24 putative A-genome diploid Chenopodium species, originating from across North, Central, and South America, is currently underway in our laboratory. Careful evaluation of chromosomes within the Amaranthaceae family can shed light on how these genomes evolved over time and what role structural changes have played in biological function. For example, homologs of Cp5 are highly conserved in both the A and B subgenomes of quinoa (Cq5A and Cq5B), but there is clear structural variation in comparison to the homolog in beet, Bv5 (Fig. 4A). One of the amaranth homologs of Cp5 is collinear (Ah2), whereas the second homolog is split between two chromosomes (Ah11 and Ah12) but also reflects a similar order. This may be evidence that a terminal inversion occurred in the evolution of http://www.wileyonlinelibrary.com/journal/AppsPlantSci beet after the divergence from a common ancestor. Homologs of Cp9 also show an evolutionarily interesting pattern. Whereas Cp9 is conserved in the A-subgenome of quinoa (Cq4A), demonstrated both by a syntenic dot plot (Fig. 4A) and a high number of syntenic genes (1323; Appendix S5), the B-subgenome homolog has a much different structure and less than half the number of syntenic genes (536). Meanwhile, beet and amaranth both have unique rearrangements of this homolog (Bv9 and Ah1, respectively), suggesting that the order of genes along this molecule may not hold significant biological importance. In conclusion, the reference-quality, chromosome-scale assembly of cañahua presented here dramatically improves the existing resources for this regionally important subsistence crop. The reference genome provides a critical genomic tool needed to draw attention to cañahua, which should lead to renewed interest in improved varietal development using modern plant breeding techniques, including marker-assisted selection and genomic selection (Jannink et al., 2010; Brachi et al., 2011). The genome annotation reported here will undoubtedly facilitate gene discovery efforts within the species, allowing researchers to move quickly from genetic linkage/linkage disequilibrium experiments to possible candidate gene targets. Indeed, once target genomic regions are identified, enhanced marker-assisted breeding methods can be more effectively employed. Given that the only other domesticated Chenopodium species are complex allotetraploids (C. quinoa and C. berlandieri subsp. nuttalliae), we anticipate that cañahua will serve as a simplified genetic model for the family (Bertioli et al., 2016; Du et al., 2018). Lastly, the resequencing information presented here for a large diversity panel of cañahua accessions collected from across the Altiplano should provide the preliminary data needed for germplasm conservation and core-collection development. ACKNOWLEDGMENTS The authors would like to thank PROINPA Foundation for the generous donation to the Universidad Mayor de San Andrés, La Paz, Bolivia, of 10 cañahua lines belonging to the “Programa de mejoramiento genético de cañahua.” DATA AVAILABILITY The raw sequences are deposited in the National Center for Biotechnology Information (NCBI) Sequence Read Archive database under the BioProject ID PRJNA552289 with the following © 2019 Mangelson et al. Applications in Plant Sciences 2019 7(11): e11300 accession numbers: SRR9661228 (PacBio), SRR9661229 (Hi-C), and SRR9640731–SRR9640759 and SRR9620980 (resequencing panel). Bulk data downloads, including annotations and BLAST analysis, and JBrowse viewing of the final proximity-guided assembly are available at CoGe (https://genomevolution.org/coge/; Genome id53872). SUPPORTING INFORMATION Additional Supporting Information may be found online in the supporting information tab for this article. APPENDIX S1. Outline of the genome assembly process. APPENDIX S2. Length and contig number for each chromosome-scale scaffold in PGA2. APPENDIX S3. Repetitive element classification for final assembly (PGA2) as reported by RepeatMasker. APPENDIX S4. Orthologous genes were identified between cañahua and beet (A), amaranth (B), and quinoa (C) to detect orthologous chromosome relationships. APPENDIX S5. Comparison of gene synteny between cañahua and the two subgenomes of quinoa. APPENDIX S6. Rate of synonymous substitutions per synonymous site (Ks). Ks values within duplicated gene pairs between cañahua with amaranth (red), beet (yellow-brown), tetraploid quinoa (green), the A-subgenome of quinoa (blue), and the B-subgenome of quinoa (purple). LITERATURE CITED Bankevich, A., S. Nurk, D. Antipov, A. A. Gurevich, M. Dvorkin, A. S. Kulikov, V. M. Lesin, et al. 2012. SPAdes: A new genome assembly algorithm and its applications to single-cell sequencing. Journal of Computational Biology 19: 455–477. Bao, W. D., K. K. Kojima, and O. Kohany. 2015. Repbase Update, a database of repetitive elements in eukaryotic genomes. Mobile DNA 6: 11. Bertioli, D. J., S. B. Cannon, L. Froenicke, G. D. Huang, A. D. Farmer, E. K. S. Cannon, X. Liu, et al. 2016. The genome sequences of Arachis duranensis and Arachis ipaensis, the diploid ancestors of cultivated peanut. Nature Genetics 48: 438. Bolger, A. M., M. Lohse, and B. Usadel. 2014. Trimmomatic: A flexible trimmer for Illumina sequence data. Bioinformatics 30: 2114–2120. Brachi, B., G. P. Morris, and J. O. Borevitz. 2011. Genome-wide association studies in plants: The missing heritability is in the field. Genome Biology 12: 232. Burton, J. N., A. Adey, R. P. Patwardhan, R. L. Qiu, J. O. Kitzman, and J. Shendure. 2013. Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. Nature Biotechnology 31: 1119–1125. Chin, C. S., D. H. Alexander, P. Marks, A. A. Klammer, J. Drake, C. Heiner, A. Clum, et al. 2013. Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nature Methods 10: 563–569. Dohm, J. C., A. E. Minoche, D. Holtgrawe, S. Capella-Gutierrez, F. Zakrzewski, H. Tafer, O. Rupp, et al. 2014. The genome of the recently domesticated crop plant sugar beet (Beta vulgaris). Nature 505: 546–549. Doyle, J. J., and J. L. Doyle. 1987. A rapid DNA isolation procedure for small quantities of fresh leaf tissue. Phytochemical Bulletin 19: 11–15. http://www.wileyonlinelibrary.com/journal/AppsPlantSci Mangelson et al.—The cañahua genome • 11 of 12 Du, X. M., G. Huang, S. P. He, Z. E. Yang, G. F. Sun, X. F. Ma, N. Li, et al. 2018. Resequencing of 243 diploid cotton accessions based on an updated A genome identifies the genetic basis of key agronomic traits. Nature Genetics 50: 796–802. Durand, N. C., J. T. Robinson, M. S. Shamim, I. Machol, J. P. Mesirov, E. S. Lander, and E. L. Aiden. 2016. Juicebox provides a visualization system for Hi-C contact maps with unlimited zoom. Cell Systems 3: 99–101. Edgar, R. C. 2004. MUSCLE: Multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Research 32: 1792–1797. English, A. C., S. Richards, Y. Han, M. Wang, V. Vee, J. X. Qu, X. Qin, et al. 2012. Mind the gap: Upgrading genomes with Pacific Biosciences RS long-read sequencing technology. PLoS ONE 7: e47768. Funk, A., P. Galewski, and J. M. McGrath. 2018. Nucleotide-binding resistance gene signatures in sugar beet, insights from a new reference genome. Plant Journal 95: 659–671. Gade, D. W. 1970. Ethnobotany of cañihua (Chenopodium pallidicaule), rustic seed crop of the Altiplano. Economic Botany 24: 55–61. Gnerre, S., I. MacCallum, D. Przybylski, F. J. Ribeiro, J. N. Burton, B. J. Walker, T. Sharpe, et al. 2011. High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proceedings of the National Academy of Sciences USA 108: 1513–1518. Haas, B. J., A. L. Delcher, J. R. Wortman, and S. L. Salzberg. 2004. DAGchainer: A tool for mining segmental genome duplications and synteny. Bioinformatics 20: 3643–3646. Holt, C., and M. Yandell. 2011. MAKER2: An annotation pipeline and genome-database management tool for second-generation genome projects. BMC Bioinformatics 12: 491. Hong, S. Y., K. S. Cheon, K. O. Yoo, H. O. Lee, K. S. Cho, J. T. Suh, S. J. Kim, et al. 2017. Complete chloroplast genome sequences and comparative analysis of Chenopodium quinoa and C. album. Frontiers in Plant Science 8: https://doi. org/10.3389/fpls.2017.01696. Hunter, S. S., R. T. Lyon, B. A. J. Sarver, K. Hardwick, L. J. Forney, and M. L. Settles. 2015. Assembly by Reduced Complexity (ARC): A hybrid approach for targeted assembly of homologous sequences. bioRxiv 014662 [Preprint]. 31 January 2015 [cited 5 July 2018]. Available from: https://doi. org/10.1101/014662. Jannink, J. L., A. J. Lorenz, and H. Iwata. 2010. Genomic selection in plant breeding: From theory to practice. Briefings in Functional Genomics 9: 166–177. Jarvis, D. E., Y. S. Ho, D. J. Lightfoot, S. M. Schmockel, B. Li, T. J. A. Borm, H. Ohyanagi, et al. 2017. The genome of Chenopodium quinoa. Nature 542: 307–312. Jellen, E. N., D. E. Jarvis, S. P. Hunt, H. H. Mangelsen, and P. J. Maughan. 2019. New seed collections of North American pitseed goosefoot (Chenopodium berlandieri) and efforts to identify its diploid ancestors through whole-genome sequencing. Ciencia e Investigacion Agraria 46: 187–196. Kim, D., B. Langmead, and S. L. Salzberg. 2015. HISAT: A fast spliced aligner with low memory requirements. Nature Methods 12: 357–360. Koch, M. A., B. Haubold, and T. Mitchell-Olds. 2000. Comparative evolutionary analysis of chalcone synthase and alcohol dehydrogenase loci in Arabidopsis, Arabis, and related genera (Brassicaceae). Molecular Biology and Evolution 17: 1483–1498. Kolano, B., B. W. Gardunia, M. Michalska, A. Bonifacio, D. Fairbanks, P. J. Maughan, C. E. Coleman, et al. 2011. Chromosomal localization of two novel repetitive sequences isolated from the Chenopodium quinoa Willd. genome. Genome 54: 710–717. Korf, I. 2004. Gene finding in novel genomes. BMC Bioinformatics 5: 59. Langmead, B., and S. L. Salzberg. 2012. Fast gapped-read alignment with Bowtie 2. Nature Methods 9: 357–359. Lee, T. H., H. Guo, X. Y. Wang, C. Kim, and A. H. Paterson. 2014. SNPhylo: A pipeline to construct a phylogenetic tree from huge SNP data. BMC Genomics 15: 162. Li, H. 2013. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv preprint arXiv:1303.3997 [Preprint]. 16 March 2013 [cited 14 September 2018]. Available from: https://arxiv.org/abs/1303.3997v2. Li, H., and R. Durbin. 2010. Fast and accurate long-read alignment with BurrowsWheeler transform. Bioinformatics 26: 589–595. © 2019 Mangelson et al. Applications in Plant Sciences 2019 7(11): e11300 Li, H., B. Handsaker, A. Wysoker, T. Fennell, J. Ruan, N. Homer, G. Marth, et al. 2009. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25: 2078–2079. Lightfoot, D. J., D. E. Jarvis, T. Ramaraj, R. Lee, E. N. Jellen, and P. J. Maughan. 2017. Single-molecule sequencing and Hi-C-based proximity-guided assembly of amaranth (Amaranthus hypochondriacus) chromosomes provide insights into genome evolution. BMC Biology 15: 74. Lynch, M., and J. S. Conery. 2000. The evolutionary fate and consequences of duplicate genes. Science 290: 1151–1155. Marcais, G., A. L. Delcher, A. M. Phillippy, R. Coston, S. L. Salzberg, and A. Zimin. 2018. MUMmer4: A fast and versatile genome alignment system. PLoS Computational Biology 14: e1005944. Maughan, P. J., B. A. Kolano, J. Maluszynska, N. D. Coles, A. Bonifacio, J. Rojas, C. E. Coleman, et al. 2006. Molecular and cytological characterization of ribosomal RNA genes in Chenopodium quinoa and Chenopodium berlandieri. Genome 49: 825–839. Maughan, P. J., L. Chaney, D. J. Lightfoot, B. J. Cox, M. Tester, E. N. Jellen, and D. E. Jarvis. 2019. Mitochondrial and chloroplast genomes provide insights into the evolutionary origins of quinoa (Chenopodium quinoa Willd.). Scientific Reports 9: 185. Mujica, A. 1994. Andean grains and legumes. In J. E. H. B. a. J. León [ed.], Neglected crops: 1492 from a different perspective. FAO, Rome, Italy. Orzechowska, M., M. Majka, H. Weiss-Schneeweiss, A. Kovarik, N. BorowskaZuchowska, and B. Kolano. 2018. Organization and evolution of two repetitive sequences, 18-24J and 12-13P, in the genome of Chenopodium (Amaranthaceae). Genome 61: 643–652. Page, J. T., Z. S. Liechty, M. D. Huynh, and J. A. Udall. 2014. BamBam: Genome sequence analysis tools for biologists. BMC Research Notes 7: 829. Park, J.-S., I.-S. Choi, D.-H. Lee, and B.-H. Choi. 2018. The complete plastid genome of Suaeda malacosperma (Amaranthaceae/Chenopodiaceae), a vulnerable halophyte in coastal regions of Korea and Japan. Mitochondrial DNA Part B 3: 382–383. https://doi.org/10.1080/23802359.2018.1437822. Penarrieta, J. M., J. A. Alvarado, B. Akesson, and B. Bergenstahl. 2008. Total antioxidant capacity and content of flavonoids and other phenolic compounds in canihua (Chenopodium pallidicaule): An Andean pseudocereal. Molecular Nutrition & Food Research 52: 708–717. Pertea, M., D. Kim, G. M. Pertea, J. T. Leek, and S. L. Salzberg. 2016. Transcriptlevel expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown. Nature Protocols 11: 1650–1667. Pritchard, J. K., M. Stephens, and P. Donnelly. 2000. Inference of population structure using multilocus genotype data. Genetics 155: 945–959. Rastrelli, L., F. DeSimone, O. Schettino, and A. Dini. 1996. Constituents of Chenopodium pallidicaule (Canihua) seeds: Isolation and characterization http://www.wileyonlinelibrary.com/journal/AppsPlantSci Mangelson et al.—The cañahua genome • 12 of 12 of new triterpene saponins. Journal of Agricultural and Food Chemistry 44: 3528–3533. Repo-Carrasco, R., C. Espinoza, and S. E. Jacobsen. 2003. Nutritional value and use of the Andean crops quinoa (Chenopodium quinoa) and kaniwa (Chenopodium pallidicaule). Food Reviews International 19: 179–189. Repo-Carrasco-Valencia, R., J. K. Hellstrom, J. M. Pihlava, and P. H. Mattila. 2010. Flavonoids and other phenolic compounds in Andean indigenous grains: Quinoa (Chenopodium quinoa), kañiwa (Chenopodium pallidicaule) and kiwicha (Amaranthus caudatus). Food Chemistry 120: 128–133. Richards, S. 2018. Full disclosure: Genome assembly is still hard. PLoS Biology 16: e2005894. Ruas, P. M., A. Bonifacio, C. F. Ruas, D. J. Fairbanks, and W. R. Andersen. 1999. Genetic relationship among 19 accessions of six species of Chenopodium L., by Random Amplified Polymorphic DNA fragments (RAPD). Euphytica 105: 25–32. Simao, F. A., R. M. Waterhouse, P. Ioannidis, E. V. Kriventseva, and E. M. Zdobnov. 2015. BUSCO: Assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 31: 3210–3212. Smit, A. F. A., R. Hubley, and P. Green. 2013–2015. RepeatMasker Open-4.0. Website http://www.repeatmasker.org [accessed 10 August 2018]. Štorchová, H., J. Drabesova, D. Chab, J. Kolar, and E. N. Jellen. 2015. The introns in FLOWERING LOCUS T-LIKE (FTL) genes are useful markers for tracking paternity in tetraploid Chenopodium quinoa Willd. Genetic Resources and Crop Evolution 62: 913–925. Tillich, M., P. Lehwark, T. Pellizzer, E. S. Ulbricht-Jones, A. Fischer, R. Bock, and S. Greiner. 2017. GeSeq: Versatile and accurate annotation of organelle genomes. Nucleic Acids Research 45: W6–W11. Vargas, A., D. B. Elzinga, J. A. Rojas-Beltran, A. Bonifacio, B. Geary, M. R. Stevens, E. N. Jellen, and P. J. Maughan. 2011. Development and use of microsatellite markers for genetic diversity analysis of cañahua (Chenopodium pallidicaule Aellen). Genetic Resources and Crop Evolution 58: 727–739. Walker, B. J., T. Abeel, T. Shea, M. Priest, A. Abouelliel, S. Sakthikumar, C. A. Cuomo, et al. 2014. Pilon: An integrated tool for comprehensive microbial variant detection and genome assembly improvement. PLoS ONE 9: e112963. Wang, Y. P., H. B. Tang, J. D. DeBarry, X. Tan, J. P. Li, X. Y. Wang, T. H. Lee, et al. 2012. MCScanX: A toolkit for detection and evolutionary analysis of gene synteny and collinearity. Nucleic Acids Research 40: e49. Wilson, H. D. 1990. Quinua and relatives (Chenopodium sect. Chenopodium subsect. Celluloid). Economic Botany 44: 92–110. https://doi.org/10.1007/ BF02860478. Xu, Y. Q., C. W. Bi, G. X. Wu, S. Y. Wei, X. G. Dai, T. M. Yin, and N. Ye. 2016. VGSC: A web-based vector graph toolkit of genome synteny and collinearity. BioMed Research International 2016: 7823429. © 2019 Mangelson et al.

Log In

The genome of Chenopodium pallidicaule : An emerging Andean super grain

The genome of Chenopodium pallidicaule : An emerging Andean super grain