We are searching data for your request:
Upon completion, a link will appear to access the found materials.
My question concerns the incorporation of individual sequence reads into chromosomes during gene sequencing projects, especially those with larger genomes such as Drosophila melanogaster or Homo sapiens.
There are some sequences or short contigs that it has not yet been possible to incorporate into the larger chromosome assemblies.
What are the reasons for this failure?
I'm going to re-interpret this question slightly to make it easier to answer:
"Why do genome assemblies frequently consist of large numbers of short contigs rather than a relatively small number of long chromosomes (or full replicons of other types)? And how could I make my assembly better?"
De novo sequence assemblies (usually) consist of a set of contiguous sequence fragments ("contigs") that have been derived from shotgun sequencing by some heuristic. If sequence assembly did a perfect job, we would expect that these contigs would be full chromosomes. Instead, what we observe is lots of sequence fragments with uncertain relationships to one another. In this common case, we think that some of the contigs could be stitched together ("localized", "anchored", "scaffolded") meaningfully into chromosomes, but we aren't sure how.
For de novo sequence assembly, there is no ground truth to compare to. So all of the information for generating assemblies comes from the starting sequences (which are usually "reads", small shreds of sequence that come off of a sequencing instrument). The heuristics that generate contigs from reads consist of finding overlaps between different starting reads and trying to extend them into the largest possible partially overlapping groups of reads, using some assumptions about rates of error in the sequence of the reads (and other variables). See wikipedia or this paper (figure inset) for a little more information on this.
It turns out that there is a body of statistical theory that governs how large these overlaps will be, and therefore how big/complete your contigs will be. This body of theory says that the size and number of contigs is predictable to some degree from the size of the genome you're trying to assemble, the length of the reads, and the number of reads you have.
Some genomes, like maize, wheat, or the human genome, are also harder to assemble because they are "repetitive". This means that there are lots of very similar sequences in different places in the genome. So when you take a read from one of these places, you are uncertain which one of these very similar regions it comes from. These repetitive regions are hard to resolve unless you have very long reads or some parallel information, so they also lead to gaps in sequence. Such regions will also tend to be "collapsed", meaning that you only have one contig representing a repeat that should actually be present as several contigs with different locations in the assembly.
Sources of parallel information that are helpful are proximity ligation (methods such as Hi-C), genetic maps from crosses, or optical mapping. These orthogonal sources of information express relationships between contigs in terms of their closeness on a chromosome. Here is a paper with more information on these topics. Recently, long read technologies such as PacBio or nanopore sequencing are starting to read through large repeats, but they are still usually not reliable enough to fully assemble even smallish genomes "telomere-to-telomere".
So, if there is a recipe for localizing/anchoring/scaffolding fragmented assemblies, here is roughly what can be done:
- get more reads
- get longer reads
- get orthogonal information about contig placement (e.g. proximity ligation, genetic maps, a closely related assembly, etc.)
Hope that helps.
Assembly and Analysis of Unmapped Genome Sequence Reads Reveal Novel Sequence and Variation in Dogs
Dogs are excellent animal models for human disease. They have extensive veterinary histories, pedigrees, and a unique genetic system due to breeding practices. Despite these advantages, one factor limiting their usefulness is the canine genome reference (CGR) which was assembled using a single purebred Boxer. Although a common practice, this results in many high-quality reads remaining unmapped. To address this whole-genome sequence data from three breeds, Border Collie (n = 26), Bearded Collie (n = 7), and Entlebucher Sennenhund (n = 8), were analyzed to identify novel, non-CGR genomic contigs using the previously validated pseudo-de novo assembly pipeline. We identified 256,957 novel contigs and paired-end relationships together with BLAT scores provided 126,555 (49%) high-quality contigs with genomic coordinates containing 4.6 Mb of novel sequence absent from the CGR. These contigs close 12,503 known gaps, including 2.4 Mb containing partially missing sequences for 11.5% of Ensembl, 16.4% of RefSeq and 12.2% of canFam3.1+ CGR annotated genes and 1,748 unmapped contigs containing 2,366 novel gene variants. Examples for six disease-associated genes (SCARF2, RD3, COL9A3, FAM161A, RASGRP1 and DLX6) containing gaps or alternate splice variants missing from the CGR are also presented. These findings from non-reference breeds support the need for improvement of the current Boxer-only CGR to avoid missing important biological information. The inclusion of the missing gene sequences into the CGR will facilitate identification of putative disease mutations across diverse breeds and phenotypes.
Hevea brasiliensis (Willd.) Müll.-Arg. (the Para rubber tree) is a deciduous perennial tree crop indigenous to the rain forests of the Amazonian basin in South America. It is a monoecious, outcrossing species belonging to the family Euphorbiaceae and has a chromosome number of 2n = 2x = 36. Rubber tree produces high-quality isoprenoid polymers (cis-1,4-polyisoprene) with unique physical properties such as elasticity, resilience and efficient heat dispersion that are unsurpassed by any petroleum-based synthetic substitutes 1 . Of the 2500 latex-producing plant species, H. brasiliensis is the only species that produces commercially viable quantities of high-quality natural rubber, accounting for more than 98% of total production worldwide 2,3 . Southeast Asian countries currently dominate rubber production and trade, accounting for more than 76% of the 11.96 tons produced globally, with Thailand and Indonesia being the two largest producers and exporters of natural rubber (http://faostat3.fao.org/ Accessed September 14, 2016). Despite originating from the Amazonian basin, rubber production in South America represents merely 2% of the total production worldwide due to the devastating spread of South American leaf blight (SALB) disease caused by ascomycete Microcyclus ulei in 1930s 4 . The infestation resulted in the collapse of most rubber tree plantations and the cessation of the commercial scale rubber production in Brazil and other South American countries 5 . Even though Hevea cultivation in Southeast Asia has not been affected by SALB, other native pathogenic fungi are still major threats to rubber production in this area. A number of commercially cultivated, high-yielding clones are susceptible to abnormal leaf fall disease caused by various Phytophthora species and up to 40% loss in yield has been observed upon infection of highly susceptible clones 6 . Another widespread pathogen, Corynespora cassiicola, is a necrotrophic fungus that causes Corynespora leaf fall disease, resulting in significant losses in natural rubber yield 7 .
In the past two decades, plant breeders have used conventional approaches of recurrent selections to obtain clones with improved latex yield and resistance to fungal pathogens 8 . However, traditional breeding is inherently time-consuming owing to long selection cycle in this species. The ability to select for commercially valuable traits at an early stage will have a tremendous impact on reducing time and resources needed to develop superior clones. The advent of DNA-based genetic markers provided an alternative strategy that allowed plant breeders to accelerate the process of cultivar development. Recent advances in sequencing technology have simplified and accelerated the discovery of sequence variants, enabling the shift from anonymous markers such as microsatellites to more ubiquitous single nucleotide polymorphisms (SNPs) 9,10 . The availability of SNP markers facilitates the construction of high-density genetic linkage maps, which are instrumental for quantitative trait loci (QTL) analyses and association studies with valuable agronomic traits 11 .
Several genomic resources for H. brasiliensis have been developed recently, including transcriptome sequencing of various tissues and the sequencing of the commercially cultivated RRIM600 genome 10,12,13 . While the first draft of the genome sequence 13 provided an excellent source of genomic information, the assembly is highly fragmented, containing over a million contigs with an N50 length of 2.9 kb. Assembly of a complex genome is challenging in part owing to the presence of highly repetitive DNA sequences, which introduce ambiguity during the genome reconstruction. It has been estimated that repetitive DNA accounts for 78% of the rubber tree genome 13 , posing a major difficulty in the de novo assembly particularly when it is built exclusively from short-read data.
Although it is possible to construct a high-quality assembly for plant species with large genomes through hierarchical shotgun sequencing, the utilization of a sequencing platform capable of generating long reads that can span regions of complex repeats is likely to be more cost-effective and less labour-intensive/time-consuming. Recent development and applications of long-read sequencing technologies have shown substantial improvement in genome assemblies, primarily by joining contigs and scaffolds and spanning gaps around repetitive regions 14,15 . The Pacific Biosciences (PacBio) sequencing technology offers kilobase-sized reads without GC-bias or systematic errors however, the raw data generated from this system is inherently error-prone, with an average accuracy of
82% 16 . Availability of hybrid assembly software allows biologists to combine PacBio long-read data with complementary, high-accuracy short-read data from other platforms 17 . Nowak et al. 18 demonstrated that the incorporation of additional 8 × coverage of PacBio data to the existing short-read-only assembly of the Primula veris genome resulted in 21% of the gaps being closed and 38% of the ambiguous positions in the gaps being filled, in addition to the 40% improvement in N50 contig size 18 . Similarly, a set of 16.5 × PacBio long reads was used to fill in 68% of the gaps existed in the Illumina-only assembly of the African cichlid 14 .
Recently, high-throughput short-read sequencing has been employed to explore DNA linkage up to several hundred kilobases in proximity ligation libraries constructed from in vitro reconstituted chromatin 19 . The proprietary in vitro chromatin conformation capture-based long-range genome assembly method called “Chicago” developed by Dovetail Genomics has been used in combination with a standard whole genome sequencing approach to generate a de novo human genome assembly with comparable accuracy and contiguity to the assemblies built using more expensive methods. The Chicago long-range scaffolding technique has also been utilized to improve the existing genome assemblies in the American alligator 19 , African clawed frog 20 and cassava 21 .
Moving beyond fragmented genome assembly requires a high-quality and high-density linkage map that can help order and orient de novo assembled scaffolds into pseudo-chromosome scale sequences 22 . Complementary information in a de novo shotgun assembly and a genetic map can be integrated to generate a high-quality reference sequence. High-density linkage maps have successfully been employed to anchor scaffolds from whole genome shotgun assemblies in plant species ranging from a diploid common bean 23 to a hexaploid bread wheat 24 . Integration of a genetic map and a genome assembly allows us to investigate ancient and recent whole genome duplications and perform comparative genomics to study genome architecture and genome evolution across species 25 .
Here, we employed complementary technologies to generate deep-coverage 454/Illumina short reads and medium-coverage PacBio long-read data for a de novo hybrid assembly of the H. brasiliensis genome from BPM24 (Bank Pertanian Malaysia 24) clone. We subsequently applied a long-range Chicago assembly technique 19 to scaffold our preliminary assembly to obtain 1.26 Gb of assembled rubber tree sequences. BPM24 accession exhibits a high degree of resistance to two fungal pathogens, Phytophthora and Corynespora, commonly found in Southeast Asia 26 , and it is currently being exploited as genetic sources of fungal resistance in rubber tree breeding programs in Thailand. The availability of the high-density consensus linkage map constructed from two populations derived from BPM24 11 allowed us to anchor and orient a large number of scaffolds in this new assembly. These genetically anchored sequences provided the first concrete evidence to demonstrate the presence of paleotetraploidy in rubber tree and enabled us to perform comparative analyses among the Euphorbiaceae.
II. Complexity of plant genomes
Early targets for genome sequencing, apart from the human genome, were genomes that were relatively small and, therefore, easier and less expensive to sequence, such as Arabidopsis– a five-chromosome, c. 120 Mbp haploid genome plant ( Meinke et al., 1998 ). In fact, Arabidopsis is much smaller than most plant genomes, by orders of magnitude. The average plant genome is > 6000 Mbp per haploid genome for angiosperms ( Gregory et al., 2007 ), approximately twice the size of the human genome. Many economically important plant genomes are even larger. Wheat, for instance, is c. 15 Gbp per haploid genome and pine has at least a 26 Gbp genome ( Valkonen et al., 1993 ). Genome size is one contributor to plant genome complexity other contributors include polyploidy and repetitive DNA sequences, and, in particular, transposable elements. Together, these attributes of plant genomes increase the cost of sequencing and negatively impact the quality of the resulting sequence, especially as the field migrates from map-based sequencing (to be described later) to short-read whole-genome shotgun (WGS) sequencing.
Two primary factors that contribute to plant genome size and complexity are polypoidy and repetitive DNA sequences (crop examplars shown in Fig. 1). Polyploidy is the accumulation of additional sets of chromosomes through either autopolyploidy, doubling of the same genome, or allopolyploidy, two diverged genomes in the same nucleus. Increased chromosome number and DNA content are immediate consequences of polyploidy, but depending on when the polyploidy event occurred, increased chromosome number may not be immediately apparent as ancient polyploidy events are likely to be shared by sister taxa and/or diploidization of chromosome number may have occurred (reduction of chromosome number via loss and rearrangements). Most, if not all, land plants have undergone polyploidy events at various times in their evolution (reviewed in Soltis et al., 2004 ). For example, soybean (Glycine max) has undergone at least three polypoid events that, as a consequence of having a high-quality genome sequence ( Schmutz et al., 2010 ), can now be examined. The first, and most difficult, event, to detect was one early in plant evolution shared by many land plants ( Bowers et al., 2003 ). The second event was c. 45–55 million yr ago (Mya) and should be shared with legumes that diverged after that event, such as Medicago ( Cannon et al., 2006 ). The most recent event, c. 5 Ma, was most likely an allopolyploid event ( Gill et al., 2009 ) that was coincident with the emergence of the Glycine genus ( Innes et al., 2008 ). Thus, the 1.1 Gbp soybean genome has relics of at least three polyploidy events that resulted in a genome that is a mosaic of duplicated segments ( Schmutz et al., 2010 ). In the Glycine genus, however, there is an even more recent allopolyploid event that occurred in perennial species found in Australia ( Doyle et al., 2002 ). Thus, polyploidy is a recurrent process that molds and shapes plant genomes during evolution.
Diagram of the evolutionary relationship of several major crop species showing polyploidy events (red octagons), relative genome sizes to rice (size of circles), and percentage of transposable elements (color of circle). The estimated time of divergence and polyploid events are shown based on the literature ( Gaut & Doebley, 1997 Huang et al., 2002 Blanc & Wolfe, 2004 Paterson et al., 2004, 2009 Schlueter et al., 2004 Swigonova et al., 2004 IRGSP, 2005 Jaillon et al., 2007 Zhu et al., 2008 Schnable et al., 2009 Wicker et al., 2009 Abrouk et al., 2010 Choulet et al., 2010 Schmutz et al., 2010 ). Mya, million years ago.
In addition to polyploidy, repetitive DNA sequences and, in particular, transposable elements (TEs) compose large fractions of most plant genomes and are impediments to efficient genome sequencing. TEs have been reviewed in depth ( Bennetzen et al., 2005 ) here we will focus only on contribution to genome obesity in plants and organization in plant genomes as it contributes to obtaining accurate genome sequences. There are several instances of rapid amplification of a few TE families that have resulted in increased genome size. Oryza australiensis, for example, is approximately twice the size of its nearest relatives as a result of the amplification of three TE families ( Piegu et al., 2006 ). Maize is the most prominent example of genome obesity resulting from TE amplification ( SanMiguel et al., 1996 ).
The complicating factor of TE amplification on genome sequencing is not primarily the increase in the amount of DNA to sequence, but rather the effect of many copies of the same sequence throughout the genome that make mapping and assembly difficult. If a TE family amplified recently, it can have thousands of copies scattered throughout the genome, all with very high sequence identity. If a genome is sequenced via a shotgun approach, as most genomes are nowadays, then these highly similar TEs will complicate assembly unless there are sufficient mate-pair reads that span the repeats (Fig. 2).
Diagram of whole genome shotgun (WGS) sequencing. The chromosome segment is shown across the bottom in black with transposable elements (TEs) in orange and short-read sequences above. Short reads are idealized in both coverage and arrangement. (a) Assembly of short reads results in three unordered sequence contigs, as reads from TEs cannot be placed unambiguously. (b) Using a variety of larger pieces of DNA (2–4, 6–10 and 20 Kbp shown) with mate pair sequences, the TEs can be spanned and ordered contigs result that can then be placed in a chromosomal context. , repetitive DNA sequences, > 98% sequence identity , shotgun sequences.
Barcode swapping observed on the HiSeq 4000 was probably a result of the new design using patterned flowcells, and therefore probably should be expected to also occur on the HiSeq 3000 and HiSeq X Ten and the flagship NovaSeq machine. The use of patterned flowcells necessitates the need for a new type of sequencing chemistry, called exclusion amplification (ExAmp), replacing the cluster generation by bridge amplification used on the conventional non-patterned flow cell.
Technical details describing the new sequencing chemistry have not been made publicly available, but examination of the associated patents made it possible to understand the overall process and it is reasonable to assume therefore that index hopping occurs before cluster generation, since all the reagents needed for cluster generation are present in that reaction mix. Free index primers in that mix have the potential to prime library fragments and be extended by DNA polymerase. These molecules, incorrectly assigned barcodes, are free to generate DNA clusters. This is not the case for the conventional flowcell design in which free barcodes are washed away after the DNA is hybridised to the flowcell. Only after this step is DNA polymerase added and sequence extension can initiate.
It is not clear at present how to implement a general purpose bioinformatics solution to remove such artefacts post-sequencing, and indeed experimental solutions may only be possible. To combat the problem, when working with a small number of samples, researchers should stop employing a single-barcode strategy, but instead should double-barcode samples. Following sequencing, reads should then be filtered to allow only those in which both barcodes are identical and have an expected sequence. Since double-barcode swapping should be a rare event, taking this step should greatly reduce sample misclassification.
When multiplexing many samples (already requiring double-barcoding, irrespective of the barcode hopping issue), Stanford researchers advised using barcode pairs in which each individual single barcode was used only once. Although that strategy reduced the number of available barcode combinations, it still allowed many samples to be run on the same lane. In addition, they proposed purification strategies to remove free primers from the library pool.
Illumina recently published a paper confirming the occurrence of index-hopping. They stated that with the proper clean-up of primers and implementing other experimental techniques, it should be possible to minimise barcode swapping to negligible levels for most applications. The report also advised only multiplexing similar conditions in a single lanes. A brain vs liver RNA-Seq example was given in which a transcript is present in liver at a high level, but not at all in the brain. Owing to index hopping that transcript may actually appear to be expressed at a low level in the brain. Illumina suggests that by only combining similar samples (i.e. either brain or liver) in a lane this problem will be prevented. This is far from an ideal solution however, since researchers will need to know in advance what to expect with regard to the expression profiles of their samples and furthermore this strategy would compound the problem of batch effects.
Pneumocystis organisms from rodents and humans
P. murina-infected lung samples were obtained from CD40 ligand knock-out female mice 26 . P. carinii-infected lung samples were obtained from immunosuppressed Sprague-Dawley male rats 24 . P. jirovecii-infected autopsy lung samples were from one patient with AIDS, who was infected with only one P. jirovecii strain based on genotyping at five different genomic loci 53 . Animal and human subject experimentation guidelines of the National Institutes of Health were followed in the conduct of these studies.
Preparation of genomic DNA samples
Pneumocystis-infected lung tissues were cut into small pieces, homogenized in Qiagen Tissuelyser (5 times for 20 s at 1/30 frequency), then centrifuged at 15,000g for 6 min. The pellet was resuspended in Trypsin/EDTA Solution (Lonza), incubated at 37 °C for 30 min and centrifuged at 15,000g for 6 min. The pellet was washed once in PBS, resuspended in lysis buffer containing 2.9% (w/v) collagenase type I (Gibco) and 0.1% (w/v) DNase I (Sigma), incubated at 37 °C for 30 min, and centrifuged at 15,000g for 6 min. This sequential enzyme digestion was expected to remove most of the host cells and DNA as well as potentially some of Pneumocystis trophic forms. The pellet was washed three times in PBS, resuspended in 100 μl of 0.5% (w/v) Zymolyase solution (Zymo Research) and incubated at 37 °C for 1 h. Genomic DNA was extracted using the MasterPure Yeast DNA Purification Kit (Epicentre). RNA was removed using DNase-free RNAase (Epicentre).
All DNA samples were analysed by quantitative real-time PCR (qPCR) assays using fluorescence resonance energy transfer probes 54 to measure the fraction of Pneumocystis DNA compared with host DNA in each sample. In qPCR for both P. murina 20 and P. carinii 54 , the target was the single-copy dihydrofolate reductase (dhfr) gene of Pneumocystis and the target for the host was a highly conserved region of the single-copy polycystic kidney disease 1 (pkd1) gene 22 . The targets for P. jirovecii and human qPCR were the msg gene family 55 and the single-copy β-globin gene 56 , respectively. The purity of each DNA sample was estimated by the ratio of Pneumocystis to host genome copy number and based on an estimated genome size of 8 Mb for Pneumocystis and 2.7–3.3 Gb for the host. The qPCR results were validated by 454 or Illumina MiSeq sequencing of selected samples. DNA samples extracted from enriched P. murina or P. carinii preparations contained up to 90% Pneumocystis DNA, and from enriched P. jirovecii preparations, 10–25% P. jirovecii DNA, while those extracted directly from heavily infected lung tissues contained <0.4–1% Pneumocystis DNA.
P. murina genome sequencing and assembly
To assess the quality of the P. murina DNA preparations for whole-genome sequencing, small-scale sequencing was first conducted using a 454 GS FLX Titanium Sequencer (Roche Applied Science) at Leidos, Inc. (Frederick, MD) according to the 454 standard shotgun sequencing protocols. A total of 4 million reads (with a mean length of 344 bases) were generated, with ∼ 40% of them having blast hits with P. carinii 7 . Subsequently, deep sequencing was performed using Illumina HiSeq technology at the Broad Institute (Cambridge, MA). Seven different P. murina DNA preparations (each with 60–90% purity for P. murina by qPCR) were used to construct small insert libraries each was sequenced, and the library (preparation B123 from a single mouse, center project G11228) with the lowest percent of contaminating host mouse DNA was used for assembly. From this library, a total of 34 million 101 base paired-end reads with a mean insert size of 153 bases were generated. A second larger insert library (preparation C2 from another mouse, center project G11230) with a mean insert size of 1,247 bases was prepared, and a total of 83 million 101 base paired-end reads were generated. After removal of mouse sequences (version mm9) and P. murina mitochondrial sequences 22 , these reads were assembled with Allpaths (version R37380) with default parameters 57 the resulting assembly was screened for contaminating sequences (mouse and bacteria) and to remove P. murina mitochondrial sequences 22 by aligning to the non-redundant nucleotide database from NCBI and by removing the scaffolds with high coverage that matched non-fungal organisms. The draft assembly of 24 scaffolds was further improved using long 454 reads and PCR to support scaffold joins. The final assembly contained 17 scaffolds. All internal contig gaps were closed by PCR and Sanger sequencing. The ends of scaffolds without msg genes or telomere repeats were extended by PCR and/or PacBio sequencing (Supplementary Methods).
P. carinii genome sequencing and assembly
After preliminary 454 sequencing demonstrated a purity of 63%, 1 P. carinii DNA preparation (no. B80 from 2 heavily infected rats, 80% purity for P. carinii by qPCR) was used to construct 2 libraries with a mean insert size of 180 bases and 793 bases, respectively. Each was sequenced using 101 base paired-end reads on the Illumina MiSeq platform at the Broad Institute, and, after removal of host rat sequences and P. carinii mitochondrial sequences 22 , assembled using AllPaths-LG (version R47825) with default parameters, resulting in a total of 53 scaffolds. After joining scaffolds using long 454 reads and confirming by PCR, the final genome assembly contained 17 scaffolds. All internal contig gaps were closed by PCR and Sanger sequencing. The ends of six scaffolds overlapped with the seven telomere sequences reported elsewhere 58 they were merged and confirmed by PCR and/or aligning merged scaffolds with Illumina and 454 raw reads. The ends of several scaffolds without msg genes, kexin genes or telomere repeats were extended by PCR and/or PacBio sequencing (Supplementary Methods).
P. jirovecii genome sequencing and assembly
Three enriched DNA samples (RU7, RU12 and RU817) from a single autopsy lung sample RU (1.4–2 μg each, with 10–25% P. jirovecii DNA by qPCR) were used to construct three libraries (156 base average insert size) each was sequenced on the Illumina HiSeq platform at the Broad Institute. To improve on initial assemblies with high numbers of contigs, additional sequence was generated from hybrid-selected DNA samples as previously described 59 120 base oligo bait probes were designed to target regions present in the genome assembly as well as to pull in uncovered regions and transcripts probes were designed for: existing assemblies 8 including higher probe density at contig ends, unassembled reads 8 that did not match the human genome, assembled RNA-Seq transcripts 8 that did not match existing assemblies or the human genome and unanchored msg sequences 25 . Host sequences were removed by aligning to human genome assembly 19 (GRCh37/hg19), and P. jirovecii mitochondrial sequences 22 were removed. The remaining 101 base paired-end reads were assembled separately with Spades 2.5.1 (at the Broad Institute), Spades 3.0 (at Leidos) and Abyss (Biowulf at NIH), which resulted in 400, 149 and 312 contigs, respectively. All these contigs were merged with Sequencher (Gene Codes Co., Ann Arbour, MI), resulting in a total of 20 scaffolds. All internal gaps were closed by PCR and Sanger sequencing. Scaffolds were validated by examining raw reads aligned by Seqman Pro (DNAStar, Madison, WI) to ensure there were no false joins. To improve the coverage of msg genes in the assembly, PacBio sequencing was performed as described in Supplementary Methods.
Chromosome electrophoresis and Southern hybridization
To compare the P. murina assembly with chromosome number and length, we used contour clamped homogeneous electrical field (CHEF) electrophoresis to separate chromosomes. P. murina organisms were partially purified from fresh mouse lungs by Ficoll-Hypaque density gradient centrifugation 60 , and then processed for CHEF 61 using the CHEF Yeast Genomic DNA Plug Kit (Bio-Rad). Briefly, partially purified P. murina organisms were embedded in 0.8% CleanCut agarose, treated with 24 U ml −1 proteinase K overnight at 50 °C, then washed four times in 1 × Wash Buffer and stored at 4 °C. Electrophoresis was performed in 1% agarose gels (14 × 21 cm) using the CHEF DR II apparatus (Bio-Rad). Gels were run in 0.5 × TBE buffer for 144 h at 135 V and 12.5 °C, with a 50 s initial pulse that was gradually increased to 100 s. The gel was stained with ethidium bromide (Sigma), then DNA was transferred to Nytran membranes (Schleicher & Schuell) under neutral conditions 61 . We used two blots prepared from the same DNA plug, and each blot was hybridized consecutively to different probes, with stripping of the blot between hybridizations. All probes were PCR-amplified double-stranded DNA fragments labelled using the PCR DIG Probe Synthesis Kit (Roche Applied Science) except for the probe for telomere repeats, which was a synthesized oligonucleotide labelled using the DIG Oligonucleotide Tailing kit (Roche Applied Science). All primer and probe sequences are provided in Supplementary Data 27. Hybridization and signal detection were performed by using the DIG Probe Hybridization system (Roche Applied Science) as described previously 22 .
RNA-Seq, expression levels and gene prediction
Three independent samples of P. murina and P. carinii organisms were partially purified from heavily infected lungs of three mice and rats, respectively, by Ficoll-Hypaque density gradient centrifugation 60 . The partially purified pellets were expected to contain both cyst and trophic forms. Total RNA was isolated using the RNeasy Mini Kit (Qiagen). The ratio of Pneumocystis to host RNA was estimated to be from 1:4 to 8:1 based on the density of RNA bands in agarose gels for the host 28S rRNA and Pneumocystis 26S rRNA. Strand-specific libraries were constructed using the dUTP second strand marking method 62,63 , except as noted below. For all samples, poly(A) RNA was purified using the Dynabeads mRNA Purification kit (Life Technologies), with two rounds of selection (with bead regeneration), resulting in <5% rRNA as measured by the Bioanalyzer RNA 6000 Pico chip (Agilent) program. The resulting mRNA (200 ng per sample) was treated with the Turbo DNA-free kit (Ambion), checked by qPCR for the absence of detectable genomic DNA (Supplementary Table 5), and fragmented in 1 × RNA fragmentation buffer (Affymetrix) for 4 min at 80 °C. After first-strand complementary DNA (cDNA) synthesis a 2.0 × volume of RNAClean SPRI beads (Beckman Coulter Genomics) clean-up was used instead of phenol/chloroform/isoamyl alcohol (25:24:1) extraction and ethanol precipitation. Size selection after adapter ligation was done with two 0.7 × Ampure beads (Beckman Coulter Genomics) clean-up steps and eight cycles of PCR were used to generate the library. Libraries were sequenced on the HiSeq2000 platform generating an average of 26.1 million paired 76 base reads for each of the 6 samples.
To examine expression levels, RNA-Seq reads from each of the three samples for each organism were aligned to the extracted coding DNA sequences (CDSs) using bowtie 65 . The alignment bam files were then used to quantify transcript abundances by RSEM 66 . We selected the top 5% of expression levels to examine functional enrichment by GSEA 45 . For each organism, we examined GSEA of gene sets classified based on Pfam, TIGRFAM, KEGG and SignalP functional annotations. We also ran GSEA on some manually curated gene sets such as msg genes.
The relative expression level for each gene was expressed as fragments per kb of exon per million fragments mapped (FPKM). For both P. murina and P. carinii, 99% of all annotated genes were expressed (FPKM>2), with a median coverage of 151 FPKM in both species. Further, expressed genes showed good alignment depth across the transcript, with 98% of all genes in each species having at least fivefold RNA-Seq depth. All genes without detectable expression (26 in P. murina and 23 in P. carinii) contain short (average size of ∼ 250 bp, potentially pseudogenes) or incomplete CDSs.
The P. murina and P. carinii genomes were annotated using a combination of expression data (RNA-Seq), homology information and ab initio gene finding methods (Supplementary Methods) as previously described 67 . These methods were also used to annotate the RU assembly of P. jirovecii, with the exception that RNA-Seq data was not used (Supplementary Methods). Gene sets were compared based on functional annotation and used as the basis for syntenic and phylogenetic analysis (Supplementary Methods).
Analysis of strain variation in P. jirovecii
To examine genome-wide single nucleotide polymorphisms (SNPs) between sequenced isolates, the RU7 assembly generated in this study was used as the reference compared with the SE8 raw reads 8 . SE8 reads were retrieved from NCBI (ERR135854 to ERR135863) and converted to fastq format. Individual fastq files were concatenated, and the full read set aligned to the RU7 assembly using BWA-MEM 68 and converted to sorted bam using SAMtools 64 . This resulted in very high alignment depth across all assembly bases, with a median alignment depth of 1,046.
The Genome Analysis Toolkit 69 (GATK) v2.7–4-g6f46d11 was used to call both variant and reference bases from the alignments. Briefly, the Picard tools (http://picard.sourceforge.net) AddOrReplaceReadGroups, MarkDuplicates, CreateSequenceDictionary and ReorderSamwere were used to preprocess the alignments, followed by GATK RealignerTargetCreator and IndelRealigner for resolving misaligned reads close to indels. Next, GATK UnifiedGenotyper (with the haploid genotype likelihood model (GLM)) was run with both SNP and INDEL GLM. We additionally ran BaseRecalibrator and PrintReads for base quality score recalibration on sites called using GLM SNP and re-called variants with UnifiedGenotyper emitting all sites. A final filtering step was used to remove any position that was called by both GLMs (that is, incompatible indels and SNPs) and to remove positions with low read depth support (<10). The 24,902 SNPs were then mapped to the gene set predicted on the SE8 assembly a total of 14,135 SNPs are located in CDS regions, with substitutions affecting protein coding regions as follows: 7,410 nonsynonymous, 6,690 synonymous, 26 nonsense and 9 extensions. SNP density across the SE8 assembly was calculated for 5-kb windows.
Detection of chitin by fluorescence labelling
The cDNA sequence encoding the CBD of Bacillus circulans chitinase A1 gene 70 (2,183–2,338 bp, GenBank accession code M57601.1) was optimized for bacterial expression (GenScript USA Inc.) and modified by the addition of ATG and 3 HA tags. The modified sequence was synthesized (GenScript USA Inc., Supplementary Fig. 15) and cloned into pET-28 vector (EMD Biosciences). The cDNA construct was transformed into Escherichia coli strain BL21 (DE3) RIL (Agilent technologies) and expressed as a His tag fusion protein. The expressed protein was purified in two steps, first using anti-CBD antibody (New England Biolabs) that was immobilized using AminoLink Plus Immobilization Kit (Thermo Scientific), and then the Hispur cobalt purification kit (Thermo Scientific). The purified protein was biotinylated using Lightning-Link Biotin conjugation kit (Type A) (Innova Biosciences Ltd) and used for fluorescence labelling (Histoserv, Inc., Germantown, MD) of P. murina-infected lung tissue sections fixed in Histochoice (Amresco, Inc.). Alexafluor 488-conjugated streptavidin was used for the detection of CBD. Pneumocystis organisms were detected by an anti-Msg antibody 26 and cysts by a dectin-Fc recombinant protein (kindly provided by Dr Chad Steele, the University of Alabama at Birmingham, Alabama) 71 . Mouse kidney sections infected with Candida (kindly provided by Dr Michail Lionakis, the National Institute of Allergy and Infectious Diseases, Bethesda, Maryland) 72 were used as a positive control for chitin staining (Fig. 5).
Chitin glycosyl linkage analysis
To prepare cell wall fraction, P. carinii organisms were partially purified from infected rat lungs by Ficoll-Hypaque density gradient centrifugation 60 S. cerevisiae organisms were obtained from Stratagene (strain YPH499) and grown in standard yeast culture. Cells were resuspended in denaturing buffer (2% SDS, 1 mM EDTA in 50 mM Tris pH 8.0) and heated at 100 °C for 10 min. The cell wall pellet was recovered by centrifugation at 5,000g for 5 min., washed twice with PBS, and dried in a speed vacuum concentrator.
For chitin analysis, cell wall pellets were incubated with 0.1% (w/v) chitinase (Sigma) in 50 mM sodium acetate buffer, pH 5.6, at 37 °C for 72 h. Insoluble materials after the initial digestions were further treated sequentially with pronase (Roche), lyticase (Sigma) and chitinase to look for residual chitin. Protein digestion was performed using 0.01% (w/v) pronase in Tris-HCl buffer pH 8.2 with 10 mM CaCl2 at 37 °C for 48 h. Glucan digestion was performed with 0.04% (w/v) lyticase in 50 mM sodium phosphate buffer, pH 7.5 at 37 °C for 24 h. Chitinase digestion was performed as above. After each digestion, the enzyme was inactivated by heating at 100 °C for 5 min.
Digested cell wall components were analysed by matrix-assisted laser-desorption ionization time-of-flight MS (MALDI/TOF-MS) and nanospray ionization MS (NSI-MSn). An aliquot of supernatant after each enzyme digestion was analysed by MS to determine oligosaccharide content. Prior to MS analysis, enzymes were removed by passing through a C18 Sep-PAK cartridge. The flow-through was collected in 5% acetic acid and lyophilized. The dried samples were permethylated using previously published methods 73 and profiled by MS. MALDI/TOF-MS was performed in the reflector positive ion mode using DHBA (α-dihyroxybenzoic acid, 20 mg ml −1 solution in 50% methanol/water) as a matrix. The spectrum was obtained by using an AB SCIEX TOF/TOF 5800 (AB SCIEX). NSI-MSn analysis was performed on an LTQ Orbitrap XL mass spectrometer (Thermo Fisher) equipped with a nanospray ion source. Permethylated sample was dissolved in 1 mM NaOH in 50% methanol and infused directly into the instrument at a constant flow rate of 0.5 μl min −1 . A full Fourier transform MS spectrum was collected at 30,000 resolution. The capillary temperature was set at 210 °C and MS analysis was performed in the positive ion mode. For total ion mapping (automated MS/MS analysis), an m/z range of 200–2,000 was scanned with ion trap MS mode in successive 2.8 mass unit windows that overlapped the preceding window by 2 MU .
For determination of glycosyl linkages, partially methylated alditol acetates (PMAA) were prepared from fully permethylated glycans. The permethylated glycans were hydrolyzed with HCl/water/acetic acid (0.5:1.5:8, by vol.) at 80 °C for 18 h, followed by reduction with 1% NaBD4 in 30 mM NaOH overnight, then acetylation with acetic anhydride/pyridine (1:1, v/v) at 100 °C for 15 min and analysis by gas chromatograph-MS (GC–MS) on an Agilent 7890A GC interfaced to a 5975C MSD (mass selective detector, electron impact ionization mode). The separation of the PMAA was performed on a 30-m SP2331 fused silica capillary column (Supelco) for the neutral sugar derivatives, and a DB-1 (Agilent) column for amino sugar derivatives.
Msg protein glycosylation identification
To purify P. carinii Msg proteins, partially purified P. carinii organisms 60 were resuspended in 2% SDS in 50 mM Tris-HCl buffer, pH 8.0, boiled for 10 min and centrifuged. The supernatant was collected. The extraction procedure was repeated two more times, the supernatants were pooled and SDS was removed using the SDS-Out SDS Precipitation Kit (Thermo Scientific). Solubilized Msg was affinity purified using an anti-Msg monoclonal antibody (RA-E7, gift of Drs Peter Walzer and Michael Linke 74 , the University of Cincinnati, Cincinnati, Ohio) immobilized on to AminoLink plus column (AminoLink plus immobilization kit) following manufacturer’s instructions. The protein extract was diluted with an equal volume of Tris buffered saline and incubated with the immobilized antibody in the column by end-over-end mixing overnight at 4 °C. The column was then drained and washed with 0.1 M sodium phosphate buffer pH 6.9. The Msg was eluted with 150 mM ammonium hydroxide and fractions were neutralized with 1 M NaH2PO4. The fractions containing Msg were pooled and concentrated using Microcon-30 kDa centrifugal filter unit (Millipore) and the buffer was exchanged to 0.1 M sodium phosphate buffer pH 6.9. Aliquots were stored at −80 °C.
Purified Msg proteins were subjected to glycopeptide detection by LC-MS. Glycoproteomic analysis was carried out as previously described 75 . Briefly, the tryptic digested peptides of purified Msg proteins were separated with a Magic C18 column (15 cm × 75 μm, Bruker-Michrom, CA) and analysed with an Orbitrap Fusion (Thermo Scientific) mass spectrometer after Msg was denatured, reduced, alkylated and desalted. Gradient elution was performed with a 30 min linear gradient of 5–35% acetonitrile in 0.1% formic acid at a flow rate of 300 nl min −1 . The MS data were processed automatically by a data-dependent MS scan pipeline, which integrated three dissociation methods, including higher energy collisional dissociation (HCD), electron transfer dissociation (ETD) and collision-induced dissociation (CID). In this pipeline, the full MS spectrum was first acquired from most abundant ions between m/z 350 and m/z 1,550 with a 3-second cycle time at 120,000 resolutions in FT mode. Subsequently the acquired MS data were assessed by MS/MS with HCD-product-dependent ETD or HCD-product-dependent ETD/CID mode with most abundant ions within 3-s cycle time. For the HCD-product ion trigger MS/MS, glycan oxonium ions at m/z 204.0867 (HexNAc), m/z 138.0545 (HexNAc fragment) or m/z 366.1396 (Hexose-HexNAc) in the HCD spectra were used to trigger ETD or CID acquisition. The MS/MS were measured at a resolution of 15,000.
For the mapping of glycopeptides from P. carinii Msg digests, LC-MS data were analysed using Byonic software (Protein Metrics) and the data annotation by the software was manually validated. Byonic parameters were set to allow 3.0 p.p.m. of precursor ion mass tolerance and 3.0 p.p.m. of fragment ion tolerance with monoisotopic mass. Digested peptides were allowed with up to three missed internal cleavage sites, and the differential modifications of 57.02146 and 15.9949 Da were allowed for alkylated cysteine and oxidation of methionine, respectively. Protein database used for protein blast included the protein set of the rat genome at NCBI (version 104) and of the P. carinii genome B80 generated in this study. Glycan database used was 309 mammalian N-glycan database (Default database in the Byonic software). Released N-linked glycan analysis of Msg was performed prior to the glycopeptide mapping.
Data annotation by the software was further filtered as follows. Any protein identification with FDR >1%, log probability <4 or best Byonic score <500 was excluded. Moreover, any glycopeptide spectra annotations with Byonic score <400 were excluded. Glycopeptide identifications that remained after these strict filters were manually validated by looking at fragment ions matched to the theoretical glycopeptide fragmentation, the presence of a series of expected glycan oxonium ions and neutral loss of the glycan moiety for MS/MS-HCD data.
Identification of UCEs
We identified UCEs by screening whole genome alignments of the chicken (Gallus gallus) and Carolina anole (Anolis carolinensis) prepared by the UCSC genome bioinformatics group using a custom Python (http://www.python.org/) script to identify runs of at least 60 bases having 100% sequence identity. We then aligned each conserved region from the chicken–lizard alignments to the zebra finch (UCSC taeGut1) genome using a custom Python program and BLAST ( Altschul et al. 1997), and we stored metadata for matches having an e v a l u e ≤ 1 × 1 0 − 15 in a relational database (RDB) along with the initial screening results. We removed duplicates from the group of matches containing data from chicken, lizard, and zebra finch, and we defined the remaining set of 5599 unique sequences as UCEs. We estimated the average distance (±95% CI) between each of these UCEs using positions in the chicken genome (UCSC galgal3) because the chicken genome is currently the most complete and best assembled avian or reptile genome.
Design of Probes from UCEs
We designed target enrichment probes by selecting UCEs from the RDB, adding sequence to those UCEs shorter than 120 bp in length by selecting equal amounts of 5′ and 3′ flanking sequence from a repeat-masked chicken genome assembly, and recording the length of flanking sequence, if any, added to each. We masked all buffered UCEs containing repeat-like regions using RepeatMasker (http://www.repeatmasker.org/) prior to probe design. If UCEs were >180 bp, we tiled 120 bp probes across target regions at 2× density (i.e., probes overlapped by 60 bp). If UCEs were <180 bp total length, we selected a single probe from the center of the UCE. We used LASTZ (available at http://www.bx.psu.edu/miller_lab/) to align probes to themselves and to identify and remove duplicates arising as a result of probe design. We inserted these 5561 probes into the RBD, and we updated each probe record with additional data indicating if probes contained ambiguous (N) bases, the Tm and GC content of the probe, the number of bases added to buffer a particular UCE to probe length (120 bp), the number of masked bases within designed probes, and the types of mismatches we observed for each probe's parent UCE when BLASTing chicken-anole UCEs against zebra finch.
Alignment of Designed Probes to Ten Amniote Genomes
We aligned 5561 probes to ten amniote genomes using a Python wrapper-program around LASTZ to facilitate parallel data processing. We retained only those matches having ≥92.5% identity across ≥100 bp of the 120 bp (83%) probe sequence. We used a custom Python program to screen LASTZ matches for reciprocal and non-reciprocal duplicates, and we also excluded matches where the observed number of matches was less than the number of designed probes. For example, if we tiled two probes across a UCE locus, but LASTZ only matched a single probe to the genome sequence, we dropped the parent UCE locus from further consideration.
In Silico Target Enrichment of UCE Loci from Primates
To test our putative in vitro workflow, we aligned 5561 probes to nine primate genomes and one mouse outgroup using LASTZ. As above, we retained only those matches having ≥92.5% identity across ≥100 bp of the 120 bp (83%) probe sequence, and we ignored reciprocal and non-reciprocal duplicates to filter out potential paralogs. We also excluded UCE loci where the number of probes matching the genome was less than expected. Across this reduced set of matches and each primate taxon, we sliced the alignment location of remaining probes from the reference genomes plus 200 bp of flanking sequence upstream (5′) and downstream (3′) of the alignment location to yield a total slice of approximately 520 bp. We chose this length of flanking sequence because preliminary calculations revealed that this length was likely close to the maximum contig size we could expect using Illumina Nextera library preparation techniques. We assembled probes + flank back into the parent UCEs they represented using a custom Python program that integrated LASTZ—to match probes to their UCE—and MUSCLE ( Edgar 2004) to assemble multiple probes designed from the same parent UCE. After assembly, we refer to each UCE plus flanking sequence as a locus. For each locus, we aligned the data across primate species using a custom Python program and MUSCLE. We used a moving average across a 20-bp window to trim the ends of all alignments, ensure ends contained at least 50% sequence identity, and remove poorly aligned sequence. We allowed insertions at alignment ends as long as the insertions were present in fewer than 30% of individual taxa within a given alignment. We excluded loci that were not found in all primate species, and we dropped alignments that were missing any primate species. This resulted in a complete data matrix with no missing loci.
In Vitro Target Enrichment of UCE Loci from Birds
From the set of 5561 probes, we selected a subset of 2560 probes for synthesis where probes had <10 masked bases and <50 added bases (25 to each side). These probes represented 2386 UCEs conserved among chicken, lizard, and zebra finch. We had these probes commercially synthesized into a custom MySelect kit (Mycroarray, Inc.). We performed phenol–chloroform DNA extractions on bird tissue from vouchered museum specimens (Supplementary Table 1, available from doi:10.5061/dryad.64dv0tg1), and we prepared libraries for Illumina sequencing using Illumina Nextera library preparation kits (Epicentre Biotechnologies).
To enrich the targeted UCEs, we followed the basic workflow for solution-based target enrichment ( Gnirke et al. 2009) with several modifications for Nextera-prepared libraries. First, we used 1.8X AMPure XP in place of column clean up following the Nextera tagmentation reaction because the recommended Zymo column clean up yielded lower final DNA concentrations than AMPure. Second, we amplified tagmented libraries using reduced length, HPLC-purified PCR primers complementary to the adapters attached to each DNA fragment during tagmentation. We did not attach sequence tags to libraries at this point to reduce potential complications during enrichment introduced by longer, individually variable, adapters. We increased the number of PCR cycles during the post-tagmentation PCR to 16 or 19 to yield sufficient (∼500 ng) template for enrichment. Following library preparation, we substituted a blocking mix of 500 μM (each) oligos composed of the forward and reverse complements of the Nextera adapters for the kit-provided adapter blocking mix (#3), and we individually enriched species-specific libraries using the synthetic RNA probes described above. We incubated hybridization reactions at 65C for 24 h on a thermal cycler. Following hybridization, we enriched samples by binding hybridized DNA to streptavidin-coated beads, and we eluted DNA from streptavidin-coated beads using 50 μL of NaOH, which we neutralized with an additional 50 μL of Tris–HCl. We cleaned eluted DNA by binding to 1.8X (v/v) AMPure XP Beads, and we resuspended clean enriched DNA in 30 μL nuclease-free water. We amplified 14 μL of clean, enriched DNA in a 50 μL PCR reaction combining forward, reverse, and indexing primers with either Nextera Taq or Phusion DNA Polymerase (New England Biolabs) to add a custom set of 24 Nextera indexing adapters, robust to insertions, deletions, and substitutions to each sample. Following PCR, we cleaned reactions by binding amplicons to 1.8X (v/v) AMPure XP. We quantified enriched indexed libraries using a Bioanalyzer, and we combined libraries for sequencing at equimolar concentrations assuming all adapter-ligated fragments were at the mean length across all libraries.
We sequenced libraries using the Nextera sequencing primers and a 100 bp single-end Illumina Genome Analyzer IIx run conducted by the LSU Genomics Facility. Following sequencing, the LSU Genomics Facility demultiplexed libraries using Illumina-provided software, and we used a pipeline (https://github.com/faircloth-lab/illumiprocessor) implemented in Python to trim adapter contamination from reads using SCYTHE (https://github.com/vsbuffalo/scythe), adaptively quality-trim reads using SICKLE (https://github.com/najoshi/sickle), exclude reads containing ambiguous (N) bases, and collect metadata for each cluster of reads analyzed.
After pre-processing reads, we assembled species-specific reads into contigs using VELVET ( Zerbino and Birney 2008) via VelvetOptimiser (http://bioinformatics.net.au/software.velvetoptimiser.shtml), which we used with default parameters ( k r a n g e = 59 –79) to optimize kmer length, coverage, and cutoff for the assembly of data from each species. Velvet resolves potentially variable sites to the majority allele ( Zerbino and Birney 2008). We converted contigs output by VELVET to AMOS bank format, and we computed coverage within contigs using custom Python code to parse output from the cvgStat and analyze-read-depth programs provided within the AMOS 3.0.0 software package (http://sourceforge.net/projects/amos). We used a custom Python program integrating LASTZ (match_contigs_to_probes.py) to align assembled contigs back to their respective probe/ UCE region while removing reciprocal and non-reciprocal duplicates and probes having fewer matches than expected, as described above.
This program (match_contigs_to_probes.py) creates a relational database of matches to UCE loci by taxon, and we used a second program (get_match_counts.py) to query this database and produce a fasta file containing only those contigs built from UCEs present in every taxon. This program also has the ability to include UCE loci drawn from existing genome sequences, for the primary purpose of adding high-quality outgroup data from genome-enabled taxa. We used this feature to include UCE loci identified in Carolina anole (UCSC anoCar2) as outgroup data for the bird phylogeny. We aligned and trimmed reads as described above. We used multimodel inference and model averaging ( Burnham and Anderson 2002) of binomial-family generalized linear models ( Calcagno and de Mazancount 2010 R Core Development Team 2011) to evaluate the effect of reasonable combinations of the following parameters on enrichment of UCE loci (detected, undetected): UCE GC content, UCE length, probe TM, probe count, masked bases included within probes, bases added to buffer probes, and taxon.
Estimating Substitution Models
We used custom Python programs (run_mraic.py) wrapping a modified MrAIC 1.4.4 ( Nylander 2004) to estimate, in parallel, the most likely finite-sites substitution models for each of the alignments generated for primates (2030 loci) and birds (854 loci). We selected the appropriate substitution model for all loci using AIC ( Akaike 1974).
Bayesian Analysis of Concatenated Data
We grouped genes with the same substitution model into different partitions, we assigned an appropriate substitution model to each partition, and we concatenated partitions and analyzed these data using MrBayes 3.1 ( Huelsenbeck and Ronquist 2001). We conducted all MrBayes analyses using two independent runs (four chains each) of 5,000,000 iterations each, sampling trees every 100 iterations to yield a total of 50,000 trees. We sampled the last 25,000 trees after checking results for convergence by visualizing the log of posterior probability within and between the independent runs for each analysis, ensuring the average standard deviation of split frequencies was <0.00001, and ensuring the potential scale reduction factor for estimated parameters was approximately 1.0.
Analysis of Gene Trees and Species Trees
We estimated gene trees under maximum likelihood with PhyML 3.0 ( Guindon et al. 2010) using the most likely substitution model for each tree, which we estimated as described above. We estimated species trees from these gene trees using the STAR (Species Trees based on Average Ranks of coalescences) and STEAC (Species Trees Estimated from Average Coalescent times) methods implemented in the R package Phybase ( Liu and Yu 2010). STAR and STEAC calculate a species tree topology analytically based on average ranks or times of coalescent events in collections of gene trees ( Liu et al. 2009). STAR and STEAC perform similarly to probabilistic coalescent-based species-tree methods (e.g., BEST), which are unsuited, from a practical perspective, for the size of data sets used here. STAR also performs well when gene trees deviate from equal evolutionary rates, likely the case in the deep and taxonomically diverse phylogenies we investigated a benefit of STEAC, on the other hand, is that it provides an estimate of branch lengths, although they can be somewhat biased ( Liu et al. 2009). After generating a single species tree, we used a custom Python program on 250 nodes of a Hadoop (http://hadoop.apache.org/) cluster (Amazon Elastic Map Reduce) to perform 1000 nonparametric bootstrap replicates by resampling nucleotides within loci as well as resampling loci within the data set ( Seo 2008).
Materials and Methods
Defining Inversion Categories
In our analysis, we define three classes of inversion population frequency. Previous work in D. melanogaster has typically referred to four categories of inversion, “common cosmopolitan,” “rare cosmopolitan,” “recurrent endemic,” and “unique endemic” ( Mettler et al. 1977 Krimbas and Powell 1992). The latter half of each of these terms refers to the geographic distribution of the inversion. As long as an inversion reached high frequency in any population, it has not been strongly impacted by negative selection. We label these high-frequency inversions “common” inversions. We use “rare” to refer to inversions which were found in only single samples (with the exception of In(2R)Mal, which is present in three samples studied here). The distribution of rare inversions, while possibly containing high-fitness inversions that could eventually spread to high frequencies, are likely to primarily reflect mutational biases in their overall breakpoint distribution. To summarize, “common cosmopolitan,” “rare cosmopolitan,” and “recurrent endemic” will all fall under our label “common,” whereas we refer to “unique endemic” as “rare” inversions, similarly to the analysis in Corbett-Detig (2016).
The third class in our framework, “fixed” inversions, are inversions that have gone to fixation within one lineage during divergence of the D. melanogaster subgroup ( Ranz et al. 2007). Originally all fixed inversions occurred as unique events in a Drosophila ancestor. They subsequently spread until they reached fixation in populations ancestral to contemporary species in the melanogaster subgroup. These fixed inversions were discovered by comparing the locations of homologous sequences in the genomes of between D. melanogaster and its relatives ( Lemeunier and Ashburner 1976) and have been molecularly characterized previously ( Ranz et al. 2007). It is important to note that the vast majority of these fixed inversions occurred on the Drosophila yakuba branch and not in a direct D. melanogaster ancestor ( Krimbas and Powell 1992 Ranz et al. 2007). The reference genome of D. melanogaster should therefore generally reflect the ancestral state and the genetic background on which these inversions originated rather than a derived state evolved after fixation. Common and rare inversions annotated here occurred in contemporary D. melanogaster populations and thus in the absence of additional changes unrelated to genome structure, on a similar genetic background to that on which the D. yakuba inversions were fixed. The functional annotations used here are also based on the D. melanogaster standard arrangement, meaning these annotations should represent the genetic background of all three inversion frequency categories.
We obtained short-read data as fastq files from the Sequence Read Archive. All short-read data are described in Lack et al. (2016) and was originally produced in Pool et al. (2012), Lack et al. (2015), Mackay et al. (2012), Kao et al. (2015), and Grenier et al. (2015). We aligned the short-read data using bwa v0.7.15 using the “mem” function and default parameters ( Li 2013). All postprocessing (sorting, conversion to BAM format, and filtering) was performed in SAMtools v1.3.1 ( Li et al. 2009). We filtered these BAM files to include only those alignments with a minimum mapping quality of 20 or more.
Rare Breakpoint Identification
As in previous works that characterized structural variation using short-insert paired-end Illumina libraries ( Cridland and Thornton 2010 Rogers et al. 2014 Corbett-Detig et al. 2019), we first identified aberrantly mapped read “clusters.” Briefly, here, a cluster is defined as three or more read pairs that align in the same orientation (for inversions, this is either both forward-mapping or both reverse-mapping) and for which all reads at one edge of the cluster map to within 1 kb of all other reads in the cluster. We considered only aberrant clusters where both ends mapped to the same chromosome arm as the vast majority of inversions in Drosophila are paracentric ( Krimbas and Powell 1992). We required that all read pairs included in a cluster map a minimum of 500 kb apart. We then retained only those potential inversions for which we recovered both forward- and reverse-mapping clusters there were within 100 kb of one another. The choice of a maximum distance between possible breakpoint coordinates was included to reduce the possible rates of false-positives and because none of the known inversions whose breakpoints have previously been characterized included a duplicated region of 100 kb or more ( Ranz et al. 2007 Corbett-Detig and Hartl 2012). When breakpoint assemblies existed in very close proximity or appeared to delete short sequences, we set the duplication size to 1 base. We further filtered all breakpoint assemblies that overlapped annotation transposable elements as these are the primary source of aberrantly mapping read clusters in previous works ( Corbett-Detig and Hartl 2012).
As an additional check for the accuracy of our newly discovered breakpoints, we compared our distribution of rare breakpoints to the known cytogenetic distribution and found no chromosomal or by-region differences (P = 0.7, χ 2 test cytogenetic data from Corbett-Detig (2016) who summarized Krimbas and Powell (1992)). The short insert size from previous sequencing experiments ranged from ∼200 to ∼600 bp, which may have led to a nontrivial false-negative rate of breakpoint discovery particularly if the breakpoints contain repetitive elements or other large DNA insertions. However, we do not expect that these potential false-negatives will bias our downstream analyses, and all previously characterized inversion breakpoints in the Melanogaster species complex occurred in unique sequences ( Ranz et al. 2007 Corbett-Detig and Hartl 2012). All software used to perform these analyses is available from the github repositories associated with this project. Specifically, scripts used for breakpoint detection and assembly are in https://github.com/dliang5/breakpoint-assembly (last accessed May 26, 2020).
De Novo Rare Breakpoint Assembly
For each putative inversion, we then extracted all reads for which either pair mapped to within 5 kb of the predicted breakpoint position. We converted all fastq read files to fasta and qual files as is required by Phrap, and we assembled each using otherwise default parameters but including the “-vector_bound 0 -forcelevel 10” command line options ( Corbett-Detig and Hartl 2012 Rogers et al. 2014). We then used BLAST to align the resulting de novo assembled contigs to the D. melanogaster reference genome to identify the contig that overlapped the predicted breakpoint using the flybase BLAST tool (https://flybase.org/blast/, last accessed May 26, 2020). We retained only inversions for which we could de novo assemble contigs overlapping both breakpoints, and we further discarded any contigs where the sequence intervening two distant genomic regions contained sequence with homology to known transposable elements. All of the assembled breakpoint sequences are available in supplementary file S1 , Supplementary Material online. Assembly scripts are available from https://github.com/dliang5/breakpoint-assembly (last accessed May 26, 2020).
Overlapping Inversions and In(2R)Mal
We also attempted to find sets of overlapping inversions. Briefly, for overlapping inversions, where one inversion arises on a background that contains another inversion with one breakpoint inside and one outside of the inverted region, the breakpoint-spanning read clusters should be largely the same as inversions that arose on a standard arrangement chromosome. However, the key difference is that rather than pairs of forward- and reverse-mapping read clusters, we expect to observe two distantly mapping read clusters in the reverse–forward and forward–reverse arrangements. We applied this approach for the 17 rare inversions that we initially discovered as well as to all samples that contained common inversions that are known from previous work ( Corbett-Detig and Hartl 2012 Lack et al. 2015). We found only one such overlapping rare inversion, which is consistent with the known segregation distorter-associated chromosomal inversion In(2R)Mal, which is composed of two overlapping inversions ( Presgraves et al. 2009). In our analysis here, we treat these overlapping inversions as independent, but our results are qualitatively unaffected if we simply exclude the second inversion.
Genome Version, Insulator, and Gene Annotations
All our analyses are based on alignments to D. melanogaster genome version 6.26 ( Hoskins et al. 2015). We obtained genome annotation data including gene locations from flybase. We treated long noncoding RNAs as genes for our purposes, as they perform essential functions and can be disrupted in the same way as protein-coding genes. We obtained insulator-binding site positions from Nègre et al. (2010, accession GSE16245). As necessary, we converted the coordinates of genomic features from genome version 5 to 6 using the flybase coordinate batch conversion tool (https://flybase.org/convert/coordinates, last accessed May 26, 2020).
Selection of Public Data Sets for Topological Domains and Chromatin Marks
We obtained TAD data including annotations of chromatin state from Sexton et al. (2012). This data set is composed of domains detected by genome-wide chromosome conformation capture sequencing, HiC, on early stage embryos, and annotated with an epigenetic state using a clustering method applied to another source of linear epigenomic data ( Sexton et al. 2012). Their annotations include four categories: “active,” “null,” “PcG” (polycomb), and “HP1” (centromeric heterochromatin). For the sake of consistency, we refer to Sexton et al.’s “null” domains as “inactive.” Early stage embryos are likely to be the environment in which any regulatory disruption induced by inversions is most deleterious given the sensitive nature of development, which makes this a promising source of context for our analysis of inversion frequency. This data set also allows us to separately analyze breakpoint occurrence within TADs and chromatin states in tandem, because they are derived from the same source. It should be noted, however, that the annotations of these TADs are relatively coarse and may not reflect the more local environment of an inversion breakpoint.
We therefore performed a second analysis on finer scales using the data set of Kharchenko et al. (2011, accession GSE25321). This data set in its raw form consists of short spans marked with one of a set of chromatin markers, in both a nine-state model and a 30-state model. As we desired a representation of the local chromatin environment around inversion breakpoints, we chose to bin the nine-state representation into total counts of bases assigned to a state of the given type over windows of 10 kb. About 10 kb was selected based on the average heterogeneity of the windows we wanted our window size to be as small as possible but for most windows to contain at least one region with an annotated chromatin state. This yielded a distribution of values for each window which represented the overall enrichment of each state in each 10-kb span. As we lacked statistical power to evaluate these mark types individually with our relatively small inversion breakpoint data sets, we further assigned each 10-kb window an activity state based on the majority of present marks. Windows in which the vast majority of sites were assigned states one through five, annotated by Kharchenko et al. (2011) as being various components of genes including promoters, exons, and introns, were designated “active.” Windows where states six through nine, which include PcG, HP1, and other heterochromatic marks, were most prominent, were designated “inactive.” Windows in which both groups each constituted at least 5% of all marks were designated “mixed.” This yields an alternative representation of chromatin environments surrounding inversion breakpoints that is much finer-grained than the annotations of Sexton et al. (2012).
We compared this representation to Sexton et al.’s annotated chromatin states as an additional check for the validity of our approach. We found that 10-kb windows located within each annotated TAD generally aligned with the annotation of that TAD, but that substantial heterogeneity of chromatin marks exists within each TAD span ( supplementary fig. S3 , Supplementary Material online). For example, ∼19% of windows within TADs annotated as “active” are enriched for chromatin state 9, which is associated with extended silenced regions, and conversely 26% of windows within TADs annotated as inactive are enriched for chromatin state 2, which is associated with the active transcription. This indicates that one cannot be treated as a direct substitute for the other.
As a final check on the validity of the domains obtained from Sexton et al. (2012), we obtained polytene domain data from Eagen et al. (2015), repeated our analysis, and found them to be generally consistent with our conclusions. These results may be found in supplementary text S1 , Supplementary Material online.
Permutations and Statistical Tests
To compare inversion breakpoint positions to a randomized distribution, permutations for all categories of inversions (rare, common, and fixed) were performed with 1,000 iterations of a group of randomly located breakpoints, holding the inversion number, duplication lengths, and chromosome arms constant. Specifically, for each inversion breakpoint, 1,000 starting positions were chosen from a uniform distribution between the start of that chromosome arm and the end minus the length of the duplication—that is, from the entire set of possible points for that size of breakpoint. Random breakpoints were located independently for most tests, as most values were calculated for each breakpoint individually rather than the inversion as a whole. The exception is the chromatin-blending test, in which we additionally controlled for inversion lengths to account for the role of inversion length in biasing pairs of chromatin environments. Features of the genome at each of these breakpoints were recorded as our expected value for the random distribution of breakpoints.
Tests were divided by the nature of the factor. For factors that are a discrete numerical value for each break, such as distance to an element or length of a duplication, P values were calculated as percentiles of real values within a large set of random distributions. Tests between categories of the distance-based factors and the duplication length test were performed distribution to distribution with pairwise Mann–Whitney rank-sum tests.
For categorical values, such as disrupting a gene span or not, rates of category occurrence were calculated for 1,000 permutations. We define disruptions of genes and other elements as both forward and reverse single-strand breaks occurring within a single-annotated functional element. It is important to note that our method of defining disruption is likely to overestimate the proportion of fixed inversion breakpoints that truly disrupt genic sequences. Ranz et al.’s (2007) method to identify sequences duplicated by the original break relies on sequence homology, and in fixed inversions divergence of noncoding sequences can interfere with the precise identification of breakpoint regions. For example, if the original duplicated region includes a gene coding span and some noncoding bases, a complete gene copy will be produced along with a partial duplication. Over time, the noncoding region will tend to accumulate more mutations than the intact gene copy. In this case, coordinates obtained from BLAST alignments may not detect the homology between the noncoding regions and instead only yield apparent homology from duplication within the conserved gene span. This would be counted as a gene disruption event by our analysis. This bias will tend to make our analysis conservative with respect to identifying the impacts of natural selection, because breakpoints are more likely to be identified within coding regions and because we should tend to underestimate the sizes of breakpoint-adjacent duplicated regions after sequence homology has decreased. All scripts used to produce the results of the permutation tests described above are available from the github repository associated with this project https://github.com/jmcbroome/breakpoint_analysis (last accessed May 26, 2020).
Lethal and Sterile Phenotype Analysis
Additionally, we obtained phenotype data from Flybase using the query builder (https://flybase.org/cgi-bin/qb.pl, last accessed May 26, 2020) to get the IDs of all genes which have lethal phenotypes and sterile phenotypes. These data were incorporated into the gene disruption analysis and we sought evidence of difference in disruption rates between genes annotated with these phenotypes and the overall set of annotated genes. Supplementary table S2 , Supplementary Material online, contains the set of inversion breakpoints which appear to disrupt these genes.
Detection of alternative splicing
Our analysis of alternative splicing is based strictly on experimental data, not theoretical models. Rather than seeking to predict alternative splices, we directly detect them as large inserts in EST data from the publicly available dbEST ( 20) and UNIGENE ( 18) databases. We measure the evidence for a genuine alternative splice via a series of criteria (Fig. 1). First, a set of ESTs must match over their full lengths, on both sides of a putative alternative splice (allowing for sequence error). A large insert in the middle of such a perfect match is a candidate alternative splice. Unlike many other types of genomics results such as SNPs and variations in expression level, alternative splicing does not resemble common experimental noise (such as sequencing error).
Next, the EST consensus sequence is mapped to the draft human genome sequence by homology search. Because human genes are broken into short exons, a genomic hit typically consists of many short matches. To be valid, these matches must be perfect (again allowing only for sequencing error), must all be in the same orientation (strand) and form a complete, correctly ordered walk through the EST consensus sequence. We require that each genomic–EST match region (putative exon) be bounded by consensus splice donor site and acceptor site sequences in the neighboring genomic (intron) sequence. Our results give an average internal exon size of 144 bp, with only 4% of internal exons >300 bp in length, similar to results obtained for known genes ( 21). Only 0.2% (79/39 862) of our introns were <60 bp, and the median intron length was 935 bp. The typical gene pattern of short internal exons ending in a single, long 3′ exon can usually be verified because 3′-end sequences are highly represented in the EST data, and because 3′ ESTs can be identified by their conspicuous poly(A) tails, which directly indicate the end of the 3′ exon.
To assess the accuracy of our gene mapping and exon/intron structure, we have compared against the completely independent data produced by NCBI’s Acembly, a human curated gene annotation effort (data downloaded from ftp://ncbi.nlm.nih.gov/genomes/H_sapiens). LocusLink provides an independent linkage between individual RefSeq genes and UNIGENE clusters ( 22). For genes mapped independently to the genomic sequence by RefSeq and our procedure, 97.3% mapped to the same genomic contig. Moreover, of those genes, 95% were mapped to the same nucleotides of the contig. While Acembly’s mapping should not be assumed to be perfect, this high level of agreement between independent efforts is encouraging. Our exon details (derived in our procedure from our splice detection) match the NCBI Acembly exons in 97% of cases at the 5′ splice site, and 96% at the 3′ splice site (overall, 94% of the exons were identical). Our splice details matched the NCBI Acembly introns in 93% of cases at the 5′-end, and 92% at the 3′-end (86% matched exactly at both ends). Because of alternative splicing, a 100% correspondence is not expected.
A candidate alternative splice insert (from the EST) must pass a series of tests. First, it must also be found in the genomic sequence, matching an exonic region in the genomic sequence whose boundaries correspond to known splice site sequences. Since these splice site sequences are mostly intronic, this provides an independent validation of the alternative splice. It should be emphasized that differences in where ESTs begin and end in a gene (e.g. a shorter EST might give the appearance of a truncated gene product) will never be interpreted as an alternative splice by our procedure. We focus exclusively on detecting splicing, i.e. a contiguous region of the transcript that has been removed during mRNA processing. Detecting a splice in an EST requires extensive matches to both upstream and downstream exons. Our analysis identified 39 862 splices in 8429 clusters. Our analysis only reports alternative splices, i.e. pairs of validated splices that are mutually exclusive. Thus unspliced introns or other genomic contaminants will never be reported, since they result in the absence of a splice, not the creation of a new, mutually exclusive splice. To call an alternative splice, our procedure requires a pair of splices that match exactly at one splice site, and differ at the second splice site. This procedure can detect exon skipping, alternative 5′ donor sites, and alternative 3′ acceptor sites (Fig. 1B). 6201 such alternative splice relationships were identified in 2272 clusters. These diverse forms of evidence produce strong log odds scores for each detected alternative splice. A detailed statistical analysis of this evidence will be presented elsewhere (D.Miller, J.Aten, C.Grasso, B.Modrek and C.Lee, manuscript in preparation).
As a typical validation example from our database, we illustrate the dystrophia myotonica protein kinase (DMPK) gene (Fig. 2), whose alternative splicing has previously been studied extensively. In DMPK, we identified three alternative splices in the EST data, all of which are verified by independent experimental results in the existing literature ( 23). Of the three alternative splices, one deletes the last 15 bp of exon 8, another skips exon 12 and exon 13, and the last deletes just 4 bp in exon 14. Figure 2 shows one of these alternative splice forms including junction and quality of match of the EST evidence versus the genomic sequence.
Novel alternative splice forms of a known gene
Figure 3 shows several novel alternative splices detected in a well-studied gene, HLA-DM β. Eighty ESTs from UNIGENE cluster Hs.1162 align to form a consensus sequence, which in turn matches an ordered series of segments on one strand of chromosome 6. The EST sequences match the genomic sequence closely, consistent with sequencing error. The EST sequences mark out a long 3′ exon (359 bp) plus a series of five short exons, whose sizes (36–288 bp) match the range expected for internal exons. This matches the known gene location and structure for HLA-DM β ( 24, 25). Eight splices are observed in these ESTs, where sequence matching one exonic region skips directly to a downstream exonic region as indicated in Figure 3A. The 16 putative exon boundaries implied by the ESTs map precisely to strong consensus splice acceptor and donor sites in the genomic sequence (Fig. 3C).
Four different alternatively spliced forms of HLA-DM β are observed: splices 3+4+5 (including exons IV and V in the mRNA product) splices 6+5 (skipping exon IV) splices 3+7 (excluding exon V) splice 8 (skipping exons IV and V). Exons IV and V are 117 and 36 bp in length, and thus these alternative splices are all in-frame. The protein coding region begins in exon I and ends in exon VI, so these splices produce four different forms of the HLA-DM β chain that differ at their C-terminus.
Analysis of these forms reveals a remarkably simple and intriguing functional effect. HLA-DM is essential for the loading of class II MHC molecules with exogenous peptide antigens, a key step in antigen presentation and activation of the humoral immune response. This is thought to occur in early lysosomal compartments. HLA-DM is normally targeted to lysosomes, and its β chain contains a transmembrane domain anchoring its C-terminus ( 26, 27). Exon IV is short, and corresponds precisely to the transmembrane domain. Exon V is very short, and encodes the lysosomal targeting signal YTPL, whose first residue begins at the start of the exon. Thus, the alternative splice regulates HLA-DM’s targeting to endosomal compartments (by including or excluding the YTPL signal), as well as its anchoring to the membrane. Given HLA-DM’s importance in antigen processing and presentation by class II MHC, this regulation is functionally interesting. Removing its targeting signal would likely redirect HLA-DM first to the plasma membrane, so that it would travel to lysosomes via endocytic pathways, altering the kinetics and conditions in which it first encounters class II MHC. It appears that the gene structure of the HLA-DM β gene has been carefully ‘designed’ to enable control of HLA-DM function, by pulling out both the transmembrane helix and the lysosomal targeting signal into separate short exons (IV, V) that can be alternatively spliced in-frame (exon VI supplies the last 4 amino acids of the protein, identical in all forms). The alternatively spliced forms were detected in uterus (two ESTs), placenta, lymph, stomach and colon. Despite the fact that HLA-DM is the subject of intense research, we have not been able to find any report of such alternative splicing in the published literature, and it is thought to be novel by an expert on HLA-DM (E.Mellins, personal communication).
Scope of alternative splicing in human genes
Our genome-wide analysis detected thousands of alternative splices in the current, publicly available human genome data (Table 1). 6201 alternative splice relationships were detected in which two splices shared a common donor or acceptor site, but spliced to a different site on their other end (i.e. exon skipping, alternative 5′ splice donor site or alternative 3′ splice acceptor site Fig. 1B). We found alternative splices in 27% of genes for which we had enough expressed sequence to cover more than a single exon. However, this estimate, based on analysis of all EST clusters, likely underestimates the real occurrence of alternative splicing, because the available EST data typically cover only a small part of the complete gene. To test this hypothesis, we analyzed the alternative splicing rate in genes for which mRNA sequence was available (representing all or part of the full gene). We detected one or more alternative splice forms in 42% of these genes, significantly higher than the rate observed in EST-only clusters. This is in close agreement with a previous study of mRNA-based expressed sequence clusters ( 8). Since fragmentation of the genomic sequence can also block complete coverage of a gene, we assessed the rate of alternative splice detection in genes mapped to chromosome 22. Of these, 43% contained alternative splices, including both mRNA and EST-only clusters.
The current EST data appears to be incomplete. Our procedure identified splices (i.e. multiple exons) in only 18% of the mapped EST clusters. However, for clusters that we mapped to chromosome 22 (full genomic) that also had an mRNA sequence, 88% contained at least one splice. A variety of factors such as the fragmentation of the draft human genome sequence, the large size of introns and the tendency of the ESTs to cluster at the 3′-end bias the current dataset against finding full-length genes, and probably underestimate the true level of alternative splicing. Moreover, since the current EST data for each gene represent only a subset of the tissues and cell types in which that gene is expressed, it is likely that the total occurrence of alternative splicing is much greater than what our analyses can detect. A large fraction of the EST alternative splice forms were observed multiple times (from different clones and different libraries), indicating that they constitute a relatively high fraction of total mRNA. Of our alternative splices, 2892 (47%) were observed in two or more EST sequences. These data represent a ‘high confidence’ subset of the detected alternative splices.
Our analysis indicates that the vast majority of our database represents novel findings (Fig. 5D). Only 13% of our alternative splices were detected in mRNA sequences from GenBank, which presumably have been thoroughly studied. The remaining 87% could be detected only with ESTs. Our procedure also detected large numbers of alternative splicing events in completely novel genes. Approximately 1200 alternative splices were detected in clusters containing ESTs only.
Alternative splicing in a novel human gene
Figure 4 illustrates an example of alternative splice detection in a novel gene mapped in the human genome by our procedure. This gene has 33% identity to rat FCε receptor I β chain, and 25% identity to CD20, and has a pattern of four predicted transmembrane domains characteristic of both proteins. At least seven different forms are detectable, all of which affect the protein product. In a pattern strikingly reminiscent of HLA-DM β, the C-terminal transmembrane region and cytoplasmic tail of the major form (form 1) are placed on a single, short exon (exon VI), that can be included or excluded to create different forms. One particularly interesting form is created by ignoring the normal splice from exon V to exon VI, extending the coding region from exon Va for 142 bp (which we have designated exon Vb). A polyadenylation site is predicted at the end of this sequence, and the ESTs are observed to terminate in poly(A) at this point. This alternative termination replaces the coding region of exons VI and VII with 40 amino acids encoded by exon Vb [terminated by a STOP codon 23 bp before the poly(A) site]. Intriguingly, this replacement C-terminal sequence also contains a predicted transmembrane sequence, and thus neatly substitutes a new C-terminal transmembrane domain and cytoplasmic tail. The cytoplasmic tail in equivalent FC receptor chains plays a key role in activating cytoplasmic signal transduction molecules ( 28, 29), so this alternative form likely modulates the signal transduction activity of this receptor. This form is detected in placenta and kidney, while the majority form was detected in many different libraries.
Single-molecule, real-time sequencing developed by Pacific BioSciences offers longer read lengths than the second-generation sequencing (SGS) technologies, making it well-suited for unsolved problems in genome, transcriptome, and epigenetics research. The highly-contiguous de novo assemblies using PacBio sequencing can close gaps in current reference assemblies and characterize structural variation (SV) in personal genomes. With longer reads, we can sequence through extended repetitive regions and detect mutations, many of which are associated with diseases. Moreover, PacBio transcriptome sequencing is advantageous for the identification of gene isoforms and facilitates reliable discoveries of novel genes and novel isoforms of annotated genes, due to its ability to sequence full-length transcripts or fragments with significant lengths. Additionally, PacBio’s sequencing technique provides information that is useful for the direct detection of base modifications, such as methylation. In addition to using PacBio sequencing alone, many hybrid sequencing strategies have been developed to make use of more accurate short reads in conjunction with PacBio long reads. In general, hybrid sequencing strategies are more affordable and scalable especially for small-size laboratories than using PacBio Sequencing alone. The advent of PacBio sequencing has made available much information that could not be obtained via SGS alone.