Search
Advanced Search
Average Rating (0 User Ratings)
    • Currently 0/5 Stars.
    See all categories
      • Currently 0/5 Stars.
      • Currently 0/5 Stars.
      • Currently 0/5 Stars.
    Rate This Article
We are still in beta! Help us make the site better and report bugs.

Open Access

Research Article

SPA: A Probabilistic Algorithm for Spliced Alignment

Erik van Nimwegen1,2*, Nicodeme Paul1,2, Robert Sheridan3¤, Mihaela Zavolan1,2

1 Biozentrum, University of Basel, Basel, Switzerland, 2 Swiss Institute of Bioinformatics, Switzerland, 3 The Rockefeller University, New York, New York, United States of America

Abstract

Recent large-scale cDNA sequencing efforts show that elaborate patterns of splice variation are responsible for much of the proteome diversity in higher eukaryotes. To obtain an accurate account of the repertoire of splice variants, and to gain insight into the mechanisms of alternative splicing, it is essential that cDNAs are very accurately mapped to their respective genomes. Currently available algorithms for cDNA-to-genome alignment do not reach the necessary level of accuracy because they use ad hoc scoring models that cannot correctly trade off the likelihoods of various sequencing errors against the probabilities of different gene structures. Here we develop a Bayesian probabilistic approach to cDNA-to-genome alignment. Gene structures are assigned prior probabilities based on the lengths of their introns and exons, and based on the sequences at their splice boundaries. A likelihood model for sequencing errors takes into account the rates at which misincorporation, as well as insertions and deletions of different lengths, occurs during sequencing. The parameters of both the prior and likelihood model can be automatically estimated from a set of cDNAs, thus enabling our method to adapt itself to different organisms and experimental procedures. We implemented our method in a fast cDNA-to-genome alignment program, SPA, and applied it to the FANTOM3 dataset of over 100,000 full-length mouse cDNAs and a dataset of over 20,000 full-length human cDNAs. Comparison with the results of four other mapping programs shows that SPA produces alignments of significantly higher quality. In particular, the quality of the SPA alignments near splice boundaries and SPA's mapping of the 5′ and 3′ ends of the cDNAs are highly improved, allowing for more accurate identification of transcript starts and ends, and accurate identification of subtle splice variations. Finally, our splice boundary analysis on the human dataset suggests the existence of a novel non-canonical splice site that we also find in the mouse dataset. The SPA software package is available at http://www.biozentrum.unibas.ch/personal/nimwegen/cgi-bin/spa.cgi.

Synopsis

A prerequisite for the identification and analysis of splice variation in the transcriptomes of higher eukaryotes is the very accurate mapping of cDNAs to their genomes. However, current algorithms use ad hoc scoring schemes that cannot correctly trade off the likelihoods of different sequencing errors against the likelihoods of different gene structures.

In this paper the authors develop a Bayesian probabilistic approach to cDNA-to-genome mapping that combines explicit models for the prior probabilities of different gene structures with the likelihoods of different sequencing errors. The parameters of these probabilistic models can be estimated automatically from the input such that the mapping procedure is automatically adapted to the organism and sequencing technology of the data under study.

The authors implement their approach in a fast mapping algorithm called SPA and apply it to a dataset of human full-length cDNAs and the FANTOM3 dataset of mouse full-length cDNAs. Comparisons with four other mapping algorithms show that SPA produces mappings that are significantly more accurate, with the largest improvements in the mappings of the 5′ and 3′ ends of the cDNAs, and the mappings around splice boundaries. The authors also identify a novel set of putative splice sites in the human dataset.

Introduction

Recent large-scale sequencing projects such as FANTOM3 [1] have started to unveil the complexity of the mammalian transcriptome. Data from such projects provide a unique opportunity for an in-depth investigation of the mechanisms that lead a single region in the genome to produce a myriad of transcript forms through variation in the transcription initiation site, the transcription termination site, and the combination of splice sites used. In order to perform such an analysis it is essential that the mapping of the observed transcripts to the genome from which they derive is highly accurate. For instance, the study of the structure of basal promoters and of the mechanism of alternative transcription initiation requires that the starts of transcripts are correctly mapped. Similarly, the study of important regulatory elements such as microRNA binding sites requires that the ends of the transcripts are accurately mapped and the 3′ UTRs are correctly identified. For learning more about the mechanism and regulation of alternative splicing, it might well be that the most informative transcripts are those with rare splice variations, potentially including errors of the splicing machinery. The mapping program thus should reliably identify these cases as well.

Currently available algorithms for mapping cDNAs to the genome fail to reach the required level of accuracy for various reasons. Structurally, the most significant problem with existing mapping algorithms is that they use ad hoc scoring schemes for defining the quality of different alignments. Some programs, e.g., Sim4 [2], cannot properly deal with non-canonical splice boundaries and sometimes introduce multiple errors in the alignment around the splice boundary to force the canonical, GT-AG, pair of splice signals. Other programs, e.g., BLAT [3], do not explicitly distinguish between small introns and small deletions caused by sequencing errors. Finally, a significant fraction of all mapping errors are caused by the general inability of the algorithms to correctly map the starts and ends of the transcripts.

Here we introduce SPA, a novel algorithm for spliced alignment that is based on a Bayesian probabilistic model for mapping cDNAs to the genome and identifying their gene structures. Using a set of parameters that can be estimated in a dataset- and organism-specific way, SPA searches for the mapping with maximal posterior probability. Extensive comparisons between SPA, Sim4 [2], GMAP [4], BLAT [3], and Spidey [5] on mappings of full-length human cDNAs show that SPA performs significantly better than each of the other algorithms. Our main application is the mapping of the more than 100,000 full-length cDNAs of the FANTOM3 dataset [1]. We show that SPA's alignments are significantly more accurate than those that were obtained for the FANTOM3 project [1] using the BLAT program [3] with the –fine option and post-processing of the output to identify the intron/exon structure.

Results

Trading Off Sequencing Errors and Gene Structures: Bayesian Probabilistic Model

For a given cDNA we want to infer the set of exons in the genome from which it derives. That is, we want to align the cDNA with the genome and indicate in this alignment where exons start and end. For simplicity, we will refer to such a combination of an alignment and identification of exon boundaries as a mapping of the cDNA. To evaluate the quality of different mappings it is essential to take into account both the sequencing errors that they imply as well as the gene structures that they imply. On the one hand, we know that sequencing errors are quite rare and that, roughly speaking, the fewer mismatches, insertions, and deletions there are in the alignment, the more likely the alignment is. On the other hand, one can always “perfectly” align a cDNA to the genome as long as one allows an arbitrary number of arbitrarily small exons separated by arbitrarily small introns, e.g., by letting each nucleotide form an exon by itself. However, it is clear that the resulting gene structure would be extremely unlikely. To correctly evaluate the quality of different mappings one thus has to trade off the likelihood of various sequencing errors against the likelihood of different gene structures.

To combine the probabilities of the gene structure and the sequencing errors in a systematic way we use a Bayesian approach. Depending on the organism under study, we assign different prior probabilities to different gene structures. In our model the prior probabilities of gene structures depend on the lengths of their exons and introns, and on the sequences occurring at their splice boundaries. This prior over gene structures is then combined with a likelihood model for different sequencing errors. Each possible mapping implies a transcript, and this transcript generally differs from the observed cDNA by a number of single-base mutations, insertions of bases into the cDNA that were not present in the transcript, and deletions of bases from the cDNA that were present in the transcript. The likelihood model assigns the probability that, starting from a given transcript, sequencing errors would lead to the observed cDNA. In this model we assume that bases are mutated at a constant rate during sequencing, and that insertions and deletions of different lengths occur at different rates.

Formally, given a cDNA c, the genome g, and a hypothesized mapping m, we have a prior probability P(m|g) for the gene structure that the mapping implies, and a probability P(c|m) for the sequencing errors that the mapping implies. Using Bayes's theorem the posterior probability P(m|c,g) for a mapping given both cDNA and genome is



Our alignment algorithm attempts to find, for a given cDNA c, the mapping m that maximizes this posterior probability P(m|c,g).

Estimating Model Parameters

The likelihood of different gene structures generally varies between organisms. Similarly, the rate at which different errors occur during sequencing generally varies between different sequencing technologies. In contrast to existing algorithms that use a single scoring function for all organisms and experimental technologies, our algorithm adapts its scoring function by adapting the prior P(m|g) over gene structures to the organism from which the cDNAs derive, and the likelihood model of sequencing errors P(c|m) to the experimental technology used.

As detailed in the Materials and Methods, in our model the distributions P(m|g) and P(c|m) are specified by a number of parameters such as the distribution of intron lengths, the probabilities of different sequences at the splice boundaries, and the rates at which mutations, insertions, and deletions of various lengths are introduced during sequencing. These parameters are either estimated directly from the dataset under study or estimated from an external set of reference alignments. In the first approach, we start by mapping the dataset of cDNAs with a default set of parameters and then estimate the parameters of the distributions P(m|g) and P(c|m) from the resulting “reference” alignments. A second round of alignments is then performed with these new parameters. If necessary this procedure can be iterated more than once. If an external set of reference alignments is provided, we estimate the parameters of the distributions P(m|g) and P(c|m) directly from these reference alignments.

Part of the estimation procedure involves the estimation of the splice boundary probabilities P(s1s2s3s4), i.e., the probability that an intron will start with bases s1s2 and end with bases s3s4. To this end we could of course simply count the frequency with which the boundaries s1s2s3s4 occur in the set of reference alignments. However, as illustrated in Figure 1, many of these boundaries are ambiguous. That is, the boundary can be put in multiple places without introducing any mismatches or gaps into the alignment. Since the placement of these ambiguous boundaries in the reference alignments is very sensitive to the details of the scoring function that produced the reference alignments, the frequency of boundaries s1s2s3s4 in the reference alignments might be highly biased.

thumbnail

Figure 1. Example of an Ambiguous Splice Boundary

Because a few nucleotides at the start of the intron match the nucleotides at the start of the neighboring exon, the splice boundary can be positioned in multiple ways without introducing mismatches. The different possible boundaries are indicated in different colors.

doi:10.1371/journal.pgen.0020024.g001

As detailed in the Materials and Methods we calculate a probability for all possible ways in which ambiguous splice boundaries in the reference alignments can be assigned (i.e., shifted left or right) and use a Monte-Carlo Markov chain to sample the space of all possible boundary assignments in proportion to their probability. We set P(s1s2s3s4) equal to the frequency with which each boundary s1s2s3s4 occurs during this sampling. Roughly speaking, the assignments that contribute most to the average are those that lead to distributions P(s1s2s3s4) that have low entropy. We confirm that our sampling has converged by performing multiple sampling runs.

Finally, we have observed that in both the human and mouse full-length cDNA datasets, a small fraction of the cDNAs show a much higher rate of misincorporation than the others. To optimize the mappings of these cDNAs, our parameter estimation procedure also identifies the subset of cDNAs that belong in this high misincorporation-rate class, estimates their average misincorporation rate, and remaps them with the misincorporation-rate parameter set to this higher value.

Alignment Algorithm

As shown in the Materials and Methods, finding the optimal mapping m with maximal posterior probability P(m|c,g) reduces to finding the optimal alignment of the cDNA c to the genome g under an appropriate scoring scheme derived from our probabilistic model. In addition, the maximally scoring alignment can be determined by dynamic programming.

Formally, with s(i,j) being the score of an optimal alignment that ends with position i in the cDNA being mapped to position j in the genome, this score can be determined by the recursion relation



where the score s[(i,j)|(i′,j′)] is the change in alignment score associated with extending the optimal alignment ending at (i′,j′) by adding the single pair of mapped bases (i,j). Apart from the score associated with the matching or mismatching of the bases at i and j, the score s[(i,j)|(i′,j′)] also incorporates the contribution of the (possible) gaps associated with leaving bases i′ + 1 through i − 1 of the cDNA unmapped, and skipping bases j′ + 1 through j − 1 in the genome. The latter may involve an intron.

Note that, in contrast to scoring models that use affine gap penalties, our much more general scoring of insertion, deletion, and intron lengths forces us to consider all pairs of positions (i′,j′) with i′ < i and j′ < j. Therefore, although we can theoretically determine the optimal score rigorously using the dynamic programming recursion (Equation 2) on the full dynamic programming matrices formed by the cDNA and each of the chromosomes, it is computationally infeasible to map cDNAs using this scheme and we therefore developed a heuristic approach, which is illustrated in Figure 2.

thumbnail

Figure 2. Schematic Summary of the SPA cDNA-to-Genome Mapping Algorithm

doi:10.1371/journal.pgen.0020024.g002

We first use the BLAT gfServer [3] to determine a set of genomic loci to which the cDNA might map. We separately find the best alignment to each of these loci and choose the overall optimal alignment at the end. For each of the loci we then find a set of “defined positions.” The defined positions (i,j) consist of all areas in the alignment matrix that show significant homology between the cDNA and the genomic locus and are shown as the red diagonals in the top alignment matrix on the right in Figure 2. The defined positions are identified by first finding all k-mer matches between locus and cDNA, extending these diagonally up to the first mismatch on each side, and adding a “fuzz” of defined positions at both ends of the diagonal. In the next step an optimal alignment is found by applying the dynamic programming recursion (Equation 2) only to the set of defined positions. For each defined position (i,j), shown as a green square in the second diagram on the right in Figure 2, we find the maximum alignment score obtained by extending the optimal alignment ending at any of the previous defined positions (i′,j′), shown as the red squares in the same diagram. We also record the position (i′,j′) that leads to the optimal mapping at (i,j). When all s(i,j) are calculated we can trace back the optimal alignment as usual.

We then iteratively check the alignment for problem areas such as unmapped cDNA nucleotides and non-canonical splice boundaries, and add defined positions in the problem areas. When there are unmapped cDNA bases at the 5′ or 3′ end we also extend the genomic locus at the corresponding end or ends and search for defined positions between the unmapped 5′ or 3′ end of the cDNA and the added genomic segment. This “re-tiling” procedure is illustrated in the diagram at the bottom right in Figure 2. The blue line shows the best alignment through the current set of defined positions. There are two segments of the cDNA that remain unmapped in this alignment and these define the “problem areas” shown as orange boxes in the diagram. We then identify additional defined positions in these areas at a finer resolution than before, i.e., we decrease k. The new defined positions are shown as the two red diagonals. Every time new defined positions are added to the matrix we recalculate the globally optimal alignment afresh and check it for problem areas. This iterative procedure ends when no more problem areas exist, when no more defined positions can be added to the matrix, or when the number of defined positions exceeds a prespecified maximum. The last condition guarantees that the time that the algorithm spends on a given cDNA stays within a strict upper bound.

Finally, since a small fraction of the cDNAs are erroneously reverse-complemented with respect to the transcript from which they derive we also determine the optimal alignment for the reverse-complement of the cDNA, and we take into account the small prior probability that the cDNA was reverse-complemented in the sequencing process. SPA reports the alignment with maximal posterior probability over all loci and orientations. Details of the procedure are described in the Materials and Methods.

Running Times

The speed of the mapping is a considerable challenge. For example, to map 1 million cDNAs within a week on a cluster with 100 CPUs, the algorithm cannot take more than 1 min per cDNA. Ideally the mapping should take no more than seconds per cDNA on average. However, some transcripts are very long, derive from large paralogous gene families, and can therefore be mapped to many different loci in the genome, some of which may be as long as 1,000,000 nucleotides. In addition, some loci may contain highly repetitive areas that give a very large number of significant local alignments that all need to be checked in order to determine which alignment gives the globally optimal score. When multiple sequencing errors fall within a small exon that is flanked by large introns in the genome the algorithm will have to search the entire genomic region at a fine resolution to discover the exon. Given these complications, it is simply impossible to produce accurate alignments in a matter of seconds for all transcripts. However, there are also many short single-exon transcripts that map without any, or with very few, errors to only one place in the genome. These transcripts can obviously be accurately mapped in much less than a second.

An efficient alignment algorithm should thus take into account the large variation in mapping difficulty for different cDNAs, quickly dispensing with cDNAs that are easy to map and detecting automatically when more time is needed for more complicated cases. As illustrated in Figure 3, this is naturally achieved by our iterative scheme that checks the current best alignment for problem areas and extends the number of defined positions in these areas by performing a finer resolution homology search. The figure shows the distribution of SPA's running time on all cDNAs of the FANTOM3 dataset. The average time per cDNA for this set of 102,793 cDNAs was 23.6 s. This is comparable to the time BLAT takes when run with the –fine option. The running time of individual cDNAs varies over six orders of magnitude from less than 0.01 s to several hours. The vast majority of cDNAs map in less than a second, a few percent take on the order of minutes, and a very small number of cDNAs take more than an hour. More detailed analysis of a number of examples shows that there are several reasons why some cDNAs take such a long time to map. Most of these cDNAs can be mapped to multiple locations in the genome (often more than ten), and SPA attempts an alignment at each locus. At many of these loci only part of the cDNA can be mapped, and this will lead to unmapped pieces at the 5′ or 3′ end in the initial alignment. SPA will then extend the loci at the corresponding ends and re-tile these regions at ever finer resolution in an attempt to map the unmapped 5′ and 3′ ends. At every step of this re-tiling a significant number of defined positions are added, and SPA has to perform the dynamic programming to identify the best alignment through these positions at every step. Another reason for long running times is that some cDNAs contain a significant amount of repetitive sequence, which produces a very large number of defined positions in the dynamic programming matrix.

thumbnail

Figure 3. Distribution of Running Times of SPA on All 102,143 cDNAs of the FANTOM3 Dataset

The horizontal axis shows the time in seconds, and the vertical axis shows the fraction of all cDNAs that took longer than the corresponding time to finish. Both axes are shown on a logarithmic scale.

doi:10.1371/journal.pgen.0020024.g003

To give a specific example, the cDNA with RIKEN identifier F830048L18 took almost 4 h to run. SPA considered 20 different loci for this cDNA, one of which was on Chromosome 1 and the other 19 of which were on pieces of unassembled chromosome. The initial alignments of 15 of these loci all had unmapped 3′ ends of about 75 nucleotides. For all 15, SPA extended the locus at the 3′ end (by 50 kilobases), and then re-tiled this area with finer and finer tile size in an attempt to map the 3′ end piece. This failed in all 15 cases. For the other five loci the initial alignment had an unmapped piece at the 5′ end. For these loci SPA extended the loci at the 5′ end and re-tiled with finer tile size. For one of the loci (locus 4) this led to a successful mapping of the piece at the 5′ end. This final alignment contains three exons in a genomic locus of 2,013 nucleotides.

Results for Human Full-Length cDNAs

As an initial test of SPA, and to extensively compare its performance with other algorithms, we used a recent dataset [6] of 20,207 human full-length cDNAs. (The paper mentions over 21,000 cDNAs, but we only found GenBank records for 20,207.) We mapped the cDNAs of this dataset using SPA, Sim4 [2], Spidey [5], BLAT [3], and the recently published algorithm GMAP [4] (see Materials and Methods for details). Sim4, BLAT, and Spidey were chosen because they are the most commonly used spliced alignment algorithms, and GMAP was chosen because it is a recent fast algorithm that is reported [4] to produce mappings that are in quality comparable to or better than those that BLAT produces.

Evaluating the quality of the mappings produced by the different algorithms is complicated by the fact that we obviously do not know what the “correct” mapping is for each cDNA. In fact, deciding which of several mappings for a single cDNA is the best is equivalent to defining a scoring model for cDNA mappings. Since we are of the opinion that SPA's scoring model is by far the most realistic and sophisticated model available, we believe that a higher scoring under this model is a good proxy for higher quality of the mappings. Thus, as a first test we scored each algorithm's mappings using SPA's scoring model, and compared the scores with the scores of SPA's mappings. The results are shown as the solid lines in Figure 4. In each panel the solid green line shows the distribution of the score differences for cDNAs for which SPA has a higher scoring mapping, and the solid red line shows the distribution of score differences for cDNAs for which the other algorithm scores higher. For example, the top left panel shows that there are over 6,000 cDNAs for which SPA's mapping scored better than Sim4′s mapping, versus only 210 cDNAs for which Sim4 scored better. There are almost 4,000 cDNAs for which the score difference in SPA's favor is at least ten, and a little over 700 cDNAs for which the score difference is at least 100. In contrast, there are only about 200 cDNAs for which Sim4′s mapping is better by at least ten, and less than 40 for which the difference is at least 100 in Sim4′s favor. To give an indication of the scale of these score differences, a difference of 100 corresponds to between 30 and 60 more matched nucleotides. We see that, for all four algorithms, the number of cDNAs for which SPA scores better is at least ten times as high as the number of cDNAs for which the other algorithm scores better.

thumbnail

Figure 4. Distributions of Differences in the Log-Posterior Probability of the Mappings of the Human cDNAs

Shown are the differences between SPA and Sim4 (top left), GMAP (top right), BLAT (bottom left), and Spidey (bottom right). The green curves show the distributions of score differences for the cases in which SPA had a higher scoring mapping, and the red curves show the distributions of score differences for mappings where the other algorithm had a higher scoring alignment. The solid lines were obtained using SPA's scoring function and the dashed lines using for each algorithm the scoring function that maximizes the posterior probability of the entire set of mappings of that algorithm. All axes are shown on logarithmic scales.

doi:10.1371/journal.pgen.0020024.g004

Although we believe these results provide strong evidence that SPA produces higher quality mappings, one could always argue that the apparent higher quality of SPA's mappings under this scoring model is simply a result of the fact that SPA used the same model to produce the mappings and does not necessarily indicate higher quality mapping. To address this we performed the same comparisons using scoring models that are adapted to the mappings of each of the other algorithms. That is, for each other algorithm, we ran our parameter estimation procedure to infer the mismatch probability pmm, the splice boundary probabilities Psb(s1s2s3s4), and the probability distributions of insertions Pi(k), deletions Pδ(k), and intron lengths Pint(k) that maximize the overall probability of this algorithm's mappings. We then compared the mappings of SPA and each other algorithm under the optimal scoring model for that algorithm. The results are shown as the dashed lines in Figure 4. One can see that the distributions of score differences between SPA and the other algorithms are highly robust under the change of the scoring function.

The dashed lines substantially deviate from the solid lines only for very high score differences in SPA's favor and for small score differences in the other algorithm's favor. These observations can be understood as follows. Almost all mappings that have a much higher score under SPA's scoring contain one or more very short introns of length less than 30 in the other algorithm's mapping. SPA's scoring highly penalizes these short introns (see Materials and Methods), but under the algorithm's own scoring model this penalty is significantly reduced. This causes the difference in the green lines at very high score differences.

Mappings with very small score differences generally only differ in their choice of splice sites or by subtle differences in the assignment of mismatches and gaps. For example, one mapping may have a single insertion of length six and the other mapping two insertions of length three. For such cases SPA's scoring slightly prefers SPA's mapping and the other algorithm's scoring slightly prefers its own mapping. Because the large majority of cases for which the other algorithm scored better involved only small differences, changing from SPA's scoring model to the other algorithm's scoring model causes the number of such cases to increase significantly. In contrast, because the large majority of mappings for which SPA scored better involved larger differences, the total number of cDNAs for which SPA scored better is affected relatively little by changing the scoring function. Mappings with differences of ten or more differ substantially, e.g., by the mapping of an additional exon or a 5′ or 3′ end that is unmapped in the other mapping, and changes to the scoring hardly affect the score difference between such mappings.

In summary, for each algorithm, and independent of the scoring function's parameters, SPA produces a better alignment on about 25% (over 5,000) of the cDNAs, a substantially better alignment on at least 5% (1,000 or more) of the cDNAs, and a very large improvement for a few percent of the cDNAs (200 or more).

Given that, for the solid lines in Figure 4, we use our own scoring function for assessing the quality of the mappings, and given that SPA of course attempts to find the mapping with maximal score, one may wonder why there are any cDNAs at all for which the other algorithms find mappings that are considered better under SPA's scoring model (the solid red lines). The reason is that, because of time constraints, we used a heuristic algorithm to find the best alignment. The red curves thus indicate cases for which our heuristic algorithm failed to find the alignment with optimal score. Fortunately, these cases are rare, amounting to about 1% of the cDNAs. Cases where SPA's mapping has a lower score by a large amount are rarer still.

All these comparisons still depend on a scoring function for evaluating the quality of the mappings. We additionally compared the mappings in a number of ways that are completely independent of a scoring model. First, in Figure 5 we compare a number of global statistics across the mappings of the different algorithms (a much more detailed table of global statistics is provided in Table S1). In the bar chart in the upper left of Figure 5 we show the relative numbers of errors of different types in the mappings of the five algorithms. We have rescaled the bars such that, for each type of error, the total number of errors in the SPA mappings is set to one.

thumbnail

Figure 5. Comparison of Global Mapping Statistics for the Mappings of the Human cDNAs for SPA, Sim4, GMAP, BLAT, and Spidey

The bar chart shows the relative numbers of nucleotides in different types of errors between SPA and the other algorithms. The number of splice boundary errors is defined as the number of nucleotides in mismatches, insertions, and deletions that occur within ten alignment positions of a splice boundary. The fraction of unknown splice boundaries is the fraction of all splice boundaries that do not contain a GT-AG, GC-AG, or AT-AC splice boundary. The numbers of errors in SPA are scaled to one. The pie charts show the percentages of the unmatched nucleotides in each of the algorithms' mappings that are the result of unmapped cDNAs (dark blue), unmapped 5′ and 3′ ends (light blue), internal insertions (green), and mismatches (red).

doi:10.1371/journal.pgen.0020024.g005

In an ideal situation every cDNA nucleotide would be mapped to a matching nucleotide in the genome. SPA has a total of 472,392 nucleotides that are not mapped to matching nucleotides in the genome, which corresponds to just under 1% of all nucleotides in the data. The top five bars in the bar chart in Figure 5 show the relative numbers of unmatched bases in the tested algorithms. We see that all other algorithms have over 20% more unmatched bases than SPA. The pie charts in Figure 5 show how these unmatched nucleotides are distributed over different types of errors for each of the algorithms. All algorithms failed to produce mappings for some of the cDNAs (see Materials and Methods). As the pie chart for Spidey shows, unmapped cDNAs account for almost half of Spidey's unmatched nucleotides. Unmapped cDNAs also account for a significant fraction of GMAP's unmatched nucleotides, and for a moderate fraction of BLAT's unmatched nucleotides. For the large majority of these cDNAs, BLAT and GMAP simply did not report a significant alignment. SPA and Sim4 did produce alignments for most of these cDNAs, but manual inspection revealed that these were typically very low quality alignments with only a small fraction of the cDNA mapped. We suspect that many of these cDNAs are experimental artifacts, including chimeric cDNAs.

Unmapped nucleotides at the 5′ and 3′ ends of cDNAs account for more than half of the unmatched nucleotides for all algorithms but Spidey. As shown in the second set of bars in the upper left in Figure 5, all other algorithms have substantially more nucleotides in these unmapped ends than SPA. As shown in Table S1, unmapped poly-A tails account for only a small proportion of these unmapped ends (see also Materials and Methods). The difference between SPA and the other algorithms is caused mostly by SPA's ability to identify additional 5′ or 3′ exons that the other algorithms miss. For these cases we often found that one or more initial/terminal exons were separated from the other exons by a region in which the quality of the mapping was poor, containing an internal piece of the cDNA that could not be mapped at all. Frequently there are assembly gaps in these regions as well, and we suspect that misassemblies occur in some of these regions. The other algorithms, but in particular GMAP, tend not to extend their mappings beyond problematic areas, which causes them to miss the initial/terminal exons that lie beyond this region and that can be accurately mapped. As a result of this, the number of internal insertions is much smaller for GMAP than for the other algorithms.

The bottom two sets of bars in Figure 5 compare the quality of the mappings around the splice boundaries (see Materials and Methods for details). We see that SPA had fewer errors (mismatches, insertions, and deletions occurring within ten nucleotides of splice boundaries) and that the fraction of splice boundaries that do not match any of the known boundaries, i.e., the canonical GT-AG, the non-canonical GC-AG, or the U12 spliceosome boundary AT-AC, was lower for SPA than for any of the other algorithms. Sim4′s mappings contained almost twice as many errors around splice boundaries as SPA's mappings. The main reason for this is that Sim4 attempts to find canonical GT-AG splice sites and that it is willing to introduce multiple mismatches and insertions to accomplish this. In contrast, BLAT's mappings contained only a few more errors around splice boundaries, but BLAT's mappings contained a much higher fraction of boundaries that do not match any known splice site.

Table 1 shows the ten most frequent splice sites in the SPA, Sim4, and Spidey alignments with their associated frequencies. We found that the canonical GT-AG signal and non-canonical GC-AG signal used by the U2 spliceosome [7] were the two most abundant boundaries for all five algorithms. Interestingly, in spite of Sim4′s strong preference for canonical boundaries, and its willingness to allow errors around the splice boundaries to accommodate them (see Figure 5), it still ended up with a lower frequency of GT-AG boundaries than the mappings of SPA, GMAP, and BLAT. Spidey had an even lower frequency of canonical GT-AG. The overall frequency of GT-AG boundaries was a bit higher in GMAP's mappings than in SPA's mappings. This is the result of the relatively high frequency of non-canonical boundaries in areas of the cDNAs that remain unmapped in GMAP's mappings. When we restricted the analysis to splice boundaries that lay in areas of the cDNAs that were mapped by both SPA and GMAP, SPA had a higher frequency of GT-AG boundaries than GMAP.

thumbnail

Table 1.

The Ten Most Frequently Occurring Splice Sites in the Mappings of the Human Dataset by SPA, Sim4, GMAP, BLAT, and Spidey

doi:10.1371/journal.pgen.0020024.t001

Both BLAT and Sim4 showed a disproportionately high number of CT-AC boundaries. In the case of BLAT this is a result of the fact that it does not consider the possibility that the cDNA has been misoriented, whereas SPA, GMAP, and Spidey do consider this possibility. Sim4 also does not explicitly consider the possibility that the cDNA has been misoriented but instead attempts to find either canonical GT-AG or reverse-complemented CT-AC boundaries independently for each splice boundary. This frequently leads to mappings for which, in a single cDNA, some boundaries are GT-AG and others are CT-AC. Other frequent alternative boundaries of Sim4 and BLAT are mostly single-point mutant versions of the canonical GT-AG boundary. Spidey finds splice boundaries by first looking for a donor site and then choosing the acceptor site independently. This is reflected in the fact that all of its top ten boundaries use a donor site that matches either GT or GC. The AT-AC signal used by the U12 spliceosome [8] occurs near the top of the list of splice sites for SPA (0.17%), GMAP (0.17%), and BLAT (0.14%), whereas Sim4 and Spidey seem to miss this experimentally verified alternative boundary.

Finally, the splice boundary inference procedure employed by SPA leads to the emergence of a set of related splice sites, i.e., AG-CC, TG-CC, AG-GC, TG-GC, AC-CC, and TC-CC, that are even more abundant in SPA's mappings than the AT-AC boundaries and that together account for almost 1% of all splice boundaries. In SPA's initial parameter set, only the known boundaries GT-AG, GC-AG, and AT-AC were given high probabilities (see Materials and Methods). Thus, in the first round of mappings, whenever ambiguous boundaries occurred that did not allow for any of the known splice sites, SPA placed the boundary essentially at random. This seems to be the behavior of GMAP as well. However, the splice boundary inference revealed that many of these unknown boundaries can be explained by the novel set of splice sites just mentioned. These splice sites were thus given higher probability in the parameter set, and in the second round of mappings SPA now chose these boundaries with much higher frequency, as shown in Table 1. Since none of the other algorithms employ such splice boundary inference, none of them recover these novel alternative boundaries. However, when we applied our splice boundary inference procedure to GMAP's or BLAT's mappings, these alternative boundaries also appeared with relatively high frequency in their mappings. We are currently investigating these new splice sites in more detail (T. M. Chern, E. van Nimwegen, and M. Zavolan, unpublished data).

Another way of evaluating the quality of the mappings of the different algorithms is to compare the mappings with a reference set of human gene structures. Unfortunately, obtaining a large collection of human gene structures will always involve computational methods for mapping transcripts to the genome, which could introduce a systematic bias that favors the type of algorithm that was used in producing the reference set. However, the Consensus CDS project [9] combines a number of different lines of evidence with manual curation to produce a set of highly trusted protein coding exons in the human genome that are hopefully not significantly affected by systematic biases. As a further comparison of the mappings we intersected the mappings of all five algorithms with the full set of CCDS exons. The results are shown in Table 2.

thumbnail

Table 2.

The Intersection between the Mappings of the Different Algorithms and the CCDS Exons

doi:10.1371/journal.pgen.0020024.t002

We see that all mappings intersect roughly half of the nucleotides in CCDS exons. In this test as well SPA outperformed all other algorithms, although the difference is less than 1% in all cases. As in all other tests so far, Spidey performed worst on this test. It is noteworthy that Sim4 outperformed BLAT significantly, and that BLAT in turn outperformed GMAP on this test.

As a final test of the mappings we compared how well the sequences within the mappings of each algorithm are conserved across vertebrates. To this end we used the phastcon profiles [10], which assign a conservation score (a number between zero and one) to each nucleotide in the human genome based on a comparison with the genomes of chimpanzee, dog, mouse, rat, chicken, zebrafish, and fugu (see Materials and Methods for details). The results are shown in Figure 6. The left panel shows that, at any level of conservation, the number of nucleotides with at least that level of conservation is larger in SPA's mappings than in the mappings of any other algorithm. Thus, the higher number of mapped nucleotides in SPA's mappings are not concentrated in badly conserved regions but include highly conserved regions. In fact, as the right panel of Figure 6 shows, the nucleotides that are unique to each of the algorithms' mappings do not differ substantially in their distribution of conservation level. The right panel of Figure 6 shows the ratio of conservation distributions of nucleotides that are unique to SPA's mappings and nucleotides that are unique to each of the other algorithms' mappings. We see that, with the exception of Spidey, which has a significantly underrepresented fraction of highly conserved nucleotides, the ratio stays within the range 0.8–1.2 over the entire range of conservation level. These results show that the relatively large number of nucleotides that are mapped only by SPA have roughly the same distribution of conservation level as the relatively small number of nucleotides that are mapped only by the other algorithms.

thumbnail

Figure 6. Comparison of Conservation Statistics for the Mappings of the Human Dataset

Shown are the data for SPA versus Sim4 (green), GMAP (blue), BLAT (orange), and Spidey (pink). The left panel shows, as a function of conservation score c, the difference in the number of nucleotides with conservation score at least c between the SPA mappings and the mappings of each other algorithm. The right panel shows the relative distribution of conservation scores for the nucleotides that were unique to SPA's mappings and the nucleotides that were unique to the other algorithm's mappings. The panel shows, for each conservation score c, the ratio between the fraction of all nucleotides unique to SPA's mappings that have conservation score c and the fraction of all nucleotides unique to the other algorithm's mappings that have conservation score c.

doi:10.1371/journal.pgen.0020024.g006

Results on FANTOM3 Mouse Full-Length cDNAs

We ran SPA on all 102,143 cDNAs of the FANTOM3 dataset that had been mapped for the FANTOM3 project using BLAT with the –fine option, followed by post-processing to distinguish deletions from introns. As in the mappings of the human dataset, we compared the log-posterior probabilities of all SPA mappings with the FANTOM3 mappings using both SPA's scoring function as well as a scoring function that maximizes the overall posterior probability of the FANTOM3 mappings. The results are shown in Figure 7. As in the mappings of the human cDNAs, the change in scoring function affected these distributions relatively little. At almost any score difference there are more than ten times as many mappings with at least such a score difference in SPA's favor as there are in BLAT's favor. Overall, SPA had a better mapping on more than 70% of the cDNAs, a substantially better mapping (difference of ten or more) on more than 10% of the cDNAs, and a much better mapping (difference of 100 or more) on almost 5% of the cDNAs. In contrast, BLAT's mapping was much better than SPA's mapping on only 0.3% of the cDNAs.

thumbnail

Figure 7. Distributions of Differences in the Log-Posterior Probability of SPA and Post-Processed BLAT Alignments on the FANTOM3 Mouse cDNAs

The green curve shows the distribution of score differences for the cases in which SPA had a higher mapping score. The red curve shows the distribution of score differences for mappings where post-processed BLAT had a higher alignment score. Solid lines were obtained using SPA's scoring function and dashed lines using the scoring function that maximizes the posterior probability of BLAT's mappings. Both axes are shown on logarithmic scales.

doi:10.1371/journal.pgen.0020024.g007

To compare the quality of the mappings independently of a scoring function we again calculated a number of global mapping statistics. The relative numbers of nucleotides in errors of various kinds are shown in Figure 8. A more detailed set of statistics is shown in Table S2. One can see that all types of errors are significantly overrepresented in BLAT's mappings compared to SPA's mappings. The largest differences are in the number of unmapped cDNAs, the number of nucleotides in unmapped 5′ and 3′ ends, and the fraction of splice boundaries that do not match any of the known splice sites. However, the number of internal mismatches (internal insertions and mismatches) is also larger by about 20% in BLAT's mappings, and the number of nucleotides in errors around splice boundaries is larger by more than 50% in BLAT's mappings. Thus, the global statistics confirm the overall higher quality of SPA's mapping especially in the mapping of 5′ and 3′ ends and in the correct mapping of the splice boundaries.

thumbnail

Figure 8. Comparison of Global Statistics of the SPA Mappings and Post-Processed BLAT Mappings of the FANTOM3 Mouse cDNAs

The bars show, from top to bottom, the total number of unmatched nucleotides, the number of internally unmatched nucleotides (mismatches plus internal insertions), the number of nucleotides in unmapped 5′ and 3′ ends, the number of nucleotides in unmapped cDNAs, the number of errors (mismatches, insertions, and deletions) within ten alignment positions of the splice boundaries, and the fraction of splice boundaries not matching any known splice site (GT-AG, GC-AG, or AT-AC). The bars are scaled such that SPA's total numbers of errors are set to one for each type.

doi:10.1371/journal.pgen.0020024.g008

The ten most common splice boundaries in the SPA and BLAT mappings of the FANTOM3 cDNAs are shown in Table 3.

thumbnail

Table 3.

The Ten Most Frequently Occurring Splice Sites in the Mappings of the FANTOM3 Dataset by SPA and BLAT

doi:10.1371/journal.pgen.0020024.t003

The canonical GT-AG boundary was the most frequent in both SPA's and BLAT's mappings, but its frequency was 1.2% higher in SPA's mappings. The known non-canonical boundary GC-AG was about equally common in the SPA and the BLAT mappings. Both SPA and BLAT found the boundary AT-AC of the U12 spliceosome. Also very abundant in the SPA mappings were boundaries where one or both of the ends of the intron were in an assembly gap (where the wildcard nucleotide N occurs in the genome). As we describe below, BLAT often has problems with mappings in these areas. The fourth most abundant boundary in the BLAT mappings was CT-AC, i.e., the reverse-complement of the canonical boundary. As with the mappings of the human cDNAs, this boundary's high abundance is a result of the fact that BLAT does not explicitly recognize reverse-complemented cDNAs. It is interesting that one of the set of novel related splice boundaries that we identified in the human cDNA mappings also occurs with relatively high frequency in the mappings of the mouse cDNA, i.e., the TG-CC boundary. It is also interesting that many of the other high frequency boundaries, e.g., GT-CA, GG-AG, and GT-TG, occur in the top ten of both algorithms.

Figure 9 shows examples of the most common differences between the SPA and BLAT mappings that we have identified through manual inspection. Figure 9A and 9F show examples with large differences in score between the