ASSP

Alternative Splice Site Predictor


ASSP - Overview
ASSP - Prediction
ASSP - Definitions
Sequence analysis software

Evaluation of ASSP Results

Contents

Identification of exons

Exon sample

Figure 1. Example of exon 6 (2586..2713) of aryl-sulfatase A precursor.

Coding sequences are generally characterized by trinucleotide frequencies which differ markedly from the frequencies observed in non-coding sequences. Codon usage reflects the probability of a sequence to be coding by comparing the observed trinucleotide frequencies with those observed in coding sequences. Codon usage values above zero indicate, that a subsequences is probably coding, while sequence stetches with values below zero are probably not coding (ASSP calculates log-likelihood values). Codon usage is calculated for a sliding window of a given size. Window sizes of 20 to 50 nucleotides are usually adequate for identifying exons; for short exons small window sizes might be prefered. Codon usage and stop codons are calculated for all three possible reading frames (F1 - frame 1, F2 - frame 2, F3 - frame 3). Stop codons should usually not be observed within exons read in the corresponding frame. Splice sites, especially constitutive ones, are usually detected near the putative boundaries of exons, as being indicated by high codon usage values (figure 1).

 

Interpretation of splice site hits

Alternative isoform and cryptic splice sites are characterized by several sequence features which allow some separation from constitutive and most skipped ones (for details see Wang and Marín 2005). First, cryptic and especially alternative isoform splice sites show low splice sites scores (figure 2). Second, most cryptic exons are covered by GC-poor introns (GC < 0.45; ASSP provides the GC value of introns adjacent to each putative splice site). Third, introns adjacent to cryptic splice sites are usually AG-independent (figure 3), i.e. the recognition of the acceptor site during the first step of splicing is not dependent on the binding of the splicing factor U2AF35, and most of them contain the splice site core AG|H (with H being any nucleotide other than G). On the other hand, most skipped, and constitutive splice sites which are preceeded by GC-por introns are AG-dependent (most of them contain the splice site core AG|G). GC-rich alternative isoform, skipped, and constitutive splice sites are either AG-dependent or AG-independent.


Figure 2. Splice site scores of constitutive, skipped, cryptic and alternative isoform acceptor and donor splice sites.



Figure 3. Relation of U2AF35 binding site score (AG-dependence) and intronic GC content (adjacent 70 nt) of constitutive, skipped, cryptic and alternative exon isoform acceptor and donor splice sites.

 

Cutoff values of pre-processing models

ASSP uses pre-processing models (a position specific score matrix, PSSM, for acceptor sites and a combination of 22 PSSMs for the donor site, which were built using maximum dependence decomposition), which scan a sequence for putative splice sites. Only when the score of the respective matrix indicates a hit, putative splice sites are classified by a backpropagation network. Thus, the preprocessing matrices eliminate false splice sites (any subsequences containing AG or GT, respectively, wich do not correspond to real splice sites), which are characterized by a low splice site strength. The cutoff value (score threshold) of these matrices may be selected by the user, and should be chosen depending on the application. Since many alternative exon isoform and to a lower extent also cryptic splice sites have low splice sites scores, high cutoff values will result in reduced detection of alternative isoform and some cryptic splice sites. The following graphs (figure 4) show the correlations between cutoff values and the percentages of false positives (false splice sites) and false negatives (missed alternative isoform/cryptic splice sites). The default cuttoff values of 2.2 for acceptor sites and 4.5 for donor sites correspond to a correct identification of about 75 percent (acceptor site) and 80 percent (donor site) of false and alternative isoform/cryptic splice sites .

 

Cutoff values

Figure 4. The relation between cutoff values (score thresholds) of the pre-processing models of ASSP and the percentages of false negatives (missed alternative isoform/cryptic splice sites) and false positives (false splice sites).

 

Classification

Once splice sites are recognized by the pre-processing models, they are classified by the corresponding backpropagation network. Details about the neural network's output activations and the confidence of the classification are given in the ASSP output. Since neural networks are no probabilistic models, the output activations, which indicate the class a putative splice site is assigned to, do not correspond to probabilities and no not sum up to one. Output activations range between one and zero. Activations of constitutive: 0.95 and alternative isoform/cryptic: 0.10 indicate a fairly reliable classification, while activations of constitutive: 0.44 and alternative isoform/cryptic: 0.51 indicate an unreliable classification. A simple confidence measure expresses the relation of output activations relative to the optimal classification. The classification performance for constitutive, skipped, cryptic, and alternative isoform splice sites are listed in table 1.

Acceptor Sites
Donor Sites
     
 
Classification
 
Classification
Constitutive
Isoform / Cryptic
Constitutive
Isoform / Cryptic
Constitutive
70.77 %
29.23 %
Constitutive
73.40 %
26.60 %
Skipped
64.39 %
35.61 %
Skipped
66.39 %
33.61 %
Cryptic
40.88 %
59.22 %
Cryptic
41.48 %
58.52 %
Isoform
27.64 %
72.36 %
Isoform
17.86 %
82.14 %

Table 1. Classification performance of the backpropagtion networks for the acceptor and donor implemented in ASSP.

 


Reference

Wang M. and Marín A. 2006. Characterization and prediction of alternative splice sites. Gene 366: 219-227.


  Top Back

HomeOverviewPredictionDefinitionsEvaluationSoftware


Last Changes 10.01.2011