Motivation: With many transgenic proteins introduced today, the ability to predict their potential allergenicity has become an important issue. Previous studies were based on either sequence similarity or the protein motifs identified from known allergen databases. The similarity-based approaches, although being able to produce high recalls, usually have low prediction precisions. Previous motif-based approaches have been shown to be able to improve the precisions on cross-validation experiments.
We present an algorithm to detect protein sub-structural motifs from primary sequence. The input to the algorithm is a set of aligned multiple protein sequences. It uses wavelet transforms to decompose protein sequences represented numerically by different indices (such as polarity, accessible surface area or electron-ion integration potentials of the amino acids). The numerical representation of a protein sequence has significant correlation with its biological activity, thus common motifs are expected to be observable from the wavelet spectrum.