Design and implementation of algorithms for searching for motifs in nucleotide and amino acid sequences, and a construction of phylogenetic trees

Coordinator: Prof. Jacek Blazewicz
The Laboratory of Bioinformatics, The Institute of Bioorganic Chemistry, The Polish Academy of Sciences, Poznan

As a result of the realization of the Human Genome Project a rapid growth of the amount of biological sequences stored in databases is observed. While collecting these sequences is an important milestone in the genome research, the biggest challenge is to understand their meaning. Since there is no method of inferring such a meaning directly from a sequence of nucleotides or amino acids the problem is extremely hard. Obviously, there are methods used to discover the biological meaning of DNA, RNA and amino acid chains, but they are "indirect". Most of them are based on the comparison of many such sequences and looking for some analogies [1, 3, 4]. Such methods are useful especially if the meaning of one of the analyzed sequences is known. In case of the nucleotide sequences one usually searches for fragments of genes as well as the promotor regions. In case of the amino acid sequences it is very important to find these fragments of the polypeptide chains which are crucial for a function of a given protein.
These fragments are called motifs. They can be described in many different ways, which are dependent on the amount of information available [1]. From the algorithmic point of view the easiest way is to find fully defined substrings or subsequences. Such motifs can be efficiently found even in large sets of sequences. The reason why this approach is not frequently used is that it does not often happen that exactly the same subsequence appears many times in different biosequences. In case the information of an approximate structure of a motif is available, the problem is equivalent to the well-known problem of searching for regular expressions. Such a problem is important from the biological point of view, since it is applicable to finding counterparts of a gene in chromosomes coming from different species. The most difficult variant of the problem is when there is no information available about the structure of a motif. It is also the most important variant from the biological viewpoint. This problem appears if one looks for fragments of the polypeptide chains which are essential for a function of a given protein family. The same problem appears also in searching for the promotor fragments in DNA chains. Moreover, the importance of this problem comes from the fact that its solution brings the biggest amount of information. On the other hand, no information of the structure of a motif is required.
One more application of searching for motif problem is worth mentioning. Namely, the solution of this problem may give a hint which regions in a given set of sequences are especially useful in the phylogenetic analysis. It is known that different fragments of nucleotide and amino acid sequences evolve with different speed. So, it is very important to choose proper fragments for a process of construction of a phylogenetic tree. The subsequences that are crucial for important biological functions usually evolve slower than other fragments. Rapid changes in such subsequences might lead to serious malfunctioning of proteins. They are often lethal, which obviously limits the range of such mutations.
The phylogenetic analysis mentioned above is a very broad area of biology [1, 3, 4]. Its goal is to reconstruct the evolution history of species. Such a history is usually described as a tree. The root of the tree denotes a hypothetical ancestor of all species from the analyzed set. Leaves are the species which are, in a given point in time, the last stage of an evolution. All nodes of the tree located between the root and the leaves are the hypothetical species which appeared on different stages of the evolution.
There are two main classes of phylogenetic trees:
? trees built on the basis of the similarity among the sets of characters of the analyzed species,
? trees built on the basis of evolution distances between the given species.
Today phylogenetic trees are usually built on the basis of the nucleotide or amino acid sequences.
Many methods for the phylogenetic tree construction are known. However, the efficient algorithms are known only for simple models of the evolution with many unrealistic constraints. In such models a number of species or a number of possible character values are usually limited. The combinatorial problems connected with phylogenetic trees are usually computationally hard.
As a result of the project new sequential and parallel algorithms for searching for motifs will be proposed. In particular, they will be looking for the promotor regions in the nucleotide sequences. Although finding such regions is a very important problem very few papers have appeared which are devoted to the algorithmic aspects of this problem (cf. [2]).
Additionally, an algorithm for finding subsequences of the nucleotide and amino acid sequences, which are especially useful in phylogenetic analysis, will be proposed.
Parallel exact and heuristic algorithms for phylogenetic tree construction using the above algorithms for the motif search will also be proposed. In order to reach this goal it will be necessary to develop a method of input data decomposition. Such a decomposition is crucial for the effective work of an algorithm in a multiprocessor environment. Moreover, the research will be carried out to discover regions in the nucleotide and amino acid sequences that are evolutionary important.
The algorithms will be implemented on a multiprocessor Sun machine under the Solaris operating system. It will be possible to use this software through the Internet using a standard browser.

References
[1] D. Gusfield. (1997). Algorithms for Strings, Trees, and Sequences, Cambridge University Press, Cambridge 1997.
[2] L. Marsan, M.-F. Sagot. (2000). Extracting structured motifs using a suffix tree - algorithms and application to promoter consensus identification, Proceedings of RECOMB 2000, 210-219.
[3] J. Setubal, J. Meidanis. (1997). Introduction to Computational Molecular Biology, PWS Publishing Company, Boston 1997.
[4] M. S. Waterman. (1995). Introduction to Computational Biology, Chapman & Hall, London 1995.