Coordinator: Prof. Jacek Blazewicz
The Laboratory of Bioinformatics, The Institute of Bioorganic Chemistry, The
Polish Academy of Sciences, Poznan
As a result of the realization of the Human Genome Project a rapid growth of
the amount of biological sequences stored in databases is observed. While collecting
these sequences is an important milestone in the genome research, the biggest
challenge is to understand their meaning. Since there is no method of inferring
such a meaning directly from a sequence of nucleotides or amino acids the problem
is extremely hard. Obviously, there are methods used to discover the biological
meaning of DNA, RNA and amino acid chains, but they are "indirect".
Most of them are based on the comparison of many such sequences and looking
for some analogies [1, 3, 4]. Such methods are useful especially if the meaning
of one of the analyzed sequences is known. In case of the nucleotide sequences
one usually searches for fragments of genes as well as the promotor regions.
In case of the amino acid sequences it is very important to find these fragments
of the polypeptide chains which are crucial for a function of a given protein.
These fragments are called motifs. They can be described in many different ways,
which are dependent on the amount of information available [1]. From the algorithmic
point of view the easiest way is to find fully defined substrings or subsequences.
Such motifs can be efficiently found even in large sets of sequences. The reason
why this approach is not frequently used is that it does not often happen that
exactly the same subsequence appears many times in different biosequences. In
case the information of an approximate structure of a motif is available, the
problem is equivalent to the well-known problem of searching for regular expressions.
Such a problem is important from the biological point of view, since it is applicable
to finding counterparts of a gene in chromosomes coming from different species.
The most difficult variant of the problem is when there is no information available
about the structure of a motif. It is also the most important variant from the
biological viewpoint. This problem appears if one looks for fragments of the
polypeptide chains which are essential for a function of a given protein family.
The same problem appears also in searching for the promotor fragments in DNA
chains. Moreover, the importance of this problem comes from the fact that its
solution brings the biggest amount of information. On the other hand, no information
of the structure of a motif is required.
One more application of searching for motif problem is worth mentioning. Namely,
the solution of this problem may give a hint which regions in a given set of
sequences are especially useful in the phylogenetic analysis. It is known that
different fragments of nucleotide and amino acid sequences evolve with different
speed. So, it is very important to choose proper fragments for a process of
construction of a phylogenetic tree. The subsequences that are crucial for important
biological functions usually evolve slower than other fragments. Rapid changes
in such subsequences might lead to serious malfunctioning of proteins. They
are often lethal, which obviously limits the range of such mutations.
The phylogenetic analysis mentioned above is a very broad area of biology [1,
3, 4]. Its goal is to reconstruct the evolution history of species. Such a history
is usually described as a tree. The root of the tree denotes a hypothetical
ancestor of all species from the analyzed set. Leaves are the species which
are, in a given point in time, the last stage of an evolution. All nodes of
the tree located between the root and the leaves are the hypothetical species
which appeared on different stages of the evolution.
There are two main classes of phylogenetic trees:
? trees built on the basis of the similarity among the sets of characters of
the analyzed species,
? trees built on the basis of evolution distances between the given species.
Today phylogenetic trees are usually built on the basis of the nucleotide or
amino acid sequences.
Many methods for the phylogenetic tree construction are known. However, the
efficient algorithms are known only for simple models of the evolution with
many unrealistic constraints. In such models a number of species or a number
of possible character values are usually limited. The combinatorial problems
connected with phylogenetic trees are usually computationally hard.
As a result of the project new sequential and parallel algorithms for searching
for motifs will be proposed. In particular, they will be looking for the promotor
regions in the nucleotide sequences. Although finding such regions is a very
important problem very few papers have appeared which are devoted to the algorithmic
aspects of this problem (cf. [2]).
Additionally, an algorithm for finding subsequences of the nucleotide and amino
acid sequences, which are especially useful in phylogenetic analysis, will be
proposed.
Parallel exact and heuristic algorithms for phylogenetic tree construction using
the above algorithms for the motif search will also be proposed. In order to
reach this goal it will be necessary to develop a method of input data decomposition.
Such a decomposition is crucial for the effective work of an algorithm in a
multiprocessor environment. Moreover, the research will be carried out to discover
regions in the nucleotide and amino acid sequences that are evolutionary important.
The algorithms will be implemented on a multiprocessor Sun machine under the
Solaris operating system. It will be possible to use this software through the
Internet using a standard browser.
References
[1] D. Gusfield. (1997). Algorithms for Strings, Trees, and Sequences, Cambridge
University Press, Cambridge 1997.
[2] L. Marsan, M.-F. Sagot. (2000). Extracting structured motifs using a suffix
tree - algorithms and application to promoter consensus identification, Proceedings
of RECOMB 2000, 210-219.
[3] J. Setubal, J. Meidanis. (1997). Introduction to Computational Molecular
Biology, PWS Publishing Company, Boston 1997.
[4] M. S. Waterman. (1995). Introduction to Computational Biology, Chapman &
Hall, London 1995.