Coordinator: Prof. Jacek Blazewicz
The Laboratory of Bioinformatics, The Institute of Bioorganic Chemistry, The
Polish Academy of Sciences, Poznan
The problem of assembling of DNA chains is one of the most important questions
in the computational biology domain. It consists of combining into a longer
piece, for example a gene, many DNA fragments cut out from copies of the gene
using, for example, the shotgun approach. During the shotgun process, one obtains
a set of overlapping fragments from both strands of the gene that differ in
their lengths. These fragments must be previously reconstructed in the DNA sequencing
process (see [1, 2, 3] for methods of DNA sequencing with the presence of negative
and positive errors). The assembling stage, just after the sequencing level,
is a fundamental method of recognizing the genetic information of organisms.
Next, the information is the subject of detailed analysis, e.g. in the genome
comparison or in the prediction of protein functions. The assembling problem
is, from the computational point of view, intractable. Thus, the construction
of an algorithm returning exact or near-exact solutions for the problem, in
the amount of time acceptable by the user, is an important topic.
That algorithm can be widely used in the scientific community because of the
necessity of frequent assembling of DNA chains in molecular biology labs. For
instance, most of the genetic information stored in databases all over the world
had to be previously assembled [5]. In order to solve the computational part
of the assembling problem effectively, it is necessary to use the most up-to-date
computers, especially composed of many processing units.
The DNA assembling problem can be defined as follows. As the input we have a
set of DNA fragments differing in their lengths (from a few dozens to a few
thousands of nucleotides). As the output we return a DNA chain, as short as
possible, composed of all given DNA fragments. The additional assumptions, as
well as experimental errors listed below, should also be taken into account.
The fragments from the input set can be derived from both strands of the original
DNA chain, and the information about their orientation is not accessible. The
fragments can overlap with some shifts, but they can also include one another
or even be duplicated in the input set. It can happen that two fragments neighboring
in the original DNA chain are disjoint if the information about a part of the
chain is missing in the input data. Because the fragments come from the DNA
sequencing stage, they include errors that arise at that stage, i.e. some misreadings
in the nucleotide sequences of the fragments. Hence, one cannot think about
the exact alignment (letter-to-letter) of two sequences overlapping during the
assembling process. Moreover, it is assumed that some parts of the fragments
from the input set can be unknown. The above difficulties can of course cumulate,
so in an extreme case in the input set we can find a DNA fragment and a fragment
from the other strand corresponding to it (reversed and complementary), both
having improperly read nucleotide sequences, even of different lengths. Recognizing
the two fragments as the same part of an original DNA chain can be very hard,
especially if the fragments overlap each other (or with another fragments from
the opposite strand) by an acceptable number of nucleotides. All the above assumptions
make the assembling process computationally hard, in the sense of time as well
as conceptually.
The program which will be implemented during the realization of the discussed
subject, will lead to the assembling of a DNA chain on the basis of the input
set as described above. In addition to the input data, the program will accept
a set of parameters defining options of a user, and the construction of the
solution will depend on the values of the parameters. The software will operate
in a multiprocessor environment, on a Sun computer, under the Solaris operating
system. It will be completed by a Windows user interface accessible via Internet.
(Earlier works concerning the WWW computational biology server from the Institute
of Computing Science, Poznan University of Technology, have been described in
[4].)
References
[1] J. Blazewicz, J. Kaczmarek, M. Kasprzak, W.T. Markiewicz, and J. Weglarz.
(1997). Sequential and parallel algorithms for DNA sequencing, Computer Applications
in the Biosciences 13, 151-158.
[2] J. Blazewicz, P. Formanowicz, M. Kasprzak, W.T. Markiewicz, and J.Weglarz.
(1999). DNA sequencing with positive and negative errors, Journal of Computational
Biology 6, 113-123.
[3] J. Blazewicz, P. Formanowicz, M. Kasprzak, W.T. Markiewicz and J.Weglarz.
(2000). Tabu search for DNA sequencing with false negatives and false positives,
European Journal of Operational Research 125, 257-265.
[4] J. Blazewicz, P. Formanowicz, M. Kasprzak, and P. Wierzejewski. (2000).
Web server for computational biology problems, Proceedings of ISThmus 2000,
Poznan, 77-85.
[5] E. W. Myers et al.. (2000). A whole-genome assembly of Drosophila. Science
287, 2196-2204.