Design and implementation of the software for DNA assembling

Coordinator: Prof. Jacek Blazewicz
The Laboratory of Bioinformatics, The Institute of Bioorganic Chemistry, The Polish Academy of Sciences, Poznan

The problem of assembling of DNA chains is one of the most important questions in the computational biology domain. It consists of combining into a longer piece, for example a gene, many DNA fragments cut out from copies of the gene using, for example, the shotgun approach. During the shotgun process, one obtains a set of overlapping fragments from both strands of the gene that differ in their lengths. These fragments must be previously reconstructed in the DNA sequencing process (see [1, 2, 3] for methods of DNA sequencing with the presence of negative and positive errors). The assembling stage, just after the sequencing level, is a fundamental method of recognizing the genetic information of organisms. Next, the information is the subject of detailed analysis, e.g. in the genome comparison or in the prediction of protein functions. The assembling problem is, from the computational point of view, intractable. Thus, the construction of an algorithm returning exact or near-exact solutions for the problem, in the amount of time acceptable by the user, is an important topic.
That algorithm can be widely used in the scientific community because of the necessity of frequent assembling of DNA chains in molecular biology labs. For instance, most of the genetic information stored in databases all over the world had to be previously assembled [5]. In order to solve the computational part of the assembling problem effectively, it is necessary to use the most up-to-date computers, especially composed of many processing units.
The DNA assembling problem can be defined as follows. As the input we have a set of DNA fragments differing in their lengths (from a few dozens to a few thousands of nucleotides). As the output we return a DNA chain, as short as possible, composed of all given DNA fragments. The additional assumptions, as well as experimental errors listed below, should also be taken into account. The fragments from the input set can be derived from both strands of the original DNA chain, and the information about their orientation is not accessible. The fragments can overlap with some shifts, but they can also include one another or even be duplicated in the input set. It can happen that two fragments neighboring in the original DNA chain are disjoint if the information about a part of the chain is missing in the input data. Because the fragments come from the DNA sequencing stage, they include errors that arise at that stage, i.e. some misreadings in the nucleotide sequences of the fragments. Hence, one cannot think about the exact alignment (letter-to-letter) of two sequences overlapping during the assembling process. Moreover, it is assumed that some parts of the fragments from the input set can be unknown. The above difficulties can of course cumulate, so in an extreme case in the input set we can find a DNA fragment and a fragment from the other strand corresponding to it (reversed and complementary), both having improperly read nucleotide sequences, even of different lengths. Recognizing the two fragments as the same part of an original DNA chain can be very hard, especially if the fragments overlap each other (or with another fragments from the opposite strand) by an acceptable number of nucleotides. All the above assumptions make the assembling process computationally hard, in the sense of time as well as conceptually.
The program which will be implemented during the realization of the discussed subject, will lead to the assembling of a DNA chain on the basis of the input set as described above. In addition to the input data, the program will accept a set of parameters defining options of a user, and the construction of the solution will depend on the values of the parameters. The software will operate in a multiprocessor environment, on a Sun computer, under the Solaris operating system. It will be completed by a Windows user interface accessible via Internet. (Earlier works concerning the WWW computational biology server from the Institute of Computing Science, Poznan University of Technology, have been described in [4].)

[1] J. Blazewicz, J. Kaczmarek, M. Kasprzak, W.T. Markiewicz, and J. Weglarz. (1997). Sequential and parallel algorithms for DNA sequencing, Computer Applications in the Biosciences 13, 151-158.
[2] J. Blazewicz, P. Formanowicz, M. Kasprzak, W.T. Markiewicz, and J.Weglarz. (1999). DNA sequencing with positive and negative errors, Journal of Computational Biology 6, 113-123.
[3] J. Blazewicz, P. Formanowicz, M. Kasprzak, W.T. Markiewicz and J.Weglarz. (2000). Tabu search for DNA sequencing with false negatives and false positives, European Journal of Operational Research 125, 257-265.
[4] J. Blazewicz, P. Formanowicz, M. Kasprzak, and P. Wierzejewski. (2000). Web server for computational biology problems, Proceedings of ISThmus 2000, Poznan, 77-85.
[5] E. W. Myers et al.. (2000). A whole-genome assembly of Drosophila. Science 287, 2196-2204.