Coordinator: Prof. Jacek Blazewicz
The Laboratory of Bioinformatics, The Institute of Bioorganic Chemistry, The
Polish Academy of Sciences, Poznan
Finding out the structures of proteins is not easy. It often takes months, and
sometimes years to find the right mix chemicals that will make a protein crystals
form. It is known that the information that tells proteins how to fold up is
contained in their own sequences. With all of the sequences being collected
in big projects like the human genome project and from the individual work it
should be possible to predict the structure of a protein from its sequence.
However, there is a problem with this approach. The code by which DNA instructs
the cell to make proteins was worked out very quickly. No similar code in the
amino acid sequence of proteins which tells them how to fold has been recognized
yet. One does not completely understand how the amino acid sequence causes the
folding of the protein. This problem is called the „Protein Folding Problem”.
The protein folding problem has now stood for over 30 years. The coding of the
protein structure by the amino acid sequence was first proposed by Anfinsen
and colleagues in 1961 [1]. The first attempts to predict the protein structure
from a sequence were made in the late 1960’s following the solution of
the first few 3-dimensional structures of proteins by X-ray crystallography.
The first notable successes in prediction came with the work of Chou and Fasman
[5] and Robson and colleagues (known as GOR) [6]. These programs were capable
of predicting if each amino acid was a part of a helix, a sheet or a coil region
in a protein. The Chou and Fasman’s method worked for about 52 % of amino
acids. The GOR method was correct for 54 % of amino acids. These programs worked
using statistics gathered on the existing protein structures in the case of
the GOR method and on simple rules developed from observation of structures
in the case of Chou and Fasman’s method. The methods were applied to single
sequences.
The importance of understanding the protein structure comes from two factors
working together. The first of these is that the function of the protein is
dependent on its structure. The second factor is that it is extremely difficult
to determine the structure of protein experimentally. The first level of the
protein structure termed primary structure refers to the sequence of amino acids
in the protein. Some parts of this protein chain can sometimes fold into regular
structures; that is, structures which are the same in shape for different subchains
of proteins. These structures create the second level of the protein structure.
The final shape is made up of secondary structures, perhaps the super-secondary
structural features, and some apparently random conformations. This overall
structure is referred to as the tertiary structure. The first and probably the
most important step to predict the tertiary structure from its primary structure
is to predict as many possible secondary structures. Nowadays, the best method
for protein secondary prediction is a method based on the neural networks and
evolutionary information [7]. It gives prediction accuracy over 70% for the
three state prediction. Unfortunately, it requires the existence of similar
proteins with known structures –
a feature which is not always available.
The idea to create a tool to aid molecular biologists was the main reason to
choose a new rule-based method – the Logical Analysis of Data with its
high accuracy [3]. It generates simple and strong rules which could be easily
interpreted by the domain expert. The Logical Analysis of Data gives impressive
results (over 90%) in many fields of science [2], [4], so it seemed possible
that the same accuracy for the problem in question, will be obtained. The goal
for analysis is to create a system which allows to receive a protein secondary
structure based on the protein primary structure and to find rules responsible
for this effect. The system will be written in a parallel environment. The implementation
of the LAD method will be made in C++ language under the Solaris operating system.
It will give possibility to obtain results really fast and to compute an incredible
amount of biological data.
This system will provide the following functionality:
? ability to predict protein secondary structures form its primary structure;
? information and knowledge about the computational and molecular biology tools
and algorithms for the protein prediction problem;
? scientist will obtain results almost immediately;
? widely accessible via Internet.
References
[1] Anfinsen C.B., 1973, Principles that govern the folding of protein chains,
Science 181, 223-230;
[2] Boros E., Hammer P., Kogan A., Mayoraz E., Muchnik I., 1994, Logical Analysis
of data – overview, Rutcor Research Raport 01-94;
[3] Boros E., Hammer P., Ibaraki T., Kogan A., Mayoraz E., Muchnik I., 1996,
An implementation of logical analysis of data, Rutcor Research Raport 22-96;
[4] Boros E., Hammer P., Ibaraki T., Kogan A., 1997, Logical Analysis of Numerical
Data, Rutcor Research Raport 04-97
[5] Chou, Fasman, 1974, Conformational parameters for amino acids in helical,
beta-sheet, and random coil regions calculated from proteins, Biochemistry 13,
211-222;
[6] Garnier J., Osguthorpe D.J., Robson B., 1978, Analysis of the accuracy and
implications of simple methods for predicting the secondary structure of globular
proteins, JMB 120, 97-120;
[7] Rost B., Sander C., 1993, Prediction of protein secondary structure at better
than 70% accuracy, JMB 232, 584-599;