Design and implementation of the system for prediction of protein secondary structure using Logical Analysis of Data

Coordinator: Prof. Jacek Blazewicz
The Laboratory of Bioinformatics, The Institute of Bioorganic Chemistry, The Polish Academy of Sciences, Poznan

Finding out the structures of proteins is not easy. It often takes months, and sometimes years to find the right mix chemicals that will make a protein crystals form. It is known that the information that tells proteins how to fold up is contained in their own sequences. With all of the sequences being collected in big projects like the human genome project and from the individual work it should be possible to predict the structure of a protein from its sequence. However, there is a problem with this approach. The code by which DNA instructs the cell to make proteins was worked out very quickly. No similar code in the amino acid sequence of proteins which tells them how to fold has been recognized yet. One does not completely understand how the amino acid sequence causes the folding of the protein. This problem is called the „Protein Folding Problem”.
The protein folding problem has now stood for over 30 years. The coding of the protein structure by the amino acid sequence was first proposed by Anfinsen and colleagues in 1961 [1]. The first attempts to predict the protein structure from a sequence were made in the late 1960’s following the solution of the first few 3-dimensional structures of proteins by X-ray crystallography. The first notable successes in prediction came with the work of Chou and Fasman [5] and Robson and colleagues (known as GOR) [6]. These programs were capable of predicting if each amino acid was a part of a helix, a sheet or a coil region in a protein. The Chou and Fasman’s method worked for about 52 % of amino acids. The GOR method was correct for 54 % of amino acids. These programs worked using statistics gathered on the existing protein structures in the case of the GOR method and on simple rules developed from observation of structures in the case of Chou and Fasman’s method. The methods were applied to single sequences.
The importance of understanding the protein structure comes from two factors working together. The first of these is that the function of the protein is dependent on its structure. The second factor is that it is extremely difficult to determine the structure of protein experimentally. The first level of the protein structure termed primary structure refers to the sequence of amino acids in the protein. Some parts of this protein chain can sometimes fold into regular structures; that is, structures which are the same in shape for different subchains of proteins. These structures create the second level of the protein structure. The final shape is made up of secondary structures, perhaps the super-secondary structural features, and some apparently random conformations. This overall structure is referred to as the tertiary structure. The first and probably the most important step to predict the tertiary structure from its primary structure is to predict as many possible secondary structures. Nowadays, the best method for protein secondary prediction is a method based on the neural networks and evolutionary information [7]. It gives prediction accuracy over 70% for the three state prediction. Unfortunately, it requires the existence of similar proteins with known structures –
a feature which is not always available.
The idea to create a tool to aid molecular biologists was the main reason to choose a new rule-based method – the Logical Analysis of Data with its high accuracy [3]. It generates simple and strong rules which could be easily interpreted by the domain expert. The Logical Analysis of Data gives impressive results (over 90%) in many fields of science [2], [4], so it seemed possible that the same accuracy for the problem in question, will be obtained. The goal for analysis is to create a system which allows to receive a protein secondary structure based on the protein primary structure and to find rules responsible for this effect. The system will be written in a parallel environment. The implementation of the LAD method will be made in C++ language under the Solaris operating system. It will give possibility to obtain results really fast and to compute an incredible amount of biological data.
This system will provide the following functionality:
? ability to predict protein secondary structures form its primary structure;
? information and knowledge about the computational and molecular biology tools and algorithms for the protein prediction problem;
? scientist will obtain results almost immediately;
? widely accessible via Internet.

[1] Anfinsen C.B., 1973, Principles that govern the folding of protein chains, Science 181, 223-230;
[2] Boros E., Hammer P., Kogan A., Mayoraz E., Muchnik I., 1994, Logical Analysis of data – overview, Rutcor Research Raport 01-94;
[3] Boros E., Hammer P., Ibaraki T., Kogan A., Mayoraz E., Muchnik I., 1996, An implementation of logical analysis of data, Rutcor Research Raport 22-96;
[4] Boros E., Hammer P., Ibaraki T., Kogan A., 1997, Logical Analysis of Numerical Data, Rutcor Research Raport 04-97
[5] Chou, Fasman, 1974, Conformational parameters for amino acids in helical, beta-sheet, and random coil regions calculated from proteins, Biochemistry 13, 211-222;
[6] Garnier J., Osguthorpe D.J., Robson B., 1978, Analysis of the accuracy and implications of simple methods for predicting the secondary structure of globular proteins, JMB 120, 97-120;
[7] Rost B., Sander C., 1993, Prediction of protein secondary structure at better than 70% accuracy, JMB 232, 584-599;