ARTIFICIAL INTELLIGENCE TECHNIQUES FOR  ANALYZING THE 3-D STRUCTURE OF PROTEINS: Designing New Materials and Proteins

 

Bhavin V. Mehta,  Department of Mechanical Engineering

Shruti B. Mehta, Department of Electrical Engineering & CS

Luis C. Rabelo, Department of Industrial Engineering

John J. Kopchick, Edision Biotechnology Institute

Ohio University, Athens, Ohio 45701

 

ABSTRACT

 

Proteins are an essential part of all living organisms and are associated with growth, repair and reproductivity of living cells. All proteins are polymers of a finite set of twenty naturally occurring amino acids. A protein is a polymeric substance  made up of one or more polypeptide chains. Each polypeptide chain comprises of many  amino acids arranged in a particular sequence. However, these amino acids can combine in virtually infinite number of ways to create a myriad of protein species, each with a unique three dimensional structure. The many specific tasks served by proteins are largely dependent on their structural form. The structure of proteins can be classified into three main levels, (a) Primary, (b) Secondary and (c) Tertiary. The fourth structural level (Quaternary) is considered if the protein consists of more than a single chain. The three dimensional structure of a protein is its secondary and tertiary structure. The sequence of amino acids in a protein determines how a particular portion of the polypeptide chain folds into one of the secondary structures.   The uncertainty in the way a protein molecular sequence folds has undergone significant research. The Several techniques have been developed for protein secondary structure prediction using different type of statistical and neural network methodologies. In this paper, to predict the three dimensional structure of the newly designed protein( or mutated protein) a energy minimization technique along with the Fuzzy ARTmap paradigm, Probabilistic Neural Network(PNN) technique, and Generalized Regression Neural Network (GRNN) methods are used for the artificial neural network scheme and a modified Chou-Fassman algorithm is used for the statistical technique. The method was utilized on the bovine (bGH) and human (hGH) growth hormones and its agonist and antagonists. The in-house molecular modeling package was used to perform mutations on the first and third alpha helices of the  bGH and hGH molecules to simulate breaking the helical structure and strengthening the helical structure, and predicting the new structure using the  coupled minimization-NN and minimization-CF techniques. The results of this new technique are  compared with the existing techniques. The growth patterns observed in the genetically altered mice from experiments conducted in the  Edison biotechnology laboratory were compared with the same mutations performed on the CAD system using  the modeling and prediction software.

 

1.0     Introduction to Computer Aided Molecular Simulation

 

Computer aided molecular simulation is the basis for the design of new proteins. It uses analysis, predictions and  modeling to design new proteins and visualize them in 3-D on a Computer Aided Design (CAD) system. The primary structure of the proteins is characterized by the amino acid sequence of the constituent polypeptide. Structural motifs within the polypeptide chain are termed the secondary structures. The three most common types of secondary structures in proteins are helices, pleated sheets, and random coils. The sequence of amino acids determines whether a particular portion of the polypeptide chain will fold into one of the three secondary structures. The problem of predicting secondary structures of proteins is fairly complex. It has been tackled using statistical methods which exploit the correlation between the sequence of amino acids and their local secondary structure. One of the most commonly used statistical methods was developed by Chou and Fasman (1,2). However, only a few studies have been conducted in employing artificial intelligence techniques for investigating the protein prediction problem. Fuzzy ARTMAP is an incremental supervised learning algorithm which combines fuzzy logic and adaptive resonance theory neural network for the recognition of pattern categories (3,4).  PNN networks are three layer networks wherein the training patterns are presented to the input layer. The output layer has one neuron for each possible category. PNN is a direct out growth of Bayesian classifiers and has been successfully used to solve a diverse group of classification problems (16).

 

 With the aid of molecular computer simulation, the human growth hormone ( hGH ) gene was altered to reflect the change of one amino acid (G120R) within the third "-helix of the hGH molecule. This hGH analog, which is secreted by recombinant mouse L cells,  possesses the same binding affinity to mouse liver membrane preparations as the wild type hGH.  Transgenic mice which express this mutated hGH gene showed a significant growth-suppressed phenotype. The hGH analog, named hGHG120R, acts as an hGH antagonist. It may have important implications in treating human conditions with abnormally high hGH levels, such as acromegaly and diabetic eye and kidney problems.

 

 Molecular simulation is a technique by which models of proteins, chemicals and polymers are constructed in 3‑dimensional space so that their structural and functional properties may be predicted before they are synthesized. This process is termed "Rational Drug Design" when the compounds may have therapeutic value. The traditional methods for the simulation of molecular structure employs the use of plastic space filling models for predicting the most stable conformation of the structure. It is time-consuming and inflexible. In recent years, computer modeling of 3-dimensional structures of proteins is growing rapidly due to significant breakthroughs in the fields of computational chemistry and computer architecture which have generated a considerable research potential in the area of molecular simulation.  Computer  modeling systems have made it possible to develop 3‑dimensional computer models of molecular structures. Computerized molecular simulation and graphics are an enormous improvement over the earlier more traditional methods because the computer models can be generated in real time and the modeling parameters can be varied interactively with little effort. Moreover, the effects of small changes in recombinant proteins arising from genetic mutations can be characterized and analyzed using computer simulation.

 

The molecular simulation facility at Ohio University, therefore, has been designed to cater to modeling GH analog structures as well as the structure of other proteins. The commercial package for molecular modeling (Charmm21) developed by the Chemistry Department of Harvard University, has been interfaced with the Intergraph CAD system at Ohio University (14). The graphical interface allows the construction of complex protein structures such as "-helices, $-sheets, or linear chains in a menu-driven interactive and user-friendly manner. The 3-dimensional structure can be dynamically rotated, zoomed or moved, and energy minimization can be easily performed. The third "-helix of bGH was modeled using this system. A cleft was observed in the third "-helix because Gly 119 and Ala 122 are amino acid residues located in this cleft and also possess the smallest side chains relative to the amino acid side chains which surround the cleft. Gly 119 residue was replaced by several other amino acids with longer side chains using the computer modeling software. The Arg residue, which has a relatively long side chain, alters the molecular structure, and fills the cleft  It was observed that it is this change in the molecule which is responsible for converting GH from an agonist to an antagonist.

 

The modeling software enables the designer to visualize the 3-dimensional structure, perform energy minimization and vibrational analysis to determine whether the new/modified structure is stable or not. The prediction program can also provide information on the secondary structure of the molecule. Thus, one can see that out of the thousands of alterations in a protein molecule, modeling and simulation can narrow down the number of possibilities which need to be investigated experimentally. Using the modeling technique, one can significantly reduce the number of hGH analogs that needed to be evaluated for biological activity.

 

1.1     Why Predict the 3-D Structure (Particularly the newly designed ones)

 

Due to the interaction of amino acid residues, some conformations are more stable than others. The major amino acid interactions are :

            i)         Amino acids with polar or hydrophilic side chains tend to be on the protein surface, in contact with water.

            ii)        Amino acids with non-polar or hydrophobic side chains tend to remain in the protein interior.

            iii)       Hydrogen bonds are formed between the carbonyl oxygen of one dipeptide bond and the hydrogen attached to the nitrogen atom of the other dipeptide bond. This hydrogen bonding gives rise to two fundamental polypeptide structures, the alpha helix and the beta sheets.

            iv)       The sulfhydryl (-SH) group of amino acid cysteine tends to react with the sulfhydryl group of a second cysteine to form a disulfide (S-S) bond. Most proteins have several cysteines and it is not generally possible to predict which cysteines will be paired.

 

These rules are not exclusive, nor are they always obeyed. However they form a guideline for predicting protein secondary structure. Thus to recapitulate, if there were free rotations about every bond and no interactions between atoms, the protein chain would be a random coil. However due to restrictive rotations about certain bonds and various inter-atomic interactions, the protein chain folds in a definite manner resulting in its secondary structure. The secondary structure is very difficult to predict, owing to the various factors affecting it.

 

It is an established fact that the function and biological action of a protein is governed by its 3-D structure. Thus, it is of paramount importance to decipher the physical structure of proteins that perform crucial functions in living organisms. Only after resolving the physical configuration of proteins, can we understand their exact role and significance in biological systems. There is a certain relationship between the way the amino acids are arranged in a sequence and its corresponding structure but there is no definite rule or an algorithm which governs the relationship. Since there is no explicit rule underlying the relation between amino acids and the secondary structure of proteins, procedural programming does not seem to provide a solution.

 

The experimental determination of primary structure of proteins, that is the sequence of its amino acids, is a relatively easy task. The amino acid sequences of thousands of proteins have been decoded. However the experimental determination of the secondary and tertiary structure is a difficult and pain staking task. Although spectroscopic measurements can estimate the amount of secondary structure in a protein, as well as detect conformational changes, they cannot locate where the helical, beta sheet and random coil regions are located along the chain. Only high resolution  X-ray crystallography and some new methods such as neutron diffraction analysis and electron microscopy, can determinate the exact location of secondary structures precisely. However these techniques are expensive and time consuming. Furthermore many proteins such as histone have not been crystallized, so that the above techniques cannot be applied(1). In case of newly designed proteins, the determination of 3-D structure prior to it manufacture is very cost effective, which makes correct prediction powerful and efficient.

 

2.0     Chou-Fasman Algorithm

 

The Chou-Fasman algorithm for the prediction of protein secondary structure is one of the most widely used predictive schemes (1). This is because of its simplicity and high degree of accuracy. It can also be programmed and simulated on the computer with ease. The Chou-Fasman (C-F) algorithm is a statistical procedure based on assigning conformation potentials to all the amino acid residues (1,2). The conformation potentials, one for each conformation state, are obtained from the statistical analysis of proteins of known secondary structure. The folding mechanism is based on the assumption that nucleation for a particular conformation site starts at the region of maximum conformation potential and continues until a region of low conformation potential is reached. The C-F method is a very simple procedure devoid of complex computer calculations.

 

The C-F model includes short range interactions (in the form of conformation potentials of single residues) and medium range interactions (in the form of average conformation potentials of a group of residues). Effects of long range interactions are not included. 

 

            Simply stated the algorithm is as follows :

            Conformation potentials Pa, PB and Pt are assigned to each residue, which are a measure of its potential to form a helix, beta sheet and beta turn respectively. Accordingly a residue is termed as a helix former or breaker, beta former or breaker.

            When four helix formers out of six residues, or three beta formers out of five residues appear in a cluster, the nucleation of these secondary structures begins and propagates in both directions until terminated by a sequence of tetrapeptide, designated as breakers.

 

2.1       Rules for Secondary Structure Prediction

 

            The C-F algorithm is governed by the following empirical rules (1) :

 

A         Conditions for helical regions :

1.         Helix nucleation

            -           locate clusters of four helical residues (ha or Ha) out of six residues along the protein chain.

            -           weak helical residues Ia count as 0.5 ha.

                        e.g. 3 ha and 2 Ia out of 6 residues could nucleate a helix.

            -           helix formation is unfavorable if the six residue segment contains one-third or more helix breakers (ba or Ba) or less than one-half helix formers.

2.         Helix termination

            -           extend the six residue helical segment in both directions until terminated by tetrapeptide with <Pa> less than 1.0.

            -           once the helix is defined, some of the residues in the terminating tetrapeptide (especially h or i) could be included within the helix.

            -           adjacent beta regions can also terminate helical regions.

3.         Proline cannot occur in the inner helix or at the C terminal helical end.

4.         Helix boundaries

            -           negative residues Asp and Glu, and Pro prefer N terminal helical end.

            -           positive residues His, Lys and Arg prefer C terminal helical end.

            -           Ia designations are given to Pro (normally Ba) and Asp (normally ia) near the N terminal helix, as well as to Arg (normally ia) near the C terminal helix, in order to satisfy condition A.1. However the Pa values are the actual values. This modification is done only for nucleation and not for propagation.

 

Rule 1:            Any segment of six residues or longer with <Pa> greater than or equal to 1.03 as well as <Pa> greater than <PB> and satisfying conditions A.1 through A.4 is helical.

 

B.        Conditions for beta sheets :

1.         Beta sheet nucleation

 

            -           locate clusters of three beta residues (hB or HB) out of five residues.

            -           beta sheet formation is unfavorable if the segment contains one-third or more beta sheet breakers (bB or BB) or less than one-half beta formers.

2.         Beta sheet termination

            -           extend the five residue beta segment in both directions until terminated by tetrapeptide with <PB> less than 1.0.

            -           once the beta sheet is defined, some of the residues in the terminating tetrapeptide (especially h or i) could be included within the beta sheet.

            -           adjacent helical regions can also terminate beta regions.

3.         -           Glu occurs rarely in beta regions.

            -           Pro occurs rarely in inner beta regions.

4.         Beta sheet boundaries

            -           charged residues occur rarely at the N terminal beta end and infrequently in the inner beta region and the C terminal beta end.

            -           Trp occurs mostly at the N terminal end and rarely at the C terminal end.

Rule 2:            Any segment of five residues or longer with <PB> greater than or equal to 1.05 as well as <PB> greater than <Pa> and satisfying conditions B.1 through B.4 is a beta sheet.

 

C.        Conditions for beta turns :

 

Rule 3:            A tetrapeptide with pt > 0.75 X 10-4  and <Pt> > 1.0  as well as <Pt> greater than <PB> and greater than or equal to <Pa>, will form a beta turn.

 

2.2     Statistical Measure of Prediction

 

Two statistical quantities are defined for quantitatively measuring the prediction results. The X-ray prediction is assumed to be the standard against which the prediction results are compared. If the predicted conformation state of a residue matches with experimental state, the residue is assumed to be correctly predicted, and vice versa.

            %k       = % of residues correctly predicted in state k

                        = 100 (nk - number incorrect)/nk                                                                          

            %N      = % of total residues correctly predicted

                        = 100 (N - total incorrect)/N                                                                               

 

The Chou-Fasman rules for protein secondary structure prediction have been incorporated in the program PREDICT which predicts the helical, beta sheet and beta turn regions as well as outputs the values of all conformation potentials. In addition, the program can also perform segment analysis and print the average segment conformation potentials for each state and the type of residues (positive, negative or neutral) occurring in the segment. There is also the option to predict a four-state (helix, beta sheet, coil and beta turn) or a three-state (helix, beta sheet and beta turn) model. Furthermore, the C-F rules can be modified to perform a sensitivity analysis. The program is coded in FORTRAN 77 language and is approximately 2500 lines in length [6].

 

3.0     Fuzzy ARTMap Technique

 

Artificial neural network technique using Fuzzy ARTmap [3,4] has been selected in predicting the secondary structure of proteins because of its architectural and neurodynamical characteristics. Fuzzy ARTmap is an incremental supervised learning algorithm which combines fuzzy logic and adaptive resonance neural network for the recognition of pattern categories and multidimensional maps in response to input vectors presented in arbitrary order. Its neurodynamics implements a new min-max learning rule which conjointly minimizes predictive error and maximizes code compression, and therefore generalization. In addition, Fuzzy ARTmap is easy to use. It has small number of parameters, requires no problem specific crafting or choice of initial weights, and does not get trapped in local minima (specially with large data sets that we are using). Fuzzy ARTmap has been implemented in a simulator developed in-house using the C language [5,7].

 

Fuzzy ARTMAP is a neural network that performs incremental supervised learning of recognition categories and multi‑dimensional maps in response to input vectors presented in arbitrary order .  This architecture executes a synthesis of fuzzy logic and Adaptive Resonance Theory (ART).  It realizes a new minmax learning rule which conjointly minimizes predictive error and maximizes code compression.  This is achieved by a match tracking process that increases the ART vigilance parameter (fuzzy degree of membership of the input with respect to the category‑templates) by the minimum amount needed to correct a predictive error.  A normalization procedure complement coding leads to a symmetric theory where the MIN operator (^) and MAX operator (v) of fuzzy logic play complementary roles.  A Fuzzy ARTMAP neural network is composed of two Fuzzy ART modules.

 

4.0     Probabilistic Neural Networks (PNN)

 

PNN was developed by Donald F. Specht [16]. PNN networks are three layer networks wherein the training patterns are presented to the input layer. The output layer has one neuron for each possible category. The network produces activation in the output layer corresponding to the probability density function estimate for that category. The input units are distribution units that supply the same input value to all the pattern units. Each pattern unit forms a dot product of the input vector X with the weight vector Wi

                                                                     Zi = X * Wi

and then performs a nonlinear operation on Zi before outputting ist activation level to the summation unit. Instead of a sigmoid function commonly used for backpropagation, the nonlinear operation used in PNN is:

                                                               exp[(Zi -1)/sigma 2]

Both X and Wi are normalized to unit length which is equivalent to using the probability density function as given below.

                                             F(X) = exp( -(Wi - X)t(Wi - X)/2 sigma 2 )

Where

i           =          pattern number

X         =          training pattern

sigma   =          smoothing parameter ( > 0 )

 

The network is trained by setting the Wi weight vector in one of the pattern units equal to each of the X patterns in the training set and then connecting the pattern unit's output to the appropriate summation unit. A seperate neuron (also called pattern unit) is required for every training pattern. The same pattern units can be grouped by different summation units to provide additional pairs of categories and additional bits of information to form the output vector.

 

4.1     The PNN Methodology

 

PNN is more applicable to real world problems.  When compared with backpropagation, it trains significantly faster, and addition of training data does not require retraining of the entire network.  It is guaranteed to converge to a Bayesian classifier, compared to the backpropagation technique which may terminate in a local minimum.  The output provided by PNN contains the amount of evidence upon which its results are based.  Noise or errors in the training set do not effect classification accuracy easily.  Therefore inputs similar to those in the training set can be classified correctly within limits.

 

Each of the protein files for training contains a list of amino acids (primary structure) and its corresponding classification (i.e, helix, sheet or coil, or its secondary structure).  Such lists are manipulated into several sets of input patterns, that contain a window of a fixed size of several contiguous amino acid residues. Since there are 20 naturally occurring amino acids, each of them is coded as a 20-bit string with all but one bit turned on (i.e.  a >1=).  Representing an amino acid by a sequential number running from 1 to 20 could imply that the numbering of amino acids is a quantitative measure, which is not true.  Hence, if the window size is eleven, there would be 13 x 20 = 260 binary input elements per window.  To determine the dependence of secondary structure on the amino acid sequence, the windows were made from the amino acid sequence by using a certain number of residues before and after the amino acid under consideration.  Therefore to form a window of thirteen, six amino acid residues before and six residues after would be considered. Each input pattern is associated with an output symbol labeling it a helix, sheet or coil.  Two ways of classifying were experimented.

 

The center technique was first used by Sejnowski et. al.  He felt that the central amino acid had a large influence in structure classification of that window. In the first method, Sejnowski’s and Lapedes’ way of classification is followed, by labeling the pattern with the secondary structure that belongs to the central amino acid.  Therefore if the pattern was a sequence of 13 amino acids, and the seventh amino acid had been classified (protein data from the Protein Data Bank) as an "-helix, then the entire window of amino acids would be labeled an "-helix.  For the sake of convenience, this technique will be referred to as “the center technique” (CT) of classification.

 

It also seems logical to assume that a helix would be most likely to form in the sequence where most residues were classified as helices.  Hence this technique of classification has been included in this study.  In this method of classification, the secondary structure("-helix, $-sheet or coil) that formed a majority is used to classify the entire pattern. It seems logical to assume that the secondary structure that occurs most commonly in that window or pattern would have the most dominating effect.  For example, for a window of 13 amino acids, if six were classified as "-helices, three as $-sheet and four as coil then that window would be classified as an "-helix.  For the sake of convenience, this method of classification is referred to as “the majority technique” (MT).

 

This program is made up of seven subroutines that do the complete task of converting the data into binary form, then apply the windowing techniques of specified sizes, train the network from a specified list of data files of proteins, and then test the network using a protein file of known secondary structure. The main routine reads training data files and testing data files, trains the network and generates a weights file that it subsequently uses to test itself.  The program ends after tabulating the percentage prediction achieved [18]

 

 

5.0     Generalized Regression Neural Network (GRNN)

 

GRNN [17] is based on nonlinear regression theory.  Similar to PNN, GRNN is also a one pass learning algorithm and subsumes the basis function methods. It approximates any arbitrary function between input and output vectors, drawing the function estimate directly from the training data. The method uses a multivariate Parzen estimator to approximate the probability density function. The regression of the dependent variable y on the independent variable X is given by

 

 

                       Int (  y . f xy (x , y) dy

Ey|x (x)   =        -------------------------

                         Int ( f xy (x , y) dy

 

where

f (x, y) is the joint continuous probability density function.

 

GRNN architecture consists of four layers. Number of neurons in the input layer are same as the number of inputs. The scaling of each feature (input) is done between o and 1. The network is fully connected. The scaled inputs are assigned as weights between input layer and pattern layer. The second layer is the hidden or pattern layer. The number of neurons in the pattern layer is equal to the number of patterns. In the protein training set that number is around 10,000. The next layer, also called the summation layer, has two neurons and can be denoted as numerator (A) and denominator (B). The weights between pattern layer and neuron B in the summation layer are set to 1, and between pattern layer and neuron A are set to the scaled output. The methodology and training, testing sets used in GRNN were similar to PNN. The PNN program discussed above was modified for GRNN.

 

6.0     Results and  Conclusion

 

The neural network simulation for the prediction of secondary structures of proteins has been attempted using four different algorithms namely back‑error propagation, Fuzzy ARTMAP and PNN and GRNN. The back‑error propagation algorithm is limited by a number of drawbacks as exemplified in the simulation. The first drawback is the occurrence of local minimum which does not allow the network to converge and thereby wasting the entire training time. The network is trained with around 16,000 samples of training data, but it fails to converge in 800 iterations during its 20 hours CPU time allocated on the CRAY supercomputer (The max. allowed time for a batch process on the cray is 20 hrs). This has proven to be the most critical limitation of the back propagation algorithm. Another drawback, which turns out to be very critical for our application, is that the algorithm is not capable of incremental learning. This factor becomes significant because if there is any addition required in the training set, the training has to be performed from scratch. Again as there is no provision for plastic learning in the back error propagation it is difficult to add new proteins.

 

The Fuzzy ARTMAP algorithm overcomes the drawbacks encountered with the back propagation algorithm. Since it is not based on the gradient descent algorithm, it does not face the local minima problems, and with the result the training is very rapid. It also supports incremental learning which ensures that new proteins can be added to the training set at a later stage without any detriment to the previous learning. The training, then would be carried out from the point at which it was left. Thus, a framework has been designed, wherein information can be added and the performance improved gradually over a period of time. Currently, 85 proteins from the Brookhaven Protein Data Bank (PDB) [13] were used to train the network and 10 proteins were selected for testing the network.

 

The testing of the network with entirely unknown proteins (to the neural network) comes up with a determination of 60%-75% accuracy based on ten proteins and nucleic acids. The PDB structure was compared to the prediction done by the Fuzzy ARTMAP model and the PNN method. The Fuzzy ARTmap algorithm overcomes the drawbacks encountered with the back propagation algorithm [10]. Since it is not based on the gradient descent algorithm, it does not face the local minima problems, which results in rapid training. Its also supports incremental learning which ensures that new proteins can be added to the training set at a later stage without any detriment to the previous learning. In the back propagation method the training has to be performed from scratch if there is any addition required to the training set since there is no provision for plastic learning. The testing of the network with entirely unknown proteins (to the training set) came up with a determination of 60-75% accuracy. All the proteins were selected from the PDB data bank. About 85 proteins were used for training the network and ten more for testing the network. The same ten proteins were used with the Chou-Fassman modified technique, the Fuzzy ARTmap technique, the PNN method and the GRNN method for comparison.

 

The performance of the PNN was carefully observed under a system  of controlled variables.  In order to isolate the effects of variables, they were changed one at a time. The biggest limiting factor in this experiment was memory constraint. The training sets were limited to given sizes, with a variation(+/-) of 300 patterns.

 

1.         Window size of 19 amino acid residues provides best prediction results.

2.         Center Technique of classification works better than majority technique.

3.         “PREDICT”, a program developed by Soni[6] based on Chou-Fasman algorithm shows prediction results slightly lower than the PNN algorithm, for the same testing set of proteins.

4.         From the above inference, and overall results, one can also conclude that prediction is dependent on the protein sample being tested. i.e. Some proteins show consistently better prediction results than others (e.g.  1ppd  is consistently better than 1pyp).

5.         Best setting for Sigma(F), the smoothing constant is at 0.09.  Changing sigma in any direction reduces the prediction accuracy by a relatively small amount, up to a limit, after which the network fails.

6.         "-Helices and $-Sheet follow the same degree of prediction as is characteristic of the protein.

 

The advantage of PNN is that training is very fast. Training a PNN network only involves loading the weight matrices it can be done in "real time". A 'C' program was designed in-house using the above PNN technique. A pre-processor to automatically create the input for different window sizes based from the brookhaven protein data bank file (PDB) was also developed. The training was done on proteins used for Fuzzy ARTmap. These proteins were selected randomly from the data bank. Several window sizes were used for training and testing. The window of fifteen came up with a prediction of 72% accuracy when compared to known results. The rest of the results for window's of 7, 9, 11, 13, 17, 21, and 25 revolved around 58% to 65% accuracy.

 

The artificial neural network based on the supervised adaptive resonance theory provides a system which can make 3D structure predictions, comparable in accuracy to the back-propagation algorithm and the training time required is much shorter. The statistical technique based on Chou-Fassman rules gives good results, but with the Fuzzy ARTmap the network can be improved by adding more proteins to it. In the PNN technique the total time for trainning and testing is significantly less compared to other neural network schemes. The 85 proteins used for training in PNN took about 5 minutes on an Intergraph workstation running at 80 mips and 12 mflops. The testing of six proteins took about 14 hrs. for a window of fifteen. Trainning the same number of proteins using Fuzzy ARTmap on the same workstation was about 170 hrs.

 

The analysis of the results have demonstrated that the development of a better multi-expert system architecture with different representation schemes can yield a better and more promising solution. This multi-expert architecture would utilize rules based on statistical analysis and neural networks to grow with examples. In addition, the utilization of additional information such as energy levels and dynamics can be  very beneficial. This research was performed in the Biomolecular modeling facility at Ohio University, which utilizes an Intergraph CAD system connected to the Ohio Supercomputer. The research is part of an on-going developments in the field of genetic engineering of growth hormones at Ohio University in collaboration with the chemical engineering and the biotechnology department.

 

7.0     References

 

1.         Chou P.Y., G.D. Fasman, Empirical Predictions of  Protein Conformation, Annual Review Biochemistry, Vol. 47, PP:251-276 (1978

2.         Fasman G.D., P. Prevelige, Jr., Chou-Fasman Prediction of the Secondary Structure of Proteins, Prediction of Protein Structure and the Principles of Protein Conformation, edited by G.D. Fasman, Plenum Press, PP:391-416 (1989).

3.         Carpenter GA, Grossberg S, and Rosen DB,  Fuzzy ART: fast stable learning and categorization of analog patterns by an adaptive resonance system. Technical Report CAS/CNS‑TR‑91‑015, July 1991.

4.         Carpenter GA, Grossberg S, Markuzon N, Reynolds JH, Rosen DB,  Fuzzy ARTMAP: A neural network architecture for incremental supervised learning of analog  multidimensional maps. Technical Report CAS/CNS‑TR‑91‑016, August 1991.

5.         Vij L (1993) Computer aided biomolecular modeling and protein secondary structure prediction using neural networks. M.S. Thesis, Ohio University, Athens, OH

6.         Soni R (1992) Computer aided modeling and simulation of molecular systems and protein secondary structure prediction, M.S. Thesis, Ohio University, Athens, OH

7.         Mehta BV, Vij L, Rabelo LC(1993), Prediction of Secondary Structures of Proteins using Fuzzy ARTmap, World Congress on Neural Networks, Vol. 1, pp. 228-232, Portland, Oregon.

8.         Chen WY, Wight DC, Mehta BV, Wagner TE, and Kopchick JJ (1991a), Glycine 119 of bovine growth hormone is critical  for growth promoting activity. Molecular Endocrinology 5:1845‑1852

9.         Chen WY, Wight DC, Wagner TE, Kopchick JJ (1990) Expression of a mutated bovine growth hormone gene suppresses growth of transgenic mice. Proc. Natl. Acad. Sci. 87:5061‑5065

10.       Chen WY, White ME, Wagner TE, Kopchick JJ (1991a) Functional antagonism between endogenous mouse growth hormone (GH) and GH analog results in dwarf transgenic mice. Endocrinology 129:1402‑1408

11.       Chen WY, Chen NY, Yun J, Wagner TE, Kopchick JJ (1993) In vivo and in vitro studies of the antagonistic effects of human growth hormone analogs. J. Biol. Chem. 269:15892-15897.

12.       Sherin S., Abdel-Meguid, Huey-Sheng Shieh, W.W. Smith, H.E. Dayringer, B.N. Violand, L.A Bentle, Three-dimensional Structure of a Genetically Engineered Variant of Porcine Growth Hormone, Proceedings of the National Science Academy USA, Vol. 84, PP:6434-6437 (1987).

13.       Protein Data Bank, Chemistry Department, Brookhaven National Laboratory, Upton, NY 11973 USA.

14.       Mehta B.V, R. Soni, J. Patel, J.J. Kopchick, W.Y. Chen, Computer Aided 3D Modeling of Proteins, Proceedings of the International Association of Science and Technology for Development International Symposium, Computers and Advanced Technology in Medicine, Healthcare and Bioengineering, PP:63-65 (1990).

15.       Mehta BV, Padture S, Kopchick JJ, and Chen W (1993) Computer modeling and energy minimization of the bovine growth hormone, Bio/Technology Winter Annual Symposium, Miami Beach, Florida, Jan. 17‑20.

16.       Specht, D., Probabilistic Neural Networks for Classification, Mapping, or Associative Memory, Proceeding of IEEE International Conference on neural Networks, Vol. 1, pp. 527-530, 1988.

17.       Specht, D., A General Regression Neural Network, IEEE Trans. On Neural Networks, Vol. 2, No. 6, 1991.