A Short Introduction to Biocomputing
This introduction is thought to provide you with some basic information describing why biologists and biochemists
become increasingly interested in using computational approaches for their daily work. You will
be guided through an explicit example providing you with the chance to get
hands-on experience in using advanced programs in the same way as modern biologists frequently do.
Biocomputing, as the computational basis for e.g. Genetic Diagnostics, has increasingly more
influence on the life of everybody but most people are not aware of it. It provides the theoretical background
and practical tools for scientists to explore proteins and DNA. DNA and proteins are large molecules which
consist of a chain of smaller residues called
nucleotides or
amino acids, respectively.
They are nature's building blocks, but these building blocks are not exactly used as 'bricks', the function of
the final molecule rather strongly depends on the order of these blocks. So it is possible to think of these
residues as being numbered.
The 3D (three dimensional) structure of a protein depends on the individual sequence of these numbered residues. The order
of amino acids of a given protein is
derived
from the corresponding DNA. This piece of DNA consists of an ordered sequence of nucleotides.
About the History of Biocomputing
Over the last 20 years it has turned out that many proteins from different origin with similar function also
have similar amino acid sequences. Thus, there are corresponding DNA sequences which are similar even
though the protein under analysis occurs in different species such as mice and humans. So,
we look for differences and similarities on the DNA level between a mouse and a human for many
similar sequences.
Since the beginning of the 1990s, many laboratories are analyzing the full
genome
of several species such as bacteria, yeasts, mice, and humans.
During these collaborative efforts enormous amounts of data are collected and stored in databases, most of
which are publically accessible. Besides gathering all these data, it is necessary to compare these nucleotide
or amino acid sequences to find similarities and differences.
Since it is not very convenient to compare the sequences of several (hundred) nucleotides or amino acids by
hand, several computational techniques were developed to approach this problem. In addition, these are less
error-prone than a manual approach.
Using computational techniques to analyse biological data is referred to as
Biocomputing.
Current State of Biocomputing
Several algorithms have been developed and implemented providing graphical user interfaces
to existing databases. Thus, comparing a newly found sequence with those already stored in a database has
become a matter of minutes. Nevertheless, it is still necessary to carefully
analyse the results and to fine-tune a data base search if needed.
Thereby, it is possible to quickly determine the differences among species and the
differences between a healthy versus a diseased individual.
Biocomputing might therefore lead to a better understanding of life and the molecular causes of certain
diseases.
Examples for the Use of Biocomputing
We will start with a class of proteins which are well known and analysed for a long time.
These proteins are the so-called lectins which belong to
the group of carbohydrate-binding and recognizing biomolecules.
They are distinguished from enzymes since they cannot chemically modify the bound carbohydrates.
General Information about Lectins
For many lectins, which occur in plants, animals, and in humans, their
amino acid composition and their corresponding genes
have been determined. With the help of Biocomputing one can identify
similarities between proteins having a nearly identical amino acid sequence, and can
find out which amino acids have been replaced in otherwise highly conserved
regions. If these modifications occur together with differences in the binding
behavior of the protein, one can subsequently conclude which parts of the
lectin might be responsible for the binding of
ligands.
The term ligands usually refers to a molecule that is bound by another, bigger molecule.

Model of a lectin as constructed from the data of a crystallographic analysis (Data retrieved
from the Protein Data Bank, ID-code 7WGA. Image constructed with the help of the program
Rasmol.)
Different colors are used for different atom types: grey: C atoms; red: O atoms; blue: H atoms; yellow: S atoms
The following figures show how similar the overall structures of two lectins from two different species
(pea and lentil) can be.


Left side: Image of a lectin from peas ; Right side: Image of a lectin from lentils
Data for the images were retrieved from the Protein Data Bank, ID codes: 1RIN and 2LAL, respectively.
Images were constructed with the help of the program
Rasmol.
The grey/blue strands symbolize the amino acid backbone, while the yellow and red cartoons symbolize
different foldings of the amino acid sequence; yellow: beta-sheet, red: alpha helix.
With biocomputational approaches it is possible to find the amino acid or corresponding DNA sequence
of proteins in databases which show a high degree of similarity towards a newly discovered protein by
comparing their sequences. This can be achieved by a so-called
pairwise alignment,
and might give useful hints to determine the possible function of the molecule.
Here is a link
to an extensive database about lectins in France.
Life is Based on Different Functions of Biomolecules
Many biomolecules do not only contain an active site, like enzymes, or a binding site, as lectins do, but they
are able to interact with other proteins. This interaction can lead to the formation of oligomers which can be
composed of several different proteins or different molecules of the same protein. It is also possible that
proteins bind to smaller molecules. Such proteins are referred to as receptors since their activity is
modulated by the presence of the smaller molecule.
But the binding of a protein to another molecule does not occur at a random position, it takes place at a
specific location which has a special 3D structure that restricts the availability of this position to certain
molecules only.
Locations of a protein which have special conformations are called domains and a single protein can have
several of them with different functions.
Lectins for example can contain several different domains, e.g. for binding carbohydrates, to
interact with each other to form oligomers, for binding to other molecules, and for accepting small molecules.
These properties render them suitable for mediating interactions between carbohydrates and non-carbohydrate
binding proteins.
Comparison with other proteins can help to identify such functional domains of the protein under analysis.
Thus, when the location of a protein in the cell is known, the comparison of its amino acid sequence with
another protein might elucidate where this second protein is located in the cell, e.g. membrane proteins
usually have very characteristic domains and are grouped according to the number of these domains.
Even though the overall structure of these two symbols does not seem to be identical they both
share a similar function since one of the key elements is similar.
Malfunctions of Biomolecules Can Cause Diseases
As mentioned above proteins with the same functions but of different origin usually vary in their amino acid
sequence even though they are fully functional. These variations in the sequence of the building blocks are
caused by mutations in the corresponding DNA sequence, i.e. the 'normal' nucleotides is replaced by a different
one. While these mutations are normally not detrimental to the function of a molecule, there are also mutations
which render a molecule inactive.
Identification of a mutated sequence, i.e. an amino acid sequence with at least one different amino acid
compared to the original one, derived from a diseased person, might give evidence that a certain amino
acid is responsible for either the correct folding or the assembly of an oligomer. If it is replaced by another one,
even though closely related to the former one, the monomers might fail to assemble to a fully
functional protein.
A well kown example for this is sickle-cell anemia, which was the first identified disease with a molecular origin.
This disease is caused by the replacement of the amino acid glutamine by valine
leading to a "misformed" deoxyhemoglobin.
This finally results in a deformation of the hemoglobin containing erythrocytes.
Example for a Disease That is not Due to a Defect Protein
Another very prominent example is diabetes mellitus, a disease caused by the inability of the body to produce
enough insulin. In contrast to the former example,
diabetes is not a disease that is caused by a one-protein malfunction as the protein is not
produced at all. The currently most effective cure for this disease
is to supply the individual with insulin. Formerly this insulin was gathered from pigs, but
recently it has been possible to produce human insulin in bacteria, which can be produced in any quantity and quality.
Image of human insulin. Two chains are depicted which are differently colored.
Data for the image was retrieved from the Protein Data Bank, ID codes: 1HIU.
The Image were constructed with help of the program
Rasmol.
The following pictures show the reason why pig, or in general, animal derived insulin can be used to treat diabetes: As
can be seen in these figures, the amino acid sequences of the animal insulins are very similar to the human
form. Amino acids, symbolized using the one-letter-code, which match exactly are displayed in the middle row
marked with blue dashes. Those which do not match are marked with red bars.
The sequence identity is 94% for rabbits, 89% for pigs, and 87% for cows.
Sequence comparison of human precursor insulin with rabbit, pig, and cow, respectively,
precursor insulins. The images are derived from a BLASTp search against the
Swissprot Database.
For details of such a search have a look at the
Explicit Example below.
The sequences are encoded by the One-Letter-Code for amino acids, thus the same letter in two lines
is equivalent to a full match. Spaces in a sequence are introduced to extend a given sequence for a better
overall match. The numbers next to the individual sequences indicate the individual amino acid residues.
Identifying the Function of a Molecule
Computational approaches to compare sequences against a vast amount of already analysed ones
can help to make classifications concerning their similarities and their differences and to allow conclusions
about their functions.
It might also reveal that two molecules share the same function because they have the same domains
even though they occur in different compartments of the cell or even in different organs.
Also, the subunits of an oligomeric structure can differ leading to molecules with different functions.
These subunits sometimes vary in their amino acid composition even though they stem from a common ancestor.
Biocomputational methods can help to identify this ancestor molecule from the construction of a so-called
phylogenetic tree.
The Use of Databases
One important goal in the analysis of a protein, for example a lectin, is to determine its crystallographic structure,
which is often difficult to accomplish because of the difficulty to crystallize most proteins. If the amino
acid sequence is known, it is possible to assign a certain protein fold to a certain sequence, thus,
permitting predictions about the overall shape of other proteins which have similar sequences.
Crystallographic methods also possibly allow for identification of the function of domains, e.g. to identify binding
pockets in lectins as they can actually be visualized.
This type of analysis has been done for many lectins and the results can be viewed in the
Protein Data Bank (PDB).
Classification of a Subclass of Proteins by Sequence Alignment
The DNA and amino acid sequences of many animal and plant lectins have been elucidated
and their comparison has led to the discovery that they can be classified into different categories.
This classification is made on the basis of certain amino acids thought to be responsible for binding a certain oligosaccharide. These are identified by the fact that they
occur within a fixed distance and throughout different species. They are highly
conserved residues. This fixed distance is normally not identical to the linear distance between two amino
acids in a sequence, but it is the distance through space between two amino acid residues in a 3D structure.
Thus, they act like a 'hole' which fits another molecule as the 'key'.
In such highly conserved regions of a protein only conservative mutations are allowed without disrupting
the function of the protein.
These conservative mutations, e.g. an amino acid is replaced by another one that has comparable
physical properties can be taken into account in biocomputational programs, thereby enabling the program
to 'decide' to what degree a sequence is similar to another one.
Identification of Homology between Some Plant Derived Lectins with Similar Functions
Plant lectins, especially legume lectins which have been intensively analysed, show a high degree of homology.
The sequences of the following lectins were aligned: soybean agglutinin, favin (from fava bean),
lentil lectin, pea lectin, Phaseolus lectin, and Concanavalin A from jack bean.
The comparison of the all six sequences led to the discovery of a high degree of sequence similarity
between them.
You can check out the
Explicit example
using the BLAST algorithm to see how this can be done.
A similar comparison was done using the FASTA program, too. You can retrieve the result by choosing
392 protein neighbors ,
which shows the result for each sequence separately.
A screenshot is available
here.
Multiple alignment
of all six sequences. This result was obtained by using the program
'Block Maker' which is available for use at the
BCM Search Launcher: Multiple Sequence Alignments. The sequences were retrieved from the Protein Data Bank (PDB) using the
Entrez browser.
Now you might want to have a look at our explicit example
to get some hands-on-experience by doing some Biocomputing over the Internet all by yourself.
You will be guided back to this page to have a look at the
Even though the current approaches in Biocomputing are very helpful in identifying
patterns and functions of proteins and genes, they are still far from being perfect. They are not only time-consuming,
requiring Unix workstations to run on, but might also lead to false interpretations and assumptions due to necessary
simplifications. It is therefore still mandatory to use biological reasoning and common sense in evaluating
the results delivered by a biocomputing program. Also, for evaluation of the trustworthyness of the output
of a program it is necessary to understand the mathematical / theoretical background
of it to finally come up with a use- and senseful analysis.
© Christian Frosch
Back to Main Page