logo

A Short Introduction to Biocomputing

This introduction is thought to provide you with some basic information describing why biologists and biochemists become increasingly interested in using computational approaches for their daily work. You will be guided through an explicit example providing you with the chance to get hands-on experience in using advanced programs in the same way as modern biologists frequently do.

Biocomputing, as the computational basis for e.g. Genetic Diagnostics, has increasingly more influence on the life of everybody but most people are not aware of it. It provides the theoretical background and practical tools for scientists to explore proteins and DNA. DNA and proteins are large molecules which consist of a chain of smaller residues called nucleotides or amino acids, respectively. They are nature's building blocks, but these building blocks are not exactly used as 'bricks', the function of the final molecule rather strongly depends on the order of these blocks. So it is possible to think of these residues as being numbered.

bricks

 

  The 3D (three dimensional) structure of a protein depends on the individual sequence of these numbered residues. The order of amino acids of a given protein is derived from the corresponding DNA. This piece of DNA consists of an ordered sequence of nucleotides.

About the History of Biocomputing

Over the last 20 years it has turned out that many proteins from different origin with similar function also have similar amino acid sequences. Thus, there are corresponding DNA sequences which are similar even though the protein under analysis occurs in different species such as mice and humans. So, we look for differences and similarities on the DNA level between a mouse and a human for many similar sequences.

Since the beginning of the 1990s, many laboratories are analyzing the full genome of several species such as bacteria, yeasts, mice, and humans. During these collaborative efforts enormous amounts of data are collected and stored in databases, most of which are publically accessible. Besides gathering all these data, it is necessary to compare these nucleotide or amino acid sequences to find similarities and differences. Since it is not very convenient to compare the sequences of several (hundred) nucleotides or amino acids by hand, several computational techniques were developed to approach this problem. In addition, these are less error-prone than a manual approach. Using computational techniques to analyse biological data is referred to as Biocomputing.

Current State of Biocomputing

Several algorithms have been developed and implemented providing graphical user interfaces to existing databases. Thus, comparing a newly found sequence with those already stored in a database has become a matter of minutes. Nevertheless, it is still necessary to carefully analyse the results and to fine-tune a data base search if needed. Thereby, it is possible to quickly determine the differences among species and the differences between a healthy versus a diseased individual. Biocomputing might therefore lead to a better understanding of life and the molecular causes of certain diseases.

Examples for the Use of Biocomputing

We will start with a class of proteins which are well known and analysed for a long time. These proteins are the so-called lectins which belong to the group of carbohydrate-binding and recognizing biomolecules. They are distinguished from enzymes since they cannot chemically modify the bound carbohydrates.

General Information about Lectins

For many lectins, which occur in plants, animals, and in humans, their amino acid composition and their corresponding genes have been determined. With the help of Biocomputing one can identify similarities between proteins having a nearly identical amino acid sequence, and can find out which amino acids have been replaced in otherwise highly conserved regions. If these modifications occur together with differences in the binding behavior of the protein, one can subsequently conclude which parts of the lectin might be responsible for the binding of ligands. The term ligands usually refers to a molecule that is bound by another, bigger molecule.

Wheat Germ Agglutinin

Model of a lectin as constructed from the data of a crystallographic analysis (Data retrieved from the Protein Data Bank, ID-code 7WGA. Image constructed with the help of the program Rasmol.)
Different colors are used for different atom types: grey: C atoms; red: O atoms; blue: H atoms; yellow: S atoms


The following figures show how similar the overall structures of two lectins from two different species (pea and lentil) can be.

Pea lectinLentil lectin
Left side: Image of a lectin from peas ; Right side: Image of a lectin from lentils
Data for the images were retrieved from the Protein Data Bank, ID codes: 1RIN and 2LAL, respectively. Images were constructed with the help of the program Rasmol.
The grey/blue strands symbolize the amino acid backbone, while the yellow and red cartoons symbolize different foldings of the amino acid sequence; yellow: beta-sheet, red: alpha helix.

With biocomputational approaches it is possible to find the amino acid or corresponding DNA sequence of proteins in databases which show a high degree of similarity towards a newly discovered protein by comparing their sequences. This can be achieved by a so-called pairwise alignment, and might give useful hints to determine the possible function of the molecule. Here is a link to an extensive database about lectins in France.

Life is Based on Different Functions of Biomolecules

Many biomolecules do not only contain an active site, like enzymes, or a binding site, as lectins do, but they are able to interact with other proteins. This interaction can lead to the formation of oligomers which can be composed of several different proteins or different molecules of the same protein. It is also possible that proteins bind to smaller molecules. Such proteins are referred to as receptors since their activity is modulated by the presence of the smaller molecule. But the binding of a protein to another molecule does not occur at a random position, it takes place at a specific location which has a special 3D structure that restricts the availability of this position to certain molecules only. Locations of a protein which have special conformations are called domains and a single protein can have several of them with different functions. Lectins for example can contain several different domains, e.g. for binding carbohydrates, to interact with each other to form oligomers, for binding to other molecules, and for accepting small molecules.

These properties render them suitable for mediating interactions between carbohydrates and non-carbohydrate binding proteins. Comparison with other proteins can help to identify such functional domains of the protein under analysis. Thus, when the location of a protein in the cell is known, the comparison of its amino acid sequence with another protein might elucidate where this second protein is located in the cell, e.g. membrane proteins usually have very characteristic domains and are grouped according to the number of these domains.

key

  Even though the overall structure of these two symbols does not seem to be identical they both share a similar function since one of the key elements is similar.

Malfunctions of Biomolecules Can Cause Diseases

As mentioned above proteins with the same functions but of different origin usually vary in their amino acid sequence even though they are fully functional. These variations in the sequence of the building blocks are caused by mutations in the corresponding DNA sequence, i.e. the 'normal' nucleotides is replaced by a different one. While these mutations are normally not detrimental to the function of a molecule, there are also mutations which render a molecule inactive. Identification of a mutated sequence, i.e. an amino acid sequence with at least one different amino acid compared to the original one, derived from a diseased person, might give evidence that a certain amino acid is responsible for either the correct folding or the assembly of an oligomer. If it is replaced by another one, even though closely related to the former one, the monomers might fail to assemble to a fully functional protein.

A well kown example for this is sickle-cell anemia, which was the first identified disease with a molecular origin. This disease is caused by the replacement of the amino acid glutamine by valine leading to a "misformed" deoxyhemoglobin. This finally results in a deformation of the hemoglobin containing erythrocytes.

Example for a Disease That is not Due to a Defect Protein

Another very prominent example is diabetes mellitus, a disease caused by the inability of the body to produce enough insulin. In contrast to the former example, diabetes is not a disease that is caused by a one-protein malfunction as the protein is not produced at all. The currently most effective cure for this disease is to supply the individual with insulin. Formerly this insulin was gathered from pigs, but recently it has been possible to produce human insulin in bacteria, which can be produced in any quantity and quality.

Human Insulin

Image of human insulin. Two chains are depicted which are differently colored. Data for the image was retrieved from the Protein Data Bank, ID codes: 1HIU. The Image were constructed with help of the program Rasmol.

The following pictures show the reason why pig, or in general, animal derived insulin can be used to treat diabetes: As can be seen in these figures, the amino acid sequences of the animal insulins are very similar to the human form. Amino acids, symbolized using the one-letter-code, which match exactly are displayed in the middle row marked with blue dashes. Those which do not match are marked with red bars. The sequence identity is 94% for rabbits, 89% for pigs, and 87% for cows.

Rabbit Precursor Insulin


Pig Precursor Insulin


Bovine Precursor Insulin


Sequence comparison of human precursor insulin with rabbit, pig, and cow, respectively, precursor insulins. The images are derived from a BLASTp search against the Swissprot Database. For details of such a search have a look at the Explicit Example below.
The sequences are encoded by the One-Letter-Code for amino acids, thus the same letter in two lines is equivalent to a full match. Spaces in a sequence are introduced to extend a given sequence for a better overall match. The numbers next to the individual sequences indicate the individual amino acid residues.

Identifying the Function of a Molecule

Computational approaches to compare sequences against a vast amount of already analysed ones can help to make classifications concerning their similarities and their differences and to allow conclusions about their functions. It might also reveal that two molecules share the same function because they have the same domains even though they occur in different compartments of the cell or even in different organs.

Also, the subunits of an oligomeric structure can differ leading to molecules with different functions. These subunits sometimes vary in their amino acid composition even though they stem from a common ancestor. Biocomputational methods can help to identify this ancestor molecule from the construction of a so-called phylogenetic tree.

The Use of Databases

One important goal in the analysis of a protein, for example a lectin, is to determine its crystallographic structure, which is often difficult to accomplish because of the difficulty to crystallize most proteins. If the amino acid sequence is known, it is possible to assign a certain protein fold to a certain sequence, thus, permitting predictions about the overall shape of other proteins which have similar sequences.

Crystallographic methods also possibly allow for identification of the function of domains, e.g. to identify binding pockets in lectins as they can actually be visualized. This type of analysis has been done for many lectins and the results can be viewed in the Protein Data Bank (PDB).

Classification of a Subclass of Proteins by Sequence Alignment

The DNA and amino acid sequences of many animal and plant lectins have been elucidated and their comparison has led to the discovery that they can be classified into different categories.

This classification is made on the basis of certain amino acids thought to be responsible for binding a certain oligosaccharide. These are identified by the fact that they occur within a fixed distance and throughout different species. They are highly conserved residues. This fixed distance is normally not identical to the linear distance between two amino acids in a sequence, but it is the distance through space between two amino acid residues in a 3D structure. Thus, they act like a 'hole' which fits another molecule as the 'key'.

In such highly conserved regions of a protein only conservative mutations are allowed without disrupting the function of the protein. These conservative mutations, e.g. an amino acid is replaced by another one that has comparable physical properties can be taken into account in biocomputational programs, thereby enabling the program to 'decide' to what degree a sequence is similar to another one.

Identification of Homology between Some Plant Derived Lectins with Similar Functions

Plant lectins, especially legume lectins which have been intensively analysed, show a high degree of homology. The sequences of the following lectins were aligned: soybean agglutinin, favin (from fava bean), lentil lectin, pea lectin, Phaseolus lectin, and Concanavalin A from jack bean.
The comparison of the all six sequences led to the discovery of a high degree of sequence similarity between them. You can check out the Explicit example using the BLAST algorithm to see how this can be done.

A similar comparison was done using the FASTA program, too. You can retrieve the result by choosing 392 protein neighbors , which shows the result for each sequence separately. A screenshot is available here.

Multiple Alignment of 6 Lectin Sequences

Multiple alignment of all six sequences. This result was obtained by using the program 'Block Maker' which is available for use at the BCM Search Launcher: Multiple Sequence Alignments. The sequences were retrieved from the Protein Data Bank (PDB) using the Entrez browser.

Now you might want to have a look at our explicit example to get some hands-on-experience by doing some Biocomputing over the Internet all by yourself.
You will be guided back to this page to have a look at the

Conclusions

Even though the current approaches in Biocomputing are very helpful in identifying patterns and functions of proteins and genes, they are still far from being perfect. They are not only time-consuming, requiring Unix workstations to run on, but might also lead to false interpretations and assumptions due to necessary simplifications. It is therefore still mandatory to use biological reasoning and common sense in evaluating the results delivered by a biocomputing program. Also, for evaluation of the trustworthyness of the output of a program it is necessary to understand the mathematical / theoretical background of it to finally come up with a use- and senseful analysis.
© Christian Frosch

Back to Main Page