Microbial Bioinformatics: A game changing tool/field for microbiology in the near future (By Shaunak Raole)

June 30, 2021

Microbial Bioinformatics: A game changing tool/field for microbiology in the near future (By Shaunak Raole)

There has been a revolutionary growth in the computation speed and memory storage capability which has announced a new era in the analysis of biological data. Genomes of microbes and eukaryotes have been sequenced, including a significantly clearer draft of the human genome. Development of rational drugs and antimicrobial agents, enhanced bacterial strains needed for bioremediation and for pollution control, synthesis of much more sophisticated and easily administered vaccines, protein biomarkers for a plethora of bacterial diseases, and better conceptualization of host-bacteria interaction in order to prevent bacterial infections.

The development of new bioinformatical techniques have helped propel science towards the achievement of these goals. The current research in bioinformatics can be classified into:

(i) Genomics – which is the sequencing and comparative analysis of genomes to identify genes and the genome functionality

(ii) Proteomics – identification and characterization of properties associated with proteins and construction of metabolic pathways,

(iii) Visualization of cells and simulations that help in the study and model of cell behaviour

(iv) Application in the field of development of drugs and several anti-microbial agents.

Bioinformatics research is classified under three main methods:

(1) Experimental wet-lab data and the analysis based upon it

(2) Deriving new information by using mathematical models

(3) An approach that is involved in integrating search techniques by the help of mathematical modelling.

Bioinformatics has had a major impact in automating genome sequencing, integrating genomics and proteomics database and their automated development, comparing genomes to identify a particular function of a genome, which again, is automated. It is also used for deriving metabolic pathways, and analysis of gene expression along with statistical analysis and data mining. This has helped to find out protein-protein and protein-DNA interactions. Apart from that, it has also helped in modelling 3D structures of proteins.

For a microbiological perspective, it has mainly helped drug designing, analyzing differences between pathogenic and non-pathogenic strains, which will ultimately help in identifying candidate genes for vaccines and antimicrobial agents. Along with that, it will also give us a whole-genome comparison to understand microbial evolution.

Developing bioinformatics tools has increased the pace of biological discovery through analysis of microbial genomes.

The amazingly improved computational capability coupled with large scale reduction in the size of biochemical techniques such as PCR, BAC, electrophoresis and microarray chips has given a gigantic amount of genomic and proteomic data throughout the globe, which in turn has led to a huge number of discoveries, which would not have been possible in the conventional wet lab techniques.

The expectation of humans to be able to control and harness genetics due to this availability of data has helped us to manipulate the genetics of microbes too. This is linked with several advantages such as improved diagnosis of diseases, especially with the help of protein biomarkers, and production of vaccines to protect us against such diseases. These vaccines are cost-effective too. This is accompanied by rational drug design, improved agricultural quality and quantity, and an understanding of the microbial machinery at the systemic level.

Haemophilus influenzae was the first microbe to have its genome sequenced, in 1995. Ever since that, hundreds of genomes have been sequenced archived for public research online (GenBank), through the thorough efforts of federal health agencies such as NIH, DOE, EMBL and EBI and other national laboratories and academic universities. Along with these, drug development companies and bioremediation companies played a major part.

GENOME SEQUENCING

Bioinformatics has majorly contributed to genome sequencing for:

1) Techniques of automated sequencing that couple PCR and BAC-based amplification, 2D gel electrophoresis, and automated analysis and reading of nucleotides.

2) Joining of smaller fragments leading to the formation of a full genome sequence

3) Predicting regions of promoters and protein-coding in the genome

Amplification techniques based on PCR and BAC give us limited size fragments of a genome. These fragments sequences have nucleotide reading errors and repeats (small and similar fragments which in two or more than two parts of the genome), and chimaera, which are two separate parts of the genome that get attached end to end due to contamination, giving us an artefactual fragment.

Producing multi-copies of these fragments and aligning them and using majority voting at the same nucleotide positions solve the problem. Hence multiple experimental copies are needed. Chimaeras and repeats are removed prior to the final assembly of the genome fragments.

The fragments are joined and modelled as a graph that is mathematically weighted, where the nodes are fragments and the weights of edges denote the number of overlapping nucleotides. These fragments are joined on the basis of maximum overlap. For this, something called the greedy algorithm is used, where most nodes having maximum/minimum scores are collapsed first. For the contigs to be joined, fragments with larger nucleotide sequence overlap are joined first.

AUTOMATED IDENTIFICATION OF GENES

Once the contigs are joined, the next step is to identify the protein-coding regions/ ORFs (Open reading frames) in the genomes. The identification can be done by three methods:

1) Using Hidden Markov Model (HMM) based techniques like GLIMMER and GeneMark

These techniques use multiple probabilistic state machines, each of which is capable of identifying an ORF. Each machine can predict the next nucleotide character using a state transition with maximum probability and the current nucleotide in the actual sequence is matched to that using state transition.

In microbial genomes, GLIMMER has provided 95 to 97 per cent accuracy.

2) Searching databases like GenBank

3) Using algorithms that are based on decision trees that identify the start and stop codons of the coding regions

IDENTIFYING GENE FUNCTION

After ORFs are identified, the gene’s function and structure are identified. This is done by using four popular algorithms, which are BLAST, Smith-Waterman alignment, FASTA, and BLOCKS and their variations

The basic unit of protein function is the Protein Domain. It is associated with a unique pattern of folding alpha helixes/beta-pleated sheets/their variations at the structure level. Multiple sequence alignment and HMM are used to identify individually homologous regions in multiple homologous genes. These regions are probable domains. There are also domain-related databases such as PRODOM, Pfam and SMART.

3D STRUCTURE MODELING AND DOCKING

A protein can either live as one or more than one low free energy conformational state, which depends upon its interactions with other proteins. When the protein is in a stable conformation, certain regions of the proteins are exposed for protein-protein or protein-DNA interactions.

Functions depend on the exposed active site, and the protein function is predicted by matching unknown protein 3D structure with that of a known protein.

The term docking is used when we identify the best matches between 3D structures of two molecules, one of which is a ligand and another is a receptor. These two binds to each other by simulating interacting surfaces and minimization of free energy at the domain level.

PAIRWISE GENOME COMPARISON

After identifying the gene functions, we perform pair-wise genome comparisons. These provide us with:

1) Details of Paralogous genes. Which are duplicated genes with similar sequences but a little variation in function.

2) Information on orthologous genes: They are functionally equivalent but are diverged in two genomes because of speciation.

3) Information on lateral gene transfer, in which there is gene transfer from a microorganism that is evolutionarily far.

4) Difference analysis for identifying genes that are specific to a group of genomes such as pathogens.

RECONSTRUCTING METABOLIC PATHWAYS

This involves the automated reconstruction and pathways of newly sequenced organisms and their comparison. It is done through three approaches:

1) Global network of reactions which are catalyzed by enzymes

This uses information on known biochemical pathways and enzymes, and identifies enzyme function of genes that are newly sequenced. This approach is quite powerful.

2) Network of gene groups that are connected through the reactions catalyzed by enzymes which are inserted in the gene groups.

Here the enzymes are identified along with their functions in a newly sequenced genome, then identifying groups of genes that share a common promoter (done by analyzing promoter region of genes). This is followed by pairwise comparison of the newly sequenced genome with multiple genomes. This is done for deriving gene groups. These gene groups are then connected by using biochemical knowledge of existing enzymes and pathways.

3) The biochemical reactions globally involving products, by-products and effects of cofactors are modelled.

MICROBIAL EVOLUTION REVISITED

There has been an extensive comparison of multiple genomes in order to correlate and classify all the genomes into different families and to study evolution.

Researchers have confirmed that the overall evolution is a combination of point-based mutation which gives rise to speciation and genome restructuring. This is based upon gene duplications, gene insertion, gene deletion, gene fusion/fission, horizontal gene transfer and domain level restructuring.

The evolutionary study is classified into three approaches:

1) 16srRNA approach

Uses the concept of point mutation of conserved genes because they have a slow mutation rate. It uses 16SrRNA database and multiple sequence alignment. An evolutionary tree is built. Three distinct domains are obtained- bacteria, archaea and eukaryotes. For example, the Archaea domain Is hyperthermophilic, and the 16SrRNA of archaea is different from that of bacteria.

2) Genome rearrangement

It happens due to gene shuffling and used as a measure for the genomic distance between two organisms.

3) Comparing overall gene content of functionally equivalent genes

This is done to identify the cumulative similarity of two genomes. This method assumes that there are very few conserved genes.

WHY BIOINFORMATICS WILL HAVE A MAJOR IMPACT IN THE FIELD OF MICROBIOLOGY

Bioinformatics is a young field. But despite that, it has helped in both fundamental microbiology and biotechnology by the development of algorithms, the various tools used and the discoveries that have improved the abstract model of microbial cell functioning.

The most significant thing achieved by this technique is in automating the microbial genome sequencing, and its analysis to understand the genome function.

BLAST-based database search and Smith-Waterman based gene pair alignment algorithms are thoroughly used in comparing genes and genomes.

There are expectations of better cell visualization techniques and based on the present bioinformatics analysis which has yielded abstract genome models, their integration with existing biochemical knowledge, the microbial wet lab techniques will become more focused on their goal. The progress in both bioinformatics and wet lab has to stay interdependent and they have to complement each other.

In the near future, more and more focus will be given to applying both these techniques in an integrated way, which will help us manipulate the microbial cells at a systemic level.

Microbial bioinformatics is and will stay a vibrant, creative discipline, and will keep adding importance to the increasing demand for sequence data. It will parallelly accept novel techniques and fresh approaches.

Microbial taxonomy will adapt to a scenario where almost all the microorganisms will be discovered and characterized through sequence analysis. It is expected to become a routine approach in clinical and research laboratories.

The IoT (Internet of things) the sector will merge into healthcare places, such that every component of the facility will have its own IP address which will be integrated with pathogen genome sequences. As we enter the third decade of the 21st century, microbial sequence space will remain the final frontier!

By Shaunak Raole

https://replicoo.blogspot.com/2021/07/biochemistry-of-apoptosis.html

References:

1. Bansal AK. Bioinformatics in microbial biotechnology--a mini-review. Microb Cell Fact. 2005 Jun 28;4:19. doi: 10.1186/1475-2859-4-19. PMID: 15985162; PMCID: PMC1182391.

2. Pallen MJ. Microbial bioinformatics 2020. Microb Biotechnol. 2016 Sep;9(5):681-6. doi: 10.1111/1751-7915.12389. Epub 2016 Jul 29. PMID: 27471065; PMCID: PMC4993188.