Microbial Bioinformatics: A game changing tool/field for microbiology in the near future (By Shaunak Raole)
The development of new bioinformatical techniques have helped propel science towards the achievement of these goals. The current research in bioinformatics can be classified into:
(i) Genomics – which is
the sequencing and comparative analysis of genomes to identify genes and the
genome functionality
(ii) Proteomics –
identification and characterization of properties associated with proteins and
construction of metabolic pathways,
(iii) Visualization of cells
and simulations that help in the study and model of cell behaviour
(iv) Application in the field
of development of drugs and several anti-microbial agents.
(1) Experimental wet-lab data
and the analysis based upon it
(2) Deriving new information by
using mathematical models
(3) An approach that is involved in integrating search techniques by the help of mathematical modelling.
Bioinformatics has had a major impact in automating genome sequencing, integrating genomics and proteomics database and their automated development, comparing genomes to identify a particular function of a genome, which again, is automated. It is also used for deriving metabolic pathways, and analysis of gene expression along with statistical analysis and data mining. This has helped to find out protein-protein and protein-DNA interactions. Apart from that, it has also helped in modelling 3D structures of proteins.
For a microbiological perspective, it has mainly helped drug designing, analyzing differences between pathogenic and non-pathogenic strains, which will ultimately help in identifying candidate genes for vaccines and antimicrobial agents. Along with that, it will also give us a whole-genome comparison to understand microbial evolution.
Developing bioinformatics tools
has increased the pace of biological discovery through analysis of microbial
genomes.
The amazingly improved computational capability coupled with large scale reduction in the size of biochemical techniques such as PCR, BAC, electrophoresis and microarray chips has given a gigantic amount of genomic and proteomic data throughout the globe, which in turn has led to a huge number of discoveries, which would not have been possible in the conventional wet lab techniques.
The expectation of humans to be able to control and harness genetics due to this availability of data has helped us to manipulate the genetics of microbes too. This is linked with several advantages such as improved diagnosis of diseases, especially with the help of protein biomarkers, and production of vaccines to protect us against such diseases. These vaccines are cost-effective too. This is accompanied by rational drug design, improved agricultural quality and quantity, and an understanding of the microbial machinery at the systemic level.
Haemophilus influenzae was the first microbe to have its genome sequenced, in 1995. Ever since that, hundreds of genomes have been sequenced archived for public research online (GenBank), through the thorough efforts of federal health agencies such as NIH, DOE, EMBL and EBI and other national laboratories and academic universities. Along with these, drug development companies and bioremediation companies played a major part.
GENOME SEQUENCING
Bioinformatics has majorly
contributed to genome sequencing for:
1)
Techniques of automated sequencing that couple PCR and BAC-based amplification,
2D gel electrophoresis, and automated analysis and reading of nucleotides.
2)
Joining of smaller fragments leading to the formation of a full genome sequence
3) Predicting regions of promoters and protein-coding in the genome
Amplification techniques based on PCR and BAC give us limited size fragments of a genome. These fragments sequences have nucleotide reading errors and repeats (small and similar fragments which in two or more than two parts of the genome), and chimaera, which are two separate parts of the genome that get attached end to end due to contamination, giving us an artefactual fragment.
Producing multi-copies of these fragments and aligning them and using majority voting at the same nucleotide positions solve the problem. Hence multiple experimental copies are needed. Chimaeras and repeats are removed prior to the final assembly of the genome fragments.
The fragments are joined and modelled as a graph that is mathematically weighted, where the nodes are fragments and the weights of edges denote the number of overlapping nucleotides. These fragments are joined on the basis of maximum overlap. For this, something called the greedy algorithm is used, where most nodes having maximum/minimum scores are collapsed first. For the contigs to be joined, fragments with larger nucleotide sequence overlap are joined first.
AUTOMATED IDENTIFICATION OF
GENES
Once the contigs are joined,
the next step is to identify the protein-coding regions/ ORFs (Open reading
frames) in the genomes. The identification can be done by three methods:
1) Using Hidden Markov Model
(HMM) based techniques like GLIMMER and GeneMark
These techniques use
multiple probabilistic state machines, each of which is capable of identifying
an ORF. Each machine can predict the next nucleotide character using a state
transition with maximum probability and the current nucleotide in the actual sequence is matched to that using state transition.
In microbial genomes,
GLIMMER has provided 95 to 97 per cent accuracy.
2)
Searching databases like GenBank
3) Using algorithms that are based on decision trees that identify the start and stop codons of the coding regions
IDENTIFYING GENE FUNCTION
After ORFs
are identified, the gene’s function and structure are identified. This is done
by using four popular algorithms, which are BLAST, Smith-Waterman alignment,
FASTA, and BLOCKS and their variations
The basic unit of protein function is the Protein Domain. It is associated with a unique pattern of folding alpha helixes/beta-pleated sheets/their variations at the structure level. Multiple sequence alignment and HMM are used to identify individually homologous regions in multiple homologous genes. These regions are probable domains. There are also domain-related databases such as PRODOM, Pfam and SMART.
3D STRUCTURE MODELING AND
DOCKING
A protein can either live as one or more than one low free energy conformational state, which depends upon its interactions with other proteins. When the protein is in a stable conformation, certain regions of the proteins are exposed for protein-protein or protein-DNA interactions.
Functions depend on the exposed active site, and the protein function is predicted by matching unknown protein 3D structure with that of a known protein.
The term docking is used when we identify the best matches between 3D structures of two molecules, one of which is a ligand and another is a receptor. These two binds to each other by simulating interacting surfaces and minimization of free energy at the domain level.
PAIRWISE GENOME COMPARISON
After identifying the gene
functions, we perform pair-wise genome comparisons. These provide us with:
1)
Details of Paralogous genes. Which are duplicated genes with similar sequences
but a little variation in function.
2)
Information on orthologous genes: They are functionally equivalent but are
diverged in two genomes because of speciation.
3)
Information on lateral gene transfer, in which there is gene transfer from a microorganism that is evolutionarily far.
4) Difference analysis for identifying genes that are specific to a group of genomes such as pathogens.
RECONSTRUCTING METABOLIC
PATHWAYS
This involves the automated
reconstruction and pathways of newly sequenced organisms and their comparison.
It is done through three approaches:
1)
Global network of
reactions which are catalyzed by enzymes
This
uses information on known biochemical pathways and enzymes, and identifies
enzyme function of genes that are newly sequenced. This approach is quite
powerful.
2)
Network of gene
groups that are connected through the reactions catalyzed by enzymes which are
inserted in the gene groups.
Here
the enzymes are identified along with their functions in a newly sequenced
genome, then identifying groups of genes that share a common promoter (done by
analyzing promoter region of genes). This is followed by pairwise comparison
of the newly sequenced genome with multiple genomes. This is done for deriving gene
groups. These gene groups are then connected by using biochemical knowledge of
existing enzymes and pathways.
3) The biochemical reactions globally involving products, by-products and effects of cofactors are modelled.
MICROBIAL EVOLUTION REVISITED
There has been an extensive
comparison of multiple genomes in order to correlate and classify all the genomes
into different families and to study evolution.
Researchers have confirmed that
the overall evolution is a combination of point-based mutation which gives rise
to speciation and genome restructuring. This is based upon gene duplications,
gene insertion, gene deletion, gene fusion/fission, horizontal gene transfer
and domain level restructuring.
The evolutionary study is
classified into three approaches:
1) 16srRNA approach
Uses the concept of point
mutation of conserved genes because they have a slow mutation rate. It uses
16SrRNA database and multiple sequence alignment. An evolutionary tree is
built. Three distinct domains are obtained- bacteria, archaea and eukaryotes. For
example, the Archaea domain Is hyperthermophilic, and the 16SrRNA of archaea is
different from that of bacteria.
2) Genome rearrangement
It happens due to gene
shuffling and used as a measure for the genomic distance between two organisms.
3) Comparing overall gene
content of functionally equivalent genes
This is done to identify the cumulative similarity of two genomes. This method assumes that there are very few conserved genes.
WHY BIOINFORMATICS WILL HAVE A MAJOR
IMPACT IN THE FIELD OF MICROBIOLOGY
Bioinformatics is a young
field. But despite that, it has helped in both fundamental microbiology and
biotechnology by the development of algorithms, the various tools used and the
discoveries that have improved the abstract model of microbial cell
functioning.
The most significant thing
achieved by this technique is in automating the microbial genome sequencing,
and its analysis to understand the genome function.
BLAST-based database search and
Smith-Waterman based gene pair alignment algorithms are thoroughly used in
comparing genes and genomes.
There are expectations of
better cell visualization techniques and based on the present bioinformatics
analysis which has yielded abstract genome models, their integration with
existing biochemical knowledge, the microbial wet lab techniques will become
more focused on their goal. The progress in both bioinformatics and wet lab has
to stay interdependent and they have to complement each other.
In the near future, more and
more focus will be given to applying both these techniques in an integrated way,
which will help us manipulate the microbial cells at a systemic level.
Microbial bioinformatics is and
will stay a vibrant, creative discipline, and will keep adding importance to
the increasing demand for sequence data. It will parallelly accept novel
techniques and fresh approaches.
Microbial taxonomy will adapt
to a scenario where almost all the microorganisms will be discovered and
characterized through sequence analysis. It is expected to become a routine
approach in clinical and research laboratories.
The IoT
(Internet of things) the sector will merge into healthcare places, such that every
component of the facility will have its own IP address which will be integrated
with pathogen genome sequences. As we enter the third decade of the 21st century, microbial sequence space will remain the final
frontier!
By Shaunak Raole
https://replicoo.blogspot.com/2021/07/biochemistry-of-apoptosis.html
References:
1. Bansal AK. Bioinformatics in microbial biotechnology--a
mini-review. Microb Cell Fact. 2005 Jun 28;4:19. doi: 10.1186/1475-2859-4-19.
PMID: 15985162; PMCID: PMC1182391.
2. Pallen MJ. Microbial bioinformatics 2020. Microb
Biotechnol. 2016 Sep;9(5):681-6. doi: 10.1111/1751-7915.12389. Epub 2016 Jul
29. PMID: 27471065; PMCID: PMC4993188.
Super interesting and a very well written blog!!
ReplyDeleteThanks Aniket!
DeleteVery informative
ReplyDeleteThanks Anant!
Delete