Map of the human X chromosome (from the National Center for Biotechnology Informationwebsite).

Bioinformatics /ˌbaɪ.oʊˌɪnfərˈmætɪks/ (

listen) is an interdisciplinary field that develops methods and software tools for understanding biological data. As an interdisciplinary field of science, bioinformatics combines biology, computer science, information engineering, mathematics and statistics to analyze and interpret biological data. Bioinformatics has been used for in silico analyses of biological queries using mathematical and statistical techniques.

Bioinformatics includes biological studies that use computer programming as part of their methodology, as well as a specific analysis "pipelines" that are repeatedly used, particularly in the field of genomics. Common uses of bioinformatics include the identification of candidates genes and single nucleotide polymorphisms (SNPs). Often, such identification is made with the aim of better understanding the genetic basis of disease, unique adaptations, desirable properties (esp. in agricultural species), or differences between populations. In a less formal way, bioinformatics also tries to understand the organisational principles within nucleic acid and protein sequences, called proteomics.^[1]

Introduction[edit]

Bioinformatics has become an important part of many areas of biology. In experimental molecular biology, bioinformatics techniques such as image and signal processing allow extraction of useful results from large amounts of raw data. In the field of genetics, it aids in sequencing and annotating genomes and their observed mutations. It plays a role in the text mining of biological literature and the development of biological and gene ontologies to organize and query biological data. It also plays a role in the analysis of gene and protein expression and regulation. Bioinformatics tools aid in comparing, analyzing and interpreting of genetic and genomic data and more generally in the understanding of evolutionary aspects of molecular biology. At a more integrative level, it helps analyze and catalogue the biological pathways and networks that are an important part of systems biology. In structural biology, it aids in the simulation and modeling of DNA,^[2] RNA,^[2]^[3] proteins^[4] as well as biomolecular interactions.^[5]^[6]^[7]^[8]

History[edit]

Historically, the term bioinformatics did not mean what it means today. Paulien Hogeweg and Ben Hesper coined it in 1970 to refer to the study of information processes in biotic systems.^[9]^[10]^[11] This definition placed bioinformatics as a field parallel to biochemistry (the study of chemical processes in biological systems).^[9]

Sequences[edit]

Sequences of genetic material are frequently used in bioinformatics and are easier to manage using computers than manually.

Computers became essential in molecular biology when protein sequences became available after Frederick Sanger determined the sequence of insulin in the early 1950s. Comparing multiple sequences manually turned out to be impractical. A pioneer in the field was Margaret Oakley Dayhoff.^[12] She compiled one of the first protein sequence databases, initially published as books^[13] and pioneered methods of sequence alignment and molecular evolution.^[14] Another early contributor to bioinformatics was Elvin A. Kabat, who pioneered biological sequence analysis in 1970 with his comprehensive volumes of antibody sequences released with Tai Te Wu between 1980 and 1991.^[15]

Goals[edit]

To study how normal cellular activities are altered in different disease states, the biological data must be combined to form a comprehensive picture of these activities. Therefore, the field of bioinformatics has evolved such that the most pressing task now involves the analysis and interpretation of various types of data. This includes nucleotide and amino acid sequences, protein domains, and protein structures.^[16] The actual process of analyzing and interpreting data is referred to as computational biology. Important sub-disciplines within bioinformatics and computational biology include:

Development and implementation of computer programs that enable efficient access to, management and use of, various types of information
Development of new algorithms (mathematical formulas) and statistical measures that assess relationships among members of large data sets. For example, there are methods to locate a gene within a sequence, to predict protein structure and/or function, and to cluster protein sequences into families of related sequences.

The primary goal of bioinformatics is to increase the understanding of biological processes. What sets it apart from other approaches, however, is its focus on developing and applying computationally intensive techniques to achieve this goal. Examples include: pattern recognition, data mining, machine learning algorithms, and visualization. Major research efforts in the field include sequence alignment, gene finding, genome assembly, drug design, drug discovery, protein structure alignment, protein structure prediction, prediction of gene expression and protein–protein interactions, genome-wide association studies, the modeling of evolution and cell division/mitosis.

Bioinformatics now entails the creation and advancement of databases, algorithms, computational and statistical techniques, and theory to solve formal and practical problems arising from the management and analysis of biological data.

Over the past few decades, rapid developments in genomic and other molecular research technologies and developments in information technologies have combined to produce a tremendous amount of information related to molecular biology. Bioinformatics is the name given to these mathematical and computing approaches used to glean understanding of biological processes.

Common activities in bioinformatics include mapping and analyzing DNA and protein sequences, aligning DNA and protein sequences to compare them, and creating and viewing 3-D models of protein structures.

Relation to other fields[edit]

Bioinformatics is a science field that is similar to but distinct from biological computation, while it is often considered synonymous to computational biology. Biological computation uses bioengineering and biology to build biological computers, whereas bioinformatics uses computation to better understand biology. Bioinformatics and computational biology involve the analysis of biological data, particularly DNA, RNA, and protein sequences. The field of bioinformatics experienced explosive growth starting in the mid-1990s, driven largely by the Human Genome Project and by rapid advances in DNA sequencing technology.

Analyzing biological data to produce meaningful information involves writing and running software programs that use algorithms from graph theory, artificial intelligence, soft computing, data mining, image processing, and computer simulation. The algorithms in turn depend on theoretical foundations such as discrete mathematics, control theory, system theory, information theory, and statistics.

Sequence analysis[edit]

The sequences of different genes or proteins may be aligned side-by-side to measure their similarity. This alignment compares protein sequences and genomic sequences containing WPP domains.

Since the Phage Φ-X174 was sequenced in 1977,^[17] the DNA sequences of thousands of organisms have been decoded and stored in databases. This sequence information is analyzed to determine genes that encode proteins, RNA genes, regulatory sequences, structural motifs, and repetitive sequences. A comparison of genes within a species or between different species can show similarities between protein functions, or relations between species (the use of molecular systematics to construct phylogenetic trees). With the growing amount of data, it long ago became impractical to analyze DNA sequences manually. Today^[when?], computer programssuch as BLAST are used daily to search sequences from more than 260 000 organisms, containing over 190 billion nucleotides.^[18] These programs can compensate for mutations (exchanged, deleted or inserted bases) in the DNA sequence, to identify sequences that are related, but not identical. A variant of this sequence alignment is used in the sequencing process itself. For the special task of taxonomic classification of sequence snippets, modern k-mer based software like Kraken achieves throughput unreachable by alignment methods.

DNA sequencing[edit]

Before sequences can be analyzed they have to be obtained from the data storage bank example the Genbank. DNA sequencing is still a non-trivial problem as the raw data may be noisy or afflicted by weak signals. Algorithms have been developed for base calling for the various experimental approaches to DNA sequencing.

Sequence assembly[edit]

Most DNA sequencing techniques produce short fragments of sequence that need to be assembled to obtain complete gene or genome sequences. The so-called shotgun sequencing technique (which was used, for example, by The Institute for Genomic Research (TIGR) to sequence the first bacterial genome, Haemophilus influenzae)^[19] generates the sequences of many thousands of small DNA fragments (ranging from 35 to 900 nucleotides long, depending on the sequencing technology). The ends of these fragments overlap and, when aligned properly by a genome assembly program, can be used to reconstruct the complete genome. Shotgun sequencing yields sequence data quickly, but the task of assembling the fragments can be quite complicated for larger genomes. For a genome as large as the human genome, it may take many days of CPU time on large-memory, multiprocessor computers to assemble the fragments, and the resulting assembly usually contains numerous gaps that must be filled in later. Shotgun sequencing is the method of choice for virtually all genomes sequenced today^[when?], and genome assembly algorithms are a critical area of bioinformatics research.

Genome annotation[edit]

In the context of genomics, annotation is the process of marking the genes and other biological features in a DNA sequence. This process needs to be automated because most genomes are too large to annotate by hand, not to mention the desire to annotate as many genomes as possible, as the rate of sequencing has ceased to pose a bottleneck. Annotation is made possible by the fact that genes have recognisable start and stop regions, although the exact sequence found in these regions can vary between genes.

The first description of a comprehensive genome annotation system was published in 1995 ^[19] by the team at The Institute for Genomic Research that performed the first complete sequencing and analysis of the genome of a free-living organism, the bacterium Haemophilus influenzae.^[19] Owen White designed and built a software system to identify the genes encoding all proteins, transfer RNAs, ribosomal RNAs (and other sites) and to make initial functional assignments. Most current genome annotation systems work similarly, but the programs available for analysis of genomic DNA, such as the GeneMark program trained and used to find protein-coding genes in Haemophilus influenzae, are constantly changing and improving.

Following the goals that the Human Genome Project left to achieve after its closure in 2003, a new project developed by the National Human Genome Research Institute in the U.S appeared. The so-called ENCODE project is a collaborative data collection of the functional elements of the human genome that uses next-generation DNA-sequencing technologies and genomic tiling arrays, technologies able to automatically generate large amounts of data at a dramatically reduced per-base cost but with the same accuracy (base call error) and fidelity (assembly error).

Computational evolutionary biology[edit]

Evolutionary biology is the study of the origin and descent of species, as well as their change over time. Informatics has assisted evolutionary biologists by enabling researchers to:

trace the evolution of a large number of organisms by measuring changes in their DNA, rather than through physical taxonomy or physiological observations alone,
compare entire genomes, which permits the study of more complex evolutionary events, such as gene duplication, horizontal gene transfer, and the prediction of factors important in bacterial speciation,
build complex computational population genetics models to predict the outcome of the system over time^[20]
track and share information on an increasingly large number of species and organisms

Future work endeavours to reconstruct the now more complex tree of life.

The area of research within computer science that uses genetic algorithms is sometimes confused with computational evolutionary biology, but the two areas are not necessarily related.

Comparative genomics[edit]

The core of comparative genome analysis is the establishment of the correspondence between genes (orthology analysis) or other genomic features in different organisms. It is these intergenomic maps that make it possible to trace the evolutionary processes responsible for the divergence of two genomes. A multitude of evolutionary events acting at various organizational levels shape genome evolution. At the lowest level, point mutations affect individual nucleotides. At a higher level, large chromosomal segments undergo duplication, lateral transfer, inversion, transposition, deletion and insertion.^[21] Ultimately, whole genomes are involved in processes of hybridization, polyploidization and endosymbiosis, often leading to rapid speciation. The complexity of genome evolution poses many exciting challenges to developers of mathematical models and algorithms, who have recourse to a spectrum of algorithmic, statistical and mathematical techniques, ranging from exact, heuristics, fixed parameter and approximation algorithms for problems based on parsimony models to Markov chain Monte Carlo algorithms for Bayesian analysis of problems based on probabilistic models.

Many of these studies are based on the detection of sequence homology to assign sequences to protein families.^[22]

Pan genomics[edit]

Pan genomics is a concept introduced in 2005 by Tettelin and Medini which eventually took root in bioinformatics. Pan genome is the complete gene repertoire of a particular taxonomic group: although initially applied to closely related strains of a species, it can be applied to a larger context like genus, phylum etc. It is divided in two parts- The Core genome: Set of genes common to all the genomes under study (These are often housekeeping genes vital for survival) and The Dispensable/Flexible Genome: Set of genes not present in all but one or some genomes under study. A bioinformatics tool BPGA can be used to characterize the Pan Genome of bacterial species.^[23]

Genetics of disease[edit]

With the advent of next-generation sequencing we are obtaining enough sequence data to map the genes of complex diseases infertility,^[24] breast cancer^[25] or Alzheimer's disease.^[26] Genome-wide association studies are a useful approach to pinpoint the mutations responsible for such complex diseases.^[27] Through these studies, thousands of DNA variants have been identified that are associated with similar diseases and traits.^[28] Furthermore, the possibility for genes to be used at prognosis, diagnosis or treatment is one of the most essential applications. Many studies are discussing both the promising ways to choose the genes to be used and the problems and pitfalls of using genes to predict disease presence or prognosis.^[29]

Analysis of mutations in cancer[edit]

In cancer, the genomes of affected cells are rearranged in complex or even unpredictable ways. Massive sequencing efforts are used to identify previously unknown point mutationsin a variety of genes in cancer. Bioinformaticians continue to produce specialized automated systems to manage the sheer volume of sequence data produced, and they create new algorithms and software to compare the sequencing results to the growing collection of human genome sequences and germline polymorphisms. New physical detection technologies are employed, such as oligonucleotide microarrays to identify chromosomal gains and losses (called comparative genomic hybridization), and single-nucleotide polymorphism arrays to detect known point mutations. These detection methods simultaneously measure several hundred thousand sites throughout the genome, and when used in high-throughput to measure thousands of samples, generate terabytes of data per experiment. Again the massive amounts and new types of data generate new opportunities for bioinformaticians. The data is often found to contain considerable variability, or noise, and thus Hidden Markov model and change-point analysis methods are being developed to infer real copy number changes.

Two important principles can be used in the analysis of cancer genomes bioinformatically pertaining to the identification of mutations in the exome. First, cancer is a disease of accumulated somatic mutations in genes. Second cancer contains driver mutations which need to be distinguished from passengers.^[30]

With the breakthroughs that this next-generation sequencing technology is providing to the field of Bioinformatics, cancer genomics could drastically change. These new methods and software allow bioinformaticians to sequence many cancer genomes quickly and affordably. This could create a more flexible process for classifying types of cancer by analysis of cancer driven mutations in the genome. Furthermore, tracking of patients while the disease progresses may be possible in the future with the sequence of cancer samples.^[31]

Another type of data that requires novel informatics development is the analysis of lesions found to be recurrent among many tumors.

Gene and protein expression[edit]

Analysis of gene expression[edit]

The expression of many genes can be determined by measuring mRNA levels with multiple techniques including microarrays, expressed cDNA sequence tag (EST) sequencing, serial analysis of gene expression (SAGE) tag sequencing, massively parallel signature sequencing (MPSS), RNA-Seq, also known as "Whole Transcriptome Shotgun Sequencing" (WTSS), or various applications of multiplexed in-situ hybridization. All of these techniques are extremely noise-prone and/or subject to bias in the biological measurement, and a major research area in computational biology involves developing statistical tools to separate signal from noise in high-throughput gene expression studies.^[32] Such studies are often used to determine the genes implicated in a disorder: one might compare microarray data from cancerous epithelial cells to data from non-cancerous cells to determine the transcripts that are up-regulated and down-regulated in a particular population of cancer cells.

Analysis of protein expression[edit]

Protein microarrays and high throughput (HT) mass spectrometry (MS) can provide a snapshot of the proteins present in a biological sample. Bioinformatics is very much involved in making sense of protein microarray and HT MS data; the former approach faces similar problems as with microarrays targeted at mRNA, the latter involves the problem of matching large amounts of mass data against predicted masses from protein sequence databases, and the complicated statistical analysis of samples where multiple, but incomplete peptides from each protein are detected. Cellular protein localization in a tissue context can be achieved through affinity proteomics displayed as spatial data based on immunohistochemistryand tissue microarrays.^[33]

Analysis of regulation[edit]

Gene regulation is the complex orchestration of events by which a signal, potentially an extracellular signal such as a hormone, eventually leads to an increase or decrease in the activity of one or more proteins. Bioinformatics techniques have been applied to explore various steps in this process.

For example, gene expression can be regulated by nearby elements in the genome. Promoter analysis involves the identification and study of sequence motifs in the DNA surrounding the coding region of a gene. These motifs influence the extent to which that region is transcribed into mRNA. Enhancer elements far away from the promoter can also regulate gene expression, through three-dimensional looping interactions. These interactions can be determined by bioinformatic analysis of chromosome conformation captureexperiments.

Expression data can be used to infer gene regulation: one might compare microarray data from a wide variety of states of an organism to form hypotheses about the genes involved in each state. In a single-cell organism, one might compare stages of the cell cycle, along with various stress conditions (heat shock, starvation, etc.). One can then apply clustering algorithms to that expression data to determine which genes are co-expressed. For example, the upstream regions (promoters) of co-expressed genes can be searched for over-represented regulatory elements. Examples of clustering algorithms applied in gene clustering are k-means clustering, self-organizing maps (SOMs), hierarchical clustering, and consensus clustering methods.

Analysis of cellular organization[edit]

Several approaches have been developed to analyze the location of organelles, genes, proteins, and other components within cells. This is relevant as the location of these components affects the events within a cell and thus helps us to predict the behavior of biological systems. A gene ontology category, cellular compartment, has been devised to capture subcellular localization in many biological databases.

Microscopy and image analysis[edit]

Microscopic pictures allow us to locate both organelles as well as molecules. It may also help us to distinguish between normal and abnormal cells, e.g. in cancer.

Protein localization[edit]

The localization of proteins helps us to evaluate the role of a protein. For instance, if a protein is found in the nucleus it may be involved in gene regulation or splicing. By contrast, if a protein is found in mitochondria, it may be involved in respiration or other metabolic processes. Protein localization is thus an important component of protein function prediction. There are well developed protein subcellular localization prediction resources available, including protein subcellular location databases, and prediction tools.^[34]^[35]

Nuclear organization of chromatin[edit]

Data from high-throughput chromosome conformation capture experiments, such as Hi-C (experiment) and ChIA-PET, can provide information on the spatial proximity of DNA loci. Analysis of these experiments can determine the three-dimensional structure and nuclear organization of chromatin. Bioinformatic challenges in this field include partitioning the genome into domains, such as Topologically Associating Domains (TADs), that are organised together in three-dimensional space.^[36]

Structural bioinformatics[edit]

3-dimensional protein structures such as this one are common subjects in bioinformatic analyses.

Protein structure prediction is another important application of bioinformatics. The amino acid sequence of a protein, the so-called primary structure, can be easily determined from the sequence on the gene that codes for it. In the vast majority of cases, this primary structure uniquely determines a structure in its native environment. (Of course, there are exceptions, such as the bovine spongiform encephalopathy(mad cow disease) prion.) Knowledge of this structure is vital in understanding the function of the protein. Structural information is usually classified as one of secondary, tertiary and quaternary structure. A viable general solution to such predictions remains an open problem. Most efforts have so far been directed towards heuristics that work most of the time.^{[citation needed]}

One of the key ideas in bioinformatics is the notion of homology. In the genomic branch of bioinformatics, homology is used to predict the function of a gene: if the sequence of gene A, whose function is known, is homologous to the sequence of gene B, whose function is unknown, one could infer that B may share A's function. In the structural branch of bioinformatics, homology is used to determine which parts of a protein are important in structure formation and interaction with other proteins. In a technique called homology modeling, this information is used to predict the structure of a protein once the structure of a homologous protein is known. This currently remains the only way to predict protein structures reliably.

One example of this is hemoglobin in humans and the hemoglobin in legumes (leghemoglobin), which are distant relatives from the same protein superfamily. Both serve the same purpose of transporting oxygen in the organism. Although both of these proteins have completely different amino acid sequences, their protein structures are virtually identical, which reflects their near identical purposes and shared ancestor.^[37]

Other techniques for predicting protein structure include protein threading and de novo (from scratch) physics-based modeling.

Another aspect of structural bioinformatics include the use of protein structures for Virtual Screening models such as Quantitative Structure-Activity Relationship models and proteochemometric models (PCM). Furthermore, a protein's crystal structure can be used in simulation of for example ligand-binding studies and in silico mutagenesis studies.

Network and systems biology[edit]

Network analysis seeks to understand the relationships within biological networks such as metabolic or protein–protein interaction networks. Although biological networks can be constructed from a single type of molecule or entity (such as genes), network biology often attempts to integrate many different data types, such as proteins, small molecules, gene expression data, and others, which are all connected physically, functionally, or both.

Systems biology involves the use of computer simulations of cellular subsystems (such as the networks of metabolites and enzymes that comprise metabolism, signal transductionpathways and gene regulatory networks) to both analyze and visualize the complex connections of these cellular processes. Artificial life or virtual evolution attempts to understand evolutionary processes via the computer simulation of simple (artificial) life forms.

Molecular interaction networks[edit]

Interactions between proteins are frequently visualized and analyzed using networks. This network is made up of protein–protein interactions from Treponema pallidum, the causative agent of syphilis and other diseases.

Tens of thousands of three-dimensional protein structures have been determined by X-ray crystallography and protein nuclear magnetic resonance spectroscopy (protein NMR) and a central question in structural bioinformatics is whether it is practical to predict possible protein–protein interactions only based on these 3D shapes, without performing protein–protein interaction experiments. A variety of methods have been developed to tackle the protein–protein docking problem, though it seems that there is still much work to be done in this field.

Other interactions encountered in the field include Protein–ligand (including drug) and protein–peptide. Molecular dynamic simulation of movement of atoms about rotatable bonds is the fundamental principle behind computational algorithms, termed docking algorithms, for studying molecular interactions.

Others[edit]

Literature analysis[edit]

The growth in the number of published literature makes it virtually impossible to read every paper, resulting in disjointed sub-fields of research. Literature analysis aims to employ computational and statistical linguistics to mine this growing library of text resources. For example:

Abbreviation recognition – identify the long-form and abbreviation of biological terms
Named entity recognition – recognizing biological terms such as gene names
Protein–protein interaction – identify which proteins interact with which proteins from text

The area of research draws from statistics and computational linguistics.

High-throughput image analysis[edit]

Computational technologies are used to accelerate or fully automate the processing, quantification and analysis of large amounts of high-information-content biomedical imagery. Modern image analysis systems augment an observer's ability to make measurements from a large or complex set of images, by improving accuracy, objectivity, or speed. A fully developed analysis system may completely replace the observer. Although these systems are not unique to biomedical imagery, biomedical imaging is becoming more important for both diagnostics and research. Some examples are:

high-throughput and high-fidelity quantification and sub-cellular localization (high-content screening, cytohistopathology, Bioimage informatics)
morphometrics
clinical image analysis and visualization
determining the real-time air-flow patterns in breathing lungs of living animals
quantifying occlusion size in real-time imagery from the development of and recovery during arterial injury
making behavioral observations from extended video recordings of laboratory animals
infrared measurements for metabolic activity determination
inferring clone overlaps in DNA mapping, e.g. the Sulston score

High-throughput single cell data analysis[edit]

Computational techniques are used to analyse high-throughput, low-measurement single cell data, such as that obtained from flow cytometry. These methods typically involve finding populations of cells that are relevant to a particular disease state or experimental condition.

Biodiversity informatics[edit]

Biodiversity informatics deals with the collection and analysis of biodiversity data, such as taxonomic databases, or microbiome data. Examples of such analyses include phylogenetics, niche modelling, species richness mapping, DNA barcoding, or species identification tools.

Ontologies and data integration[edit]

Biological ontologies are directed acyclic graphs of controlled vocabularies. They are designed to capture biological concepts and descriptions in a way that can be easily categorised and analysed with computers. When categorised in this way, it is possible to gain added value from holistic and integrated analysis.

The OBO Foundry was an effort to standardise certain ontologies. One of the most widespread is the Gene ontology which describes gene function. There are also ontologies which describe phenotypes.

Databases[edit]

Databases are essential for bioinformatics research and applications. Many databases exist, covering various information types: for example, DNA and protein sequences, molecular structures, phenotypes and biodiversity. Databases may contain empirical data (obtained directly from experiments), predicted data (obtained from analysis), or, most commonly, both. They may be specific to a particular organism, pathway or molecule of interest. Alternatively, they can incorporate data compiled from multiple other databases. These databases vary in their format, access mechanism, and whether they are public or not.

Some of the most commonly used databases are listed below. For a more comprehensive list, please check the link at the beginning of the subsection.

Used in biological sequence analysis: Genbank, UniProt
Used in structure analysis: Protein Data Bank (PDB)
Used in finding Protein Families and Motif Finding: InterPro, Pfam
Used for Next Generation Sequencing: Sequence Read Archive
Used in Network Analysis: Metabolic Pathway Databases (KEGG, BioCyc), Interaction Analysis Databases, Functional Networks
Used in design of synthetic genetic circuits: GenoCAD

Software and tools[edit]

Software tools for bioinformatics range from simple command-line tools, to more complex graphical programs and standalone web-services available from various bioinformatics companies or public institutions.

Open-source bioinformatics software[edit]

Many free and open-source software tools have existed and continued to grow since the 1980s.^[38] The combination of a continued need for new algorithms for the analysis of emerging types of biological readouts, the potential for innovative in silico experiments, and freely available open code bases have helped to create opportunities for all research groups to contribute to both bioinformatics and the range of open-source software available, regardless of their funding arrangements. The open source tools often act as incubators of ideas, or community-supported plug-ins in commercial applications. They may also provide de facto standards and shared object models for assisting with the challenge of bioinformation integration.

The range of open-source software packages includes titles such as Bioconductor, BioPerl, Biopython, BioJava, BioJS, BioRuby, Bioclipse, EMBOSS, .NET Bio, Orange with its bioinformatics add-on, Apache Taverna, UGENE and GenoCAD. To maintain this tradition and create further opportunities, the non-profit Open Bioinformatics Foundation^[38] have supported the annual Bioinformatics Open Source Conference (BOSC) since 2000.^[39]

An alternative method to build public bioinformatics databases is to use the MediaWiki engine with the WikiOpener extension. This system allows the database to be accessed and updated by all experts in the field.^[40]

Web services in bioinformatics[edit]

SOAP- and REST-based interfaces have been developed for a wide variety of bioinformatics applications allowing an application running on one computer in one part of the world to use algorithms, data and computing resources on servers in other parts of the world. The main advantages derive from the fact that end users do not have to deal with software and database maintenance overheads.

Basic bioinformatics services are classified by the EBI into three categories: SSS (Sequence Search Services), MSA (Multiple Sequence Alignment), and BSA (Biological Sequence Analysis).^[41] The availability of these service-oriented bioinformatics resources demonstrate the applicability of web-based bioinformatics solutions, and range from a collection of standalone tools with a common data format under a single, standalone or web-based interface, to integrative, distributed and extensible bioinformatics workflow management systems.

Bioinformatics workflow management systems[edit]

A bioinformatics workflow management system is a specialized form of a workflow management system designed specifically to compose and execute a series of computational or data manipulation steps, or a workflow, in a Bioinformatics application. Such systems are designed to

provide an easy-to-use environment for individual application scientists themselves to create their own workflows,
provide interactive tools for the scientists enabling them to execute their workflows and view their results in real-time,
simplify the process of sharing and reusing workflows between the scientists, and
enable scientists to track the provenance of the workflow execution results and the workflow creation steps.

Some of the platforms giving this service: Galaxy, Kepler, Taverna, UGENE, Anduril, HIVE.

BioCompute and BioCompute Objects[edit]

In 2014, the US Food and Drug Administration sponsored a conference held at the National Institutes of Health Bethesda Campus to discuss reproducibility in bioinformatics.^[42] Over the next three years, a consortium of stakeholders met regularly to discuss what would become BioCompute paradigm.^[43] These stakeholders included representatives from government, industry, and academic entities. Session leaders represented numerous branches of the FDA and NIH Institutes and Centers, non-profit entities including the Human Variome Project and the European Federation for Medical Informatics, and research institutions including Stanford, the New York Genome Center, and the George Washington University.

It was decided that the BioCompute paradigm would be in the form of digital ‘lab notebooks’ which allow for the reproducibility, replication, review, and reuse, of bioinformatics protocols. This was proposed to enable greater continuity within a research group over the course of normal personnel flux while furthering the exchange of ideas between groups. The US FDA funded this work so that information on pipelines would be more transparent and accessible to their regulatory staff.^[44]

In 2016, the group reconvened at the NIH in Bethesda and discussed the potential for a BioCompute Object, an instance of the BioCompute paradigm. This work was copied as both a “standard trial use” document and a preprint paper uploaded to bioRxiv. The BioCompute object allows for the JSON-ized record to be shared among employees, collaborators, and regulators.^[45]^[46]

Education platforms[edit]

Software platforms designed to teach bioinformatics concepts and methods include Rosalind and online courses offered through the Swiss Institute of Bioinformatics Training Portal. The Canadian Bioinformatics Workshops provides videos and slides from training workshops on their website under a Creative Commons license. The 4273π project or 4273pi project^[47] also offers open source educational materials for free. The course runs on low cost Raspberry Pi computers and has been used to teach adults and school pupils.^[48]^[49]4273π is actively developed by a consortium of academics and research staff who have run research level bioinformatics using Raspberry Pi computers and the 4273π operating system.^[50]^[51]

MOOC platforms also provide online certifications in bioinformatics and related disciplines, including Coursera's Bioinformatics Specialization (UC San Diego) and Genomic Data Science Specialization (Johns Hopkins) as well as EdX's Data Analysis for Life Sciences XSeries (Harvard). University of Southern California offers a Masters In Translational Bioinformatics focusing on biomedical applications.

Conferences[edit]

There are several large conferences that are concerned with bioinformatics. Some of the most notable examples are Intelligent Systems for Molecular Biology (ISMB), European Conference on Computational Biology (ECCB), and Research in Computational Molecular Biology (RECOMB).

Bioinformatics

TABLE OF CONTENTS

Introduction
The data of bioinformatics
Storage and retrieval of data
Goals of bioinformatics

Bioinformatics, a hybrid science that links biological data with techniques for information storage, distribution, and analysis to support multiple areas of scientific research, including biomedicine. Bioinformatics is fed by high-throughput data-generating experiments, including genomic sequence determinations and measurements of gene expression patterns. Database projects curate and annotate the data and then distribute it via the World Wide Web. Mining these data leads to scientific discoveries and to the identification of new clinical applications. In the field of medicine in particular, a number of important applications for bioinformatics have been discovered. For example, it is used to identify correlations between gene sequences and diseases, to predict protein structures from amino acid sequences, to aid in the design of novel drugs, and to tailor treatments to individual patients based on their DNA sequences (pharmacogenomics).

This computerized image of anthrax shows the various structural relationships of seven units within the protein and demonstrates the interaction of a drug (shown in yellow) bound to the protein to block the so-called lethal factor unit. Bioinformatics plays an important role in enabling scientists to predict where a drug molecule will bind within a protein, given the individual structures of the molecules.University of Oxford/Getty Images

The data of bioinformatics

The classic data of bioinformatics include DNA sequences of genes or full genomes; amino acid sequences of proteins; and three-dimensional structures of proteins, nucleic acids and protein–nucleic acid complexes. Additional “-omics” data streams include: transcriptomics, the pattern of RNA synthesis from DNA; proteomics, the distribution of proteins in cells; interactomics, the patterns of protein-protein and protein–nucleic acid interactions; and metabolomics, the nature and traffic patterns of transformations of small molecules by the biochemical pathways active in cells. In each case there is interest in obtaining comprehensive, accurate data for particular cell types and in identifying patterns of variation within the data. For example, data may fluctuate depending on cell type, timing of data collection (during the cell cycle, or diurnal, seasonal, or annual variations), developmental stage, and various external conditions. Metagenomics and metaproteomics extend these measurements to a comprehensive description of the organisms in an environmental sample, such as in a bucket of ocean water or in a soil sample.

Bioinformatics has been driven by the great acceleration in data-generation processes in biology. Genome sequencing methods show perhaps the most dramatic effects. In 1999 the nucleic acid sequence archives contained a total of 3.5 billion nucleotides, slightly more than the length of a single human genome; a decade later they contained more than 283 billion nucleotides, the length of about 95 human genomes. The U.S. National Institutes of Health has challenged researchers by setting a goal to reduce the cost of sequencing a human genome to $1,000; this would make DNA sequencing a more affordable and practical tool for U.S. hospitals and clinics, enabling it to become a standard component of diagnosis.

Storage and retrieval of data

In bioinformatics, data banks are used to store and organize data. Many of these entities collect DNA and RNA sequences from scientific papers and genome projects. Many databases are in the hands of international consortia. For example, an advisory committee made up of members of the European Molecular Biology Laboratory Nucleotide Sequence Database (EMBL-Bank) in the United Kingdom, the DNA Data Bank of Japan (DDBJ), and GenBank of the National Center for Biotechnology Information (NCBI) in the United States oversees the International Nucleotide Sequence Database Collaboration (INSDC). To ensure that sequence data are freely available, scientific journals require that new nucleotide sequences be deposited in a publicly accessible database as a condition for publication of an article. (Similar conditions apply to nucleic acid and protein structures.) There also exist genome browsers, databases that bring together all the available genomic and molecular information about a particular species.

The major database of biological macromolecular structure is the worldwide Protein Data Bank (wwPDB), a joint effort of the Research Collaboratory for Structural Bioinformatics (RCSB) in the United States, the Protein Data Bank Europe (PDBe) at the European Bioinformatics Institute in the United Kingdom, and the Protein Data Bank Japan at Ōsaka University. The homepages of the wwPDB partners contain links to the data files themselves, to expository and tutorial material (including news items), to facilities for deposition of new entries, and to specialized search software for retrieving structures.

Information retrieval from the data archives utilizes standard tools for identification of data items by keyword; for instance, one can type “aardvark myoglobin” into Google and retrieve the molecule’s amino acid sequence. Other algorithms search data banks to detect similarities between data items. For example, a standard problem is to probe a sequence database with a gene or protein sequence of interest in order to detect entities with similar sequences.

Goals of bioinformatics

The development of efficient algorithms for measuring sequence similarity is an important goal of bioinformatics. The Needleman-Wunsch algorithm, which is based on dynamic programming, guarantees finding the optimal alignment of pairs of sequences. This algorithm essentially divides a large problem (the full sequence) into a series of smaller problems (short sequence segments) and uses the solutions of the smaller problems to construct a solution to the large problem. Similarities in sequences are scored in a matrix, and the algorithm allows for the detection of gaps in sequence alignment.

Although the Needleman-Wunsch algorithm is effective, it is too slow for probing a large sequence database. Therefore, much attention has been given to finding fast information-retrieval algorithms that can deal with the vast amounts of data in the archives. An example is the program BLAST (Basic Local Alignment Search Tool). A development of BLAST, known as position-specific iterated- (or PSI-) BLAST, makes use of patterns of conservation in related sequences and combines the high speed of BLAST with very high sensitivity to find related sequences.

Another goal of bioinformatics is the extension of experimental data by predictions. A fundamental goal of computational biology is the prediction of protein structure from an amino acid sequence. The spontaneous folding of proteins shows that this should be possible. Progress in the development of methods to predict protein folding is measured by biennial Critical Assessment of Structure Prediction (CASP) programs, which involve blind tests of structure prediction methods.

Bioinformatics is also used to predict interactions between proteins, given individual structures of the partners. This is known as the “docking problem.” Protein-protein complexes show good complementarity in surface shape and polarity and are stabilized largely by weak interactions, such as burial of hydrophobic surface, hydrogen bonds, and van der Waals forces. Computer programs simulate these interactions to predict the optimal spatial relationship between binding partners. A particular challenge, one that could have important therapeutic applications, is to design an antibody that binds with high affinity to a target protein.

Initially, much bioinformatics research has had a relatively narrow focus, concentrating on devising algorithms for analyzing particular types of data, such as gene sequences or protein structures. Now, however, the goals of bioinformatics are integrative and are aimed at figuring out how combinations of different types of data can be used to understand natural phenomena, including organisms and disease.

Arthur M. Lesk

CITATION INFORMATION

ARTICLE TITLE: Bioinformatics

WEBSITE NAME: Encyclopaedia Britannica

PUBLISHER: Encyclopaedia Britannica, Inc.

DATE PUBLISHED: 18 February 2019

URL: https://www.britannica.com/science/bioinformatics

ACCESS DATE: December 23, 2019

Abstract

Bioinformatics is an interdisciplinary scientific field of life sciences. Bioinformatics research and application include the analysis of molecular sequence and genomics data; genome annotation, gene/protein prediction, and expression profiling; molecular folding, modeling, and design; building biological networks; development of databases and data management systems; development of software and analysis tools; bioinformatics services and workflow; mining of biomedical literature and text; and bioinformatics education and training. Astronomical accumulation of genomics, proteomics, and metabolomics data as well as a need for their storage, analysis, annotation, organization, systematization, and integration into biological networks and database systems were the main driving forces for the emergence and development of bioinformatics. Current critical needs for bioinformatics among others highlighted in this chapter, however, are to understand basics and specifics of bioinformatics as well as to prepare new generation scientists and specialists with integrated, interdisciplinary, and multilingual knowledge who can use modern bioinformatics resources powered with sophisticated operating systems, software, and database/networking technologies. In this introductory chapter, I aim to give an overall picture on basics and developments of the bioinformatics field for readers with some future perspectives, highlighting chapters published in this book.

Keywords

bioinformatics
databases
molecular sequence analysis
software and analysis tools
bioinformatics training

Chapter and author info

Show +

1. Introduction

Biological data can be described as molecular sequence information and “wet-bench” experimented content of genome and gene product analyses [1]. Being an interdisciplinary branch of the life sciences, bioinformatics targets to develop methodology and analysis tools to explore large volumes of biological data, helping to store, organize, systematize, annotate, visualize, query, mine, understand, and interpret complex data volumes. It uses conventional, modern computer science and cloud computing, statistics, and mathematics, as well as pattern recognition, reconstruction, machine learning, simulation and iterative approaches, and molecular modeling/folding algorithms [1, 2]. The emergence and advances of the bioinformatics field, however, are tightly associated with the computerized programming and software developments needed for the handling and structural and functional analysis of large volumes of molecular sequences of DNA, RNA, proteins, and metabolites.

Presently, although still core for genomics and genetics field, bioinformatics became an umbrella for wider range of biological studies analyzing variety types of biological data, structuring, systemizing, annotating, querying, mining, and visualizing available biological information and a variety of biomedical text records [1–3]. Although drawing a fine line between bioinformatics and some other related fields is difficult because of increased applications of computers, statistics, and mathematics to scientific problem solving and experiments of life sciences, there should not be a misperception about bioinformatics description and objectives. Bioinformatics should not be mixed with, for example, biometry and biostatistics, development of DNA computers, or computerized generation and filing of data from imaging.

Bioinformatics also should be differentiated from related scientific fields such as biological computation and computational biology [1, 2]. Biological computation aims to develop biological computers using advances of bioengineering, cybernetics, robotics, and molecular cell biology. In contrast, bioinformatics develops and utilizes computational algorithms to understand and interpret biological processes based on genome-derived molecular sequences and their interactions [2]. Therefore, in many aspects, bioinformatics seems similar to computational biology objectives. A computational biology is concentrated on building and/or developing theoretical models for biological analyses [1, 2], whereas bioinformatics focuses on providing practical tools to organize and analyze basic genomic, proteomic and other “omics” data, including sequence analysis and its visualization [1, 2]. Admittedly, computational biology and bioinformatics both target to use genome data, for example, multiple sequence alignments and/or genome assembly tools. This makes distinctive boundaries of these two fields less distinguishable if their theoretical and practical scales are forgotten [2]. Thus, as mentioned above, the common core aims of bioinformatics are to handle, analyze, and interpret the genome-derived molecular sequence data and its organizational principles in broad scales/spectra of comparative, simulative, and evolutionary/phylogenetics perspectives. These tools are applicable and widely used for studies related to genetics, genomics, biochemistry, physiology, biophysics, all agricultural, medical, and environmental sciences as well as evolution, system biology, and artificial intelligence [1–10].

For instance, bioinformatics tools such as the comparative analysis of genomic and genetic data and/or signal processing help to interpret and understand the molecular and evolutionary processes [9] and interactions from large volumes of raw data in the field of wet-bench experimental molecular biology [1, 2]. In the “omics” fields, it helps to sequence and annotate genomes, and identify distinct patterns, mutation profiles, genetic epistasis, gene/protein expression and regulation, and gene ontologies [1, 2, 4, 8–11] as well as be instrumental in mining and querying the biological data and biomedical literature text [3, 4, 7]. When applied for system biology [2, 6], bioinformatics is a key instrument to analyze and catalogue the biochemical/genetic pathways and networks, which helps to integrate pieces of analyzed information to depict and model a full picture of the life processes. Application of reconstruction, pattern recognition, folding, simulation, and molecular modeling with bioinformatic tools can identify structural peculiarities and interactions of molecular sequences important for structural biology and medicinal drug design [12, 13]. All of these large scale, genome-derived, molecular sequence analyses of raw “Big Data” are impossible to be analyzed manually [1, 2]. This prompted the biology science research community to apply interdisciplinary methods and tools for “Big Data” analysis in combination with modern computing knowledge, which resulted in the emergence of novel interdisciplinary bioinformatics science. Let us, first, take a look the historic developments in the bioinformatics field.

1.1. History of emergence and development

Bioinformatics term was coined by Paulien Hogeweg and Ben Hesper in 1970 [2, 14]. Its meaning was very different from current description and referred to the study of information processes in biotic systems like biochemistry and biophysics [14–16]. However, the emergence of bioinformatics tracks back to the 1960s. It was appeared in concordance with the development of protein sequencing methods from a variety of organisms and with the availability of protein sequences after Frederick Sanger determined the sequence of insulin in the early 1950s [17, 18]. New computer methods to analyze and compare a large number of protein sequences of different organisms were needed because handling many amino acid sequences manually was impractical. This led in compiling the first “Protein Information Resources” (PIR) [1, 19, 20] by Margaret Oakley Dayhoff and her collaborators at the National Biomedical Research Foundation [1]. Dayhoff's team successfully organized the protein sequences into distinct groups and sub-groups based on sequence similarity and percent accepted mutation (PAM) matrices [1]. This was published as protein sequences atlas [21, 22] that has been widely used in performing protein sequence alignments and database similarity searches [1, 2, 23]. This was pioneered methods of protein sequence alignment and molecular evolution [22]. In the 1970s, Elvin A. Kabat further contributed to bioinformatics development by his extended protein sequence analysis of comprehensive volumes of antibody sequences, released in collaboration with Tai Te Wu between 1980 and 1991 [2, 24].

With the objective of providing the theoretical background to immunology experiments in 1974, George Bell and colleagues initiated the collection of DNA sequences into GenBank [1]. During 1982–1992, the first version of GenBank was prepared by Walter Goad's group [1] and the efforts resulted in the development of presently known and widely used DNA sequence databases of GenBank [25], “The European Molecular Biology Laboratory (EMBL) [26], and DNA DataBank of Japan (DDBJ) [27] in 1979, 1980, and 1984, respectively [1]. Most important development in DNA sequence databases, however, was incorporation of web-based searching algorithms allowing researchers to find and compare the target DNA sequences. Such first developments and resulting computer software called “GENEINFO” and its derivative version of “Entrez” were developed by David Benson and David Lipman and colleagues [1]. This software allowed researchers to rapidly search database-indexed sequences and match them with queried sequence. Software became readily available through web-based interface of the National Center of Biotechnology Information (NCBI) database [28]. Molecular sequence analysis, comparison, and visualization methods have been improved, and many different methodologies have been contributed to bioinformatics advancements in this direction. Such advancements can be exemplified by the development of dot matrix and diagram methods [29], alignment of sequences by dynamic programming [30], finding of local alignments between sequences [31], multiple sequence alignment tools [32–35], predicting the secondary structures of RNAs [36, 37], determination of evolutionary relationships of sequences [38, 39], and assigning the gene function based on sequence similarity of known function from models [40]. Development of FASTA [41, 42], BLAST [43, 44], and their various modifications [45–47] has further powered the bioinformatics field and greatly improved the biological data analysis. Development of tools for predicting the putative protein sequences, structure, and function of proteins/genes based on DNA sequences [48–58], completing full genome sequences, and building web-based genome databases for many prokaryotic and eukaryotic organisms [58] has provided a great advance in the bioinformatics field. In addition, rapid genome-wide gene expression profiling and analysis opportunities [59–62], biological pathway assignment and identification, data storing, and mining and querying for large volume of biological datasets [63–73] have further provided unprecedented popularity of bioinformatics in the scene of world science, which has been briefly reviewed below.

Figure 1.

Dynamics of bioinformatics-related publications over the past four decades. (A) Unquoted and (B) quoted keyword retrieved scientific publications from PubMed [74].

Since its emergence as an interdisciplinary scientific field in 1970, bioinformatics research has continuously increased over the past four-decade period. Unquoted search of keyword of bioinformatics in the PubMed database [74] has found nearly 181,000 scientific publications covering the period of 1958 to March of 2016. Repeating the search with the quoted keyword found 62,402 scientific publications over the four-decade period, demonstrating the starting point of increased publication efforts in the end of 1990s with its first raise in 2000/2001, following significant peaks in 2003/2004 and after 2013 (Figure 1). In this introductory chapter, I aim to give a brief highlight of these four-decade developments introducing the chapters presented in this book.

2. Bioinformatics help in handling and analysis of the genomics data, genome annotation, and expression profiling

Rapid and reliable determination of DNA molecules, because of the introduction of the sequencing technique of Sanger and Coulson [75] and Maxam and Gilbert [76], provided large-scale DNA sequence data that needed to be analyzed by computerized programming. This prompted the development of efficient bioinformatics methodologies. For example, a seminal effort of the Phage Φ-X174 [2, 77] and the Haemophilus influenza [2, 78] genome sequencing using shotgun sequencing techniques generated the sequences of many thousands of small DNA fragments, ranging from 35 to 900 nucleotides [2] and required the assembly of a complete bacterial genome. The ends of sequenced shotgun clones overlap and can be assembled using computerized similarity search algorithms into the complete genome although the assembly tasks are challenging due to the requirement for powerful computers with sufficient memory and issues of generating multiple gaps in assembled genome. Genome assembly algorithms are a critical area of bioinformatics research as fragmented genome sequencing methods have been the core approach for virtually all genomes sequenced today [1, 2].

Therefore, without bioinformatics tools, it is not possible to think about genome sequencing as present bioinformatics programs such as BLAST/sequence alignments not only provide rapid practical tools to handle, analyze, compare, relate, and visualize DNA sequences but also offer help with the sequencing process itself. The development of cost-effective, next generation sequencing (NGS) platforms [79, 80] has helped to completely decode nearly the entire genome of many different organisms including human and many other model and specialty organisms, or crop genomes with complex polyploidy levels within a short period. For example, according to the listings in the Genomes OnLine Database (GOLD) as of March 8, 2016, there were 79,650 genome sequencing projects of which 8018 were completed projects, 33,489 were permanent drafts, 35,609 were incomplete projects, and 1553 were targeted projects [81]. There are 73,000 organism, including archaea (1201), bacteria (55,303), eukaryotes (11,990), and viruses (4473), listed for sequencing. These numbers should be increased if the sequencing of the 100,000 whole-human genomes [82] is added.

Bioinformatics tools are needed in annotation and prediction of genes from sequenced genomes that requires computerized approaches because genomes are large to be manually annotated as mentioned above. Bioinformatics-based gene finding and annotation including a search for protein-coding genes, RNA transcripts, and other functional sequences within a genome is possible because there are patterns to recognize the start, stop regions, introns, exons, motifs, repeats, and other regulatory and sensory as well as signaling regions with some variations between genes and among organisms. With the availability and need for analysis of H. influenza genome, the first genome annotation computer program system was designed in 1995 by Owen White [2, 78], which provided tools to find the genes and identify putative functions of annotated sequences. White’s effort was basic for all currently available gene annotation and prediction software, which keep periodically improving [2].

Bioinformatics tools are very important to analyze gene and protein expression profiles. Large-scale sequencing of cDNA libraries has generated large volumes of serial analysis of gene expression (SAGE), expressed sequences tags (ESTs), massively parallel signature sequencing (MPSS), transcriptome profiling, or RNA-Seq, and various applications of multiplexed in-situ hybridization (microarray) profile data [83–95]. All of these gene expression techniques are extremely noise-prone and/or subject to bias in the biological measurement, which requires application of statistical tools to separate signal from noise in high-throughput gene expression studies. In this context, chapter by Zhao et al. in this book reviews and discusses the main tools and algorithms currently available for RNAseq data analyses, discussing rapidly evolving RNAseq technologies such as stranded RNAseq, targeted RNAseq, and single cell RNA-seq. Moreover, Sripathy et al. have comprehensively discussed transcriptome profiling, RNAseq, and micro-RNA expression studies in cotton (Gossypium species), whereas Younis et al. present a chapter on skin microbiome, transcriptome, and microarray data analyses. In this book, readers can find an interesting chapter on bioinformatics challenges and tools for Hepatitis B genome analysis written by Bell and Kramvis, which highlight features of this small genome virus for bioinformatics analysis.

Similarly, protein microarrays and high-throughput mass spectrometry require bioinformatics analysis to identify proteins through the complex sequence similarity searches using protein sequence databases [96–103]. Bioinformatics is a great help for analysis of gene regulation through searching and comparing the sequence motifs related to promoters and other regulatory elements. Using bioinformatics tools and sequence motifs/regulatory elements genes can be clustered by function, and the co-expression characteristics can be determined. Examples of such bioinformatics tools include k-means clustering, hierarchical clustering, and consensus clustering methods such as the Bi-CoPaM, and self-organizing maps (SOMs) that can identify functionally active sequences from very complex microarray datasets [104–107]

Not only just these, bioinformatics plays a major role in data collection of the functional elements of sequenced genomes that use the next-generation DNA-sequencing technologies and genomic tiling arrays. This is best exemplified “Encyclopedia of DNA Elements (ENCODE)” [108] project developed by the National Human Genome Research Institute that describes the functional elements of the human genome. Thanks to bioinformatics and applications of its tools, genomes and genes, and protein sequences of different organisms can be rapidly compared, searched, and interpreted. In addition, mutations can be identified that help to judge and diagnose many complex human and plant diseases, crop traits, and interpret complex evolutionary process, such as genome duplications, polyploidization, adaptation, and speciation.

3. Structural bioinformatics: molecular folding, modeling, and design

One of the widely used applications of bioinformatics is identification of three-dimensional protein structures, molecular modeling, and folding to predict the possible function of proteins or other molecular structures, model behavior of molecules, fold the molecule to its native biologically functional three-dimensional structure, and design biomedical drugs for many complex human diseases. It helps de novo protein design, enzyme design, protein-ligand/drug docking, protein-peptide interaction, and structure prediction of biological macromolecules and macromolecular complexes [1, 2, 109].

From the coding DNA sequences, the primary structure of proteins can be easily determined that is vital in understanding the function of the protein(s). Further, based on homology patterns in primary structure of proteins and using homology modeling, important structural formations and interaction sites with other proteins can be determined. This helps to predict reliably the structure of a protein based on known structure of a homologous protein(s). Moreover, the identification of secondary, tertiary, and quaternary structures of proteins is very important to understand the function of proteins. The exact three-dimensional structure is essential for correct function, and a failure to fold into native structure generally produces inactive proteins or misfolded proteins that can be toxic [108]. Bioinformatics of protein folding includes (1) energy landscape of protein folding and (2) modeling of protein folding approaches [12, 13, 109].

One of the freely available and leading web server/stand-alone software tools for automated protein structure prediction and structure-based functional annotation can be exemplified by the “Iterative Threading ASSEmbly Refinement”(I-TASSER), which “first generates full-length atomic structural models from multiple threading alignments and iterative structural assembly simulations followed by atomic-level structure refinement” [110]. Using the I-TASSER, all above-mentioned functional and structural characteristics of proteins, including ligand-binding sites, enzyme commission number, and gene ontology terms can be explored in a comparative scale [110, 111].

Molecular modeling through molecular mechanistic and/or the quantum chemistry approaches is the key bioinformatics approaches to study the behavior of molecules. These are routinely used to investigate the structure, dynamics, surface properties, and thermodynamics of inorganic, biological and polymeric systems. It helps to explore conformational changes associated with biomolecular function, and molecular recognition of proteins, and membrane complexes. The protein folding, identification of catalysis sites of enzymes, and protein stability can be studied using molecular modeling. Vast different bioinformatics tools for modeling of biomolecules and designing are available [110–112]. In this book, the chapter by Leong et al. presents bioinformatics modeling and tools for biological membranes using molecular dynamic simulations, all-atom, united-atom, and coarse-grained membrane models of lipids and proteins. In addition, in this book, by Filntisi et al. a computational method for the generation of antibody-drug through site-specific cysteine conjugation using structural prediction methods based on PDB files of a drug, linker, and antibody. Moreover, Bórquez and González-Billault have presented an interesting chapter on computational algorithms of predicting kinase-substrate relationships in protein kinases; this chapter compares prediction tools and methods and discusses improving substrate prediction with contextual information.

4. Biological networks and system biology

Watts and Strogatz in 1998 [113, 114] and Barabási and Albert in 1999 [115–117] fueled the opinion that complex systems can be viewed as networks where components can be represented as nodes and they are linked through their interactions (i.e., edges). The properties of nodes and edges form the network topology. This approach has widely been applied to many scientific fields including bioinformatics that resulted in construction of large-scale biological networks denoted as “omes” like biome, interactome, microbiome [2, 6].

Above highlighted molecular sequence analysis, prediction and annotation, and molecular modeling-related bioinformatics approaches are also the core for building, organizing, and systematizing biological networks of molecules (e.g., metabolic, protein-protein interactions, etc.), and genetic and biochemical pathways of complex cellular processes. These include reception, signal transduction, and gene regulation and gene co-expression. Such molecular networks integrate many different data types including DNA sequences, regulatory RNA, proteins, secondary metabolites, gene expression data, and other small molecules, which may be all connected physically and functionally. The construction and organization of such physically and functionally connected molecular networks of cellular processes can be achieved only by applying the combination of simulative, iterative, and model-oriented bioinformatics approaches. Such biological networks are useful to analyze and visualize the complex connections of these cellular processes, helping understand other biological networks such as neuronal networks, food webs, between/within-species interaction networks, which are the central component of modern system biology [2, 6]. Examples of “omes”-related networks are the Kyoto Encyclopedia of Genes and Genomes (KEGG), BioCyc database collection, BRaunschweig ENzyme DAtabase (BRENDA), Reactome, Comparative Toxicogenomics Database, and many other [118] biological networks. Some biological network databases and their utilization in plant genomics/epigenomics have been discussed by the chapters of Sripathi et al. and Rahman et al. in this book.

5. Databases

An organized collection of data is referred to as database that aims to collect schemes, tables, queries, reports, images, and other objects. An access to information in the databases is provided by an integrated set of computer software, which is referred to as a “database management system” (DBMS) [119]. The DBMS allows users to access all of the data contained in the databases. It has general functions for data definition, entry, storage, update, administration, and retrieval of large quantities of information in an organized way that requires modeling (hierarchical and network models), clustering, query languages and query optimization, and visualization algorithms [1, 2, 119].

Development of databases, therefore, is significantly dependent on bioinformatics tools, advances, research, and applications. There is a large number of different types of databases available, which cover all aspects of biological data storage and organization. Some aforementioned databases such as GenBank, EMBL, DDJB belong to primary nucleotide sequence databases. There are meta-databases that incorporate data compiled from multiple other databases such as Entrez, mGen, Metascape, etc. Some others are specialized databases such as those specific to an organism, for example, TAIR, the p53 Knowledgebase (p53), the plant alternative splicing database (PASD); the plant secretome, and subcellular proteome knowledgebase (PlantSecKB) [119]. All databases vary in their data definition, usage, format, and access types. In this book, the chapter by Kadam et al. specifically describes databases and bioinformatics algorithms related to allergen informatics, discussing the concepts of allergen bioinformatics and the key areas for potential development in the allergology, whereas Bell and Kramvis highlight public sequence database for Hepatitis B virus. In this book, readers can find a comprehensive discussion for bioinformatics resources, including databases for plant “omics,” written by Rahman et al.

6. Software, analysis tools, services, and workflow

As mentioned above, astronomical accumulation of genomic and proteomic as well as metabolomic data, and their expression profiles and annotation, storage, organization, systematization, and integration into biological networks as well as database systems and their wide utilization by the science research community a priori required computer programming algorithms, analysis tools, services, and workflow systems. Therefore, software and analysis tools, and bioinformatics services and workflow have been the main fields and core targets of bioinformatics since its emergence. Because of the contributions of various bioinformatics companies or public institutions, bioinformatics software, and tools started to exist as simple command-line tools, but later improved to more complex graphical programs standalone packages, and web services. Since development of the first bioinformatics software and analysis tools for molecular sequence evaluations in the early 1980s, many free and open-source software tools have been developed and continue to grow and improve with the advancement made in genomics sciences [2, 120].

The main driving forces for the current and future development of bioinformatics software and tools have been made on the past-decade advances of genome decoding technologies, accumulation of large volume biological data, consequent need for their analyses, as well as advancements of computer technologies, graphics, visualization, and molecular modeling and networking techniques. Moreover, the availability of various open-source codes, shared object models, and community-supported plug-ins has facilitated gathering innovative ideas from the community and performing innovative in silico experiments on existing “Big Data.” These all-created golden opportunities for all research groups and bioinformatics companies to work, experiment, and develop more new generation of bioinformatics software and tools that are user friendly, capable of performing extended and integrated analysis with better visualization and graphical outputs. The range of open-source software packages includes titles such as UGENE, EMBOSS, GenGIS, GENtle, MOTHUR, BioPerl, PathVisio, BioJava, GenoCAD, Biopython, GeWorkbench, GenomeSpace, Bioclipse, .NET Bio, Apache Taverna, BioJS, Bioconductor, and BioRuby [121, 122].

Development of sharing models and web access tools is also an important bioinformatics objective that allows users to utilize and access bioinformatics tools over the internet and from their computer systems to the main computing resources via servers in other parts of the world. Simple Object Access Protocol (SOAP) [123] and Representational State Transfer (REST) [124–126] are two bioinformatics tools to provide web services. SOAP is a standard-based web service access protocol, originally developed by Microsoft. REST, providing very simple web service access, has been developed to fix the problems with SOAP [127]. Both tools share similarities over the HTTP protocol and have its own issues and challenges, differ in messaging patterns, rules, architecture style, and flexibility. The main advantages derive from the fact that end users do not have to deal with software and database maintenance overheads [127].

There are several basic bioinformatics services, for example, “Sequence Search Services” (SSS), “Multiple Sequence Alignment” (MSA), and “Biological Sequence Analysis” (BSA) [2, 128]. These web service-based bioinformatics analysis resources represent a collection of standalone or web-based interface data analysis tools as well as integrative, distributed, and extensible bioinformatics workflow management systems (BWMS). The BWMSs are designed specifically to compose and execute a series of interactive computational or data manipulation steps (i.e., a workflow) in a bioinformatics analyses. Such systems provide interactive analysis of biological data, build the specific workflows for the analysis, enable the visualization of the analysis outputs in real time, and simplify the process of sharing and reusing workflows between scientists. Some of the platforms giving this service: Galaxy, UGENE, Taverna, etc. [2, 121]. Several chapters of this book cover bioinformatics software, web-based analysis tools, and bioinformatics services for membrane analysis (see Leong et al.), in plant science and crop genomics (see chapters by Rahman et al. and Sripathi et al.), medicine, viral genome analysis and drug design (see chapters by Younis et al., Bell and Kramvis, and Filntisi et al.).

7. Text mining

Part of objectives in bioinformatics research and application is the utilization of computational algorithms and bioinformatics tools to collect, organize, and structure the growing body of biomedical literature allowing scientists to query, mine, read, and synthesize the specific literature and published articles of their research interest [2–4, 7, 129, 130]. Biomedical literature and text mining, therefore, are very important for scientific development, innovations, and integration and application of discoveries to society through extracting information (EI) and assessing the relationships of publications [3, 4]. Analysis of world literature demonstrates that more than 80% of text data remain unstructured that what makes it challenging to read every paper, resulting in disjointed sub-fields of research [3]. Biomedical literature text mining uses a variety of “text mining & data mining” tools, applying techniques such as data clustering, visualization and navigation, information retrieval, and extraction, and text categorization and summarization [3]. The use of IE and “Natural Language Generation and Understanding” (NLG and NLU) that have tokenizing, morphological or lexical, and syntactic analysis components helps to build structured text, and extract, collect, organize structured information [129, 130]. Pattern recognition and matching such as the recognition of biological abbreviations, terms, and interactions are important methods in text mining [2–4].

8. Education

Advances of life sciences and high-throughput biology fields in particular “omics” disciplines, the scale, and complexity of “Big Data,” and growing demand for specialists with multilingual and cross-field expertise to understand and solve multidisciplinary scientific concepts and tasks underlie a great need for training and education in the field of bioinformatics. Bioinformatics training and education aim to create, collect, deliver, and share educational and training materials and techniques as well as develop university degree-program curricula on bioinformatics. This is to prepare scientists and specialists, who can utilize modern bioinformatics tools with the sophisticated operating systems, software and algorithms, and database/networking technologies to handle, analyze, interpret, and publish high-throughput complex biological data. This is a great bottleneck and critical need of current life sciences and bioinformatics field, especially in all developing countries, for example, analyzed by some recent reports for African [82] and Central American [131] countries.

To address this, bioinformatics research community has put specific efforts to develop local and global platforms for bioinformatics training and education. Such examples include “Bioinformatics Training Network” (BTN) [132] and “The Global Organization for Bioinformatics Learning, Education, and Training” (GOBLET) [133] that provide a community educational and training resource for bioinformatics trainers and trainees. As an outcome of European 7th Framework grant, BTN targeted to develop and share educational materials, short courses, and training delivery methods as well as discuss the challenge, issues, and needed requirements for bioinformatics training [132]. Furthermore, GOBLET continues similar efforts beyond Europe, aiming to coordinate efforts at the global scales with concentrated strategy and within the frame of single, dedicated foundation although it requires much time, focused strategic efforts, and modern innovative approaches [133].

“The Swiss Institute of Bioinformatics” training portal [134] also provides online courses for software platforms designed to teach bioinformatics concepts and methods including Rosalind [135]. There are open-access website videos and slides from the “Canadian Bioinformatics Workshops” [136]. Similarly, many different, large bioinformatics conferences, and seminars contribute for training and education on bioinformatics such as Intelligent Systems for Molecular Biology (ISMB), European Conference on Computational Biology (ECCB), Research in Computational Molecular Biology (RECOMB), and the annual Bioinformatics Open Source Conference (BOSC) of the non-profit Open Bioinformatics Foundation [2, 128]. As public bioinformatics databases, the MediaWiki engine with the WikiOpener extension, extensively referenced in this chapter, also contributes for training and education of bioinformatics through gathering research materials and descriptions of tools that can be accessed and updated by all experts in the field [128].

With the specific objectives to develop bioinformatics research and application, its integration to genomics research, and training and education as well as to prepare well-qualified new generation scientists to life sciences, we established a dedicated organization—Center of Genomics and Bioinformatics in the developing country Uzbekistan [137]. As in other developing countries, there are many challenges and limitations in funding and in accessing to sophisticated bioinformatics tools and computer operating systems as well as lack of sufficient experience to carry bioinformatics research and resource development. However, our first step goal is to integrate genomics and bioinformatics curricula to the higher education system of Uzbekistan, develop training and educational materials, provide basic training and research practices to the university students and biology field specialists, and establish international collaborations on this direction. The long-term objective is to efficiently and broadly apply genomics and bioinformatics approaches to all areas of life sciences in national and regional levels that would contribute the development of biological sciences in Central Asia. Some efforts are ongoing regarding the establishment of international collaborations [138] and providing training and education in both national and regional levels.

9. Conclusions and future perspectives

Bioinformatics has become an essential interdisciplinary scientific field to the life science helping to “omics” field and technologies and mainly handling and analyzing “omes” data. Accumulation of high-throughput biological data due to the technological advances in “omics” fields required and prioritized the use of bioinformatics resources, and research and application for the analysis of complex and even further enlarging “Big Data” volumes, which would be impractical and useless without bioinformatics. Therefore, as highlighted herein, there is a critical need for the preparation of well-qualified, new generation scientists with integrated knowledge, multilingual ability, and cross-field experience who are capable of using sophisticated operating systems, software and algorithms, and database/networking technologies to handle, analyze, and interpret high-throughput and increasing volume of complex biological data.

Community resources and a globally coordinated foundation of bioinformatics training and education platforms as well as research conferences, workshops, short online training, and web-based educational courses and materials are available to accomplish toward this goal. However, there is an urgent need for the development of bioinformatics education and training, in particular in developing countries, which requires innovative platforms, training techniques, better funding, web and network access, and high-performance computing systems.

In the research side, bioinformatics tools need to be improved for analysis of the growing body of high-throughput pangenomics, metagenomics, proteomics, and metabolomics data. There are needs for “effective tools” to perform better genome assembly and annotation with high accuracy; however, it requires the improvement of quality of sequenced genomes without gaps, and sequencing of more genome representatives, sub-genomes, polyploidy species, genomes of single cells, and specific tissues that would generate information to work, modify, and correct bioinformatics algorithms and programming approaches.

The use of third generation sequencing approaches and platforms as well as efforts on whole genome sequencing of, for example, 1000 or 100,000 human genome representatives [82] or transcriptome/exon sequencing of 1000 distinct plant species (e.g., 1KP) [139] will ultimately improve and advance the bioinformatics analysis tools. These efforts also help to improve orthologous gene identification tools that currently need attention [120]. There is a great need for sampling and handling diverse strains in pangenomic analysis, integration of prokaryotic genome-organization frameworks (GOFs) as well as integration of non-coding RNAs, pseudogenes, and epigenetics elements into the bioinformatics annotation and ontology tools and software [120]. There is a need to make sequenced genome data more functional and integrated through the construction of more organized, user friendly, cell-wide biological networks, and metabolic pathways [140] with better visualization effects, graphics outputs [120], and knowledge base construction (KB) [141]. This, however, requires the development of real-time imaging systems and high throughput phenotyping (referred to as “phenomics”) tools that would help for efficiently determining biologically meaningful associations between genomic and phenotypic data, advancing the translational sciences, personal genomics, and personalized medicine [7] and/or agriculture [142].

Abstract

BACKGROUND:

The recent flood of data from genome sequences and functional genomics has given rise to new field, bioinformatics, which combines elements of biology and computer science.

OBJECTIVES:

Here we propose a definition for this new field and review some of the research that is being pursued, particularly in relation to transcriptional regulatory systems.

METHODS:

Our definition is as follows: Bioinformatics is conceptualizing biology in terms of macromolecules (in the sense of physical-chemistry) and then applying "informatics" techniques (derived from disciplines such as applied maths, computer science, and statistics) to understand and organize the information associated with these molecules, on a large-scale.

RESULTS AND CONCLUSIONS:

Analyses in bioinformatics predominantly focus on three types of large datasets available in molecular biology: macromolecular structures, genome sequences, and the results of functional genomics experiments (e.g. expression data). Additional information includes the text of scientific papers and "relationship data" from metabolic pathways, taxonomy trees, and protein-protein interaction networks. Bioinformatics employs a wide range of computational techniques including sequence and structural alignment, database design and data mining, macromolecular geometry, phylogenetic tree construction, prediction of protein structure and function, gene finding, and expression data clustering. The emphasis is on approaches integrating a variety of computational methods and heterogeneous data sources. Finally, bioinformatics is a practical discipline. We survey some representative applications, such as finding homologues, designing drugs, and performing large-scale censuses. Additional information pertinent to the review is available over the web at http://bioinfo.mbb.yale.edu/what-is-it.

Abstract

Bioinformatics is an interdisciplinary field mainly involving molecular biology and genetics, computer science, mathematics, and statistics. Data intensive, large-scale biological problems are addressed from a computational point of view. The most common problems are modeling biological processes at the molecular level and making inferences from collected data. A bioinformatics solution usually involves the following steps: Collect statistics from biological data. Build a computational model. Solve a computational modeling problem. Test and evaluate a computational algorithm. This chapter gives a brief introduction to bioinformatics by first providing an introduction to biological terminology and then discussing some classical bioinformatics problems organized by the types of data sources. Sequence analysis is the analysis of DNA and protein sequences for clues regarding function and includes subproblems such as identification of homologs, multiple sequence alignment, searching sequence patterns, and evolutionary analyses. Protein structures are three-dimensional data and the associated problems are structure prediction (secondary and tertiary), analysis of protein structures for clues regarding function, and structural alignment. Gene expression data is usually represented as matrices and analysis of microarray data mostly involves statistics analysis, classification, and clustering approaches. Biological networks such as gene regulatory networks, metabolic pathways, and protein-protein interaction networks are usually modeled as graphs and graph theoretic approaches are used to solve associated problems such as construction and analysis of large-scale networks.

What is bioinformatics?

Powered by

Translate

Put simply, bioinformatics is the science of storing, retrieving and analysing large amounts of biological information. It is a highly interdisciplinary field involving many different types of specialists, including biologists, molecular life scientists, computer scientists and mathematicians.

The term bioinformatics was coined by Paulien Hogeweg and Ben Hesper to describe "the study of informatic processes in biotic systems" and it found early use when the first biological sequence data began to be shared. Whilst the initial analysis methods are still fundamental to many large-scale experiments in the molecular life sciences, nowadays bioinformatics is considered to be a much broader discipline, encompassing modelling and image analysis in addition to the classical methods used for comparison of linear sequences or three-dimensional structures (Figure 1).

Figure 1 A broad overview of the different types of data that fall within the scope of bioinformatics. Traditionally, bioinformatics was used to describe the science of storing and analysing biomolecular sequence data, but the term is now used much more broadly, encompassing computational structural biology, chemical biology and systems biology (both data integration and the modelling of systems).

Distinction from medical informatics

Bioinformatics is distinct from medical informatics – the interdisciplinary study of the design, development, adoption and application of IT-based innovations in healthcare services delivery, management and planning. Somewhere in between the two disciplines lies biomedical informatics – the interdisciplinary field that studies and pursues the effective uses of biomedical data, information, and knowledge for scientific enquiry, problem solving and decision making, motivated by efforts to improve human health.

Recently initiated projects, such as the 100,000 Genomes Project, are bridging the gaps between these disciplines, but on the whole bioinformatics deals with research data and uses it for research purposes, medical informatics deals with data from individual patients for the purposes of clinical management, (diagnosis, treatment, prevention...) and biomedical informatics attempts to bridge these two extremes.

In this article we will discuss about:- 1. Meaning of Bioinformatics 2. Branches of Bioinformatics 3. Applications of Bioinformatics in Crop Improvement 4. Advantages 5. Limitations.

Meaning of Bioinformatics:

Bioinformatics is the computer aided study of biology and genetics. In other words, it refers to computer based study of genetics and other biological information. Now the science of bioinformatics is gaining increasing importance in life science especially in the field of molecular biology and plant genetic resources.

Main points related to bioinformatics are given below:

ADVERTISEMENTS:

(i) It is the interface between computer and biology. In other words, it is the application of information technology in the study of biology.

(ii) It utilizes information science for the study of biology.

(iii) It is used for computer based analysis of bio-molecular data especially large scale data set derived from genome sequencing.

(iv) It is used for analysis of data related to genomics, proteornics, metabolomics and other biological aspects.

ADVERTISEMENTS:

(v) It has wide applications in handling data related to plant genetic resources.

Branches of Bioinformatics:

The science of bioinformatics can be divided into several branches based on the experimental material used for the study. Bioinformatics is broadly divided into two groups, viz., animal bioinformatics and plant bioinformatics.

Various branches of bioinformatics are defined below:

1. Animal Bioinformatics:

It deals with computer added study of genomics, proteomics and metabolomics in various animal species. It includes study of gene mapping, gene sequencing, animal breeds, animal genetic resources etc. It can be further divided as bioinformatics of mammals reptiles, insects, birds, fishes etc.

2. Plant Bioinformatics:

It deals with computer aided study of plant species. It includes gene mapping, gene sequencing, plant genetic resources, data base etc.

It can be further divided into following branches:

(i) Agricultural Bioinformatics:

ADVERTISEMENTS:

It deals with computer based study of various agricultural crop species. It is also referred to as crop bioinformatics.

(ii) Horticultural Bioinformatics:

It refers to computer aided study of horticultural crops, viz. fruit crops, vegetable crops and flower crops.

(iii) Medicinal Plants Bioinformatics:

ADVERTISEMENTS:

It deals with computer based study of various medicinal plant species.

(iv) Forest Plant Bioinformatics:

It deals with computer based study of forest plant species.

Computer Programmes used in Biology:

ADVERTISEMENTS:

Computers refer to electronic devices which can input, store and manipulate data and output information in a desired form. Now various types of computers such as micro-computer, minicomputer, mainframe computer, super computer, laptop computer and palmtop computers are available which can be used for multiple purposes.

Various computer programmes are used for the study of biological problems. Such programmes include Microsoft word (MS Word), Microsoft excel (MS excel) and Microsoft power point (MS Power Point).

A brief description of these programmes is presented below:

(i) MS Word:

ADVERTISEMENTS:

It is a very useful programme for preparation of project reports, annual reports, writing research papers, varietal information system, plant genetic resources data base, etc.

(ii) MS Excel:

It is useful Computer programme for various types of statistical and biometrical analyses. It can also be used for graphical and diagrammatic display of experimental results.

(iii) MS Power Point:

It is widely used for preparation of slides and presentation of results in various scientific meetings.

Applications of Bioinformatics in Crop Improvement:

Bioinformatics has wide practical applications in genetics and plant breeding.

Some important applications of bioinformatics in plant breeding and genetics are tested below:

1. Varietal Information system

2. Plant Genetic Resources Data Base

3. Studies on Genomics

4. Studies on Proteomics

5. Studies on Metabolomics

6. Studies on Plant Modelling

7. Pedigree Analysis

8. Biometrical Analysis

9. Forecasting Models

1. Varietal Information System:

Bioinformatics has useful applications in developing varietal information system. Variety refers to a genotype which has been released for commercial cultivation (b) State Variety Release Committee or Central Variety Release Committee and notified by the Government of India. Various types of varieties are used in plant breeding.

All such terms are defined below:

The detailed information about various type of varieties can be developed using highly heritable characters.

Such information can be used in various ways as given below:

(i) In DUS testing for varietal identification

(ii) In grouping of varieties on the basis of various highly heritable characters.

(iii) In sorting out of cultivars for use in Pre-breeding and traditional breeding.

The information can be stored in the computer memory and be retrieved as and when– required.

2. PGB Data Base:

Genetic material of plant which of value as resource for present and future generations of people is referred to as plant genetic resources. It is also known as gene pool, genetic stock and germplasm.

The germplasm is evaluated for several characters such as highly heritable morphological and other characters as given below:

(i) Highly heritable morphological traits

(ii) Yield contributing traits

(iii) Quality characters

(iv) Resistance to biotic and abiotic stresses

(v) Characters of agronomic value.

International Plant Genetic Resources Institute (IPGRI), Rome, Italy has developed descriptors and descriptor states for various crop plants. Such descriptors help in uniform recording of observations on germplasm of crop plants throughout the world. Thus huge data is collected on crop germplasm for several years. Bioinformatics plays an important role in systematic management of this huge data.

Bioinformatics is useful in handling of such data in several ways as follows:

(i) It maintains the data of several locations and several years in a systematic way.

(ii) It permits addition, deletion and updating of information.

(iii) It helps in storage and retrieval of data.

(iv) It also helps in classification of PGR based on various criteria.

(v) It helps m retrieval of data belonging to specific group such as early maturity, late maturity, dwarf types, tall types, resistant to biotic stresses, resistant to abiotic stresses, genotypes with superior quality, genotypes with marker genes, etc.

All such data can be easily managed by computer aided programmes and can be manipulated to get meaningful results.

3. Studies on Genome:

Genome refers to the basic set of chromosome. In a genome each type of chromosome is represented only once. The study of structure and function of entire genome of an organism is referred to as genomics. It is being developed as a sub discipline of genetics which is devoted to the mapping sequencing and functional analysis of genome. The word genomics was coined by Thomas Roderick in 1986.

The discipline of genomics consists of two groups, viz:

(i) Structural genomics and

(ii) Functional genomics.

These are defined below:

(i) Structural Genomics:

It deals with the study of the structure of entire genome of an organism. In other words, it deals with the study of the genetic structure of each chromosome of the basic set of chromosome i.e. genome.

(ii) Functional Genomics:

It deals with the study of genome function. It deals with transcriptome and proteome. Transcriptome refers to complete set of RNAs transcribed from a genome and proteome refers to complete set of proteins encoded by a genome

There are three methods of gene mapping, viz:

(i) Recombination mapping,

(ii) Deletion mapping and

(iii) Molecular mapping.

The last method is widely being used for gene mapping these days. It is computer aided method which is useful in genome mapping. It has been used for genome mapping in various crop plants such as Arabidopsis, rice and maize.

It is a rapid and accurate method of gene mapping. Now computer aided genomic mapping, sequencing and functional analysis studies are being carried out with almost all important field crops. Computer aided programmes have made such studies very simple.

4. Studies on Proteomics:

Proteomics refers to the study of structures and functions of all proteins in an individual. In other words, it deals with the study of entire protein expression in an organism.

Proteomics is of two types, viz:

(i) Structural proteomics and

(ii) Functional proteomics.

These are defined below:

(i) Structural Proteomics:

It refers to the study of the structures of all proteins found in a living organism.

(ii) Functional Proteomics:

It deals with functions of all proteins found in a living organism. In fact, proteomics is a new sub-discipline of functional genomics. It is the study of proteomes which refer to complete set of proteins encoded by a genome. A variety of techniques are used for the study of proteomics. Now computer aided programmes are available for the study of proteomics.

5. Studies on Metabolomics:

Metabolomics refers to the study of all metabolic pathways in a living organism. In other words, it is the computer aided information of all metabolic pathways of a living organism.

Main points related to metabolomics are listed below:

(i) It deals with the study of all metabolic pathways in a living organism.

(ii) It is computer based information about metabolic pathways in a living organism.

(iii) It helps in identification and correction of metabolic disorders in an organism.

(iv) It helps in selection of individuals with normal metabolic pathways.

(v) It helps early detection of genetic disorders associated with metabolic pathways.

6. Modelling of Plants:

Bioinformatics plays an important role in modelling of crop plants. Such computer aided studies have already been made in field pea and several other field crops. First the plant model is conceptualized using various plant traits and then efforts are made to develop such model by using appropriate breeding procedures.

For example, in cotton following characters can be used for developing conceptual plant model:

(i) Maturity duration 160 days

(ii) Plant height 150 cm

(iii) Number of monopodia 2

(iv) Length of sympodia 50 cm

(v) Number of sympodia 20

(vi) Boll weight 4g

(vii) Ginning per cent 38

(viii) Fibre length 28 mm

(ix) Leaf : small and thick

(x) Plant surface—hairy

First donor sources for these traits are identified from the available germplasm. Then efforts are made to combine these traits in one genotype particularly in a popular variety. Such computer based studies help in developing plant ideotype suitable for machine picking and used in multiple cropping system.

7. Pedigree Analysis:

Computer aided studies are useful in pedigree analysis of various cultivars and hybrids. Information about the parentage of cultivars and hybrids is entered into the computer memory which can be retrieved any time. The list of parents that are common in the pedigree of various cultivars and hybrids can be sorted out easily.

It helps in the pedigree analysis which in turn can be used in planning plant breeding programmes especially in the selection of parents for use in hybridization programmes. Through study of protein structures, it helps in pedigree analysis.

8. Biometrical Analysis:

In plant breeding and genetics, various types of biometrical analyses such as correlation, path coefficient, discriminant function, diallel, partial diallel, triallel, quadriallel, generation means, line x tester, triple test cross, stability parameters, D2 statistics, metroglypth etc. are carried out.

Computer aided programmes are very much useful in carrying out such biometrical analyses. The information obtained from such biometrical analysis is used in better planning of plant breeding programmes for achieving specific goal.

9. Forecasting Models:

Computer aided programmes have wide applications in developing various types of forecasting models especially useful for predicting crop production and productivity and in forecasting incidence of insects and diseases in crop plants. Weather parameters are used in making such predictions. Computer aided remote sensing techniques are used for such predictions.

10. Other Applications:

Besides agricultural applications, bioinformatics have several other useful applications.

Such applications include use of bioinformatics in:

(i) Medical science,

(ii) Forensic science,

(iii) Pharmaceutical and biotech industry.

In medical science computer aided studies are useful in detection of genetic diseases at an early stage of life. It can help in cure of genetic diseases in some cases. The pedigree analysis helps in advising future parents to prevent certain genetic diseases.

In forensic science, bioinformatics is useful in settling disputed cases of children and detecting criminal cases. In pharmaceutical industry, computer aided programmes help in detecting various metabolic pathways involved in the production of a medicine. Thus it can help in mass production of such chemicals.

Advantages of Bioinformatics:

Bioinformatics has several practical applications in genetics and plant breeding as discussed above.

Its main advantages in crop improvement are given below:

1. It provides systematic information about genomics, proteomics and metabolomics of living organisms. This information is useful in planning various breeding and genetical programmes.

2. It helps in finding evolutionary relationship between two species. Studies of nucleotide and protein sequences help in such matter. The closely related organisms have similar sequences and distantly related organisms have dissimilar sequence.

The time of divergence between two species can also be estimated from such studies. Thus bioinformatics helps in the study of evolutionary biology. It helps in drawing phylogenic trees (trees of relatedness).

ADVERTISEMENTS:

3. Rapid Method. Is a rapid method of gene mapping and sequencing. Earlier methods of gene mapping were time consuming and pains taking. Bioinformatics has made this task very simple. Now gene hunting has become faster, cheaper and systematic.

4. Identification of similar genes. Computer aided studies help in identification of similar genes in two species. For example, genes similar for biotic and abiotic stresses in two species can be easily detected.

5. High Accuracy. The computer based information has very high level of accuracy and is highly reliable.

6. Bioinformatics has led to advances in understanding basic biological processes which in turn have helped in diagnosis, treatment and prevention of many genetic diseases:

7. It has become possible to reconstruct genes from Expressed Sequence Tags (EST). The EST is nothing but short pieces of genes which can express.

8. Computer aided programmes have made it possible to group proteins into families based on their relatedness.

9. Computer aided programmes are useful in designing primers for PCR. Such primers can be designed with little efforts. Such primers are used to sequence unknown genes or genes of interest.

10. In life science, computer aided programmes are useful in storing, organizing and indexing huge databases.

Limitations of Bioinformatics:

Computer based programmes have helped in better understanding of various processes of life science.

However, there are some limitations of bioinformatics which are listed below:

1. Bioinformatics requires sophisticated laboratory of molecular biology for in-depth study of biomolecules. Establishment of such laboratories requires lot of funds.

2. Computer based study of life science requires some training about various computer programmes applicable for the study of different processes of life science. Thus special training is required for handling of computer based biological data.

3. There should be uninterrupted electricity (power) supply for computer aided biological investigations. Interruption of power may sometimes lead to loss of huge data from the computer memory.

4. There should be regular checking of computer viruses because viruses may pose several problems such as deletion of data and corruption of the programmes.

5. The maintenance and up keeping of molecular laboratories involves lot of expenditure which sometimes becomes a limiting factor for computer based molecular studies.

Bioinformatics
Bioinformatics is the use of mathematical, statistical and computer methods to analyze biological, biochemical, and biophysical data. Because bioinformatics is a young, rapidly evolving field, however, it also has a number of other credible definitions. It can also be defined as the science and technology of learning, managing, and processing biological information. Bioinformatics is often focused on obtaining biologically oriented data, organizing this information into databases, developing methods to get useful information from such databases, and devising methods to integrate related data from disparate sources. The computer databases and algorithms are developed to speed up and enhance biological research.

Bioinformatics can help answer such questions as whether a newly analyzed gene is similar to any previously known gene, whether a protein's sequence can suggest how the protein functions, and whether the genes turned on in a cancer cell are different from those turned on in a healthy cell.

Databases And Analysis Programs
A good deal of the early work in bioinformatics focused on processing and analyzing gene and protein sequences catalogued in databases such as GenBank, EMBL, and SWISS-PROT. Such databases were developed in academia or by government-sponsored groups and served as repositories where scientists could store and share their sequence data with other researchers. With the start of the Human Genome Project in 1990, efforts in bioinformatics intensified, rising to the challenge of handling the large amounts of DNA sequence data being generated at an unprecedented rate. By the midto late-1990s, much of the efforts in bioinformatics centered around genomic data, generated by the Human Genome Project and by private companies, and around proteomic data.

Early analysis of sequence information focused on looking for similarities between genes and between proteins. Algorithms were developed to help researchers rapidly identify similar gene or protein sequences. Such tools were extremely useful for determining whether a newly sequenced piece of DNA was at all similar to sequences already entered in a database. To determine how multiple sequences align and to view their similarities, multiplealignment programs were developed. Such programs helped scientists compare the sequences of closely related genes or compare the sequence of a particular gene or protein as it appears in several species.

To better understand the functional roles of new nucleotide and amino acid sequences, researchers developed algorithms to look for particular sequence "domains." Domains are regions where a particular sequence of nucleotides or amino acids is indicative of function in the protein. For example, a protein may have a domain that binds to ATP or GTP, two important protein regulators.

In addition, these algorithms can detect sequences that denote a region involved in particular types of post-translational modifications, such as tyrosine phosphorylation . Tools such as prosite, blocks, prints, and Pfam can be used to detect and predict such protein domains in sequence data.

Structure is central to protein function, and another set of tools, including SWISS-MODEL, allows researchers to use gene and protein sequence data to predict a protein's three-dimensional structure. Such tools can help predict how mutations in a gene sequence could alter the three-dimensional structure of the corresponding protein. They accomplish such molecular modeling by comparing a novel sequence to the sequences of genes whose protein structures are known.

The majority of tools were developed as academic freeware distributed on the Internet. In the early-to mid-1990s, commercial companies began to develop their own proprietary algorithms and tools, as well as their own proprietary databases. Those databases were then marketed to pharmaceutical and biotech companies as well as to academic research groups. The most commercially viable and profitable businesses focused on the production and sale of proprietary DNA-and gene-sequence databases in the mid-to late-1990s. These databases primarily contained genetic information that were not in the public domain databases, such as GenBank, and they thus offered potential competitive advantages to the drug discovery groups of large pharmaceutical and biotech companies.

Applications Of Bioinformatics To Drug Discovery
The application of bioinformatics to genomics data could be a huge potential boon for the discovery of new drugs. During the 1990s many pharmaceutical companies and biotech companies became convinced that they could speed up their drug-discovery pipelines by taking advantage of the data from the Human Genome Project as well as by funding their own internal genomics programs and by collaborating with third-party genomics companies.

The goal in such practical applications is to use such data as DNA sequence information and gene expression levels to help discover new drug targets. The vast majority of drugs target proteins, but there are a handful of drugs, such as some chemotherapeutic agents, that bind to DNA. In cases where the target is a protein, the drugs themselves are primarily small chemical molecules or, in some cases, small proteins, such as hormones, that bind to a larger protein in the body. Some drugs are therapeutic proteins delivered to the site of the disease.

The extent to which genomics will actually be able to help identify validated drug targets is uncertain. Genomics and bioinformatics are still young areas, and the drug development cycle can take up to ten years. As of 2001 relatively few of the drugs on the market or in the late stages of clinical trials were discovered via genomics or bioinformatics programs.

Specialists
Bioinformatics is applied to at least five major types of activities: data acquisition, database development, data analysis, data integration, and analysis of integrated data.

Data Acquisition.
Data acquisition is primarily concerned with accessing and storing data generated directly off of laboratory instruments. Many of these instruments are either automated or semi-automated high-throughput instruments that generate large volumes of data. The Human Genome Project utilized hundreds of DNA sequencers, producing enormous amounts of data. The data had to be captured in the appropriate format, and it had to be capable of being linked to all the information related to the DNA samples, such as the species, tissue type, and quality parameters used in the experiments. This area of bioinformatics primarily relates to the use of "laboratory information management systems," which are the computer systems used to manage the information needs of a particular laboratory.

Database Development.
Many laboratories generate large volumes of such data as DNA sequences, gene expression information, three-dimensional molecular structure, and high-throughput screening. Consequently, they must develop effective databases for storing and quickly accessing data. For each type of data, it is likely that a different database organization must be used. A database must be designed to allow efficient storage, search, and analysis of the data it contains. Designing a high-quality database is complicated by the fact that there are several formats for many types of data and a wide variety of ways in which scientists may want to use the data. Many of these databases are best built using a relational database architecture, often based on Oracle or Sybase.

A strong background in relational databases is a fundamental requirement for working in database development. Having some background in the molecular biology techniques used to generate the data is also important. Most critical for the bioinformatics specialist is to have a strong working relationship with the researchers who will be using the database and the ability to understand and interpret their needs into functional database capabilities.

Data Analysis.
Being able to analyze data efficiently requires having a good database design, allowing researchers to query the database effectively and letting them quickly obtain the types of information they need to begin their data analysis. If queries cannot be performed, or if performance is tediously slow, the whole system breaks down, since scientists will not be inclined to use the database. Once data is obtained from the database, the user must be able to easily transform it into the format appropriate for the desired analysis tools.

This can be challenging, since researchers often use a combination of publicly available tools, tools developed in-house, and third-party commercial tools. Each tool may have different input and output formats. Starting in the late 1990s, there have been both commercial and in-house efforts at pharmaceutical and biotech companies to reduce the formatting complexities. Such simplification efforts focus on building analysis systems with a number of tools integrated within them such that the transfer of data between tools appears seamless to the end user.

Bioinformatics analysts have a broad range of opportunities. They may write specific algorithms to analyze data, or they may be expert users of analysis tools, helping scientists understand how the tools analyze the data and how to interpret results. A knowledge of various programming languages, such as Java, PERL, C, C++, and Visual Basic, is very useful, if not required, for those working in this area.

Data Integration.
Once information has been analyzed, a researcher often needs to associate or integrate it with related data from other databases. For example, a scientist may run a series of gene expression analysis experiments and observe that a particular set of 100 genes is more highly expressed in cancerous lung tissue than in normal lung tissue. The scientist might wonder which of the genes is most likely to be truly related to the disease. To answer the question, the researcher might try to find out more information about those 100 genes, including any associated gene sequence, protein, enzyme, disease, metabolic pathways, or signal transduction pathway data.

Such information will help the researcher narrow the list down to a smaller set of genes. Finding this information, however, requires connections or links between the different databases and a good way to present and store the information. An understanding of database architectures and the relationship between the various biological concepts in the databases is key to doing effective data integration.

Analysis of Integrated Data.
Once various types of data are integrated, users need a good way to present these various pieces of data so they can be interpreted and analyzed. The information should be capable of being stored and retrieved so that, over time, various pieces of information can be combined to form a "knowledge base" that can be extended as more experiments are run and additional data are integrated from other sources. This type of work requires skills related to database design and architecture. It also requires specific programming skills in various computer languages, as well as expertise in developing interfaces between a computer and its user.

see also Combinatorial Chemistry; Computational Biologist; Evolution of Genes; Genomics; Genomics Industry; High-Throughput Screening; Human Genome Project; Pharmacogenetics and Pharmacogenomics; Proteins; Proteomics; Sequencing DNA.

Bioinformatics
Bioinformatics is a new field that centers on the development and application of computational methods to organize, integrate, and analyze gene -related data. The Human Genome Project (HGP) was an international effort to determine the deoxyribonucleic acid (DNA) base sequence of the entire human genome, which includes about thirty thousand protein -encoding genes, their regulatory elements, and many highly repeated noncoding sections. In 1985, a group of visionary scientists led by Charles DeLisi, who was then the director of the office of health and environmental research at the U.S. Department of Energy (DOE), realized that having the entire human genome in hand would provide the foundation for a revolution in biology and medicine. As a result, the 1988 presidential budget submission to U.S. Congress requested funds to start the HGP. Momentum built quickly and by 1990, DOE and the U.S. National Institutes of Health had laid out plans for a fifteen-year project. An international public consortium and a private company announced completion of a rough draft of the human genome sequence on June 26, 2000, with papers describing the data published eight months later. This is the first generation bestowed with the "parts list" of life, as well as the daunting task of making sense out of it.

Data Management
The Human Genome Project and other genome projects have generated massive data on genome sequences, disease-causing gene variants, protein three-dimensional structures and functions, protein-protein interactions, and gene regulation. Bioinformatics is closely tied to two other new fields: genomics (identification and functional characterization of genes in a massively parallel and high-throughput fashion) and proteomics (analysis of the biological functions of proteins and their interactions), which have also resulted from the genome projects. The fruits of the HGP will have major impacts on understanding evolution and developmental biology, and on scientists' ability to diagnose and treat diseases. Areas outside of traditional biology, such as anthropology and forensic medicine, are also embracing genome information.

Knowing the sequence of the billions of bases in the human genome does not tell scientists where the genes are (about 1.5 percent of the human genome encodes protein). Nor does it tell scientists what the genes do, how genes are regulated, how gene products form a cell, how cells form organs, which mutations underlie genetic diseases, why humans age, and how to develop drugs. Bioinformatics, genomics, and proteomics try to answer these questions using technologies that take advantage of as much gene sequence information as possible. In particular, bioinformatics focuses on computational approaches.

Bioinformatics includes development of databases and computational algorithms to store, disseminate, and rapidly retrieve genomic data. Biological data are complex and abundant. For example, the U.S. National Center for Biotechnology Information (NCBI), a division of the National Institutes of Health, houses central databases for gene sequences (GenBank), disease associations (OMIM), and protein structure (MMDB), and publishes biomedical articles (PubMed). The best way to get a feeling for the magnitude and variety of the data is to access the homepage of NCBI via the World Wide Web (http://ncbi.nlm.nih.gov). A bioinformatics team at NCBI works on the design of the databases and the development of efficient algorithms for retrieving data and comparing DNA sequences.

Applications
Bioinformatics also covers the design of genomics and proteomics experiments and subsequent analysis of the results. For instance, disease tissues (such as those from cancer patients) express different sets of proteins than their normal counterparts. Therefore protein abundance can be used to diagnose diseases. Moreover, proteins that are highly (or uniquely) expressed in disease tissues may be potential drug targets.

Genomics and proteomics generate protein abundance data using different approaches. Genomics determines gene abundance (which is a good indicator of protein abundance) using DNA microarrays, also known as DNA chips, which are high-density arrays of short DNA sequences, each recognizing a particular gene. By hybridizing a tissue sample to a DNA chip, one can determine the activities of many genes in a single experiment. The design of DNA chips—that is, which gene fragments to use in order to achieve maximum sensitivity and specificity, as well as how to interpret the results of DNA chip experiments—are difficult problems in bioinformatics.

Proteomics measures protein abundance directly using mass spectroscopy , which is a way to measure the mass of a protein. Since mass is not unique enough for identifying a protein, one usually cuts the protein with enzymes (that cut at specific places according to the protein sequence) and measures the masses of the resulting fragments using mass spectroscopy. Such "mass distributions" for all proteins with known sequences can be generated using computers and stored. By comparing the mass distribution of an unknown protein sample to those of known proteins, one can identify the sample. Such comparisons require complex computational algorithms, especially when the sample is a mixture of proteins. Although not as efficient as DNA chips, mass spectroscopy can directly measure protein abundance. In fact, spectrometric identification of proteins has been the one of the most significant advances in proteomics.

Bioinformatics can lead to discovery of new proteins. When the cystic fibrosis gene (CF) was first identified in 1989, for example, researchers compared its DNA sequence computationally to all sequences known at that time. The comparison revealed striking homology (sequence similarity) to a large family of proteins involved in active transport across cell membranes. Indeed, the CF gene encodes a membrane-spanning chloride ion channel, called the cystic fibrosis transmembrane regulator, or CFTR. The identification of gene function by searching for sequence homology is a widely used bioinformatics method. When no homology is found, one may still be able to tell if a gene codes for membrane-spanning channels using computational tools. Membranes are bilayers of lipid molecules, which are water insoluble. An ion channel typically has regions outside the membrane (water soluble) and regions inside the membrane (water insoluble) arranged in a certain pattern. Computer algorithms have been developed to capture such patterns in a gene sequence.

By thinking boldly and by setting ambitious goals, the Human Genome Project has brought about a new era in biological and biomedical research. Many revolutionarily new technologies are being developed, most of which have significant computational components. The avalanche of genomic data also enables model-based reasoning. The bright future of bioinformatics calls for individuals who can think quantitatively and in the meantime love biology—an unusual combination.

bioinformatics The collection, storage, and analysis of DNA- and protein-sequence data using computerized systems. Much of the data generated by genome sequencing projects and protein studies is held in various databanks and made available to researchers throughout the world via the Internet. Many computer programs have been developed to analyse sequence data, enabling the user to identify similarities between newly sequenced material and existing sequences. This allows, for example, predictions about the structure and function of a protein from its amino-acid sequence data or from the nucleotide sequence of its gene. Genome-wide sequence analysis of newly discovered organisms, especially bacteria or protoctists, indicates the array of proteins they are likely to manufacture, and therefore the kind of lifestyle they are likely to lead. Also, comparisons between genomes of different species provides information about their possible evolutionary relationships.

Bioinformatics and computational biology
Bioinformatics, or computational biology, refers to the development of new database methods to store geno-mic information, computational software programs, and methods to extract, process and evaluate this information, and the refinement of existing techniques to acquire the genomic data. Finding genes and determining their function, predicting the structure of proteins and RNA (ribonucleic acid) sequences from the available DNA (deoxyribonucleic acid) sequence, and determining the evolutionary relationship of proteins and DNA sequences are also part of bioinformatics.

The genome sequences of some bacteria, yeast, a nematode, the fruit fly Drosophila, and several plants have been obtained in the recent past, with many more sequences having been completed or nearing completion. Although work continues in order to refine the data, the initial sequencing (a rough draft) of the human genome was completed in 2000. It was announced in April 2003 that the complete genome sequence was completed. In May 2006, the sequence of the last chromosome was published in the journal Nature. Although publicly stated that the Human Genome Project has been completed, work continues. As of 2005, the number of genes in the human genome was re-stated as 20,000 to 25,000, down from the estimated number of 30,000 to 40,000. Experts predict that it will take geneticists several more years before a precise number can be given.

In addition, to this accumulation of nucleotide sequence data, elucidation of the three-dimensional structure of proteins coded for by the genes has been accelerating. The result is a vast ever-increasing amount of databases and genetic information. The efficient and productive use of this information requires the specialized computational techniques and software. Bioinformatics has developed and grown from the need to extract and analyze the reams of information pertaining to genomic information like nucleotide sequences and protein structure.

Bioinformatics utilizes statistical analysis, step-wise computational analysis and database management tools in order to search databases of DNA or protein sequences to filter out background from useful data and enable comparison of data from diverse databases. This sort of analysis is ongoing. The exploding number of databases, and the various experimental methods used to acquire the data, can make comparisons tedious to achieve. However, the benefits can be enormous. The immense size and network of biological databases provides a resource to answer biological questions about mapping, gene expression patterns, molecular modeling, molecular evolution, and to assist in the structural-based design of therapeutic drugs.

Obtaining information is a multi-step process. Databases are examined, or browsed, by posing complex computational questions. Researchers who have derived a DNA or protein sequence can submit the sequence to public repositories of such information to see if there is a match or similarity with their sequence. If so, further analysis may reveal a putative structure for the protein coded for by the sequence as well as a putative function for that protein. Four primary databases, those containing one type of information (only DNA sequence data or only protein sequence data), currently available for these purposes are the European Molecular Biology DNA Sequence Database (EMBL), GenBank, SwissProt and the Protein Identification Resource (PIR). Secondary databases contain information derived from other databases. Specialist databases, or knowledge databases, are collections of sequence information, expert commentary, and reference literature. Finally, integrated databases are collections (amalgamations) of primary and secondary databases.

The area of bioinformatics concerned with the derivation of protein sequences makes it conceivable to predict three-dimensional structures of the protein molecules, by use of computer graphics and by comparison with similar proteins, which have been obtained as a crystal. Knowledge of structure allows the site(s) critical for the function of the protein to be determined. Subsequently, drugs active against the site can be designed, or the protein can be utilized to enhance commercial production processes, such as in pharmaceutical bioinformatics.

Bioinformatics also encompasses the field of comparative genomics. This is the comparison of functionally equivalent genes across species. A yeast gene is likely to have the same function as a worm protein with the same amino acid. Alternately, genes having similar sequence may have divergent functions. Such similarities and differences will be revealed by the sequence information. Practically, such knowledge aids in the selection and design of genes to instill a specific function in a product to enhance its commercial appeal.

The most widely known example of a bioinformatics driven endeavor is the Human Genome Project (HGP, which has been mentioned earlier). Charles DeLisi, who at the time was Director of the Health and Environmental Research Programs, under the U.S. Department of Energy (DOE), began the HGP in 1986. The project was formally established in the United States in 1990 as a joint project of the DOE and the U.S. National Institutes of Health. International cooperation occurred among geneticist from the United States, Japan, Germany, France, and the United Kingdom. Work related to the Human Genome Project has allowed dramatic improvements worldwide in molecular biological techniques and improved computational tools for studying genomic function.

See also Chromosome mapping; Deoxyribonucleic acid (DNA); Genetic engineering; Genetic testing; Genome; Molecular biology; Proteomics; Ribonucleic acid (RNA).

Bioinformatics: History, Coverage, Components and Applications

Read this article to learn about the history, coverage, components and applications of bioinformatics.

The bioinformatics covers many specialized and advanced areas of biology. Such areas are: (1) Functional Genomics (2) Structural Genomics (3) Comparative Genomics (4) DNA Microarrays and (5) Medical Informatics.

Bioinformatics is the combination (or marriage) of biology and information technology. Basically, bioinformatics is a recently developed science using information to understand biological phenomenon. It broadly involves the computational tools and methods used to manage, analyse and manipulate volumes and volumes of biological data.

Bioinformatics may also be regarded as a part of the computational biology. The latter is concerned with the application of quantitative analytical techniques in modeling and solving problems in the biological systems. Bioinformatics is an interdisciplinary approach requiring advanced knowledge of computer science, mathematics and statistical methods for the understanding of biological phenomena at the molecular level.

History and Relevance of Bioinformatics:
The term bioinformatics was first introduced in 1990s. Originally, it dealt with the management and analysis of the data pertaining to DNA, RNA and protein sequences. As the biological data is being produced at an unprecedented rate, its management and interpretation invariably requires bioinformatics. Thus, bioinformatics now includes many other types of biological data.

Some of the most important ones are listed below:

i. Gene expression profiles

ii. Protein structure

iii. Protein interactions

iv. Microarrays (DNA chips)

v. Functional analysis of biomolecules

vi. Drug designing.

Bioinformatics is largely (not exclusively) a computer-based discipline. Computers are in fact very essential to handle large volumes of biological data, their storage and retrieval. We have to accept the fact that there is no computer on earth (however advanced) which can store information, and perform the functions like a living cell. Thus a highly complex information technology lies right within the cells of an organism. This primarily includes the organism’s genes and their dictates for the organism’s biological processes and behaviour.

Broad Coverage of Bioinformatics:
Bioinformatics covers many specialized and advanced areas of biology.

Functional genomics:

Identification of genes and their respective functions.

Structural genomics:

Predictions related to functions of proteins.

Comparative genomics:

For understanding the genomes of different species of organisms.

DNA microarrays:

These are designed to measure the levels of gene expression in different tissues, various stages of development and in different diseases.

Medical informatics:

This involves the management of biomedical data with special referee to biomolecules, in vitro assays and clinical trials.

Components of Bioinformatics:
Bioinformatics comprises three components:

1. Creation of databases:

This involves the organizing, storage and management the biological data sets. The databases are accessible to researchers to know the existing information and submit new entries, e.g. protein sequence data bank for molecular structure. Databases will be of no use until analysed.

2. Development of algorithms and statistics:

This involves the development of tools and resources to determine the relationship among the members of large data sets e.g. comparison of protein sequence data with the already existing protein sequences.

3. Analysis of data and interpretation:

The appropriate use of components 1 and 2 (given above) to analyse the data and interpret the results in a biologically meaningful manner. This includes DNA, RNA and protein sequences, protein structure, gene expression profiles and biochemical pathways.

Bioinformatics and the Internet:
The internet is an international computer network. A computer network involves a group of computers that can communicate (usually over a telephone system) and exchange data between users. It is the internet protocol (IP) that determines how the packets of information are addressed and routed over the network. To access the internet, a computer must have the correct hardware (modem/ network card), appropriate software and permission for access to network. For this purpose, one has to subscribe to an internet service provider (ISP).

World Wide Web (www):
www involves the exchange of information over the internet using a programme called browser. The most widely used browsers are Internet explorer and Netscape navigator.

Applications of Bioinformatics:
The advent of bioinformatics has revolutionized the advancements in biological science. And biotechnology is largely benefited by bioinformatics. The best example is the sequencing of human genome in a record time which would not have been possible without bioinformatics.

A selected list of applications of bioinformatics is given below:

i. Sequence mapping of biomolecules (DNA, RNA, proteins).

ii. Identification of nucleotide sequences of functional genes.

iii. Finding of sites that can be cut by restriction enzymes.

iv. Designing of primer sequence for polymerase chain reaction.

v. Prediction of functional gene products.

vi. To trace the evolutionary trees of genes.

vii. For the prediction of 3-dimensional structure of proteins.

viii. Molecular modelling of biomolecules.

ix. Designing of drugs for medical treatment.

x. Handling of vast biological data which otherwise is not possible.

xi. Development of models for the functioning various cells, tissues and organs.

The above list of applications however, may be treated as incomplete, since at present there is no field in biological sciences that does not involve bioinformatics.

Margaret Oakley Dayhoff has been called the “mother and father of bioinformatics” as she was a pioneer of applying mathematics and computational methods to biochemistry.

Educated as a physical chemist she had computers complete her data analysis of theoretical chemistry using punch card machines to calculate the resonance energies of several polycyclic organic molecules. She completed this work as part of her doctoral thesis at Columbia University, following this up with post-doctoral research at the Rockefeller Institute and the University of Maryland.

In 1959 she joined the National Biomedical Research Foundation and shortly thereafter began developing tools to aid protein chemist in determination of amino acid sequences by automatically overlapping the sequences of peptides. Seeing the need for a database of nucleic acids she began collecting protein sequences in the Atlas of Protein Sequence and Structure, publishing the book in 1965 which was followed by several republished editions.

The Atlas was organized by gene families of which Dayhoff was considered to be a pioneer in their recognition. She also developed the first on-line database system, a sequence database, that could be accessed by telephone line for use by remote computers in 1980. In an attempt to reduce the size of data files used by sequencing she developed a one-letter code for amino acids that was accepted by the International Union of Pure and Applied Chemists.

As part of her work with amino acids she originated one of the first substitution matrices, Point Accepted Mutations which is the replacement of a single amino acid in the primary structure of a protein with another single amino acid, accepted by the process of natural selection.

As an active member of the Biophysical Society she served as the first female officer of the group, including as president. Because of her work in bioinformatics as well as mentoring women working in scientific areas the Biophysical Society create an award in her honor that is given to a promising young woman to encourage her to enter a career in scientific research.

Bioinformatics is the application and development of methods from computer science to solve challenges within molecular biology and medicine. Modern molecular biology generates large amounts of data and is therefore highly dependent on advanced computer science. For example, today it is possible to map the genome (the combined genetic material) for a given individual. Increasingly better bioinformatics approaches have to be developed to organize and analyze such data, and to find connections between genes, lifestyle and diseases. Bioinformatics is also central in the study of how proteins function in an organism and in the development of new medicines, for example through the discovery and optimization of new enzymes.

What Is Bioinformatics?
Bioinformatics is the application of information technology to the study of living things, usually at the molecular level. Bioinformatics involves the use of computers to collect, organize and use biological information to answer questions in fields like evolutionary biology. Continue reading for more information about the applications of bioinformatics. Schools offering Bioinformatics degrees can also be found in these popular choices.

Bioinformatics Overview
Over the past decades, the quantity and quality of biological information has skyrocketed, largely because of advances in molecular biology and genomic technology. The Bioinformatics Organization reports that bioinformatics is used to develop databases, like the Human Genome Project, that store, organize and index biological information for analysis.

The value of bioinformatics goes beyond the scientific community. This field allows scientists to create comprehensive databases of biological and health information that can be used to test theories and generate solutions to medical problems that affect us all. The National Center for Biotechnology Information reports that there are three main scientific applications of bioinformatics. These are described below.

Evolutionary Biology
Evolutionary biology looks at the molecules of different organisms and determines whether they share a common evolutionary history. This process has the potential to uncover relationships between life forms never considered before. By using bioinformatics to track this data, evolutionary biologists can gain new insights into the causes of and cures for various diseases.

Protein Modeling
Proteins have specific functions in our bodies determined by DNA sequences. Using bioinformatic techniques, scientists can test theories about how various proteins interact. These tests may help scientists understand how diseases develop in living organisms.

Genome Mapping
Genome mapping is another bioinformatic technique used for scientific research. Computerized genomic maps make it easier for scientists to locate genes, and this increased efficiency results in higher productivity and greater scientific advancements. Due to this development in bioinformatics, scientists can spend less time on the painstaking mapping process and more time testing their hypotheses.

How to Study Bioinformatics
Bioinformatics is typically studied at the graduate level. However, some bachelor's degree programs in relevant fields, like bioengineering, computer science, biology and chemistry, offer a specialization in bioinformatics. Master's degree programs in bioinformatics can prepare graduates for applied research or consulting jobs, and PhD graduates can seek a range of research jobs, as well as university-level teaching positions.

Core bioinformatics courses may include molecular biology, probability, statistics, computing and informatics, while advanced courses may cover population genetics, molecular genomic and epigenomic data analysis, biological mathematical modeling, biostatistics, sustainability mathematics and computational neuroscience. Graduate degree programs typically require laboratory work, research, an internship and a thesis or dissertation.

To continue researching, browse degree options below for course curriculum, prerequisites and financial aid information. Or, learn more about the subject by reading the related articles below:

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

[34]

[35]

[36]

[37]

[38]

[39]

[40]

[41]

[42]

[43]

[44]

[45]

[46]

[47]

[48]

[49]

[50]

[51]

Evolutionary biology
Part of a series on
Darwin's finches by John Gould
Key topics[show]
Processes and outcomes[show]
Natural history[show]
History of evolutionary theory[show]
Fields and applications[show]
Social implications[show]
Evolutionary biology portal Category Book Related topics
v t e

Bioinformatics - Auditory Biotechnology

Bioinformatics

Contents

Introduction[edit]

History[edit]

Sequences[edit]

Goals[edit]

Relation to other fields[edit]

Sequence analysis[edit]

DNA sequencing[edit]

Sequence assembly[edit]

Genome annotation[edit]

Computational evolutionary biology[edit]

Comparative genomics[edit]

Pan genomics[edit]

Genetics of disease[edit]

Analysis of mutations in cancer[edit]

Gene and protein expression[edit]

Analysis of gene expression[edit]

Analysis of protein expression[edit]

Analysis of regulation[edit]

Analysis of cellular organization[edit]

Microscopy and image analysis[edit]

Protein localization[edit]

Nuclear organization of chromatin[edit]

Structural bioinformatics[edit]

Network and systems biology[edit]

Molecular interaction networks[edit]

Others[edit]

Literature analysis[edit]

High-throughput image analysis[edit]

High-throughput single cell data analysis[edit]

Biodiversity informatics[edit]

Ontologies and data integration[edit]

Databases[edit]

Software and tools[edit]

Open-source bioinformatics software[edit]

Web services in bioinformatics[edit]

Bioinformatics workflow management systems[edit]

BioCompute and BioCompute Objects[edit]

Education platforms[edit]

Conferences[edit]

Bioinformatics

The data of bioinformatics

Storage and retrieval of data

Goals of bioinformatics

Abstract

Keywords

Chapter and author info

1. Introduction

1.1. History of emergence and development

Figure 1.

2. Bioinformatics help in handling and analysis of the genomics data, genome annotation, and expression profiling

3. Structural bioinformatics: molecular folding, modeling, and design

4. Biological networks and system biology

5. Databases

6. Software, analysis tools, services, and workflow

7. Text mining

8. Education

9. Conclusions and future perspectives

Abstract

BACKGROUND:

OBJECTIVES:

METHODS:

RESULTS AND CONCLUSIONS:

Abstract

What is bioinformatics?

Distinction from medical informatics

Meaning of Bioinformatics:

Branches of Bioinformatics:

Applications of Bioinformatics in Crop Improvement:

Advantages of Bioinformatics:

Limitations of Bioinformatics:

Next

Newer Post

Previous

Older Post