PERL Programming

Perl:

* Perl is a stable, cross platform programming language.
* Perl stands for Practical Extraction and Report Language.
* It is used for mission critical projects in the public and private
sectors.
* Perl is Open Source software, licensed under its Artistic

License or the GNU General Public License (GPL).
* Perl was created by Larry Wall.
* Perl 1.0 was released to usenet's alt.comp.sources in 1987
* PC Magazine named Perl a finalist for its 1998 Technical
Excellence Award in the Development Tool category.
* Perl is listed in the Oxford English Dictionary.

Supported Operating Systems:

* Unix systems
* Macintosh - (OS 7-9 and X) see The MacPerl Pages.
* Windows - see ActiveState Tools Corp.
* VMS
* And many more...

Best Features Of Perl :

* Perl takes the best features from other languages, such as C, awk,
sed, sh, and BASIC, among others.
* Perls database integration interface supports third-party databases including Oracle, Sybase, Postgres MySQL and others.
* Perl works with HTML, XML, and other mark-up languages.
* Perl supports Unicode.
* Perl is Y2K compliant.
* Perl supports both procedural and object-oriented programming.
* Perl interfaces with external C/C++ libraries through XS or SWIG.
* Perl is extensible. There are over 500 third party modules available
from the Comprehensive Perl Archive Network.
* The Perl interpreter can be embedded into
other systems.

PERL and the Web

* Perl is the most popular web programming language due to its text
manipulation capabilities and rapid development cycle.
* Perl is widely known as " the duct-tape of the Internet.
* Perl's CGI.pm module, part of Perl's standard distribution, makes
handling HTML forms simple.
* Perl can handle encrypted Web data, including e-commerce transactions.
* Perl can be embedded into web servers to speed up processing by as
much as 2000%.
* mod_perl allows the Apache web server to embed a Perl interpreter.
* Perl's DBI package makes web-database integration easy.

Bioinformatics is a tool to solve the Biological problems based on existing data.

Bioinformatics is a method to solve the Biological outcomes based on existing experimental results.

Bioinformatics = Biology + Informatics + Statistics + (Bio-Chemistry + Bio- Physics).

Bioinformatics creates the way for the Biologists to store all the data.

Bioinformatics makes some lab experiments easy by predicting the outcome of the lab experiment.

Somtimes Bioinformatics shows the initial way to start the lab experiment from existing results.

Bioinformatics helps the researchers to get an idea about any lab experiments before they start.

Gene prediction / Gene finding softwares:

After sequencing a genome of a organism the next and the most important step is to predict the genes in the genome. Homology Search method (Ex:BLAST)is a very simple and straight forward method to predict genes.

GLIMMER - To identify coding regions in microbial DNA.

GeneScan - To predict complete gene structures, including exons, introns, promoter and poly-adenylation signals, in genomic sequences

GeneMark - For finding genes in bacterial DNA sequences.

WebGene
- Web interface for several coding region recognition programs.

Installing and executing stand-alone BLAST softwares in Linux.

Stand alone BLAST is the local installation of the NCBI BLAST suite of programs. NCBI provides binaries for various platforms. It is the same as the NCBI BLAST programs except that we can execute in the local machine.

The local version is significant when we have a large set of sequences to BLAST and this is not affected by the Internet speed /Traffic etc and it can be automated.

The stand alone blast can be downloaded from the NCBI FTP site (The link can be found at the bottom side tool bar in the NCBI main page “FTP Site-> Blast-> executables->Latest”).

The file should be in binary mode. Filenames are of the following form:

Program-version-architecture-os.extension Remember to choose the appropriate architecture (32 bit or 64 bit). Download the file and extract the contents in the gzip'ed tar archive. The ‘.gz’ file extension indicates that the file has been compressed with gzip (a standard Unix compression utility), The ‘.tar’ extension indicates that the file is a tape archive created with tar (a standard Unix archiving tool).

To uncompress ‘gunzip’ and extract the files from the archive into the current working directory follow the comments given below.

jk@jk:~/Desktop/blast-2.2.18/bin$ gunzip blast-2.2.18-ia32-linux.tar.gz #uncompress

jk@jk:~/Desktop/blast-2.2.18/bin$ tar -xpf blast-2.2.18-ia32-linux.tar #extract

For more information on the options look into $man tar/gunzip.

When you get into the extracted directory you can see three other directories (bin, data, doc). The doc directory contains the README files for each software. The data directory contains the scoring matrices. The bin directory contains all the executables for running various BLAST searches.

How to execute bl2seq (BLAST two sequence):

Bl2seq performs a comparison between two sequences using either the blastn or blastp algorithm. Both sequences must be either nucleotides or proteins.

The input files to any BLAST softwares should always be in FASTA format.

eg
>gi|229673|pdb|1ALC| Alpha-Lactalbumin
KQFTKCELSQNLYDIDGYGRIALPELICTMFHTSGYDTQAIVENDESTEYGLFQISNALWCKSSQSPQSR
NICDITCDKFLDDDITDDIMCAKKILDIKGIDYWIAHKALCTEKLEQWLCEKE

Syntax:

jk@jk:~/Desktop/blast-2.2.18/bin$ ./bl2seq - # Displays all options

You can choose the required options. The must-options are -p, -i, -j. The other options can be defined or elze the program will choose the default value.

jk@jk:~/Desktop/blast-2.2.18/bin$ ./bl2seq -p blastp -e 0.01 -i -j # blastp -to execute protein sequence
-i First sequence [File In]
-j Second sequence [File In]
-p Program name: blastp, blastn, blastx, tblastn, tblastx. For blastx 1st sequence should be nucleotide, tblastn 2nd sequence nucleotide.
-e E-Value # (optional)

The two input files (file1, file2) should be in the (/blast-2.2.18/bin) current working directory for the above syntax to work. If not, give the appropriate path. If you have multiple FASTA sequences to compare you can automate the above syntax using shell scripts.

How to execute Blastall:

Blastall is most commonly used tool. It can perform all BLAST programs like blastp, blastn, blastx, tblastn, tblastx. Unlike the bl2seq, The blastall is used when you have multiple FASTA sequences as input/queries and searched against the appropriate protein/nucleotide database.
You can download the Protein or Nucleotide database from swissprot or NCBI. for eg to download the human chr22,

go to NCBI-> FTP site-> RefSeq-> H_sapiens-> H_sapiens ->chr22.

Note:

FASTA formatted files are not compatible for the BLAST programs. You need to prepare the FASTA files for BLAST with formatdb. This indexes the entries in the FASTA file and enables BLAST to run much faster.

Uncompress the database. It will look like the one below if its a protein sequence database. The multiple sequence input query to blastall will look similar to this.

>gi|86438068|gb|AAI12638.1| HGD protein [Bos taurus]
MTELKYISGFGNECASEDPRCPGALPEGQNNPQVCPYNLYAEQLSGSAFTCPRSTNKRSWLYRILPSVSH
KPFEFIDQGHITHNWD
>gi|116283875|gb|AAH44758.1| Hgd protein [Mus musculus]
MSVLQRILAVQVPCPKDSWLYRILPSVSHKPFESIDQGHVTHNWDEVGPDPNQLRWKPFEIPKASEKKVD
FVSGLYTLCGAGDIKSNNGLAVHIFLCNSSMENRCFYNSDGDFLIVPQKGKLLIYTEFGKMSLQPNEICV
>gi|116283724|gb|AAH24369.1| Hgd protein [Mus musculus]
MSVLQRILAVQVPCPKDSWLYRILPSVSHKPFESIDQGHVTHNWDEVGPDPNQLRWKPFEIPKASEKKVD

Formatdb:
jk@jk:~/Desktop/blast-2.2.18/bin$ ./formatdb - # displays all options

jk@jk:~/Desktop/blast-2.2.18/bin$ ./blast-2.2.18/bin/formatdb -i -o T -p T

-i Input file(s) for formatting (this parameter must be set) [File In]
-p Type of file T - protein F - nucleotide (default = T)
-o Parse options T - True: Parse SeqId and create indexes. F - False: Do not parse SeqId. ( default = F)

The input database should be in the (/blast-2.2.18/bin) current working directory for the above syntax to work. If not, give the appropriate path.

After running formatdb you can see seven indexes and data files along with the original input file. All the seven files are required for the blastall to run. Make sure the database along with the generated input database is kept in the same directory. View the contents of formatdb.log for error messages.

2. Executing Blastall:
jk@jk:~/Desktop/blast-2.2.18/bin$ ./blastall -i -p blastp -d -o

-p Program Name [String] Input should be one of "blastp", "blastn", "blastx", "tblastn", or "tblastx".
-d Database [String] default = nr The database specified must first be formatted with formatdb.
-i Query File [File In]
-o BLAST report Output File [File Out]

The input database should be in the (/blast-2.2.18/bin) current working directory for the above syntax to work. If not, give the appropriate path.

The output file will contain the BLAST output for all the input query sequences.

CPAN offers two command-line utility modules. Perl-Tidy module to beautify, indent, and reformat a messy Perl script and Perl-Critic module to test/analyze Perl scripts.

a. Perl::Tidy
When a Perl script is given as an input to perltidy, it creates a intended, structured Perl script and saves it as a separate file using the same name but with a .ty extension. Perltidy does not change the input script.

Steps to follow,
1. Install Perl::Tidy. It can be run on any system with perl 5.004 or later and used on Unix, Windows, VMS and MacPerl.
2. To execute perltidy,
$ perltidy -[option] test_perl_script.pl
This will create a temporary file test_perl_.pl.ty. The test_perl_script.pl .ty file will contain the well structured perl script. There are many options that can be used indent, to take a back-up etc. For more information on installation and execution see, http://perltidy.sourceforge.net/tutorial.html

b. Perl::Critic

Perl-Critic criticizes/analyses the input Perl script and enforces the user to follow various coding guidelines (or policies). The coding guidelines are based on Damian Conway's book Perl Best Practices. The user can enable/disable or create and customize the modules through the Perl::Critic interface.
The user can set the severity levels. There are 5 severity levels: severity "5" is the most or least restrictive level ie Perl::Critic follow the basic policies/guidelines. The five levels are Gentle (equivalent to 5), stern (equivalent to 4), harsh (equivalent to 3), cruel (equivalent to 2), brutal (equivalent to 1).
Perl::Critic requires a few modules to be pre-installed for it to execute. See http://search.cpan.org/~elliotjs/Perl-Critic-1.082/lib/Perl/Critic.pm

Steps to follow,
1. Install Perl::Critic.
2. Execute Perl::Critic
$ perlcritic –1 test_perl_script.pl

For more information see, http://search.cpan.org/dist/Perl-Critic/

1. Perl Scripts are very easy for the String processing when using biological data like Genome sequences or protein sequences.

2. File handling is easy in Perl.

3. Perl regular expression is very flexible and easy to match similar patters rather than identical ones. It can be used in instance like matching a motif or a repeat in a sequence.

4. There are no strict rules for writing Perl scripts like other languages. That makes it easy for the biologist to learn Perl in short period.

5. Perl scripts can be combined with SHELL scripts for text processing.

6. Using Perl CGI and HTML one can develop the Web pages. Perl CGI is very similar to Perl scripts.

7. CPAN contains hundreds of Perl Modules which are Specific for sequence analysis.
Eg: FASTAParse , Peptide::Pubmed .

8. Perl can be used for System administration purpose also.

9. Perl Template tool kit is another Perl product which can be used for developing advanced web pages.

10. Using perl DBIx it is easier to pass mysql data (backend) to the web page(front end).

11. Processing / Parsing a HTML file is very easy by using CPAN modules.

12. File type conversion is possible in Perl using CPAN modules. Ex:Doc to PDF ,HTML to PDF ..Etc.

13. By using Perl Magick module we can do image processing.

14. Perl critic module will help you to write a best Perl codes by criticizing your code structure.

What is bioinformatics ?

It's method to predict the biological outcomes before anyone go for full fledged research. It's a method to compare the biological data. Ex: sequence analysis. It's a way to predict or solve the protein structure.
It's the only way for PERSONALIZED MEDICINE in this post genomic era. It's the method to do comparative genomics and predict the Human homolog genes in other species.
It's the method to annotate the newly sequenced genomes.

How the biological problems can be predicted ?

We are living in the world of Computers. By analyzing the existing biological data using Information Technology we can predict the biological outcomes.

What is the HOTTEST branch of Bioinformaics in this post genomic era ?

Personalized medicine is the most hottest and fastest growing field. Personalized medicine can be achieved through bioinformatics only.

PERL Programming

Next

Newer Post

Previous

Older Post