Bioinformatics is the application of informatics techniques to acquire, store and analyze large and complex biological data. With the rise of data science in every discipline nowadays, bioinformatics can be seen as the data science of biology.
Genomes and protein sequences are the most common data in bioinformatics. Many of the open source tools commonly used in data science have extended their capability to include analysis of bioinformatic data such as these.
BLAST (Basic Local Alignment Search Tool) is a set of algorithms and programs for comparing nucleotide or amino acid sequences to sequence databases. It is an excellent tool to study unknown sequences. By aligning the unknown sequences with existing databases, it helps to uncover their functions.
There are two main ways to access BLAST: web interface and standalone. Both are available from the U.S. National Center for Biotechnology Information (NCBI) . The web interface is user friendly. However, if you need to run multiple sequence alignments and keep your search confidential, standalone BLAST is very much preferred.
To run BLAST locally, both BLAST+ and reference databases need to be downloaded and installed. BLAST can be run with the command line.
This example shows a BLAST search of FASTA sequence file
query.fsa against the
nt database, saving the result to
result.csv file with selected columns
blastn \ -db nt \ -query query.fsa \ -outfmt "10 sseqid qstart qend evalue" \ -out results.csv
The resulting CSV file can be loaded into other software such as R or Python for any downstream analysis.
Even with the advent of advanced bioinformatic techniques, BLAST is still one of the most used approaches in biology today. To learn more about running BLAST, you can consult the online manual by NCBI .
The statistical programming language R has long been used in academic environments for data analysis and visualization. It is not surprising that R was picked by the biology community to solve many of their data problems.
Bioconductor is an open source and open development software development project that uses R for computational biology and bioinformatics. Established in 2001, most of the packages in Bioconductor were initially developed to analyze high throughput genomic data. Since then, the functional scope of the packages has been broadened to include analysis of next generation sequencing (NGS) data, flow cytometry data and 3D protein structures.
To install any packages from Bioconductor
BioManager needs to be installed
install.packages("BiocManager") # Install Bioconductor package(s) BiocManager::install("Biostrings")
# Use pairwise alignment to look for difference in two amino acid sequences # seq1: SGFRKMAFPSGKVEGCMVQVTCGTTTLNGLWLDDVVYCPRHVICTSEDMLNPNYEDLLIRKSNHNFLVQA # - human coronavirus 3C like protease (partial) # seq2: SGFRKMAFPSGKVEGCMVQVTCGTTTLNGLWLDDTVYCPRHVICTAEDMLNPNYEDLLIRKSNHSFLVQA # - SARS 3CLpro C145A mutant (partial) library(Biostrings) aa_1 <- AAString( "SGFRKMAFPSGKVEGCMVQVTCGTTTLNGLWLDDVVYCPRHVICTSEDMLNPNYEDLLIRKSNHNFLVQA" ) aa_2 <- AAString( "SGFRKMAFPSGKVEGCMVQVTCGTTTLNGLWLDDTVYCPRHVICTAEDMLNPNYEDLLIRKSNHSFLVQA" ) align <- pairwiseAlignment(aa_1, aa_2) mismatchTable(align)
## PatternId PatternStart PatternEnd PatternSubstring PatternQuality ## 1 1 35 35 V 7 ## 2 1 46 46 S 7 ## 3 1 65 65 N 7 ## SubjectStart SubjectEnd SubjectSubstring SubjectQuality ## 1 35 35 T 7 ## 2 46 46 A 7 ## 3 65 65 S 7
The Biostrings package is a one stop shop for DNA, RNA or amino acid sequence manipulation. The example above is a pairwise alignment between coronavirus 3C-like protease (partial) and a mutant of it. From the summary table, we know there are 3 mutations at location 35, 46 and 65 in the sequences.
Currently, there are 1974 software packages in Bioconductor, covering topics from basic biological data manipulation to machine learning frameworks. A complete list of packages can be found on the Bioconductor website .
In recent years, Python has surpassed R to become the most commonly used programming language in data science. As the number of Python programmers increases, there is also a growth in the number of Python tools for bioinformatics.
Biopython is an open source collection of Python tools that includes functions to load and characterize genomic and protein sequence data.
The installation is rather straightforward:
pip install biopython
from Bio import pairwise2 from Bio.Seq import Seq seq1 = Seq("SGFRKMAFPSGKVEGCMVQVTCGTTTLNGLWLDDVVYCPRHVICTSEDMLNPNYEDLLIRKSNHNFLVQA") seq2 = Seq("SGFRKMAFPSGKVEGCMVQVTCGTTTLNGLWLDDTVYCPRHVICTAEDMLNPNYEDLLIRKSNHSFLVQA") alignments = pairwise2.align.globalxx(seq1, seq2) print(pairwise2.format_alignment(*alignments))
## SGFRKMAFPSGKVEGCMVQVTCGTTTLNGLWLDDV-VYCPRHVICTS-EDMLNPNYEDLLIRKSNHN-FLVQA ## |||||||||||||||||||||||||||||||||| |||||||||| |||||||||||||||||| ||||| ## SGFRKMAFPSGKVEGCMVQVTCGTTTLNGLWLDD-TVYCPRHVICT-AEDMLNPNYEDLLIRKSNH-SFLVQA ## Score=67
As a single package, Biopython is very useful for sequence data manipulation and basic machine learning techniques such as clustering. Nevertheless, Bioconductor still provides a wider selection of tools for bioinformatics, especially for statistical analysis. That being said, I am not starting another discussion on R vs Python. Each of them has pros and cons. In the end, the choice really depends on your project requirements.
There are also many other open source tools for bioinformatics including Orange and the Perl programming language. To keep the overview brief, only 3 were discussed here. With these open source tools at your fingertips, you can start today on your bioinformatics journey!