How to Obtain Cropped Nucleotide Sequences and More
In my opinion, there isn't much documentation on the BioPerl site: just vague examples and whatnot.
Say for instance you want to get the entire nucleotide sequence of a particular organism, but it must be split up because you only want to deal with the proteins, and not the entire organism. So less like this:
>NC_002695.1 Escherichia coli O157:H7 str. Sakai, complete genome
AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTCTCTGACAGCAGC
TTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCAA
TATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTACACAACATCCATGAAACGCATTAGCACCACC
ATTACCACCACCATCACCACCACCATCACCATTACCATTACCACAGGTAACGGTGCGGGCTGACGCGTAC
AGGAAACACAGAAAAAAGCCCGCACCTGACAGTGCGGGCTTTTTTTTCGACCAAAGGTAACGAGGTAACA
ACCATGCGAGTGTTGAAGTTCGGCGGTACATCAGTGGCAAATGCAGAACGTTTTCTGCGGGTTGCCGATA
TTCTGGAAAGCAATGCCAGGCAGGGGCAGGTGGCCACCGTCCTCTCTGCCCCCGCCAAAATCACCAACCA
CCTGGTGGCGATGATTGAAAAAACCATTAGCGGCCAGGATGCTTTACCCAATATCAGCGATGCCGAACGT
ATTTTTGCCGAACTTCTGACGGGACTCGCCGCCGCCCAGCCGGGATTCCCGCTGGCGCAATTGAAAACTT
TCGTCGACCAGGAATTTGCCCAAATAAAACATGTCCTGCATGGCATTAGTTTGTTAGGGCAGTGCCCGGA
TAGCATTAACGCTGCGCTGATTTGCCGTGGCGAGAAAATGTCGATCGCCATTATGGCCGGCGTATTAGAA
GCGCGCGGTCACAACGTTACCGTTATCGATCCGGTCGAAAAACTGCTGGCAGTGGGGCATTACCTCGAAT
CTACTGTCGATATTGCAGAGTCCACCCGCCGTATTGCGGCAAGTCGTATTCCGGCTGATCACATGGTGCT
GATGGCAGGTTTCACCGCCGGTAATGAAAAAGGCGAACTGGTGGTACTTGGACGCAACGGTTCCGACTAC
TCCGCGGCGGTGCTGGCTGCCTGTTTACGCGCCGATTGTTGCGAGATTTGGACGGACGTTGACGGGGTAT
ATACCTGCGACCCGCGTCAGGTGCCCGATGCGAGGTTGTTGAAATCGATGTCCTACCAGGAAGCGATGGA
GCTTTCCTACTTCGGCGCTAAAGTTCTTCACCCCCGCACCATTACCCCCATCGCCCAGTTCCAGATCCCT
TGCCTGATTAAAAATACCGGAAATCCTCAAGCTCCAGGTACGCTCATTGGTGCCAGTCGTGATGAAGACG
...
And more like this:
>SDY_PA01
ATGTCTGAATTAGTTGTGTTTAAAGCAAATGAATTAGCAGTAAGCCGTTATGATCTAACTGAACATGAAA
CCAAGCTAATTCTGTTTTGCGTTGCAAAGTTGAACCCTACAATTGAAAACCCAACAAGGGATGAATTAAC
AGTTAAATTCTCATGTTCTGAATATGCACGCACTATGGGCTTAAGTTATGAAAATGCTTGGGGAAGATTG
AACAGTGCAACTAGTGATTTATTTAAACGCTCTGTTGAATTAATTTATCCTACAGGAGCAGTTTCTAAAC
GAATTTTCAATTGGACGGAATACGCTGAGTTCAATAGAGAAGAACAAACTGTAACATTAGTATTTAGCTC
ATATATTCAACCACTTTTATTTCATTTAAAAAAATTCATTAAATACAACCTTGAGCATGTTAAAGCCTTT
GAAAATAAATACTCAATGCGAATATATGAATGGTTACTAAAGGAGCTTTCACAAAGAAAAACGCATAGGG
GAAACATTGAGATAAGTATCAAGGAATTTAAGTTTATGCTCATGCTTGAAAAAAACTACCCTTTATATGC
AGAATTGAATCGCTGGATCCTAAAACCAGTTACAAATGACTTAAACACTTACAGCAATATGAAGTTGACT
ATTGAAAAACGCGGTCGTCCTGCTGACACACTGATCTTTCAAGTTGAACTGGATAAACAAATTGACCTTG
TGACTGAACTAGCAAAAGATCCCGCATCAAAAAAAGAAGATAAGACAATCCGTTTAACGCCTGAAAATCG
TCTTCATGAGGGGCTAAAAACAACATTGCATGATGCTTTAACTGCAAAAATTCAACTGACTAGTTTTGAA
GCAAAATTCCTGAGCGATATGCAAAGCAAGTACGATCTTAACGGCTCATTTACATGGCTGACTCAAAAGC
AACGAACTACGCTAGAGAAAATTCTGGCTAAATACGGGCGGATATGA
>SDY_PA02
GTGGCTTCTGTTTCTATCAACTGTCCCTCCTGTTCAGCTACTGACGGGGGGGTGCGTAACGGCAAAAGCA
CTGCCGGACATCAGCGCTATCTCTGCTCTCACTGCCGTAAAACATGGCAACTGCAGTTCACTTACACCGC
TTCTCAACCCGGTACGCACCAGAAAATCATTGATATGGCCATGAATGGCGTTGGATGCCGGGCAACTGCA
CGCATTATGGGCGTTAGCCTCAACACGATTTTACGTCACTTAAAAAACTCAGGCCGCAGTCGGTAA
>SDY_PA03
ATGGACGAACAGTGGGGATACGTCGGGGCTAAATCGCGCCAGCGCTGGCTGTTTTACGCGTATGACAGGC
TCCGGAAGACGGTTGTTGCGCACGTATTCGGTGAACGCACTATGGCGACGCTGGGGCGTCTTATGAGCCT
GCTGTCCCCCTTTGACGTGGTGATATGGATGACGGATGGCTGGCCGCTGTATGAATCCCGCCTGAAGGGA
AAGCTGCACGTAATCAGCAAGCGATATACGCAGCGAATTGAGCGGCATAACCTGAATCTGAGGCAGCACC
TGGCACGGCTGGGACGGAAGTCGCTGTCGTTCTCAAAATCGGTGGAGCTGCATGACAAAGTCATCGGGCA
TTATCTGAACATAAAACACTATCAATAA
As you may or may not notice, these are 2 completely different organisms, but you get the drift.
Pre-requisites
BioPerl must be installed, along with Perl.
Disclaimer(s)
I don't do Perl. I don't like it that much, and I'm not very experienced in it. I also don't do Biology.
The Code
Start your file with the basics:
#!/usr/bin/env perl
use strict;
use warnings;
# We'll be getting nucleotides from GenBank database (NCBI)
use Bio::DB::GenBank;
use Bio::DB::Query::GenBank;
To specify the specific sequence you want to get, BioPerl recommends creating a query object.
my $query = "NC_002695 [ACCN] OR NC_002127 [ACCN] OR NC_002128 [ACCN]";
my $query_obj = Bio::DB:Query::GenBank->new(-db => 'nucleotide',
-query => $query);
I will be using accesion numbers to get the specific sequences. Note that the
binary operators OR
(other binary operators include AND
) must be in
uppercase. For a complete list of query fields, see this.
# Start querying
my $gb = Bio::DB:GenBank->new;
my $seq_objs = $gb->get_Stream_by_query($query_obj);
# Iterate through all the sequences gotten
while (my $seq_obj = $seq_objs->next_seq) {
# Iterate through all the features
for my $feat_obj ($seq_obj->get_SeqFeatures) {
# Only print features with primary tag "CDS", "protein_id", and
# "locus_tag" (I'm not sure what "CDS" stands for)
if ($feat_obj->primary_tag eq "CDS" and
$feat_obj->has_tag("protein_id") and
$feat_obj->has_tag("locus_tag")) {
print ">", $feat_obj->get_tag_values("locus_tag"), "\n";
print $feat_obj->spliced_seq->seq, "\n";
}
}
}
To explain the above code a bit more in depth, here is the raw text of a GenBank file, copied straight off of the NCBI database, features only:
FEATURES Location/Qualifiers
source 1..5498450
/organism="Escherichia coli O157:H7 str. Sakai"
/mol_type="genomic DNA"
/strain="Sakai"
/sub_strain="RIMD 0509952"
/serovar="O157:H7"
/db_xref="taxon:386585"
misc_feature 1..5498450
/note="REFSEQ gene predictions performed by GeneMark
2.4/GeneMark.hmm 2.0 with comparison to original submitter
provided annotation."
gene 190..273
/gene="thrL"
/locus_tag="ECs0001"
/db_xref="GeneID:913387"
CDS 190..273
/gene="thrL"
/locus_tag="ECs0001"
/note="involved in threonine biosynthesis; controls the
expression of the thrLABC operon"
/codon_start=1
/transl_table=11
/product="thr operon leader peptide"
/protein_id="NP_308028.1"
/db_xref="GeneID:913387"
gene 354..2816
/gene="thrA"
/locus_tag="ECs0002"
/db_xref="GeneID:913388"
There seems to be only one method of distinguishing genes that are proteins from genes that have nothing to do with the proteins: in the CDS, there is one tag "protein_id" that only proteins have (it corresponds to the protein's protein product). Genes that aren't proteins don't have this tag. Using this "distinguishing" feature, I was able to filter out all the genes that aren't proteins.
Our research heavily used the locus tag of proteins, but not the protein product as much.