BLAST: fastacmd
I ll list two useful commands of this function.
1-) Get a brief summary about the BLAST database
Usage:
fastacmd -d database_name -I T
Example:
fastacmd -d nr -I T
Output:
Database: All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF excluding environmental samples from WGS projects 13,123,072 sequences; 4,491,037,066 total letters File names: nr.00 Date: Feb 16, 2011 6:27 PM Version: 4 Longest sequence: 36,805 res nr.01 Date: Feb 16, 2011 6:27 PM Version: 4 Longest sequence: 35,213 res nr.02 Date: Feb 16, 2011 6:27 PM Version: 4 Longest sequence: 33,423 res nr.03 Date: Feb 16, 2011 6:27 PM Version: 4 Longest sequence: 34,170 res nr.04 Date: Feb 16, 2011 6:27 PM Version: 4 Longest sequence: 33,452 res
2-) Get sequence(s) of a given query (list). You can either use GI or accession
Usage:
fastacmd -d database_name -s "seq_name 1, seqname 2 ..." fastacmd -d database_name -i "file name of the list of the seqs"
Example:
fastacmd -d nr -s 90110050
Output:
>gi|157266326|ref|NP_000266.2| P protein [Homo sapiens] >gi|90110050|sp|Q04671.2|P_HUMAN RecName: Full=P protein; AltName: Full=Melanocyte-specific transporter protein; AltName: Full=Pink-eyed dilution protein homolog >gi|773328|gb|AAC13784.1| P protein [Homo sapiens] >gi|119578067|gb|EAW57663.1| oculocutaneous albinism II (pink-eye dilution homolog, mouse), isoform CRA_c [Homo sapiens] MHLEGRDGRRYPGAPAVELLQTSVPSGLAELVAGKRRLPRGAGGADPSHSCPRGAAGQSSWAPAGQEFASFLTKGRSHSS LPQMSSSRSKDSCFTENTPLLRNSLQEKGSRCIPVYHPEFITAEESWEDSSADWERRYLLSREVSGLSASASSEKGDLLD SPHIRLRLSKLRRCVQWLKVMGLFAFVVLCSILFSLYPDQGKLWQLLALSPLENYSVNLSSHVDSTLLQVDLAGALVASG PSRPGREEHIVVELTQADALGSRWRRPQQVTHNWTVYLNPRRSEHSVMSRTFEVLTRETVSISIRASLQQTQAVPLLMAH QYLRGSVETQVTIATAILAGVYALIIFEIVHRTLAAMLGSLAALAALAVIGDRPSLTHVVEWIDFETLALLFGMMILVAI FSETGFFDYCAVKAYRLSRGRVWAMIIMLCLIAAVLSAFLDNVTTMLLFTPVTIRLCEVLNLDPRQVLIAEVIFTNIGGA ATAIGDPPNVIIVSNQELRKMGLDFAGFTAHMFIGICLVLLVCFPLLRLLYWNRKLYNKEPSEIVELKHEIHVWRLTAQR ISPASREETAVRRLLLGKVLALEHLLARRLHTFHRQISQEDKNWETNIQELQKKHRISDGILLAKCLTVLGFVIFMFFLN SFVPGIHLDLGWIAILGAIWLLILADIHDFEIILHRVEWATLLFFAALFVLMEALAHLHLIEYVGEQTALLIKMVPEEQR LIAAIVLVVWVSALASSLIDNIPFTATMIPVLLNLSHDPEVGLPAPPLMYALAFGACLGGNGTLIGASANVVCAGIAEQH GYGFSFMEFFRLGFPMMVVSCTVGMCYLLVAHVVVGWN
That’s all!
Hi ~ I have tried to test fastacmd get sequence with NR database which work perfect , but why I got wrong sequence with Uniprot database ?
Hi, Can you give me a toy example so i can test what’s going on?
OK , I use Uniprot accession as arg , the command line is this :
fastacmd -d /path/to/uniprot/uniprot_sprot.fasta -s “Q197F8”
this gives back :
>gnl|BL_ORD_ID|2 sp|Q197F5|005L_IIV3 Uncharacterized protein 005L OS=Invertebrate iridescent virus 3 GN=IIV3-005L PE=4 SV=1
MRYTVLIALQGALLLLLLIDDGQGQSPYPYPGMPCNSSRQCGLGTCVHSRCAHCSSDGTLCSPEDPTMVWPCCPESSCQL
VVGLPSLVNHYNCLPNQCTDSSQCPGGFGCMTRRSKCELCKADGEACNSPYLDWRKDKECCSGYCHTEARGLEGVCIDPK
KIFCTPKNPWQLAPYPPSYHQPTTLRPPTSLYDSWLMSGFLVKSTTAPSTQEEEDDY
which is not what I expected …
Thanks for the example, I will check this and let you know if i can figure out what’s going on. Btw, Have you tried other accession IDs in uniprot? I was wondering if this is specific only to the above accession ID or more like generic?