Swiss-Prot (http://www.expasy.org/sprot) is a hand-curated database of protein sequences. In Section 4.2.2, we described how to extract the sequence of a Swiss-Prot record as a SeqRecord
object. Alternatively, you can store the Swiss-Prot record in a Bio.SwissProt.SProt.Record
object, which in fact stores the complete information contained in the Swiss-Prot record. In this Section, we describe how to extract Bio.SwissProt.SProt.Record
objects from a Swiss-Prot file.
To parse a Swiss-Prot record, we first get a handle to a Swiss-Prot record. There are several ways to do so, depending on where and how the Swiss-Prot record is stored:
>>> handle = open("myswissprotfile.dat")
>>> import gzip >>> handle = gzip.open("myswissprotfile.dat.gz")
>>> import urllib >>> handle = urllib.urlopen("http://www.somelocation.org/data/someswissprotfile.dat")
>>> from Bio import ExPASy >>> handle = ExPASy.get_sprot_raw(myaccessionnumber)
The key point is that for the parser, it doesn’t matter how the handle was created, as long as it points to data in the Swiss-Prot format.
We can use Bio.SeqIO as described in Section 4.2.2 to get file format agnostic SeqRecord
objects. Alternatively, we can get Bio.SwissProt.SProt.Record
objects which are a much closer match to the underlying file format, using following code.
To read one Swiss-Prot record from the handle, we use the function read()
:
>>> from Bio import SwissProt >>> record = SwissProt.read(handle)
This function should be used if the handle points to exactly one Swiss-Prot record. It raises a ValueError
if no Swiss-Prot record was found, and also if more than one record was found.
We can now print out some information about this record:
>>> print record.description CHALCONE SYNTHASE 3 (EC 2.3.1.74) (NARINGENIN-CHALCONE SYNTHASE 3). >>> for ref in record.references: ... print "authors:", ref.authors ... print "title:", ref.title ... authors: Liew C.F., Lim S.H., Loh C.S., Goh C.J.; title: "Molecular cloning and sequence analysis of chalcone synthase cDNAs of Bromheadia finlaysoniana."; >>> print record.organism_classification ['Eukaryota', 'Viridiplantae', 'Embryophyta', 'Tracheophyta', 'Spermatophyta', 'Magnoliophyta', 'Liliopsida', 'Asparagales', 'Orchidaceae', 'Bromheadia']
To parse a file that contains more than one Swiss-Prot record, we use the parse
function instead. This function allows us to iterate over the records in the file. For example, let’s parse the full Swiss-Prot database and collect all the descriptions. The full Swiss-Prot database, downloaded from ExPASy on 4 December 2007, contains 290484 Swiss-Prot records in a single gzipped-file uniprot_sprot.dat.gz
.
>>> import gzip >>> input = gzip.open("uniprot_sprot.dat.gz") >>> from Bio import SwissProt >>> records = SwissProt.parse(input) >>> descriptions = [] >>> for record in records: ... description = record.description ... descriptions.append(description) ... >>> len(descriptions) 290484 >>> descriptions[:3] ['104 kDa microneme/rhoptry antigen precursor (p104).', '104 kDa microneme/rhoptry antigen precursor (p104).', 'Protein 108 precursor.']
It is equally easy to extract any kind of information you’d like from Swiss-Prot records. To see the members of a Swiss-Prot record, use
>>> dir(record) ['__doc__', '__init__', '__module__', 'accessions', 'annotation_update', 'comments', 'created', 'cross_references', 'data_class', 'description', 'entry_name', 'features', 'gene_name', 'host_organism', 'keywords', 'molecule_type', 'organelle', 'organism', 'organism_classification', 'references', 'seqinfo', 'sequence', 'sequence_length', 'sequence_update', 'taxonomy_id']
Swiss-Prot also distributes a file keywlist.txt
, which lists the keywords and categories used in Swiss-Prot. The file contains entries in the following form:
ID 2Fe-2S. AC KW-0001 DE Protein which contains at least one 2Fe-2S iron-sulfur cluster: 2 iron DE atoms complexed to 2 inorganic sulfides and 4 sulfur atoms of DE cysteines from the protein. SY Fe2S2; [2Fe-2S] cluster; [Fe2S2] cluster; Fe2/S2 (inorganic) cluster; SY Di-mu-sulfido-diiron; 2 iron, 2 sulfur cluster binding. GO GO:0051537; 2 iron, 2 sulfur cluster binding HI Ligand: Iron; Iron-sulfur; 2Fe-2S. HI Ligand: Metal-binding; 2Fe-2S. CA Ligand. // ID 3D-structure. AC KW-0002 DE Protein, or part of a protein, whose three-dimensional structure has DE been resolved experimentally (for example by X-ray crystallography or DE NMR spectroscopy) and whose coordinates are available in the PDB DE database. Can also be used for theoretical models. HI Technical term: 3D-structure. CA Technical term. // ID 3Fe-4S. ...
The entries in this file can be parsed by the parse
function in the Bio.SwissProt.KeyWList
module. Each entry is then stored as a Bio.SwissProt.KeyWList.Record
, which is a Python dictionary.
>>> from Bio.SwissProt import KeyWList >>> handle = open("keywlist.txt") >>> records = KeyWList.parse(handle) >>> for record in records: ... print record['ID'] ... print record['DE']
This prints
2Fe-2S. Protein which contains at least one 2Fe-2S iron-sulfur cluster: 2 iron atoms complexed to 2 inorganic sulfides and 4 sulfur atoms of cysteines from the protein. ...
Prosite is a database containing protein domains, protein families, functional sites, as well as the patterns and profiles to recognize them. Prosite was developed in parallel with Swiss-Prot. In Biopython, a Prosite record is represented by the Bio.Prosite.Record
class, whose members correspond to the different fields in a Prosite record.
In general, a Prosite file can contain more than one Prosite records. For example, the full set of Prosite records, which can be downloaded as a single file (prosite.dat
) from ExPASy, contains 2073 records in (version 20.24 released on 4 December 2007). To parse such a file, we again make use of an iterator:
>>> from Bio import Prosite >>> handle = open("myprositefile.dat") >>> records = Prosite.parse(handle)
We can now take the records one at a time and print out some information. For example, using the file containing the complete Prosite database, we’d find
>>> from Bio import Prosite >>> handle = open("prosite.dat") >>> records = Prosite.parse(handle) >>> record = records.next() >>> record.accession 'PS00001' >>> record.name 'ASN_GLYCOSYLATION' >>> record.pdoc 'PDOC00001' >>> record = records.next() >>> record.accession 'PS00004' >>> record.name 'CAMP_PHOSPHO_SITE' >>> record.pdoc 'PDOC00004' >>> record = records.next() >>> record.accession 'PS00005' >>> record.name 'PKC_PHOSPHO_SITE' >>> record.pdoc 'PDOC00005'
and so on. If you’re interested in how many Prosite records there are, you could use
>>> from Bio import Prosite >>> handle = open("prosite.dat") >>> records = Prosite.parse(handle) >>> n = 0 >>> for record in records: n+=1 ... >>> print n 2073
To read exactly one Prosite from the handle, you can use the read
function:
>>> from Bio import Prosite >>> handle = open("mysingleprositerecord.dat") >>> record = Prosite.read(handle)
This function raises a ValueError if no Prosite record is found, and also if more than one Prosite record is found.
In the Prosite example above, the record.pdoc
accession numbers 'PDOC00001'
, 'PDOC00004'
, 'PDOC00005'
and so on refer to Prodoc records, which contain the Prosite Documentation. The Prodoc records are available from ExPASy as individual files, and as one file (prosite.doc
) containing all Prodoc records.
We use the parser in Bio.Prosite.Prodoc
to parse Prodoc records. For example, to create a list of all Prodoc accession numbers, you can use
>>> from Bio.Prosite import Prodoc >>> handle = open("prosite.doc") >>> records = Prodoc.parse(handle) >>> accessions = [record.accession for record in records]
Again a read()
function is provided to read exactly one Prodoc record from the handle.
Swiss-Prot, Prosite, and Prodoc records can be downloaded from the ExPASy web server at http://www.expasy.org. Six kinds of queries are available from ExPASy:
To access this web server from a Python script, we use the Bio.ExPASy
module.
Let’s say we are looking at chalcone synthases for Orchids (see section 2.3 for some justification for looking for interesting things about orchids). Chalcone synthase is involved in flavanoid biosynthesis in plants, and flavanoids make lots of cool things like pigment colors and UV protectants.
If you do a search on Swiss-Prot, you can find three orchid proteins for Chalcone Synthase, id numbers O23729, O23730, O23731. Now, let’s write a script which grabs these, and parses out some interesting information.
First, we grab the records, using the get_sprot_raw()
function of Bio.ExPASy
. This function is very nice since you can feed it an id and get back a handle to a raw text record (no html to mess with!). We can the use Bio.SwissProt.read
to pull out the Swiss-Prot record, or Bio.SeqIO.read
to get a SeqRecord. The following code accomplishes what I just wrote:
>>> from Bio import ExPASy >>> from Bio import SwissProt >>> accessions = ["O23729", "O23730", "O23731"] >>> records = [] >>> for accession in accessions: ... handle = ExPASy.get_sprot_raw(accession) ... record = SwissProt.read(handle) ... records.append(record)
If the accession number you provided to ExPASy.get_sprot_raw
does not exist, then SwissProt.read(handle)
will raise a ValueError
. You can catch ValueException
exceptions to detect invalid accession numbers:
>>> for accession in accessions: ... handle = ExPASy.get_sprot_raw(accession) ... try: ... record = SwissProt.read(handle) ... except ValueException: ... print "WARNING: Accession %s not found" % accession ... records.append(record)
Now, you may remark that I knew the records’ accession numbers
beforehand. Indeed, get_sprot_raw()
needs either the entry name
or an accession number. When you don’t have them handy, you can use
one of the sprot_search_de()
or sprot_search_ful()
functions.
sprot_search_de()
searches in the ID, DE, GN, OS and OG lines;
sprot_search_ful()
searches in (nearly) all the fields. They
are detailed on
http://www.expasy.org/cgi-bin/sprot-search-de and
http://www.expasy.org/cgi-bin/sprot-search-ful
respectively. Note that they don’t search in TrEMBL by default
(argument trembl
). Note also that they return html pages;
however, accession numbers are quite easily extractable:
>>> from Bio import ExPASy >>> import re >>> handle = ExPASy.sprot_search_de("Orchid Chalcone Synthase") >>> # or: >>> # handle = ExPASy.sprot_search_ful("Orchid and {Chalcone Synthase}") >>> html_results = handle.read() >>> if "Number of sequences found" in html_results: ... ids = re.findall(r'HREF="/uniprot/(\w+)"', html_results) ... else: ... ids = re.findall(r'href="/cgi-bin/niceprot\.pl\?(\w+)"', html_results)
Prosite and Prodoc records can be retrieved either in HTML format, or in raw format. To parse Prosite and Prodoc records with Biopython, you should retrieve the records in raw format. For other purposes, however, you may be interested in these records in HTML format.
To retrieve a Prosite or Prodoc record in raw format, use get_prosite_raw()
. Although this function has prosite
in the name, it can be used for Prodoc records as well. For example, to download a Prosite record and print it out in raw text format, use
>>> from Bio import ExPASy >>> handle = ExPASy.get_prosite_raw('PS00001') >>> text = handle.read() >>> print text
To retrieve a Prosite record and parse it into a Bio.Prosite.Record
object, use
>>> from Bio import ExPASy >>> from Bio import Prosite >>> handle = ExPASy.get_prosite_raw('PS00001') >>> record = Prosite.read(handle)
Finally, to retrieve a Prodoc record and parse it into a Bio.Prosite.Prodoc.Record
object, use
>>> from Bio import ExPASy >>> from Bio.Prosite import Prodoc >>> handle = ExPASy.get_prosite_raw('PDOC00001') >>> record = Prodoc.read(handle)
For non-existing accession numbers, ExPASy.get_prosite_raw
returns a handle to an emptry string. When faced with an empty string, Prosite.read
and Prodoc.read
will raise a ValueError. You can catch these exceptions to detect invalid accession numbers.
The functions get_prosite_entry()
and get_prodoc_entry()
are used to download Prosite and Prodoc records in HTML format. To create a web page showing one Prosite record, you can use
>>> from Bio import ExPASy >>> handle = ExPASy.get_prosite_entry('PS00001') >>> html = handle.read() >>> output = open("myprositerecord.html", "w") >>> output.write(html) >>> output.close()
and similarly for a Prodoc record:
>>> from Bio import ExPASy >>> handle = ExPASy.get_prodoc_entry('PDOC00001') >>> html = handle.read() >>> output = open("myprodocrecord.html", "w") >>> output.write(html) >>> output.close()
For these functions, an invalid accession number returns an error message in HTML format.