You read all about the basic Biopython sequence class in Chapter 3, which described how to do many useful things with just the sequence. However, many times sequences have important additional properties associated with them – as you will have seen with the SeqRecord
object in Chapter 4.
This section described how Biopython handles these higher level descriptions of a sequence.
Immediately above the Sequence class is the Sequence Record class, defined in the Bio.SeqRecord
module. This class allows higher level features such as ids and features to be associated with the sequence, and is used thoughout the sequence input/output interface Bio.SeqIO
, described in Chapter 4. The SeqRecord
class itself is very simple, and offers the following information as attributes:
Seq
objectSeqFeature
objects with more structured information about the features on a sequence. The structure of sequence features is described below in Section 10.1.2.
Using a SeqRecord
class is not very complicated, since all of the information is stored as attributes of the class. Initializing the class just involves passing a Seq
object to the SeqRecord
:
>>> from Bio.Seq import Seq >>> simple_seq = Seq("GATC") >>> from Bio.SeqRecord import SeqRecord >>> simple_seq_r = SeqRecord(simple_seq)
Additionally, you can also pass the id, name and description to the initialization function, but if not they will be set as strings indicating they are unknown, and can be modified subsequently:
>>> simple_seq_r.id '<unknown id>' >>> simple_seq_r.id = 'AC12345' >>> simple_seq_r.description = 'My little made up sequence I wish I could write a paper about and submit to GenBank' >>> print simple_seq_r.description My little made up sequence I wish I could write a paper about and submit to GenBank >>> simple_seq_r.seq Seq('GATC', Alphabet())
Adding annotations is almost as easy, and just involves dealing directly with the annotation dictionary:
>>> simple_seq_r.annotations['evidence'] = 'None. I just made it up.' >>> print simple_seq_r.annotations {'evidence': 'None. I just made it up.'}
That’s just about all there is to it! Next, you may want to learn about SeqFeatures, which offer an additional structured way to represent information about a sequence.
Sequence features are an essential part of describing a sequence. Once you get beyond the sequence itself, you need some way to organize and easily get at the more “abstract” information that is known about the sequence. While it is probably impossible to develop a general sequence feature class that will cover everything, the Biopython SeqFeature
class attempts to encapsulate as much of the information about the sequence as possible. The design is heavily based on the GenBank/EMBL feature tables, so if you understand how they look, you’ll probably have an easier time grasping the structure of the Biopython classes.
The first level of dealing with Sequence features is the SeqFeature
class itself. This class has a number of attributes, so first we’ll list them and there general features, and then work through an example to show how this applies to a real life example, a GenBank feature table. The attributes of a SeqFeature are:
SeqFeature
on the sequence that you are dealing with. The locations end-points may be fuzzy – section 10.1.2.2 has a lot more description on how to deal with descriptions.ref
to provide a cross sequence reference. If there is a reference, ref_db
will be set as None if the reference is in the same database, and will be set to the name of the database otherwise.sub_features
underneath it. This allows nesting of features, and helps us to deal with things such as the GenBank/EMBL feature lines in a (we hope) intuitive way.
To show an example of SeqFeatures in action, let’s take a look at the following feature from a GenBank feature table:
mRNA complement(join(<49223..49300,49780..>50208)) /gene="F28B23.12"
To look at the easiest attributes of the SeqFeature first, if you got a SeqFeature object for this it would have it type
of ’mRNA’, a strand
of -1 (due to the ’complement’), and would have None for the ref
and ref_db
since there are no references to external databases. The qualifiers
for this SeqFeature would be a python dictionarary that looked like {'gene' : 'F28B23.12'}
.
Now let’s look at the more tricky part, how the ’join’ in the location
line is handled. First, the location for the top level SeqFeature (the
one we are dealing with right now) is set as going from
'<49223' to '>50208'
(see section 10.1.2.2 for
the nitty gritty on how fuzzy locations like this are handled).
So the location of the top level object is the entire span of the
feature. So, how do you get at the information in the ’join?’
Well, that’s where the sub_features
go in.
The sub_features
attribute will have a list with two SeqFeature
objects in it, and these contain the information in the join. Let’s
look at top_level_feature.sub_features[0]
; the first
sub_feature
). This object is a SeqFeature object with a
type
of ’mRNA_join
,’ a strand
of -1 (inherited
from the parent SeqFeature) and a location going from
'<49223' to '49300'
.
So, the sub_features
allow you to get at the internal information if you want it (i. e. if you were trying to get only the exons out of a genomic sequence), or just to deal with the broad picture (i. e. you just want to know that the coding sequence for a gene lies in a region). Hopefully this structuring makes it easy and intuitive to get at the sometimes complex information that can be contained in a SeqFeature.
In the section on SeqFeatures above, we skipped over one of the more difficult parts of Features, dealing with the locations. The reason this can be difficult is because of fuzziness of the positions in locations. Before we get into all of this, let’s just define the vocabulary we’ll use to talk about this. Basically there are two terms we’ll use:
<100
and
3^5
are all positions.I just mention this because sometimes I get confused between the two.
The complication in dealing with locations comes in the positions themselves. In biology many times things aren’t entirely certain (as much as us wet lab biologists try to make them certain!). For instance, you might do a dinucleotide priming experiment and discover that the start of mRNA transcript starts at one of two sites. This is very useful information, but the complication comes in how to represent this as a position. To help us deal with this, we have the concept of fuzzy positions. Basically there are five types of fuzzy positions, so we have five classes do deal with them:
position
attribute of the object.'<13'
, signifying that
the real position is located somewhere less then 13. To get
the specified upper boundary, look at the position
attribute of the object.BeforePosition
, this
class represents a position that occurs after some specified site.
This is represented in GenBank as '>13'
, and like
BeforePosition
, you get the boundary number by looking
at the position
attribute of the object.position
attribute specifies the lower boundary of the range we are looking at, so in our example case this would be one. The extension
attribute specifies the range to the higher boundary, so in this case it would be 4. So object.position
is the lower boundary and object.position + object.extension
is the upper boundary.'2^3'
, which indicates that
the real position happens between position 2 and 3. Getting
this information from the object is very similar to
WithinPosition
, the position
attribute specifies
the lower boundary (2, in this case) and the extension
indicates the range to the higher boundary (1 in this case).
Now that we’ve got all of the types of fuzzy positions we can have taken care of, we are ready to actually specify a location on a sequence. This is handled by the FeatureLocation
class. An object of this type basically just holds the potentially fuzzy start and end positions of a feature. You can create a FeatureLocation
object by creating the positions and passing them in:
>>> from Bio import SeqFeature >>> start_pos = SeqFeature.AfterPosition(5) >>> end_pos = SeqFeature.BetweenPosition(8, 1) >>> my_location = SeqFeature.FeatureLocation(start_pos, end_pos)
If you print out a FeatureLocation
object, you can get a nice representation of the information:
>>> print my_location [>5:(8^9)]
We can access the fuzzy start and end positions using the start and end attributes of the location:
>>> my_location.start <Bio.SeqFeature.AfterPosition instance at 0x101d7164> >>> print my_location.start >5 >>> print my_location.end (8^9)
If you don’t want to deal with fuzzy positions and just want numbers, you just need to ask for the nofuzzy_start
and nofuzzy_end
attributes of the location:
>>> my_location.nofuzzy_start 5 >>> my_location.nofuzzy_end 8
Notice that this just gives you back the position attributes of the fuzzy locations.
Similary, to make it easy to create a position without worrying about fuzzy positions, you can just pass in numbers to the FeaturePosition
constructors, and you’ll get back out ExactPosition
objects:
>>> exact_location = SeqFeature.FeatureLocation(5, 8) >>> print exact_location [5:8] >>> exact_location.start <Bio.SeqFeature.ExactPosition instance at 0x101dcab4>
That is all of the nitty gritty about dealing with fuzzy positions in Biopython. It has been designed so that dealing with fuzziness is not that much more complicated than dealing with exact positions, and hopefully you find that true!
Another common annotation related to a sequence is a reference to a journal or other published work dealing with the sequence. We have a fairly simple way of representing a Reference in Biopython – we have a Bio.SeqFeature.Reference
class that stores the relevant information about a reference as attributes of an object.
The attributes include things that you would expect to see in a reference like journal
, title
and authors
. Additionally, it also can hold the medline_id
and pubmed_id
and a comment
about the reference. These are all accessed simply as attributes of the object.
A reference also has a location
object so that it can specify a particular location on the sequence that the reference refers to. For instance, you might have a journal that is dealing with a particular gene located on a BAC, and want to specify that it only refers to this position exactly. The location
is a potentially fuzzy location, as described in section 10.1.2.2.
That’s all there is too it. References are meant to be easy to deal with, and hopefully general enough to cover lots of usage cases.
Biopython has a regression testing framework originally written by Andrew Dalke and ported to PyUnit by Brad Chapman which helps us make sure the code is as bug-free as possible before going out.
Every module that goes into Biopython should have a test (and should also have documentation!). Let’s say you’ve written a new module called Biospam – here is what you should do to make a regression test:
test_Biospam.py
python test_Biospam.py > test_Biospam
which would write the output to the file test_Biospam
.test_Biospam
to make sure the output is correct. When you are sure it is all right and there are no bugs, you need to quickly edit the test_Biospam
file so that the first line is: ‘test_Biospam
’ (no quotes).test_Biospam
file to the directory Tests/outputpython run_tests.py -g test_Biospam.py
. The
regression testing framework is nifty enough that it’ll put
the output in the right place in just the way it likes it. Tests/output/test_Biospam
) and double check the output to make sure it is all correct.python run_tests.py
. This will run all of the tests, and
you should see your test run (and pass!).Many of the older Biopython parsers were built around an event-oriented design that includes Scanner and Consumer objects.
Scanners take input from a data source and analyze it line by line,
sending off an event whenever it recognizes some information in the
data. For example, if the data includes information about an organism
name, the scanner may generate an organism_name
event whenever it
encounters a line containing the name.
Consumers are objects that receive the events generated by Scanners.
Following the previous example, the consumer receives the
organism_name
event, and the processes it in whatever manner
necessary in the current application.
This is a very flexible framework, which is advantageous if you want to
be able to parse a file format into more than one representation. For
example, both the Bio.GenBank
and Bio.SwissProt
modules
use this to read their file formats as both generic SeqRecord
objects
and file-format-specific record objects.
More recently, many of the parsers added for Bio.SeqIO
and
Bio.AlignIO
take a much simpler approach, but only generate a
single object representation (SeqRecord
and Alignment
objects
respectively).
This module provides a class and a few routines for generating substitution matrices, similar to BLOSUM or PAM matrices, but based on user-provided data.
Additionally, you may select a matrix from MatrixInfo.py, a collection of established substitution matrices.
class SeqMat(UserDict.UserDict)
self.data
: a dictionary in the form of {(i1,j1):n1, (i1,j2):n2,...,(ik,jk):nk}
where i, j are alphabet letters, and n is a value.self.alphabet
: a class as defined in Bio.Alphabetself.ab_list
: a list of the alphabet’s letters, sorted. Needed mainly for internal purposesself.sum_letters
: a dictionary. {i1: s1, i2: s2,...,in:sn}
where:
__init__(self,data=None,alphabet=None, mat_type=NOTYPE,mat_name='',build_later=0):
data
: can be either a dictionary, or another SeqMat instance.
alphabet
: a Bio.Alphabet instance. If not provided, construct an alphabet from data.mat_type
: type of matrix generated. One of the following:mat_type
is provided automatically by some of SubsMat’s functions.
mat_name
: matrix name, such as "BLOSUM62" or "PAM250"build_later
: default false. If true, user may supply only alphabet and empty dictionary, if intending to build the matrix later. this skips the sanity check of alphabet size vs. matrix size.entropy(self,obs_freq_mat)
obs_freq_mat
: an observed frequency matrix. Returns the matrix’s entropy, based on the frequency in obs_freq_mat
. The matrix instance should be LO or SUBS.
letter_sum(self,letter)
Returns the sum of all values in the matrix, for the provided letter
all_letters_sum(self)
Fills the dictionary attribute self.sum_letters
with the sum of values for each letter in the matrix’s alphabet.
print_mat(self,f,format="%4d",bottomformat="%4s",alphabet=None)
prints the matrix to file handle f. format
is the format field for the matrix values; bottomformat
is the format field for the bottom row, containing matrix letters. Example output for a 3-letter alphabet matrix:
A 23 B 12 34 C 7 22 27 A B C
The alphabet
optional argument is a string of all characters in the alphabet. If supplied, the order of letters along the axes is taken from the string, rather than by alphabetical order.
The following section is layed out in the order by which most people wish to generate a log-odds matrix. Of course, interim matrices can be generated and investigated. Most people just want a log-odds matrix, that’s all.
Initially, you should generate an accepted replacement matrix (ARM) from your data. The values in ARM are the counted number of replacements according to your data. The data could be a set of pairs or multiple alignments. So for instance if Alanine was replaced by Cysteine 10 times, and Cysteine by Alanine 12 times, the corresponding ARM entries would be:
('A','C'): 10, ('C','A'): 12
as order doesn’t matter, user can already provide only one entry:
('A','C'): 22
A SeqMat instance may be initialized with either a full (first method of counting: 10, 12) or half (the latter method, 22) matrices. A full protein alphabet matrix would be of the size 20x20 = 400. A half matrix of that alphabet would be 20x20/2 + 20/2 = 210. That is because same-letter entries don’t change. (The matrix diagonal). Given an alphabet size of N:
The SeqMat constructor automatically generates a half-matrix, if a full matrix is passed. If a half matrix is passed, letters in the key should be provided in alphabetical order: (’A’,’C’) and not (’C’,A’).
At this point, if all you wish to do is generate a log-odds matrix, please go to the section titled Example of Use. The following text describes the nitty-gritty of internal functions, to be used by people who wish to investigate their nucleotide/amino-acid frequency data more thoroughly.
Use:
OFM = SubsMat._build_obs_freq_mat(ARM)
The OFM is generated from the ARM, only instead of replacement counts, it contains replacement frequencies.
Use:
EFM = SubsMat._build_exp_freq_mat(OFM,exp_freq_table)
exp_freq_table
: should be a FreqTable instance. See section 10.4.2 for detailed information on FreqTable. Briefly, the expected frequency table has the frequencies of appearance for each member of the alphabet. It is
implemented as a dictionary with the alphabet letters as keys, and each letter’s frequency as a value. Values sum to 1.
The expected frequency table can (and generally should) be generated from the observed frequency matrix. So in most cases you will generate exp_freq_table
using:
>>> exp_freq_table = SubsMat._exp_freq_table_from_obs_freq(OFM) >>> EFM = SubsMat._build_exp_freq_mat(OFM,exp_freq_table)
But you can supply your own exp_freq_table
, if you wish
Use:
SFM = SubsMat._build_subs_mat(OFM,EFM)
Accepts an OFM, EFM. Provides the division product of the corresponding values.
Use:
LOM=SubsMat._build_log_odds_mat(SFM[,logbase=10,factor=10.0,round_digit=1])
logbase
: base of the logarithm used to generate the log-odds values.factor
: factor used to multiply the log-odds values. Each entry is generated by log(LOM[key])*factor And rounded to the round_digit
place after the decimal point, if required.As most people would want to generate a log-odds matrix, with minimum hassle, SubsMat provides one function which does it all:
make_log_odds_matrix(acc_rep_mat,exp_freq_table=None,logbase=10, factor=10.0,round_digit=0):
acc_rep_mat
: user provided accepted replacements matrix
exp_freq_table
: expected frequencies table. Used if provided, if not, generated from the acc_rep_mat
.
logbase
: base of logarithm for the log-odds matrix. Default base 10.
round_digit
: number after decimal digit to which result should be rounded. Default zero.
FreqTable.FreqTable(UserDict.UserDict)
alphabet
: A Bio.Alphabet instance.
data
: frequency dictionary
count
: count dictionary (in case counts are provided).
read_count(f)
: read a count file from stream f. Then convert to frequencies
read_freq(f)
: read a frequency data file from stream f. Of course, we then don’t have the counts, but it is usually the letter frquencies which are interesting.
A 35 B 65 C 100
And will be read using the FreqTable.read_count(file_handle)
function.
An equivalent frequency file:
A 0.175 B 0.325 C 0.5
Conversely, the residue frequencies or counts can be passed as a dictionary. Example of a count dictionary (3-letter alphabet):
{'A': 35, 'B': 65, 'C': 100}
Which means that an expected data count would give a 0.5 frequency for ’C’, a 0.325 probability of ’B’ and a 0.175 probability of ’A’ out of 200 total, sum of A, B and C)
A frequency dictionary for the same data would be:
{'A': 0.175, 'B': 0.325, 'C': 0.5}
Summing up to 1.
When passing a dictionary as an argument, you should indicate whether it is a count or a frequency dictionary. Therefore the FreqTable class constructor requires two arguments: the dictionary itself, and FreqTable.COUNT or FreqTable.FREQ indicating counts or frequencies, respectively.
Read expected counts. readCount will already generate the frequencies Any one of the following may be done to geerate the frequency table (ftab):
>>> from SubsMat import * >>> ftab = FreqTable.FreqTable(my_frequency_dictionary,FreqTable.FREQ) >>> ftab = FreqTable.FreqTable(my_count_dictionary,FreqTable.COUNT) >>> ftab = FreqTable.read_count(open('myCountFile')) >>> ftab = FreqTable.read_frequency(open('myFrequencyFile'))