# genome.py - a custom genome class which wraps biopython parsing code

import genbank # (1)
from Bio import Seq
from Bio.Alphabet import IUPAC

class Genome(object):
    """Genome - representing a genomic DNA sequence with genes
    
    Genome.genes[i] returns the CDS sequences for each gene i."""
    
    def __init__(self, accession_number):
        
        genbank.download([accession_number]) # (2)
        self.parsed_genbank = genbank.parse([accession_number])[0]
        
        self.genes = []
        
        self._parse_genes()
        
    
    def _parse_genes(self):
        """Parse out the CDS sequence for each gene."""
        
        for feature in self.parsed_genbank.features: # (3)
            if feature.type == 'CDS':
                
                #Build up a list of (start,end) tuples that will
                #be used to slice the sequence in self.parsed_genbank.seq
                #
                #Biopython locations are zero-based so can be directly
                #used in sequence splicing

                locations = []
                if len(feature.sub_features): # (4)
                    # If there are sub_features, then this gene is made up
                    # of multiple parts.  Store the start and end positins
                    # for each part.
                    for sf in feature.sub_features:
                        locations.append((sf.location.start.position,
                                          sf.location.end.position))
                else:
                    # This gene is made up of one part.  Store its start and 
                    # end position.
                    locations.append((feature.location.start.position,
                                      feature.location.end.position))


                # Store the joined sequence and nucleotide indices forming
                # the CDS.
                seq = '' # (5)
                for begin,end in locations:
                    seq += self.parsed_genbank.seq[begin:end].tostring()

                # Reverse complement the sequence if the CDS is on
                # the minus strand  
                if feature.strand == -1:  # (6)
                  seq_obj = Seq.Seq(seq,IUPAC.ambiguous_dna)
                  seq = seq_obj.reverse_complement().tostring()

                # append the gene sequence
                self.genes.append(seq) # (7)
                
# (1) Here we import the genbank module outlined in Example 1, along with two more biopython modules.  The Bio.Seq module has methods for creating DNA sequence objects used later in the code, and the Bio.Alphabet module contains definitions for the types of sequences to be used.  In particular we use the Bio.Alphabet.IUPAC definitions.

# (2) We use the genbank methods to download and parse the GenBank record for the input accession number.

# (3) The parsed object stores the different parts of the GenBank file as a list of features.  Each feature has a type, and in this case, we are looking for features with type 'CDS', which stores the coding sequence of a gene.

# (4) For many organisms, genes are not contiguous stretches of DNA, but are rather composed of several parts.  For GenBank files, this is indicated by a feature having sub_features.  Here we gather the start and end positions of all sub features, and store them in a list of 2-tuples.  In the case that the gene is a contiguous piece of DNA, there is only one element in this list.

# (5) Once the start and end positions of each piece of the gene are obtained, we use them to slice the seq of the parsed_genbank object, and collect the concatenated sequence into a string.

# (6) Since DNA has polarity, there is a difference between a gene that is encoded on the top, plus strand, and the bottom, minus strand.  The strand that the gene is encoded in is stored in feature.strand.  If the strand is the minus strand, we need to reverse compliment the sequence to get the actual coding sequence of the gene.  To do this we use the Bio.Seq module to first build a sequence, then use the reverse_complement() method to return the reverse compliment.

# (7) We store each gene as an element of the Genome.genes list.  The CDS of the ith gene is then retrievable through Genome.genes[i].