Design a site like this with WordPress.com
Get started

Quick python exercises with DNA sequences

Python is one of the most used coding languages for analyzing biological data. It is often difficult to figure out where to start, so here are some easy and fun exercises to kick off that journey!

Let’s say that we are working with the complete genome of a ssDNA virus, Cyclovirus SL-108277, for simplicity sake. Its sequence is as follows.

>KJ831064.1 Cyclovirus SL-108277, complete genome
ACCCGTCACTTCGTTTCACTTCGTTCCAAATCCTTTCACTCGACTAGTGCGCCGCGTAGCGGCGCGTCAG
CACACACACACACACACTTGTGTAAGCGAGCGGTAGCGAGCGTTAGCCTCCCCTCCACTGTCTTAGCTTA
GCGACGGTTGGAACATGCTCAGACAGTGACTCACTGGCTGCAATGGCGTACGCACGTAGATACAGGTTCC
GTCGAAGACCGGTTCGCAAGATGCAAAGGCGTAGGCGTCGATTTCGTCGCCGTGCACGCCGTATGGGAAA
TTTGTTGTGCAAGCTTACGAGGGTTGTTACTATTGCTGTTGATCCTTCTAAAGTACAAATCCAAGATTTG
GCAGTTAATCCTCACGACATTCAAGAGTATGTTAACTTGGGAAAGCAATTTGAATTTTGCAAGTTCATCA
GCTGTAGAGTTCGTGTGATTCCTCATCAGAATGTGACAAATAATTCTACATCGTCGTTACCGAACTATTG
TATCTTACCTTGGCATTCCGCTACTCCACCTGGAAGTACTGGTTTTACTACGTATACTTCTCATGATCGG
GCAAAGGTGTATCGGTCCACTCAAAAGGCTCATATGAATTTTGTATGTTCTACGGCCTTTGAAGCTCAAA
TTGAATCGAAAAATGCTTCATCTTTTAATACTATAAAATGGAAACCTGATATTCGCTGGGATGCTGATAA
GACTTTAGCGCCTACGATAAGGACTGGGATTTTGGCTTTTCAAGGTGATGCGGATGCGCCGACGGGAGCA
AAATCAAAGTTTACAATATTTCAAGATTATATTTGTTTATTTAAGAATCAATGTATTTTGGAGACGTTTG
CTCCGCCGCCTACATTAGAGCATATGACACTTTAGTGTAATTCGTCTGTCGAAAGCGGTGTTTGTATAAT
TAATAAATTTGTATAATGTATCTGTGTCAATGTTGCTTGTTATCCATATTCTTTCACTTGTAAATTCCTC
GAATCCCCCTTTTACTTGTACTTTGTATGGGTATCGGTCCATGATTTTTAGCATTTCGTCGTATTTTATC
CATCCGTAGAAGTCGTCGATGATTACGTTTGGCTGTTGGTTGTATCCGTCCCACCATAGTCCTCTGGGTT
TGTAGTATATTGATTGGTTTGTTGCTTTTGCTTCTTCTAAGGCTCGTCTTGACTTTCCACTGCCTGGAGG
TCCCCAAAAGTAATATACAAGAGTTTTATGTTGTCTCTCTTGAATTGGTTTAACCATTCGTAAATATTCT
CGAATTCCTTTGTGGTATCTGATATACGCCGTGGGATGCTTGGTGGCAATATCTTGTAATGTGCTATTGC
TTTGTGCGATGGTTTCGACCACAGCTTGCAGATCTGAGCGATGTCCTTGGCTACAAGGTGTGCCCTTTTC
AAAATATATGCCTGATTTCGAACAGTATTTTTTGTTATCTTCGTCGGATCCATTTGCCTTCTCAAGATGG
ATTGAGTTATCGAGATGCTTTTTGATTTTGTTGAAGCGTATGGGTTTATGTAGATTACAGAATCCCTGAA
GGTGAATTGTCCCAGTATTCGGAGCGATTTCTTCTCCAACGATGCCATATTTGCAATATGTATTGATGAA
ATCTTCGCACTTTTTGTATGCCTCTTCTGTGTAATTATTCCACGTGAAACAGAATCGACGTACGTTTGCG
TTCATCGCAACGAAGTGACGGTATAGTATT

For easier access, putting the genome sequence inside a variable would be necessary. If the sequence was saved into your computer, you could always access it directly instead of pasting, but here we will just paste the entire sequence into a variable.

cyclovirus_seq = "ACCCGTCACTTCGTTTCACTTCGTTCCAAATCCTTTCACTCGACTAGTGCGCCGCGTAGCGGCGCGTCAGCACACACACACACACACTTGTGTAAGCGAGCGGTAGCGAGCGTTAGCCTCCCCTCCACTGTCTTAGCTTAGCGACGGTTGGAACATGCTCAGACAGTGACTCACTGGCTGCAATGGCGTACGCACGTAGATACAGGTTCCGTCGAAGACCGGTTCGCAAGATGCAAAGGCGTAGGCGTCGATTTCGTCGCCGTGCACGCCGTATGGGAAATTTGTTGTGCAAGCTTACGAGGGTTGTTACTATTGCTGTTGATCCTTCTAAAGTACAAATCCAAGATTTGGCAGTTAATCCTCACGACATTCAAGAGTATGTTAACTTGGGAAAGCAATTTGAATTTTGCAAGTTCATCAGCTGTAGAGTTCGTGTGATTCCTCATCAGAATGTGACAAATAATTCTACATCGTCGTTACCGAACTATTGTATCTTACCTTGGCATTCCGCTACTCCACCTGGAAGTACTGGTTTTACTACGTATACTTCTCATGATCGGGCAAAGGTGTATCGGTCCACTCAAAAGGCTCATATGAATTTTGTATGTTCTACGGCCTTTGAAGCTCAAATTGAATCGAAAAATGCTTCATCTTTTAATACTATAAAATGGAAACCTGATATTCGCTGGGATGCTGATAAGACTTTAGCGCCTACGATAAGGACTGGGATTTTGGCTTTTCAAGGTGATGCGGATGCGCCGACGGGAGCAAAATCAAAGTTTACAATATTTCAAGATTATATTTGTTTATTTAAGAATCAATGTATTTTGGAGACGTTTGCTCCGCCGCCTACATTAGAGCATATGACACTTTAGTGTAATTCGTCTGTCGAAAGCGGTGTTTGTATAATTAATAAATTTGTATAATGTATCTGTGTCAATGTTGCTTGTTATCCATATTCTTTCACTTGTAAATTCCTCGAATCCCCCTTTTACTTGTACTTTGTATGGGTATCGGTCCATGATTTTTAGCATTTCGTCGTATTTTATCCATCCGTAGAAGTCGTCGATGATTACGTTTGGCTGTTGGTTGTATCCGTCCCACCATAGTCCTCTGGGTTTGTAGTATATTGATTGGTTTGTTGCTTTTGCTTCTTCTAAGGCTCGTCTTGACTTTCCACTGCCTGGAGGTCCCCAAAAGTAATATACAAGAGTTTTATGTTGTCTCTCTTGAATTGGTTTAACCATTCGTAAATATTCTCGAATTCCTTTGTGGTATCTGATATACGCCGTGGGATGCTTGGTGGCAATATCTTGTAATGTGCTATTGCTTTGTGCGATGGTTTCGACCACAGCTTGCAGATCTGAGCGATGTCCTTGGCTACAAGGTGTGCCCTTTTCAAAATATATGCCTGATTTCGAACAGTATTTTTTGTTATCTTCGTCGGATCCATTTGCCTTCTCAAGATGGATTGAGTTATCGAGATGCTTTTTGATTTTGTTGAAGCGTATGGGTTTATGTAGATTACAGAATCCCTGAAGGTGAATTGTCCCAGTATTCGGAGCGATTTCTTCTCCAACGATGCCATATTTGCAATATGTATTGATGAAATCTTCGCACTTTTTGTATGCCTCTTCTGTGTAATTATTCCACGTGAAACAGAATCGACGTACGTTTGCGTTCATCGCAACGAAGTGACGGTATAGTATT"

Now, let’s find the open reading frame (ORF) of this genome.

bp = 0  #starting at the beginning of sequence, at base pair 0
while (bp < len(cyclovirus_seq) - 2): #end earlier to conserve base 3
       codon = cyclovirus_seq[bp:(bp + 3)] #count by 3 bp at a time
       print(codon)
       bp = bp + 3

The above code is a loop where each cycle will go through the genomic sequence by 3 base pairs at a time to imitate codons. Since we cannot be sure that the given sequence will be perfectly divisible by 3, the counter will end 2 base pairs earlier than the last base pair to ensure the codon counting.

bp = 0  
stop_codons = ["TAG","TAA","TGA"]

while (bp < len(cyclovirus_seq) - 2):
       codon = cyclovirus_seq[bp:(bp + 3)]
       if codon in stop_codons:
           print (codon + " -> Stop codon")
       else:
           print (codon)
       bp = bp + 3

Now, we introduce the 3 possible stop codons into the code so whenever the counter hits one of them, there will be an indication that prints “-> Stop codon”.

To make the printed information easier to read, we can do some formatting of the code.

bp = 0 
stop_codons = ["TAG","TAA","TGA"] 
orf_length = 0 #initial length size is 0
orf_seq = "" #variable to store our sequence 

while (bp < len(cyclovirus_seq) - 2):
    codon = cyclovirus_seq[bp:(bp + 3)]
    if codon in stop_codons:
        if ( orf_length > 70 ): 
            print("") #blank line for easier reading
            print(">KJ831064.1 Cyclovirus SL-108277, complete genome") #FASTA header
            print(orf_seq)
        orf_length = 0 #resets for next frame of the DNA
        orf_seq = ""
    else:
        orf_seq += codon #this is the same as orf_seq = orf_seq +3
        orf_length += 3
    bp += + 3

Here just some additional lines to make the printed information look neater for the viewer. I chose the length of each frame printed to be set at 70, as is common for FASTA files uploaded to the NCBI Genbank database. When each ORF found in the cyclorvirus genome is printed, it will have its FASTA file header name along with it.

Now we have spliced out the ORF of our whole genome and it is ready to be searched against protein database queries such as BLASTx to validate that the code worked properly.

When running the first ORF our code output through BLASTx, sure enough, it resulted in a few good matches at 100% identity!

From here on, the ORF finder could be further modified to be able to translate the DNA sequences into its mRNA form. But that can be left for another time for now 🙂

Advertisement

One response to “Quick python exercises with DNA sequences”

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: