Python is one of the most used coding languages for analyzing biological data. It is often difficult to figure out where to start, so here are some easy and fun exercises to kick off that journey!
Let’s say that we are working with the complete genome of a ssDNA virus, Cyclovirus SL-108277, for simplicity sake. Its sequence is as follows.
>KJ831064.1 Cyclovirus SL-108277, complete genome ACCCGTCACTTCGTTTCACTTCGTTCCAAATCCTTTCACTCGACTAGTGCGCCGCGTAGCGGCGCGTCAG CACACACACACACACACTTGTGTAAGCGAGCGGTAGCGAGCGTTAGCCTCCCCTCCACTGTCTTAGCTTA GCGACGGTTGGAACATGCTCAGACAGTGACTCACTGGCTGCAATGGCGTACGCACGTAGATACAGGTTCC GTCGAAGACCGGTTCGCAAGATGCAAAGGCGTAGGCGTCGATTTCGTCGCCGTGCACGCCGTATGGGAAA TTTGTTGTGCAAGCTTACGAGGGTTGTTACTATTGCTGTTGATCCTTCTAAAGTACAAATCCAAGATTTG GCAGTTAATCCTCACGACATTCAAGAGTATGTTAACTTGGGAAAGCAATTTGAATTTTGCAAGTTCATCA GCTGTAGAGTTCGTGTGATTCCTCATCAGAATGTGACAAATAATTCTACATCGTCGTTACCGAACTATTG TATCTTACCTTGGCATTCCGCTACTCCACCTGGAAGTACTGGTTTTACTACGTATACTTCTCATGATCGG GCAAAGGTGTATCGGTCCACTCAAAAGGCTCATATGAATTTTGTATGTTCTACGGCCTTTGAAGCTCAAA TTGAATCGAAAAATGCTTCATCTTTTAATACTATAAAATGGAAACCTGATATTCGCTGGGATGCTGATAA GACTTTAGCGCCTACGATAAGGACTGGGATTTTGGCTTTTCAAGGTGATGCGGATGCGCCGACGGGAGCA AAATCAAAGTTTACAATATTTCAAGATTATATTTGTTTATTTAAGAATCAATGTATTTTGGAGACGTTTG CTCCGCCGCCTACATTAGAGCATATGACACTTTAGTGTAATTCGTCTGTCGAAAGCGGTGTTTGTATAAT TAATAAATTTGTATAATGTATCTGTGTCAATGTTGCTTGTTATCCATATTCTTTCACTTGTAAATTCCTC GAATCCCCCTTTTACTTGTACTTTGTATGGGTATCGGTCCATGATTTTTAGCATTTCGTCGTATTTTATC CATCCGTAGAAGTCGTCGATGATTACGTTTGGCTGTTGGTTGTATCCGTCCCACCATAGTCCTCTGGGTT TGTAGTATATTGATTGGTTTGTTGCTTTTGCTTCTTCTAAGGCTCGTCTTGACTTTCCACTGCCTGGAGG TCCCCAAAAGTAATATACAAGAGTTTTATGTTGTCTCTCTTGAATTGGTTTAACCATTCGTAAATATTCT CGAATTCCTTTGTGGTATCTGATATACGCCGTGGGATGCTTGGTGGCAATATCTTGTAATGTGCTATTGC TTTGTGCGATGGTTTCGACCACAGCTTGCAGATCTGAGCGATGTCCTTGGCTACAAGGTGTGCCCTTTTC AAAATATATGCCTGATTTCGAACAGTATTTTTTGTTATCTTCGTCGGATCCATTTGCCTTCTCAAGATGG ATTGAGTTATCGAGATGCTTTTTGATTTTGTTGAAGCGTATGGGTTTATGTAGATTACAGAATCCCTGAA GGTGAATTGTCCCAGTATTCGGAGCGATTTCTTCTCCAACGATGCCATATTTGCAATATGTATTGATGAA ATCTTCGCACTTTTTGTATGCCTCTTCTGTGTAATTATTCCACGTGAAACAGAATCGACGTACGTTTGCG TTCATCGCAACGAAGTGACGGTATAGTATT
For easier access, putting the genome sequence inside a variable would be necessary. If the sequence was saved into your computer, you could always access it directly instead of pasting, but here we will just paste the entire sequence into a variable.
Now, let’s find the open reading frame (ORF) of this genome.
bp = 0 #starting at the beginning of sequence, at base pair 0 while (bp < len(cyclovirus_seq) - 2): #end earlier to conserve base 3 codon = cyclovirus_seq[bp:(bp + 3)] #count by 3 bp at a time print(codon) bp = bp + 3
The above code is a loop where each cycle will go through the genomic sequence by 3 base pairs at a time to imitate codons. Since we cannot be sure that the given sequence will be perfectly divisible by 3, the counter will end 2 base pairs earlier than the last base pair to ensure the codon counting.
bp = 0 stop_codons = ["TAG","TAA","TGA"] while (bp < len(cyclovirus_seq) - 2): codon = cyclovirus_seq[bp:(bp + 3)] if codon in stop_codons: print (codon + " -> Stop codon") else: print (codon) bp = bp + 3
Now, we introduce the 3 possible stop codons into the code so whenever the counter hits one of them, there will be an indication that prints “-> Stop codon”.
To make the printed information easier to read, we can do some formatting of the code.
bp = 0 stop_codons = ["TAG","TAA","TGA"] orf_length = 0 #initial length size is 0 orf_seq = "" #variable to store our sequence while (bp < len(cyclovirus_seq) - 2): codon = cyclovirus_seq[bp:(bp + 3)] if codon in stop_codons: if ( orf_length > 70 ): print("") #blank line for easier reading print(">KJ831064.1 Cyclovirus SL-108277, complete genome") #FASTA header print(orf_seq) orf_length = 0 #resets for next frame of the DNA orf_seq = "" else: orf_seq += codon #this is the same as orf_seq = orf_seq +3 orf_length += 3 bp += + 3
Here just some additional lines to make the printed information look neater for the viewer. I chose the length of each frame printed to be set at 70, as is common for FASTA files uploaded to the NCBI Genbank database. When each ORF found in the cyclorvirus genome is printed, it will have its FASTA file header name along with it.
Now we have spliced out the ORF of our whole genome and it is ready to be searched against protein database queries such as BLASTx to validate that the code worked properly.
When running the first ORF our code output through BLASTx, sure enough, it resulted in a few good matches at 100% identity!
From here on, the ORF finder could be further modified to be able to translate the DNA sequences into its mRNA form. But that can be left for another time for now 🙂
One response to “Quick python exercises with DNA sequences”
[…] Quick python exercises with DNA sequences […]