Bioinformatics

Released	Due
TBD	TBD

Homework 6

We will be working in the field of Bioinformatics, a field in which people analyze and manipulate biological data.

We will mainly be working with DNA, which you may already be familiar with. The DNA’s double helix structure is made up of 4 nucleotides: adenine (A), thymine (T), guanine (G), cytosine (C). We can represent this pattern of nucleotides as a list of characters!

Nucleotide Counting

Task 1 Write a function called contains_nucleotide. It should take in a list of characters representing the strand of DNA, as well as a nucleotide as a parameter (either a character 'A', 'T', 'G', or 'C'). It should return a boolean representing whether or not the given nucleotide exists in the DNA strand.

Task 2 Write a function called get_nucleotide_count. It should take in a list of characters representing the strand of DNA, as well as a nucleotide (of the same form as task 1). It should return the number of times the nucleotide appears in the strand of DNA.

Task 3 Write a function called most_common_nucleotide, which takes in a list of characters representing the strand of DNA as a parameter, and returns the most commonly occuring nucleotide. There will not be a tie for the most commonly occuring nucleotide, so you won’t have to worry about that test case.

Task 4 Because guanine (G) pairs with cytosine (C), the amount of guanine (G) and cytosine (C) will be roughly equal when looking at DNA. The same applies for thymine (T) and adenine (A). The percentage of guanine (G) and cytosine (C) that appear in the DNA sequence is called the GC content, and is an important number for biologists. You can easily find the percentage of thymine (T) and adenine (A) by subtracting the GC content from 1, so we will focus on the GC content.

Write a function called GC_content. It should take in a list of characters representing the strand of DNA as a parameter, and returns the GC content as a float (decimal value). You may assume that the list is not empty.

RNA Transcription

Task 5 RNA is similar to DNA except thymine (T) is changed to uracil (U). RNA is used as a copy of DNA in the generation of proteins.

Write a function called DNA_to_RNA that takes in a list of characters representing the strand of DNA as a parameter, and returns a list of characters representing the corresponding strand of RNA. For example, DNA_to_RNA(['A','C','T','G']) should return ['A','C','U','G']

Proteins

Proteins are complex molecules and do most of the work in cells. They are polymer chains composed of smaller molecules called amino acids. When RNA is translated into protein, the nucleotides are read in groups of 3, and a group of 3 nucleotides is called a “codon”. Each codon represents an amino acid, but the amino acids being represented are not necessarily unique amongst codons.

Task 6 Write a function called sort_by_protein_length. This function takes in a list of proteins, which are themselves represented by strings, where each character of the string is a nucleotide. This function should sort the proteins in the list by their length. Note that this function does not return anything, but instead modifies the list passed in as a parameter.

Hint: In Python, sorting a list numerically or alphabetically is as easy as writing sorted(theList), where theList is the list you want to sort. If you instead want to provide your own criterion, such as the length of the name you use the following form: sorted(theList, key=keyFun), where theList is the list to sort and keyFun is a function that takes a single item of the list and returns a value (like a number) that gets used for sorting.

Task 7 The building blocks of proteins are called amino acids, which are made up of codons. For example, the string of nucleotides “ACA” makes up the amino acid Threonine, and “CGA” makes up Arginine.

Your task is to write a function protein_contains_amino_acid that takes in a protein (string), and an amino acid (string of length 3), and returns a boolean signifying whether or not the amino acid is contained within the string.

Data Cleaning

We have gotten a list of DNA/RNA strands on which to perform analysis on from our supplier, but we’re not too sure if we can trust the inputs that they gave us. So, we must first check the strands for correctness before doing any further processing on them.

Task 8 The first thing that we want to check is if the DNA strands have valid nucleotides! If we find something like the letter 'H', 'p', or even '@' in our DNA, we can be sure that something’s incorrect.

Write a function invalid_nucleotides that takes in a list of characters representing a strand of DNA, and returns a boolean representing whether or not there is an invalid nucleotide anywhere in the strand (Note: the only valid nucleotides are CAPITAL 'A', 'T', 'G', or 'C').

Task 9 Genes are specific parts of DNA that contain the information required to make proteins. Proteins are encoded in RNA, and are made up of units of information called “codons” - a sequence of 3 nucleotides. Proteins always start with the AUG codon, so we want to make sure that any proteins that we receive start with that codon.

Write a function called starts_with_AUG that takes in a strand of RNA encoded as a list of characters, and returns a boolean representing whether or not it starts with the AUG codon.

Task 10 Many times, data in the real world is messy with invalid or missing data and must be cleaned before used. When thinking about cleaning input, we need to consider what course of action to take, which is often not a clear cut answer. For example, if we are given a list of many DNA strings and encounter a DNA string that is problematic, should we get rid of invalid characters or discard the entire string? If the string is mostly comprised of invalid characters, should we discard the entire string? Are lower case letters just mistypings of upper case letters, or should they be treated as invalid characters? To make a good decision, we need to know how our input is made, how those errors may have come to be, and what our desired outcome is. For this assignment, we will create a simple cleaning input function that removes invalid characters and converts lowercase letters to uppercase.

Write a function clean_inputthat takes in a list of characters representing a strand of DNA, and return a list of characters that represent the cleaned strand of DNA. Remove all characters that aren’t 'A', 'G', 'C', 'T', and make sure that all 'a', 'g', 'c', 't' get converted to uppercase.

Challenge Problem

Task 11 There are many reasons why DNA is double stranded, but the primary argument is that it is useful in replication. The two strands of DNA can be generated from each other. Adenine (A) and thymine (T) bond together, where as cytosine (C) and guanine (G) bond together. When the nucleotides bond together, they are called the complement of each other. By convention, the letters representing a DNA molecule is always written in the direction 5’ to 3’. This means that the complement must be reversed and is called the reverse complement.

Write a function called reverse_complement. This function takes in a list of characters representing the strand of DNA as a parameter, and returns a list of characters representing the reverse complement of the strand of DNA. For example reverse_complement(['A', 'C', 'G', 'T', 'A', 'G', 'C']) should return ['G', 'C', 'T', 'A', 'C', 'G', 'T'].

These are purely optional and for your own interest. This homework assignment barely touches upon the vast and rapidly expanding field of bioinformatics, which still has complex problems that have not been solved yet.

DeepMind solving protein folding: https://www.deepmind.com/research/highlighted-research/alphafold Bioinformatics textbook: https://www.bioinformaticsalgorithms.org/