After a while and some German blog posts about my ROSALIND efforts I write this entry to publish my solutions for the first four ROSALIND problems. As described in the ROSALIND project section on this blog, I will write the upcoming posts about ROSALIND in English. I think this makes definitely more sense as to write them in German. So this post will describe my solutions for the four problems with the IDs DNA, RNA, REVC and GC.
The first solved problem was the counting of DNA nucleotides. This is a very simple problem. The input data is a DNA string of a specific length. For example 1000 nucleotides. Every DNA string is some sort of word of a particular language. The symbols are selected from an alphabet of four characters, the so called nucleobases. Every DNA is formed with the nucleobases ‚A‘, ‚C‘, ‚G‘, and ‚T‘.
The task in the DNA problem is to count these different nucleobases. The following data is given and the following result is expected in return.
W X Y Z
Where X, X, Y and Z are the number of times for the symbols ‚A‘, ‚C‘, ‚G‘, and ‚T‘. I’ve solved this problem in the class
Dna within the method
NucleotidesCount. The result of this method is a
Dictionary<char, int>. The code can be found in the solution describe on my ROSALIND project page. I wont show it here completely, because it is a simple
foreach loop, which counts the different symbols.
The second problem is the transcription of DNA into RNA. This is a very simple problem, too. Every DNA string contains the symbols ‚A‘, ‚C‘, ‚G‘, and ‚T‘. In comparison to this, a RNA string is build up on the symbols ‚A‘, ‚C‘, ‚G‘, and ‚U‘. So every ‚T‘ in the DNA string must be transcribed to ‚U‘ in the corresponding RNA string. The following data is given and the appropriate result must be returned.
This can be achieved with a very simple solution. I created the method
TranscribeRna in the
Dna class. This method returns a new object of the type
Rna, to represent a RNA string. With the short call of
Symbols.Replace('T', 'U'), all ‚T‘ symbols are replaced with ‚U‘. The
Symbols attribute contains the complete DNA string. Because this solution is very simple, the code isn’t shown here, but can be found in the solution.
The third problem is the computation of a complementing strand of DNA. The complement of a DNA string is another DNA string, where all symbols are replaced by their complements. So ‚A‘ and ‚T‘ are complements of each other, as are ‚C‘ and ‚G‘. The reverse complement of a DNA string is another DNA string but with reversed symbols. So first we have to reverse all symbols of the original string and then take the complement of them. The following data is the basis for the computation and the shown result is an example of the results needed.
The solution is as simple as the solutions for the first two problems. The method
ReverseComplementDna of the
Dna class implements the solution for this problem. First I iterate through the original DNA string in the reversed order. For every symbol I save the complement to a
StringBuilder instance. I’ve used a
StringBuilder because the DNA strings can be very long. It’s better to use the specialized class than the normal string concatenation. So the main code line of this implementation is
The fourth solved problem is the computation of the GC content. This is an option to identify unknown DNA very quickly. The GC content of a DNA string is given bye the percentage of symbols that are either ‚C‘ or ‚G‘. A subproblem is the reading of the FASTA format. This is a specialized format to label DNA strings when they are consolidated into a database. The following input data is given and the following result is expected.
>Rosalind_6404 CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC TCCCACTAATAATTCTGAGG
>Rosalind_5959 CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCT ATATCCATTTGTCAGCAGACACGC
>Rosalind_0808 CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGAC TGGGAACCTGCGGGCAGTAGGTGGAAT
The decimal is the percentage of ‚C‘ or ‚G‘ symbols of the whole DNA string. To solve this problem, I first have to read the FASTA format. But this is very simple. Evey line, that starts with a > is the label. The label contains a name, in this case every time ROSALIND, and a four-digit code between 000 and 9999. The solution must return the label of the DNA with the highest GC content. So every dataset in the FASTA file have to be read and calculated properly.
First I build two new classes. The
Fasta class can handle an input file in the fasta format. I’m using the regular expression
>(?<Label>\w+)(\s*)(?<Sequence>[A-Za-z\s]+) with the option
RegexOptions.Multiline to split the complete file into multiple result groups. Through the named groups
Sequence it is very easy to access the different substrings. For every FASTA entry in the file I create new instances of the
FastaEntry class. This represents one entry which contains the label and the DNA string, which is represented through the already known
In this class I’ve implemented the method
CalculateGcRatio. The implementation is very simple. I’m using the regular expression
[G|C], to determine the amount of ‚C‘ or ‚G‘ symbols. After that I can simply divide the found matches by the length of all symbols in the DNA string. That’s it. The result is a decimal which represents the percentage of the GC content of this particular DNA string.
These four problems were not very hard to solve. They were good first steps to build up a Visual Studio 2012 solution and the needed unit test structure. I think, I can solve the upcoming problems faster, because the infrastructure is available. But I think, too, that the ROSALIND team wont make the next problems so easy.
I hope you can learn something form my solutions. All solutions were found and implemented through TDD. So every implementation is covered by unit tests. E-Mail me for any suggestions, ideas and criticism. Infos for the Visual Studio 2012 solution can be found here.