Science

Say it with DNA – how our genetic code could hold the key to our data storage problems

We’re generating more data than ever before, but where are we going to store it? Some scientists are pinning their hopes on using the building blocks of life to help solve the issue.

Varshit Dusad

19 Jan 2018 — 4 min read

Say it with DNA – how our genetic code could hold the key to our data storage problems

Data has been hailed as the oil of this century. Just like oil, entire industries rely on data analytics to develop consumer-centric products, which are nowadays in constant demand. Google Search, Facebook Recommendations, and pretty much everything else on the Internet has become part of new ‘Data Economy’. Every single day a large mass of data is created and stored: a joint research project by Seagate, the world’s leading hard drive manufacturer, and IDC analytics, predicts that the total amount of data worldwide will reach 163 Zettabytes (ZB) by the year 2025 – a tenfold increase from the 16.1 ZB generated in 2016. To put into perspective, one ZB is equal to a trillion GB. This is an explosive rate of data generation! Much of this data is managed by few large tech giants like Google, Facebook, Amazon, and Microsoft, who store it inside giant data servers. But they keep running out of additional space, at least when using the conventional measures. The solution? Recently there has been an interest in replacing magnetic tapes and silicon-based hard drives with DNA, which could be the next generation of storage.

Though you may find it surprising, using DNA for information storage is not as strange as you may think. Nature has always used DNA to encode all of life’s genetic information, and the latest research has used its innate properties to store digital information. In 2012, the Church lab at Harvard University displayed the potential of DNA as storage platform by encoding a 53,426 word book, eleven JPG images, and one Javascript program using next-generation DNA synthesis and sequencing platforms. Digital information is stored as strings of binary digits (bits) holding only the values as zeroes and ones, while genetic information is stored in a sequence of four chemical bases – Adenine, Guanine, Cytosine, and Thymine. In both cases, it is the sequence and rules of interpretation which encode the information. By mapping the bits to base sequences, one can transfer digital information to its chemical equivalent. To access the information, the DNA is sequenced, and sequences of eight bases used to retrieve the digital information.

“Entire industries rely on data analytics to develop consumer-centric products”

What advantages can this bring? Well, one gram of DNA can store 215 petabytes (PB). While conventional hard drives last for an average of 5fiveyears, DNA is far more resilient – in fact, it is one of the most stable chemicals found on Earth, being found preserved in remains thousands of years old, despite harsh conditions. Some have even called DNA apocalypse proof! If humanity suffers a disaster, then future generations will be able to access the information stored on DNA memory sticks to recreate our civilization.

So, why is DNA storage not yet mainstream? Because both the steps of chemically synthesizing DNA and sequencing are slow processes – very slow compared to the lightning fast information storage and retrieval we are used to. Furthermore, the sequencing process is error prone, with the error increasing with increasing lengths of DNA. The other issue is that sequencing is a linear, end to end process and, without a robust encoding scheme, it is very challenging to retrieve make data randomly accessible. For example, if you wanted to find a key passage on a book written on DNA then you will have to read it all and can’t simply skip to the passage of interest. Smart primer design might help with this, but only works well in a minority of cases.

“Why is DNA storage not mainstream? Because both synthesising and sequencing DNA are very slow sequences”

The challenges have not deterred Microsoft from investing. In an interview with MIT Technology Review last year, Microsoft unveiled their plan of developing “proto-commercial system in three years” which can be used at their data centers. Though they can’t be used for quick retrieval of information, they can act as valuable back-up for archived data. In July 2016, Microsoft, in collaboration with University of Washington, stored 200 megabytes of data in DNA, including a music video. 200 MB might appear much lower than what is desired, but the real limit is cost, and not science. According to the estimates by MIT Tech review, the project would have cost US$ 800,000 were supplies bought from the open market. To reduce cost, Microsoft partnered with Twist Bioscience, a synthetic biology start-up. Twist Bioscience, using their proprietary synthesis technology, provides rapid and cost-effective DNA on demand to Microsoft for their data compression ambition.

The nucleus is the powerhouse of data storage // Flickr/ZIESS Microscopy

Though Microsoft may be the only tech giant ringing DNA-based storage’s praises, academics remain excited by the possibilities. Dr. Nick Goldman at European Bioinformatics Institute is driving research to make DNA data storage reliable and competent. In 2013, a year after Church’s proof of concept, he developed an improved strategy to encode digital data into biological text. His technique had an error checking procedure to ensure that data can be both reliably encrypted as well as interpreted. Using this they were able to store 739 kB of data on DNA and retrieve it with 100% accuracy!

“The latest findings are collapsing down the barriers between the IT revolution and the biotechnology revolution”

Their encryption included all 154 of Shakespeare’s sonnets, a 26-second audio clip of Martin Luther King’s “I have a dream”, and the classic paper on the structure of DNA by Watson and Crick. The latest research, published last year, from the joint efforts of Columbia University and the New York Genome Center, was able to push the envelope even further by encoding an operating system, movies, and other such files amassing to a total of 2.14 GB which they were able to retrieve perfectly. The future may well be written in DNA. Nick Goldman has even bet on it! At the 2015 World Economic Forum, he wagered a single bitcoin: The Davos Challenge, as it is called, is to decode the bitcoin encrypted in the vial of DNA distributed to the audience. The winner shall claim the hidden bitcoin if it is found before 21st of this month.

The current century witnesses two major revolutions – the information technology revolution and the biotechnology revolution. Both of them have changed the world in unimaginable ways. With these latest findings, the boundary between these distant fields is breaking down and the future is bright with remarkable possibilities.