The human genome comprises 3 billion nucleotides of information that outline how to construct and continue to live as a human. ~2% of those nucleotides code for proteins, which perform the bulk of the work to be done in the cell. The remaining 98% contains instructions for where and when to turn those proteins on and off—the controller or manager of the protein workers. There’s a language written in these nucleotides, but we cannot presently read it. If we could, we’d be far better able to understand how nucleotide differences between individuals relate to our physical differences, including tendencies toward certain diseases. Thus, reading the noncoding genome is a major emphasis of biology research right now.
Fortunately, biologists have designed a variety of clever assays to profile what’s happening to the DNA in a cell. One such assay co-opts a naturally occurring protein called DNaseI that cuts the DNA strand as a biotechnology tool. To fit 3 billion nucleotides into a cell’s nucleus, the DNA wraps tightly around protein complexes called nucleosomes. DNaseI will preferentially cut DNA in regions where the DNA has loosened from the nucleosomes, usually because interesting proteins are binding in order to turn nearby genes on and off—precisely the events that we’d like to know about! Proteins called “transcription factors” like to bind specific “words” in the DNA, such as TGACTCA for the protein JUN. But the genome is SO large that these words occur all over the place, and the data indicates that very few of them are actually bound by proteins. One important question that we tackled here is what determines when and where the proteins will bind?
To do so, we enlisted the assistance of machine learning. A team of graduate students would only get so far examining every binding site to assess what makes it special, but a computer can consider millions of sequences without breaking a sweat. Recent work in algorithms for machine learning demonstrate that a very flexible set of models called artificial neural networks are powerful tools for learning a function to map instances of structured items (like DNA sequences) to their properties (like protein binding) if given sufficient examples of that mapping. Since we have many DNaseI experiments (to mark protein binding) and lots of DNA sequence, we have ample training data for the algorithm!
In our paper, we benchmarked this approach and found that it worked very well for classifying which sequences would and wouldn’t be bound by proteins in specific types of cells. Similarly to the English language, DNA words take meaning from their context, and proteins most like to bind when the context is most favorable. While these patterns would have been very challenging for a human eye to pick out, a properly tuned algorithm can do it well and report back to us with the answer.
Our success in this project energized my colleagues and I to continue to pursue machine learning applications to help humans effectively read the human genome. As biologists continue to generate data describing the functional properties of DNA sequences, we will continue to model those properties until we can accurately predict how the variation in human populations determines the relevant next steps to regulate genes. These are exciting times, as we seek to understand how our genetic material influences who we become and how we can best put that knowledge to use for more effective medicine.