Interview With Computational Biologist and Genomic Scientist Prof. John Quackenbush

John Quackenbush is a professor of biostatistics and computational biology and a professor of cancer biology at the Dana-Farber Cancer Institute, as well as the director of its Center for Cancer Computational Biology. Prof. Quackenbush also holds an appointment as a professor of computational biology and bioinformatics in the Department of Biostatistics at the Harvard TH Chan School of Public Health.

“Genomics has transformed biological science not by producing genome sequences and gene catalogs for a range of species, but rather through the development of technologies that allow us to survey, on a global scale, organisms and their gene, protein, and metabolic patterns of expression. The challenge is no longer how to generate these vast bodies of genomic data, but rather in how to best collect, manage, and analyze the data.

As a community, we have a long history of studying biological systems and our best strategy moving forward is to leverage that knowledge so as to best interpret genome scale datasets. Our research group focuses on methods spanning the laboratory to the laptop that are designed to use genomic and computational approaches to reveal the underlying biology. In particular, we have been looking at patterns of gene expression in cancer with the goal of elucidating the networks and pathways that are fundamental in the development and progression of the disease.” – (Source:


The following has been paraphrased from an interview with Prof. John Quackenbush on March 13th, 2018.

(Click here for the full audio version)



How did your background as a physicist prepare you for the study of biology?

One of the things about a PhD in any field is that it should teach you how to think and answer tough questions. When I look at the problems in biology I approach them like a physicist by taking complex problems and breaking them down into smaller problems that I can solve and then try to stitch the pieces back together. And that is how I approach hard problems in any field.

When I first started working in biology, one of the advantages I had coming from a different discipline is the freedom to ask some basic questions that nobody else in the field was asking. I find that those simple questions that everyone thinks we know the answers to often have incomplete answers and exploring them can give a lot of insight.

Which field of study is more complex?

There are clearly enormous complexities in physics, but when we break down our understanding of the universe, we have a relatively small number of fundamental particles and forces. As we analyze those we can write down basic rules as to how those particles and forces interact with each other. The challenge is in extrapolating that to understand large scale processes where we have many layers of complex interactions; doing so opens up new areas and shows us that we don’t fully understand how things work.

As an example,  we have understood the Ideal Gas Law for a very long time, but applying that to understanding atmospheric dynamics takes us to another level of complexity where we can’t really make accurate predictions.

In the biological sciences we see much of the same thing, but there we don’t have the fundamental understanding of how individual elements interact and what all the fundamental pieces are. But as we start to build up our understanding of how cells work, and ask how an organism functions, we see the same kind of problems in trying to understand complex systems. This phenomenon is often referred to as a system having ’emergent properties’, if I understood exactly how a brain cell works and could model it perfectly, I still could not tell you how a brain functions because there are interactions that occur between brain cells that give the brain different properties than each cell has on its own.

To quote you,”Every revolution in the history of science has been driven by one and only one thing: access to data.” But it seems this revolution is giving us access to more data than we can make sense of (genomics, proteomics, transcriptomics, etc.) So, will this revolution ultimately be determined less by access and more by figuring out how to interpret all this data?

Access to data is the starting point. If you don’t have it you can’t do anything. But you are right, you can’t do anything with just a pile of data either. One of my favorite quotes is from Jules Poincare, “Science is built up of facts, as a house is built of stones; but an accumulation of facts is no more a science than a heap of stones is a house.”

That is what we are facing today, we have access to unprecedented quantities of data, but we have to put all of our data together into some kind of coherent framework. That requires both that we understand on a basic level how biological systems work, but also that we build thoughtful models that we can test and validate to get a better understanding of what the data are telling us.

And that requires not only data but meta-data. I can sequence the genome of everyone in the world, but if I don’t have information about who goes on to develop diseases, I can’t make sense of what is in that genomic data.

But we also need better models of how cells work and interact and how the variations we see in gene expression are related to the underlying disease. The data is the raw material, without it we can’t go anywhere, but without the right models we can’t get anywhere either.

How far do you think we are from really being able to understand and apply all that data?

We are using some of it today, the question is how much more effectively can we use it? That will require us to continually refine our models and to generate more meta-data. One of my favorite papers in computer science is a 1997 paper by Wolpert and Macready where they describe ‘no free lunch theorems.’ Essentially what they said is that if you try to develop general purpose algorithms and then throw data into it expecting something meaningful to come out, nothing generally useful is going to happen. What you need to do is start with a guess about how the system works, and as long as it is a reasonable guess, you can build systems on top of it that lead you to a better understanding.

I think we have to approach  biological science the same way we think about anything else, we collect data, we start with a guess, we build theories on top of that, and then refine and continuously improve them. Even though our initial guess isn’t perfect, hopefully this process will lead to some insight into the diseases we are studying.

People often ask me whether the models we build are right. But rather than getting fixated on that, a better question we need to ask at each step along our journey is, is the model useful? Does it provide insight into the diseases we are studying?

Genetics by itself is incredibly complex, but many diseases actually seem to be the result of complex interactions between our genetics and the environment. What (if any) tools and methods are being developed that might allow us to integrate all the environmental factors that go into disease with all the genomic data?

To date, a lot of the analysis that has been done has isolated one data type from another to try to build models. But there are statistical models in epidemiology that look at gene-environment interactions that allow us to start to explore genetic variation and environmental exposure and estimate the interactions between them, as well as the likelihood that someone will develop a particular disease. Part of the challenge we have is that a lot of the data is incomplete, we don’t know much about a lot of the environmental factors that a person is exposed to. So even there we need to develop better hypothesis so we can test what factors might be relevant.

One of the other interesting things I have been working on is modeling the interactions between various gene interactions to determine their cumulative effects on disease phenotypes.

Most of us, when we think about genetics, fall into the trap of thinking about it the way it was presented to us in our first high school biology class. There we were introduced to Mendel and his peas and learned that one gene leads to one trait. In our work, we are seeing that there are networks of interacting genetic variants that together influence phenotype. And that structure matters because it tells not only what factors work together, but which are the most likely to be important. We also find that the networks can change between individuals, and that changes in network structure are  informative about the processes that drive disease.

What lessons, if any, does oncology have to teach neurology about how to properly deal with such complex interactions?

First, oncology is easier than neurology. In oncology the starting point is either you have a tumor or you don’t. In neurological disorders there often isn’t a clear dividing line between disease and not-disease, and so there are often a spectrum of different diseases that get lumped together. And that fuzziness makes it harder to search for differences. Another big advantage in oncology is we have access to disease tissue as the first line of therapy is surgery. In neurology we don’t have the same access to the primary source of genomic data we want to study.

Nevertheless, what we are learning in oncology is giving us better tools for understanding neurological disorders. One of the great things about modern technology is that it gives us the ability to generate multiple types of data from small samples that we can then use to build complex models. I think we are going to see big advances in our understanding of neurological disorders as our ability to generate data improves. Richer data sets will give us better models and a deeper understanding of these diseases.

Could you elaborate on how individual genes evolved to have a co-dependency with their surrounding genes and what the implications of this are for designing treatments to target gene-polymorphisms associated with diseases?

Within our genome there are about 25,000 genes, our cells are little machines made from proteins encoded in those genes. So, the genes tell us how the cells are functioning because they determine what proteins are being made. So, when I look at a brain cell compared to a liver cell, for example, we know that they are carrying out different functions and we see that because they are activating different genes. Gene expression data lets us find those genes that are different between cell types. For example, when we ask what distinguishes a brain cell from a liver cell, and sure enough we find, among other things, that brain cells activate the genes responsible for making neurotransmitters and liver cells don’t. We can then start to build up a catalog of the genes that are activated and use that to tell us about the state of the cell. And we can use exactly the same approach to look at healthy and disease cells, giving us a window into the biology driving disease.

Another hypothesis that we have is that the genetic variants that we carry are linked to the development of particular diseases. We can look at the intersection between genetics and gene expression, asking whether there are certain genetic variants that are more prevalent in people with disease, and whether those variants alter gene expression.

A few years ago my research group and I  started to look at these patterns to see how genes are expressed, which means examining about six million different variants across the 25,000 different genes. We found that certain expression profiles can be linked to the absence or presence of certain genetic variants. In fact, we found hundreds of thousands of regulatory interactions–one variant influencing many genes, and one gene being influenced by many variants. So we asked, how do we model all of these interactions?

Our answer was that we needed to create a graph to represent the associations between the variants and genes. Of course, at first glance, it was a big mess of interactions. But we looked more deeply at the structure of that graph and noticed that it was telling us something about how certain genetic variants worked together to influence a certain gene and vice-versa. The more we looked at the structure of the graph, the more we were able to learn.

One particularly striking thing we noticed is that there seemed to be communities of interaction. A good analogy to this is a mobile phone network. Any phone can call any other phone in the world, but when you examine the patterns you see that certain numbers tend to call certain other numbers far more frequently than you would expect. For example, you call your family members, who call each other, but your family never calls me. In the language of graphs, we would say you have a ‘family community’ that captures your interactions.

When we built our genetic variant-gene graphs, we saw something very similar. Genes and genetic variants group together into communities, and those communities tend to be associated with specific functions within cells. This moves us from an old model in which a single genetic variation might influence a single gene and takes us to a model in which a group of genetic variants could not just shift a particular gene, but could actually change how a cell functions. This has allowed us to move from just looking at one variant one trait, to examining families of variants that increase the likelihood of a person being tall or developing a disease like Parkinson’s.


Click here to learn more about the work of Prof. Quackenbush or watch this excellent lecture on how his team is using data analytics to better understand disease.

1 comment

Leave a Reply