One of the biggest challenges with research into the genetic causes of illness used to be the time and cost of genomic sequencing. Today, thanks to significant advances in technology, the problem is no longer collecting the data; it’s mining the huge amount of information that can be pulled from a single patient sample to identify meaningful insights into health and disease. 

Dr. Keegan Korthauer
Dr. Keegan Korthauer

Dr. Keegan Korthauer is developing new tools and strategies to enable researchers to sift through large volumes of data and draw more accurate conclusions. An expert in biostatistics, she works with scientists across many fields of child health research including childhood cancer, epigenetics and childhood development, to help them analyze their data more effectively and improve children’s health in BC and beyond.

Dr. Korthauer is a new investigator with the Healthy Starts research theme at BC Children’s Hospital and is an assistant professor in the department of statistics at the University of British Columbia.

We spoke to her about her work and how new statistical methods are able to highlight the results that are most important for children’s health in a process she describes as “separating the signal from the noise.”

Why are new statistical methods needed?

The technology used in health research today is like night and day compared to ten years ago, particularly in genetics.

Earlier studies would examine a single gene in a handful of patients, but now, studies can harness the entire genome of hundreds or even thousands of patients. Today, the challenge is to put all the information together and draw useful and relevant conclusions.

When measuring enormous amounts of data, the chances are much higher that you will find false positives – correlations that seem significant, but actually aren’t related to the disease. So it becomes all the more important that researchers use effective statistical tools to weed out these false positives.

What do you mean by the “signal” and the “noise?”

Large amounts of data can generate two types of information, the “signal” – that is the real bits of information that we want to track, such as the DNA changes that lead to disease – and the “noise” – all the other information that interferes with it, such as false positives and statistically insignificant variations.

An example of this noise in the data is known as a batch effect. Batch effects happen when the conditions change between acquiring different bits of data. If researchers are studying how a person’s genes affect their hormone levels they might set up a study where patients come into the clinic to provide a blood sample. Patients that visited the clinic on a Monday, when the weather was sunny and warm, may have very different hormone levels than those visiting on a Wednesday when it’s rainy and cold. It could be difficult for the team analyzing the data to tell whether any observed changes in hormone levels were due to DNA differences or were instead a result of the day of the week, the weather, or a change in technicians collecting the samples. In other words, the signal – in this case the DNA changes that lead to differences in hormone levels – is complicated by the noise, which are the changes in hormone levels caused by environmental factors.

Scientists usually design their experiments to reduce this kind of noise, but when deciphering especially complicated data such as epigenetic changes, more sophisticated statistical methods may be necessary to amplify the signal.

How does your work address these challenges?

Our methods act like a filtration system to better separate true results from background noise.

Similar to how water out of the tap is fine to drink, but tastes better when filtered — these tools provide a second layer of analysis to purify the data and highlight the findings that are actually relevant and significant.

Earlier in my career a paper was published that had come to some quite startling conclusions about epigenetics. In fact it directly challenged some foundational theories in the field. Curious about whether the signal was hidden beneath the noise in these data, I tried implementing more rigorous statistical methods. The additional analysis ended up changing the overall results, bringing them more in line with what had been previously suspected and demonstrating the importance of a rigorous analytical approach.

What research are you concentrating on now?

Epigenetics is one of the key focusses of my work here at BC Children’s which aims to understand how environmental factors and early-life experiences can affect children’s health over the course of their lives.

Environmental exposures can influence whether the genes in DNA are “turned on” or “turned off.” Studying these processes can improve our understanding of the origin of many different types of childhood diseases and health conditions and potentially lead to new ways to diagnose these conditions at an earlier stage when intervention is often more effective.

When we think of DNA we think of genes in a nice sequential line with gene A preceding gene B etc. Unfortunately, with epigenetics it’s not that simple. The chemical markers that alter gene expression can be outside the genes entirely. So when we study a child’s DNA to look for changes in gene expression we can’t just look at the gene itself, we have to look all around it to identify these epigenetic markers. 

Multiple epigenetic changes can also work together to affect whether a gene is turned on or not and scientists are still trying to understand the extent of their influence and reach. Are five chemical tags right next to a gene more or less influential that a cluster of hundreds of chemical tags further away?

Questions like these can make analyzing epigenetic data a lot less straightforward and much more reliant on careful statistical analysis to make sure that we come to the right answers for the benefit of children’s health.

What brought you to BC Children’s?

Researching how to improve children’s health here at BC Children’s is ideal for me. I always wanted to be in a place where researchers, computational experts and patients are all together. It’s very rare in science work with people interested in similar questions and who are also directly treating children. This both reminds us why we do the work we do and on a practical level, means that our work can help children directly.