Issue 23 / 24 June 2013

Star trek: the next generation fans will wonder whether the phrase “Big Data” is a descriptor for a new sentient android of epic proportion, a supersized upgrade of Lieutenant Commander Data.

As an addicted Trekkie, sadly, I must quickly disabuse you — even though there are people who consider Big Data to be every bit (byte?) as exciting as the USS Enterprise’s second officer.

Big Data actually refers to immense datasets that are collected in fields as diverse as astronomy and genomics. As Wikipedia tells it, “as of 2012, every day 2.5 quintillion (2.5×1018) bytes of data were created”, so there are a lot of data about.

The dynamic of Big Data is the search for relationships among these data and teasing out correlations that may not be obvious from the constituent datasets that comprise it. Our technical capacity to search immense data repositories means that correlations can be found in a way never before possible.

In their new book Big Data: a revolution that will transform how we live, work and think, Viktor Mayer-Schönberger, an internet governance academic from Oxford, and Kenneth Cukier, the data editor of The Economist, recount an interesting example of how Big Data, collected by Google from the three billion search requests it receives each day, was used to track influenza in the US.

Google took the 50 million most common search terms used by Americans and compared the list with Centers for Disease Control (CDC) data on the spread of seasonal flu between 2003 and 2008.

After stupendous computer activity, they settled on 45 search terms that were strongly correlated with official figures. These included many obvious terms such as flu, cough, medications for cough but others that were not so obviously linked.

As Mayer-Schönberger and Cukier point out in their book, unlike the CDC, “they could tell it in near real time, not a week or two after the fact”. Although not without their critics and errors, Google flu trends are now available for many countries.

The authors concede that there is no universally accepted definition of Big Data, but rather see the term as referring to “things one can do at a large scale that cannot be done at a smaller one, to extract new insights or create new forms of value, in ways that change markets, organizations, the relationship between citizens and government, and more”.

Our capacity to collect, link and analyse data electronically is growing exponentially. Mayer-Schönberger and Cukier draw a parallel between the present and the era that followed the invention of the Guttenberg printing press around 1439.

In the half-century starting in 1453, they quote an estimate that eight million books were printed, “more than all the scribes of Europe had produced since the founding of Constantinople 1,200 years earlier”.

In 2003, following a decade of effort, the human genome was sequenced — “Now… a single facility can sequence that much DNA in a day”.

And because Big Data includes all the data available, population samples will no longer be needed in the way they are today and the work of statisticians will be redefined.

There are many features of Big Data to ponder for medicine. How will we practise with more information about correlation and less about causation?

If Big Data shows that people who take regular exercise have better cancer survival, what will we advise our patients? Is the correlation sufficient to advise them to exercise, even though the causal pathway is not known?

This will increase our need, and that of our patients, to live with uncertainty.

What meaning does privacy and even confidentiality have in this new age?

We should surely be thinking about and discussing these things now.

 

Professor Stephen Leeder is the editor in chief of the MJA and professor of public health and community medicine at the University of Sydney.

Jane McCredie is on leave.
 

Potential COI: The author’s son Nick is country director of Google France.
 


Poll

Does Big Data have the potential to make major improvements in medicine?
  • Yes - if we can harness it (79%, 49 Votes)
  • Don't know (13%, 8 Votes)
  • Maybe - it's so overwhelming (8%, 5 Votes)

Total Voters: 62

Loading ... Loading ...

3 thoughts on “Stephen Leeder: The Big Data trek

  1. Dr George Margelis says:

    As another “trekkie” and someone involvedin Health IT in a major way over the last decade or so, I too am excited by the potential of “big data” to improve healthcare. However it is very important to ensure we are not swept up by the hype of the IT world which so far has predicted that “big data” will solve all the world’s problems, and even some we don’t know are problems yet.

    Another good book on “big data” is Nate Silver’s “The Signal and the Noise”. Mr Silver is famous for accurately predicting the US election at a very granular election through his use of data. The key message is that we still need people with insight into problems to ensure we don’t get baffled by the sheer volume of data available. The fact is that much of  the data will really only be noise. The key is to be able to find the signal that is relevant within that noise.

    As medical doctors we are trained to focus on what is important in a mass of information we receive about our patients. It may well be that medical training may be the best training for “big data” scientists. At the very least in healthcare it is important that we maintain our focus on finding the right signal in the upcoming “big data” deluge we are soon to experience, and make sure we use it to deliver better outcomes.

    http://www.georgemargelis.com

  2. Tim Churches says:

    Yes, although Google Flu Trends got it famously wrong for the northern hemisphere flu season this year – see http://www.nature.com/news/when-google-got-flu-wrong-1.12413 and Nick Bilton’s much-cited NYT article about the perils of Big Data sans Context: http://bits.blogs.nytimes.com/2013/02/24/disruptions-google-flu-trends-shows-problems-of-big-data-without-context/

    I agree about the potential – but some human-to-human communication is required to gleen the all important context before letting lose the computer on all that data.

  3. Michael Gliksman says:

    An erudite analysis of the unrealised potential of very large data sets.

Leave a Reply

Your email address will not be published. Required fields are marked *