Laplace and Big Data fallacy
Earlier this week I was in Singapore, attending the MRSS Asia Research Conference, which this year focused on the theme of Big Data. There was an interesting range of papers, including ones linking neuroscience, Behavioural Economics, and ethnography to Big Data.
One reference that was repeated by several of the speakers, including me, was IBM’s four Vs, i.e. Volume, Velocity, Variety, and Veracity. Volume is a given, big data is big. Velocity relates to the speed that people want to access the information. Variety reminds us that Big Data includes a mass of unstructured information, including photos, videos, and open-ended comments. Veracity relates to whether the information is correct or reliable.
However, as I listened to the presentations, and whilst I heard at least three references to the French mathematician/philosopher René Descartes, my mind turned to another French mathematician, Peirre-Simon Laplace. In 1814, Laplace put forward the view that if someone were (theoretically) to know the precise position and movement of every atom it would be possible to estimate their future position – a philosophical position known as determinism. Laplace was shown to be wrong, first by the laws of thermodynamics, and secondly and more thoroughly by quantum mechanics.
The assumption underlying much of Big Data seems to echo Laplace’s deterministic views, i.e. that if we have enough data we can predict what will happen next. A corollary to this proposition is a further assumption that if we have more data, then the predictions will be even better. However, neither of these is necessarily true.
There are several key factors that limit the potential usefulness of big data:
- Big Data only measures what has happened in a particular context. Mathematics can often use interpolation to produce a reliable view of the detail of what happened. However, extrapolation, i.e. predicting what will happen in a different context (e.g. the future) is often problematic.
- If you add random or irrelevant data to a meaningful signal, then the signal is less clear. The only way to process the signal is to remove the random or irrelevant data. If we try to measure shopping data and we collect everything we can collect, then we can only make sense of it by removing elements irrelevant to the behaviour we are trying to measure – bigger isn’t always better.
- If the data we collect are correlated with each other (i.e. they exhibit multicollinearity) then most mathematical techniques will not interpret their contribution of the factors correctly – rendering predictions unstable.
- Some patterns of behaviour are chaotic. Changes in the inputs cause changes in the outputs, but not in ways in which are predictable.
One of the most successful organisations in using Big Data has been Tesco. For almost 20 years, the retailer Tesco has been giving competitors and suppliers a hard time by utilising the data from its Clubcard loyalty scheme. Scoring Points (the book about Tesco written by Clive Humby and Terry Hunt) shows that one key to Tesco’s success was that they took the 4 points above into account.
Tesco simplified the data, removed noise, categorised the shoppers, the baskets, and times of day. Their techniques are based on interpolation, not extrapolation, and they are able to extend the area of knowledge by trial and error.
Big Data is going to be increasingly important to marketers and market researchers. But, its usefulness will be greater if people do not over-hype it. More data is not necessarily better. Knowing what people did will not necessarily tell you what they will do. And, knowing what people did will often not tell you why they did it, and that they might do if the choice is repeated or varied.
Marketers and market researchers seduced by the promise of Big Data should remember Laplace’s demon – and realise that the world is not deterministic.
5 thoughts on “Laplace and Big Data fallacy”
Comments are closed.
Aren’t these four points really part of ALL research projects? Garbage in, garbage out. Poor understanding of how statistics work, garbage out. Poor understanding of human behaviour, garbage out. I really don’t see how “big” data is different from any other data.
HI Annie, I would agree up to a point. But I think there are issues that make it stand out from regular research.
1) Regular research is not necessarily a rear view mirror, we can present new choices to a respondent and try to find out which they would pick. It is not perfect, or even close, but it is not as ontologically rooted in the past as Big Data – Big Data records one time line (the real one) and records what happened in that time line.
2) Garbage in Garbage out, yes! But also, good data plus garbage in, often results in garbage out. The chase with Big Data is to collect everything, but collecting the irrelevant risks making the good data hard or impossible to analyse.
3) Yes, in regular research we know that correlated variables cause problems. But in a controlled experiment (e.g. a survey) we can try to reduce muticollinearity. But Big Data collects the world as it is, which is massive inter correlated and subject to feedback looks, making cause and effect very difficult to define and often impossible to measure.
4) The point about chaotic systems relates to the over-hype for Big Data. Many systems (perhaps most) are not deterministic, so there is a limit to what the data can tell us. If we look at the tides, we can tell with near certainty what time the tide will come in, and we can be pretty sure how far up the beach the tide will come, but we have no idea about what the shape of individual waves will be.
A great summary of some of the issues in big data, many of which come down to the difference between data and information.
If you think about the camera in your mobile phone, then each year you are likely to get a new model with an even better camera (more pixels). So each year, your phone collects more ‘data’ but in each picture, even with it’s higher resolution, there is arguably only the same amount of useful information.
Likewise, Annie & yourself allude to the problem of signal to noise ratio. We may have more data, but is it useful and does it help provide clearer answers? Many studies (eg in the medical profession) have shown that more information often leads to poorer decision making, as even experts tend to be drawn to look at data which has marginal impact on decision making, rather than focusing on the few salient data which explain most of what is happening.
Even with big data, less is probably more. One of the challenges will surely be to filter out as much of the noise as possible. This is becoming more difficult as we collect more data.
We agree completely on big data being over-hyped. That’s why I see zero difference between big data and all data. All data must be treated with rigor and brains. Without that, any data is just a big dung pile.