Problems in Identifying Causality in Observational Data – BigSurv18

CCTV cameraPosted by Ray Poynter, 27 October, 2018

Earlier today I gave a presentation at the BigSurvey18 Conference in Barcelona, Spain. The conference was themed “Big Data Meets Survey Science” and comprised one day of short courses and two days of presentations. The majority of attendees and presenters were from academia, which perhaps explains why it included a Saturday. I enjoy attending some conferences outside of my comfort zone, I feel I can learn different things and help break down the silos between disciplines.

You can download a copy of my presentation “Problems in Identifying Causality in Observational Data” by clicking here.

Although I am a fan of observational data (all the way from Big Data through to ethnographic investigations), we need to be aware that observations based on the real world can produce findings that are difficult to unpick. This is also the topic in a blog I wrote last week “Does Running Damage Your Heart? Another example of the problems of using observational data to infer cause and effect.

Interesting topics that arose at the conference inlcuded:

  • Image of chart showing trustRene Bekkers from the Vrije Universiteit Amsterdam shared results from many studies looking at trust in society. What he showed is that using a 10-point scale generates higher levels of trust, compared with 4 and 5-point scales. Clicking on the image will enlarge it.
  • Topic modelling, this seemed to be the most fashionable approach, a bit like cluster analysis applied to concepts in the data. For example, take words, pick the key ones, then cluster them into topics.
  • Survey assisted models, using surveys to add the missing element to Big Data models, is a nice name for an idea that has become more common over the last few years.
  • Most of the big data was much smaller than we see at other conferences. The size ranged from thousands of observations or cases to millions, but few had more than 10 million – i.e. they were data sets that can be held on a single computer.
  • Twitter was the most common source of social media data, with convenience outweighing rigour.
  • I was impressed to hear about Statistics Netherlands’ innovation portal. The portal, which is in English, showcases lots of innovative experiments and pilot projects they are carrying out, for example, for example from their Center for Big Data Statistics.
  • Sampling ErrorThe average level of statistical ability was much higher than at most other conferences I attend. For example check out the error formula for non-ignorable non-response error, from a presentation by Michael Bailey from Georgetown University.