Last year, we wrote about Flu Trends, Google's search engine-based influenza barometer. The takeaway: after calibrating its results against the numbers from the Centers for Disease Control--which are based on emergency-room visits--Google did a pretty good job of predicting a flu outbreak, and did so quickly, without having to wait for all those hospital reports to reach the CDC and be compiled into a weekly report.
But how did Google's algorithm fare during this year's fierce outbreak? Flu reached epidemic levels in January and--pertinent for the Flu Trends algorithm--was widely covered in the news. Would the widespread news coverage cause healthy people to enter influenza keywords into their search bars, thereby skewing Google's results? We updated our graphic from last year to find out. Indeed, Google's methods seem to wildly overstate the outbreak's severity, outstripping the CDC's figures by nearly a factor of two. It's worth noting that we're assuming sick people visited the hospital at the same rate as in previous years--in other words, that the CDC isn't suddenly under-reporting influenza.
The result provides a cautionary tale for big data: if the data set doesn't cleanly map the underlying terrain--that is, if people search for "flu symptoms" because they've heard flu is going around, not only because they have it--that data won't yield reliable conclusions.
Neither Google nor the CDC could be reached for comment.