One of the most important principles of Big Data, says David Brooks, writing in the New York Times (4.16.13) is the move from causality to correlation. We no longer need to find the reason why A causes B; we just have to observe that it does
The theory of big data is to have no theory, at least about human nature. You just gather huge amounts of information, observe the patterns and estimate probabilities about how people will act in the future.
One of the most interesting events in the Big Data arena was a recent (2011) debate between Noam Chomsky, renowned academic and pioneer in the field of linguistics; and Peter Norvig, Senior Research Director at Google. Chomsky contended that language was hard-wired in the brain at birth, and to formulate an artificial intelligence, it would be necessary to understand that circuitry so that it could be produced. Nonsense, said Norvig. All we have to do is to track and monitor the expressions of language throughout the world, and we can very easily understand how, generically, it is constructed:
Chomsky decried modern AI’s emphasis on statistical models of huge data sets as modeling devoid of understanding. Norvig battled back, emphasizing that statistical modeling can indeed mimic incredibly complex systems like human language. If you have a cache of a trillion words, as Google does, surely there can be statistical models that can accurately reflect language..(Edward Larkin, Mind and Myth, October 2011)
In other words, we do not have to bother with figuring out why we speak; only with how we speak. Brooks writes:
As Viktor Mayer-Schönberger and Kenneth Cukier write in their book, “Big Data,” this movement asks us to move from causation to correlation. People using big data are not like novelists, ministers, psychologists, memoirists or gossips, coming up with intuitive narratives to explain the causal chains of why things are happening. “Contrary to conventional wisdom, such human intuiting of causality does not deepen our understanding of the world,” they write.
The challenge is constructing predictive models from a past for which there is little statistical data. Netflix can create House of Cards, its movie based on billions of bits of data mined from its streaming viewers – what music is preferred for what type of scene; how do viewers want a love story to end; or what is the preferred balance between action and drama – but Pentagon planners can still not predict what North Korea is going to do. They simply do not have enough statistical information, and the search words ‘dictator’, ‘despot’, ‘nuclear’, ‘family’, ad infinitum will never be enough in whatever combination to predict the future from the past.
Yet the Pentagon does not have to rely solely on ‘intuitive narratives’, for a close reading of history – and the volumes written about every period – can reduce insight and improve predictability. For example, if a computer scans and records all the works on topics related to North Korea – from the marauding expansionism of Genghis Khan to the isolationism of the Qing Dynasty; through the family disputes and palace coups from Henry II to Henry VIII; from Bismarck to Hitler; etc. – certain heretofore hidden trends are sure to be revealed which can suggest Kim Jung Un’s likely moves.
The predictability becomes even more robust if an algorithm is written which correlates the more psycho-social determinants of individual human behavior (ego, pride, selfishness, ambition, etc.) with individual political behavior (‘despot’, ‘tyrant’, ‘lust for power’); and then correlates both with actual historical events.
Some defenders of intuition and critics of Big Data say that personal insight will always be necessary, for some real, live person has to come up with the hypotheses that are the basis for research algorithms. No, say Big Data people. Scanning all major works of history of any conceivable relevance (i.e. leave out the history of the English garden) will pick up trends and patterns, some familiar and some surprising, and many very useful. This is no different from Peter Norvig’s approach to language – Don’t look for anything. Just see how language is used in the 7000 languages extant in the world, pick up the similarities in grammar, vocabulary, and usage, and conclude a general theory.
The same approach can be applied for an analysis of history – Don’t look for anything, but determine what activities, personalities, behavioral traits, environmental factors, etc. are common to all kingdoms, dynasties, republics, and dictatorships.
If this sounds like a big task, for now it is. We are in an embryonic period of data collection and analysis. It still takes time to scan and electronically record all the books in the Library of Congress, but eventually the number of new electronic books will far surpass those in print; and those in print will all soon be transcribed.
Brooks misjudges and overemphasizes the importance of the human observer in this process of correlation.
In my columns, I’m trying to appreciate the big data revolution, but also probe its limits. One limit is that correlations are actually not all that clear. A zillion things can correlate with each other, depending on how you structure the data and what you compare. To discern meaningful correlations from meaningless ones, you often have to rely on some causal hypothesis about what is leading to what. You wind up back in the land of human theorizing.
As I have pointed out above, this is not always true. In fact, as Norvig and his Big Data colleagues have repeatedly pointed out, the best use of data mining is not to have personal hypotheses get in the way. Let the data speak for themselves.
Of course some direction in data collection is necessary. The CDC has famously used social media data to pick up the evolution of the flu. If Facebook records a large number of people in Peoria who are reporting painful coughing, CDC knows about the epidemic long before data from doctor visits and hospital admissions come in. An apparel producer anxious to produce avant-garde, hip clothing, can troll the web to find out what classic new adopters and creative innovators are wearing simply by analyzing what they say.
Big data has a very predictive value beyond the narrow confines of causality and correlation. Crowdsourcing has become increasingly used as an efficient way to solve problems. Google is famous for using crowdsourcing to come up with new search algorithms. The company figured that it might get one interesting idea in ten million, but these days, who’s counting? The enterprise was successful. The famous gumball experiment in which a large number of individuals are asked to guess the number of candies in a large glass container challenged expert opinion. Again and again, the average of a large number of observers was always closer to the actual number than any guessed by a mathematician.
The betting markets used to predict political outcomes have also been remarkably accurate. That is, if you ask people to bet on political candidates with their own money, the aggregate of those bets has been shown to be far more accurate than the prediction of any political talking head. The Nate Silver statistical methodology is good; but the wisdom of the crowd is even better.
Brooks notes that ‘people are discontinuous’. What we do one year, we may not do another. Taste and preference change dramatically over time; and with this unpredictability, how can data-based algorithms be any good?
Once again, Brooks misunderstands the logic of Big Data. As even a beginning student of statistics will tell you, the larger the sample over the larger period of time, the better the correlation. In other words, of course, people change, and these likely and often predictable changes are factored into the research algorithm. Suppose Manolo Bialek wants to come out with a new, hip shoe. He knows from mining vast amounts of data what young affluent women are saying they would like to buy; and he knows from similar research how long fashion trends tend to last. He can put the two together, do his balance sheet, and determine whether the short-term gains he can realize from a 12” spiky stiletto heel shoe are worth it.
Brooks concludes with a worry:
Most of the advocates [of Big Data] understand data is a tool, not a worldview. My worries mostly concentrate on the cultural impact of the big data vogue. If you adopt a mind-set that replaces the narrative with the empirical, you have problems thinking about personal responsibility and morality, which are based on causation. You wind up with a demoralized society. But that’s a subject for another day.
Not only is this a subject for another day; it is a subject which does not belong in this discussion. The assumption that Big Data might cause an erosion of personal responsibility and morality is far-fetched indeed.
I agree that the world will never be the same once Big Data takes over; and Brooks is right in signaling the death of intuition. There will be less and less room for political commentators, forecasters, and market analysts; but there still will be plenty of room for political philosophers, writers, and playwrights. Shakespeare was a canny observer of human nature, and he knew exactly how people behave. He understood the repetitive, cyclical nature of history because he knew that we all had an immutable core which would always govern our actions. If he had his assumptions confirmed, and laid out end to end after Big Data analysis, would he not write; or would he write less insightful plays? Hardly. Shakespeare knew he was right, and still wove the most remarkable human dramas imaginable around his convictions.
In a less abstract way, political pundits who are now flying by the seat of their pants will soon be on firmer ground. Except instead of PBS talking heads, there will be Nate Silver clones who will discuss revealed truths. A different landscape is all.
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.