David Brooks (New York Times 2.19.13) has written about what he considers the limitations of big data and feels that there are many areas of perception and decision-making that should be left to good old common sense.
Many people are alarmed at what they feel is the dehumanization of American life; the decline of individual creativity, innovation, and insight; and the subjugation of the human spirit to dry, passionless, data. Recently Netflix has made the movie House of Cards based on sophisticated audience research. Finding out through the analysis of millions of viewers’ preferences, they decided that they could not go wrong with Kevin Spacey, the director David Finscher, and a political drama. Not only that, given the millions of bits of information collected on streaming video viewers’ pauses, rewinds, fast forwards, and volume control, the producers could craft a movie that was very carefully tailored to its audience. The latest ticket sales data show that they were right on.
Oh, but what will happen to the auteur, the genius of Francis Ford Coppola, Eisenstein, or Herzog? Will they be discarded on the rubbish heap of old celluloid? Will movies be made by robots sucking up big data, crafting mega-bits into mass-market films, and discarding subjectivity and that elusive but ‘you-know- it when-see-it’ genius? Recently a bunch of young MIT and Harvard graduates were having fun over beers to see who could come up with the wildest, yet scientifically plausible science fiction idea. The next day they decided to put the same challenge on the social media. This exponentially large network might yield some interesting ideas. Not only did they get some equally wild but scientifically grounded ideas, they received a draft of a movie script from an unknown wannabe in Iowa. It was brilliant and perfectly Hollywood, and the movie got made. A modern-day Fellini or Antonioni may well make great movies, but it is just as likely that the best scripts will come from the vast unwashed.
Throughout the United States there are hundreds of high-tech labs working to keep a step ahead of the competition in the cyber world. Google Labs epitomizes this center of intellectual excellence where the smartest of the smart using their youth, computer savvy, frightening logic, and mathematical skills to come up with the latest tweak of the company’s search engine. Yet, despite this concentration of brain power, said to be the best in the business, Google went crowdsourcing. It promoted a world-wide competition, open to all, no credentials required, to generate the best ideas on how to upgrade Google Search. It assumed, rightly in this case, that if ten million brains could work on the problem, some alone, some in mighty collaboration, it would do even better than its hi-octane mix in ‘an undisclosed location’ in the Bay Area. They were right.
Other companies have followed suit, and go to crowdsourcing for new ideas and solutions to problems in marketing, science, and mathematics. The old adage that an infinite number of monkeys banging away on an infinite number of typewriters will eventually produce all the works of Shakespeare is at work here. No longer do we need to rely on individual genius, individual inspiration, or individual creativity. While it is true that some one person out there among the ten million eager challengers for a competitive outsourced prize may have found the solution, the enterprise of finding and utilizing that genius is totally, radically different. Rather than arbitrarily and subjectively select Joe Geek from Cupertino to work in the Google Labs, it is better to let Jose Geek or Wang Chen Geek select himself.
Nate Silver has shown that simply by analyzing the numbers, political predictions can be scarily accurate. Anyone who has seen Moneyball will remember the scene between Billy Beane and his old line talent scout. Billy wants only to know who gets on base. The scout, with 30 years of experience, argues that numbers cannot tell the whole story, that only real, live, human beings can correctly assess the whole person, the complete physical, psychological, emotional toolkit. Sorry, says Billy, I don’t care what the guy is like as long as he gets on base.
The Sunday talk shows are famous for their collection of pundits and experts who perorate and opine about the week’s events. We tune in to learn from these professionals who have been trained to discern trends, anticipate national and international reaction, and make considered judgments about policy. We put up with their posturing and windy blather because they supposedly know more than we do.
Not so, for in the era of big data, we the citizens rule:
A classic example is that of ‘Galton’s ox’, a seminal study in which Sir Francis Galton noted down the estimates of 800 or so entrants in a competition to guess the weight of an ox. He found that the average (mean) estimate of the crowd was almost exactly correct. Similar accuracy was reproduced in classic experiments in which students were asked to guess the number of jelly beans in a jar or the weight of a range of objects (Leighton Vaughn Williams, The Cleverness of Crowds)
Predictive marketing has shown noteworthy success. Betting on American elections was permitted up until recently, and Rohde and Strumpf studied election results from 1868 to 1940 and compared betting predictions with actual outcomes:
The market did a remarkable job forecasting elections in an era before scientific polling. In only one case did the candidate clearly favored in the betting a month before Election Day lose, and even state-specific forecasts were quite accurate. This performance compares favorably with that of the Iowa Electronic Market (currently the only legal venue for election betting in the U.S.). Second, the market was fairly efficient,despite the limited information of participants and attempts to manipulate the odds by political parties and newspapers (Betting vs. Polls)
Other observers have concluded that the scope for using prediction markets is limitless:
“Prediction markets in general perform exceedingly well compared to individual forecasts. In his article on prediction markets, Philip O'Connor writes: "In fact, studies of prediction markets have found that the market price does a better job of predicting future events than all but a tiny percentage of individual guesses. (Michael Rozeff on LewRockwell.com)
Not so fast, warns Brooks. There is still plenty of room for subjectivity, and he gives a number of examples:
Data struggles with the social. Your brain is pretty bad at math (quick, what’s the square root of 437), but it’s excellent at social cognition. People are really good at mirroring each other’s emotional states, at detecting uncooperative behavior and at assigning value to things through emotion.
Even if this were true (how many husbands totally misinterpret what is going on in their wives’ heads?), computer technology is already able to read facial expressions, physical attitude and comportment, gestures; and read changes in body heat, heart rate, and perspiration. Sophisticated ‘sentiment analysis’ software is already used to interpret meaning from common phrases and words within context. Large service industries, like Holiday Inn or Marriott are mining big data (with help from Facebook & Co.) to decipher what former guests really think of their stay. It is not difficult to imagine the day when each individual equipped with mini-sensors will be able to ‘read’ another’s meaning with exactitude.
Data struggles with context. Human decisions are not discrete events. They are embedded in sequences and contexts. The human brain has evolved to account for this reality. People are really good at telling stories that weave together multiple causes and multiple contexts. Data analysis is pretty bad at narrative and emergent thinking, and it cannot match the explanatory suppleness of even a mediocre novel.
Once again Brooks underestimates the power of big data. In a recent Artificial Intelligence Conference at MIT, Noam Chomsky was shown to be in the arrière-garde of the field for his inflexible insistence that one must understand the brain and its linguistic core in order to create artificial intelligence. No, said Google’s top research scientist. We simply mine the data to find out how people construct and use language. We don’t care how they do it, just what they do. A corner had been turned.
Sentiment analysis relies on context to extract meaning. If a former guest says “The beds were OK”, Marriott knows that this is not a compliment. Even without hearing the speaker give that slight note of indifference, the program understands that the judgment is at the very least tepid, for that is how American English is used. That is, the program has understood the cultural context of the sentence.
Most of us are amazed that Google knows what we are looking for after only a few keystrokes; but now they are working on an algorithm which will know what we are going to ask before we even ask it! Again, by mining big data, tracking our cookies on various websites, email conversations, Facebook and Twitter entries, Google will know who we are and what we are likely to ask. In other words, it has developed a profile for us – i.e. has understood the context in which we function – and is able to predict what we might be looking for.
Now when Brooks raises the issue of context, he is really talking about Faulkner. How on earth can a computer make sense of that complex, multi-layered, and multi-dimensional prose when most human beings have to parse, re-read, study, and pore over each page? The answer goes back to the Chomsky-Google debate. There is no mystery to language, the big data advocates say; and eventually a computer will be able to read, understand, summarize, and describe every page of Absalom, Absalom within the context of the entire work. A computer can already read every word of Faulkner in a millisecond, can analyze his mode of writing, his references and cross-references, his metaphors, his verbal twists, his inventions to such a degree that soon it will be able to accurately predict the next page.
Data creates bigger haystacks. This is a point Nassim Taleb, the author of “Antifragile,” has made. As we acquire more data, we have the ability to find many, many more statistically significant correlations. Most of these correlations are spurious and deceive us when we’re trying to understand a situation. Falsity grows exponentially the more data we collect. The haystack gets bigger, but the needle we are looking for is still buried deep inside.
This is misleading. Of course we can create a multi-dimensional spider-web of correlations and associations which exponentially increase with the amount of data acquired; but must we? In the example of Moneyball, a very simple question was asked: “What factor is most correlated with runs scored?”. This enabled millions of bits of data to be collected, but within a disciplined, well-organized context. The big data analysts know quite well the pitfalls of big haystacks.
Big data has trouble with big problems. We’ve had huge debates over the best economic stimulus, with mountains of data, and as far as I know not a single major player in this debate has been persuaded by data to switch sides.
This is a specious argument. First of all people do not always change their minds because of data, big or small. There are hundreds of thousands of conspiracy theorists out their who refuse to believe that President Obama was born in the United States, and nothing is going to change their minds. More to the point, however, is that data are only so good as the technology used to collect and analyze them. Forecasters can now predict five-day weather with a high degree of accuracy and are getting much better at ten. Weather systems are described as ‘complex’ because there are so many factors which contribute to wind speed, temperature, and precipitation, it is very hard to model them. Yet, we are getting there, and few of us doubt that in not very long, we will be able to plan vacations far in advance. Economics is called the dismal science for a good reason. There are so many factors that determine share prices, exchange rates, interest rates, consumer confidence, etc. that predicting economic activity is even harder than predicting the weather. At least the meteorologist does not have to deal with human psychology when crafting his models. It is normal, then, that economic predictions are still waffly.
Data favors memes over masterpieces. Data analysis can detect when large numbers of people take an instant liking to some cultural product. But many important (and profitable) products are hated initially because they are unfamiliar.
As I mentioned earlier in my discussion of Netflix, there will be fewer and fewer marketing experiments. Brooks assumes that the old way will prevail and that someone will invent a product, try it out on the public, and hang in there to see if initial negative reaction can be overcome. Marketing will not work this way, but will first find out with exacting precision what consumers want and produce products according to those demands.
Data obscures values. I recently saw an academic book with the excellent title, “ ‘Raw Data’ Is an Oxymoron.” One of the points was that data is never raw; it’s always structured according to somebody’s predispositions and values. The end result looks disinterested, but, in reality, there are value choices all the way through, from construction to interpretation.
Of course. Although some scientific discoveries have been made by accident (e.g. penicillin), many more have been made through the process of hypothesis-test-theory. In other words, someone had to ask the question, “I wonder what would happen if…..”
In conclusion, I do not believe that Brooks has made a convincing argument against big data. On the contrary, all the points he raises confirm the robustness of the new discipline. Objective, value-neutral, logical analysis through the use of IT and logical algorithms is the future. No doubt about it.
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.