Well, not “everything”. But every measure on which decisions of value depend (e.g., book purchases, dating opportunities, or tenure) can and will be manipulated.
And if the measure depends on user-contributed content distributed on an open platform, the manipulation often will be easy and low cost, and thus we should expect to see it happen a lot. This is a big problem for “big data” applications.
This point has been the theme of many posts I’ve made here. Today, a new example: citations of scholarly work. One of the standard, often highly-valued (as in, makes a real difference to tenure decisions, salary increases and outside job offers) measures of the impact of a scholar’s work is how often it is cited in the published work of other scholars. ISI Thompson has been providing citations indices for many years. ISI is not so easy to manipulate because — though it depends on user-contributed content (articles by one scholar that cite the work of another) — that content is distributed on closed platforms (ISI only indexes citations from a set of published journals that have editorial boards which protect their reputation and brand by screening what they publish).
But over the past several years, scholars have increasingly relied on Google Scholar (and sometimes Microsoft Academic) to count citations. Google Scholar indexes citations from pretty much anything that appears to be a scholarly article that is reachable by the Google spiders crawling the open web. So, for example, it includes citations in self-published articles, or e-prints of articles published elsewhere. Thus, Google Scholar citation counts depends on user-contributed content distributed on an open platform (the open web).
And, lo and behold, it’s relatively easy to manipulate such citation counts, as demonstrated by a recent scholarly paper that did so: Delgado Lopez-Cozar, Emilio; Robinson-Garcia, Nicolas; Torres Salinas, Daniel (2012). Manipulating Google Scholar Citations and Google Scholar Metrics: simple, easy and tempting. EC3 Working Papers 6: 29 May, 2012, available as http://arxiv.org/abs/1212.0638v2.
Their method was simple: they created some fake papers that cited other papers, and published the fake papers on the Web. Google’s spider dutifully found them and increased the citation counts for the real papers that these fake papers “cited”.
The lesson is simple: for every measure that depends on user-contributed content on an open platform, if valuable decisions depend on it, we should assume that it is vulnerable to manipulation. This is a sad and ugly fact about a lot of new opportunities for measurement (“big data”), and one that we must start to address. The economics are unavoidable: the cost of manipulation is low, so if there is much value to doing so, it will be manipulated. We have to think about ways to increase the cost of manipulating, if we don’t want to lose the value of the data.
Here is a recent article about high school students manipulating their Facebook presence to fool college admissions officers. Not terribly surprising: the content is (largely) created and controlled by the target of the background searches (by admissions, prospective employers, prospective dating partners etc) so it’s easy to manipulate. We’ve been seeing this sort of manipulation since the early days of user-contributed content.
People mining user-contributed content should be giving careful thought to this. Social scientists like it when they can observe behavior, because it often reveals something more authentic than simply asking someone a question (about what they like, or what they would have done in a hypothetical situation, etc). Economists, for example, are thrilled when they get to observe “revealed preference”, which are choices people make when faced with a true resource allocation problem. It could be that I purchased A instead of B to fool an observer, but there is a cost to my doing so (I bought and paid for a product that I didn’t want), and as long as the costs are sufficiently salient, it is more likely that we are observing preferences untainted by manipulation.
There are costs to manipulating user-contributed content, like Facebook profiles, of course: some amount of time, at the least, and probably some reduced value from the service (for example, students say that during college application season they hide their “regular” Facebook profile, and create a dummy in which they talk about all of the community service they are doing, and how they love bunnies and want to solve world hunger: all fine, but they are giving up the other uses of Facebook that they normally prefer). But costs of manipulating user-contributed content often may be low, and thus we shouldn’t be surprised if there is substantial manipulation in the data, especially if the users have reason to think they are being observed in a way that will affect an outcome they care about (like college admissions).
Put another way, the way people portray themselves online is behavior and so reveals something, but it may not reveal what the data miner thinks it does.
Curiouser and curiouser.
Last week Jonathan Tasini, a free-lance writer, filed a lawsuit on behalf of himself and other bloggers who contributed — well maybe not contributed — their work to the Huffington Post site. His complaint is that Huffington Post sold itself to AOL for $315 million and did not share any of the gain with the volunteer — well maybe not volunteer — writers.
The lawsuit complaint makes fun reading, as these things go.
The main gripe (other than class warfare: it’s unfair!) seems to be that HuffPo “lured” (paragraph 2) writers to contribute their work not for payment but for “exposure (visibility, promotion and distribution)”, yet did not provide “a real and accurate measure of exposure” (paragraph 103). However, as far as I can see, there is no claim that HuffPo ever told its writers that HuffPo would not be earning revenue, nor a promise that it would provide any page view or other web analytic data.
How deceived was Tasini? He’s no innocent. In fact, he volunteers (oops! there’s that word again) in the complaint that he runs his own web site, that he posts articles to it written by volunteers, and that he earned revenue from the web site (paragraph 15). And he was the lead plaintiff in the famous (successful) lawsuit against the New York Times when it tried to resell freelance writer content to digital outlets (not authorized in its original contracts with the writers). And, gosh, though he was “lured” into writing for the HuffPo, and was “deceived” into thinking it was a “free forum for ideas”, he didn’t notice that they sold ads and were making money during the several years in which he contributed 216 articles to the site. That’s a pretty powerful fog of deception! Maybe Arianna Huffington should work for the CIA.
The Peer-to-Patent system created by Beth Noveck’s group at NYU Law School and being piloted by the U.S. Patent Office has gotten a fair bit of attention. The basic idea is to gather user-contributed content from experts who can help patent examiners figure out whether a proposed invention is novel (no prior art). Anyone can submit comments on the posted patent proposals, and in particular can cite to evidence of prior art (which generally leads, if valid, to denial of the patent application). The purpose is to speed up patent reviews, and in particular to help prevent the granting of invalid patents, because it is often costly, time-consuming and chilling to later innovation to fight and prove a granted patent is invalid.
Andy Oram wrote an editorial in the Feb 2008 Communications of the ACM urging computer scientists to participate (viewing article may require subscription). He explained the system, and why it would be good for innovation for experts to donate their time to read and comment on patent applications.
Why would experts — whose time is somewhat valuable — want to do this? Andy argues that the primary reason is public service: donate to create a public good (better software patent system) for all. There are lots of ideas of things that would be “good for all” that require volunteer donations of time, effort, money. It’s actually not a given that such public goods are a good idea: the value of a public good does not always or automatically exceed the cost of the time or other resources donated by the people who created it. The experts who Andy seeks to contribute to Peer-to-Patent are highly trained people whose time is generally valued quite highly. In any case, if P-to-P depends on volunteer contributions by experts, how likely is it to succeed? These are people who already feel deluged by requests to volunteer their time to referee conference and journal articles, advise students on projects, advice government, serve on department and university committees, serve on professional organization committees and edit journals, etc., etc. I know few serious, successful academics who work less than 50 or 60 hours a week already.
Andy also suggests another reason to volunteer time for Peer-to-Patent: the bad patent you block may save your startup company! Now we’re talking….a monetary incentive to “volunteer” time. But this is a bit problematic too: it points out a strategic concern with P-to-P. Potential competitors, or entrepreneurs who at least want to use the disclosed invention, have an interest in trying to block patent applications, and may try to do so even if the invention is legitimate? They can flood the Patent Office with all sorts of “prior art”, which may not be valid, but now the patent examiners will have more work to do. And just as patent examiners may conclude incorrectly that a patent application is valid, so may they conclude incorrectly that one is invalid. It’s not prima facie obvious, especially given that those most motivated to “donate” time and effort are those who themselves have a financial stake in the outcome, that user-contributed content in this setting will be a good thing, on balance.
Google has made available a striking set of new features for search, which it calls SearchWiki. If you are logged in to a Google account, when you search you will have the ability to add or delete results you get if you search that page again, re-order the results, and post comments (which can be viewed by others).
But the comments are user-contributed content: this is a relatively open publishing platform. If others search on the same keyword(s) and select “view comments” they will see what you entered. Which might be advertising, political speech, whatever. As Lauren Weinstein points out, this is an obvious opportunity for pollution, and (to a lesser extent in my humble opinion, because there is no straightforward way to affect the behavior of other users) manipulation. In fact, she finds that comment wars and nastiness started within hours of SearchWiki’s availability:
It seem inevitable that popular search results in particular will
quickly become laden with all manner of “dueling comments” which can
quickly descend into nastiness and even potentially libel. In fact,
a quick survey of some obvious search queries shows that in the few
hours that SearchWiki has been generally available, this pattern is
*already* beginning to become established. It doesn’t take a
lot of imagination to visualize the scale of what could happen with
the search results for anybody or anything who is the least bit
Lauren even suggests that lawsuits are likely by site owners whose links in Google become polluted, presumably claiming they have some sort of property right in clean display of their beachfront URL.
When I explain to people the fundamental ICD problem of motivating users to contribute content to a user-contributed content information resource, I often use Wikipedia as a familiar example: “Why do so many people voluntarily donate so much time and effort to research, write content, and copy edit and correct the content of others? That’s a lot of unpaid work!”
Some people ask what the problem is, and why this needs academic research: “Wikipedia is doing great! They don’t need to come up with clever incentives to motivate contribution.” My reply: “Yes (maybe), but the point is, how do we create the next Wikipedia” (that is, another fabulously successful and valuable information resource dependent on all that volunteer labor)? What is the special sauce? Is it replicable?
Simson Garfinkel has an article in the current Technology Review that, indirectly, makes the point nicely. Yes, Wikipedia is fabulously successful…in some ways. But certainly not everyone thinks Wikipedia is that final word in online reference, such that we don’t need to create any other reference resources. Simson focuses on “Wikipedia and the Meaning of Truth”. Wikipedia’s primary rule for admissible content is not that it be verifiably true (which would be diffcult to enforce, to say the least!), but that it be verifiably published somewhere “reliable”.
That not everything in Wikipedia is correct is well-known, and not surprising. There are enthusiastic debates about whether it is as accurate as traditional encyclopedias, like Britannica. And so forth. The point is: many people want other types of reference resources as an alternative, or at least as a complement to Wikipedia. And thus the question: to build such a resource with user-contributed content, we need to motivate the users.
Some are trying to create more accurate, reliable alternatives, and they are not nearly as successful in getting contribution as Wikipedia has been. One of the interesting examples is Google’s Knol, which is trying to establish greater reliability by having each topic “owned” by its original author (who may then permit and seek contributions from other users).
Do you think Wikipedia is the final word, forever, in online reference? If not, perhaps you should be wondering how to motivate users to contribute to other resources, and thinking about whether motivation is trivial now that Wikipedia has “figured it out”.
As I’ve given talks and written the past couple of years about the motivation mysteries surrounding user-contributed content sites, I generally mention Amazon book reviews as a prominent example. It is not uncommon for over 100 people to review a popular book. And the top 10 reviewers (as of today) have each written more than 1600 reviews (leader Harriet Klausner is about to pass 17,500!).
Why? Not only is that a lot of time (allegedly) reading, but it’s a lot of time writing…for the economic benefit of Amazon. What do reviewers get out of it?
One explanation for open source software contributions is that new programmers get professional experience on a team software engineering project, and their contributions are publicly documented so they can show them to potential employers. That might explain some reviewers on Amazon: they can show their reviews to an employer, and users rate them so they can show their scores too. But how many jobs are out there for book reviewers (and what about those with massive output who remain “amateur”)?
Slate published Garth Hallberg’s article in January 2008 (yes, I’m a bit behind in posting things to this blog!) that suggests the amateur reviewers may be motivated the old-fashioned way: through extrinsic, direct benefits. For example, apparently publishers send free copies of their books to prolific reviewers, so people who do want to read a lot get a lot of in-kind compensation. Grady Harp (#6) said he is “innundated”. Amazon has extended this form of compensation by creating its Vine program, in which it selects successful reviewers and gives them free products from across its line of goods (electronics, appliances, etc.), as long as they write reviews (Amazon claims they do not influence opinions or modify or edit reviews).
(Thanks to Rick Wash for pointing me to the Hallberg article.)
A fun ad from IBM that makes the point… (Thanks to Mark McCabe)
(This is not really an incentive design entry, just information economics more broadly. But too interesting to pass up.)
Yahoo! Music store announced yesterday it would be closing this fall. All that music you bought (well, not many people actually bought from Yahoo! Music, but still)? They are taking down the DRM servers in September, and your computer will not be able to “phone home” to get the key. The only solution: burn to CD (which of course, made DRM pretty ineffective in the first place). Apparently the same problem occurred when Microsoft and Sony announced the shuttering of their online music stores.
Conventional notions of “owning” property generally involve control over the use of that property in perpetuity (including transfer of ownership). When there are significant use restrictions and rights retained by the provider, it’s licensing, not buying. This has been drummed into us over the years with software licenses (you can’t take a copy of Windows off your old machine and install it on your new machine, for example). With music, I think the general sense is that we are buying it, not licensing it, however. Be that as it may, DRM imposes licensing-like restrictions, and apparently one of them is “you may not be able to listen to this music if we decide to shut down our service in the future.”
Note to self: Finish burning backup CD copies of all of my iTunes music!
One of my students alerted me to a recent dramatic episode. Author and psychologist Cooper Lawrence appeared on a Fox News segment and made some apparently false statements about the Xbox game “Mass Effect”, which she admitted she had never seen or played. Irate gamers shortly thereafter started posting (to Amazon) one-star (lowest possible score) reviews of her recent book that she was plugging on Fox News. Within a day or so, there were about 400 one-star reviews, and only a handful any better.
Some of the reviewers acknowledged they had not read or even looked at the book (arguing they shouldn’t have to since she reviewed a game without looking at it). Many explicitly criticized her for what she said about the game, without actually saying anything about her book.
When alerted, Amazon apparently deleted most of the reviews. Its strategy apparently was to delete reviews that mentioned the name of the game, or video games at all (the book has nothing to do with video games). With this somewhat conservative strategy, the reviews remaining (68 at the moment) are still lopsidedly negative (57 one-star, 8 two-star, 3 five-star), more than I’ve ever noticed for any somewhat serious book, though there’s no obvious way to rule these out as legitimate reviews. (I read several and they do seem to address the content of the book, at least superficially.)
Aside from being a striking, and different example of book review pollution (past examples I’ve noted have been about favorable reviews written by friends and authors themselves), I think this story highlights troubling issues. The gamers have, quite possibly, intentionally damaged Lawrence’s business prospects: her sales likely will be lower (I know that I pay attention to review scores when I’m choosing books to buy). Of course, she arguably damaged the sales of “Mass Effect”, too. Arguably, her harm was unintentional and careless (negligent rather than malicious). But she presumably is earning money by promoting herself and her writing by appearing on TV shows: is a reasonable social response to discipline her in her for negligence? (And the reviewers who have more or less written “she speaks about things she doesn’t know; don’t trust her as an author” may have a reasonable point: so-called “public intellectuals” probably should be guarding their credibility in every public venue if they want people to pay them for their ideas.)
I also find it disturbing, as a consumer of book reviews, but not video games, that reviews might be revenge-polluted. Though this may discipline authors in a way that benefits gamers, is it right for them to disadvantage book readers?
I wonder how long it will be (if it hasn’t already happened) before an author or publisher sues Amazon for providing a nearly-open access platform for detractors to attack a book (or CD, etc.). I don’t know the law in this area well enough to judge whether Amazon is liable (after all, arguably she could sue the individual reviewers for some sort of tortious interference with her business prospects), but given the frequency of contributory negligence or similar malfeasances in other domains (such as Napster and Grokster facilitating the downloading of copyrighted materials), it seems like some lawyer will try to make the case one of these days. After all, Amazon provides the opportunity for readers to post reviews in order to advance its own business interests.
Some significant risk of contributory liability could be hugely important for the problem of screening pollution in user-contributed content. If you read some of the reviews still on Amazon’s site in this example, you’ll see that it would not be easy to decide which of them were “illegitimate” and delete all of those. And what kind of credibility would the review service have if publishers made a habit of deciding (behind closed doors) which too-negative reviews to delete, particularly en masse. I think Amazon has done a great job of making it clear that they permit both positive and negative reviews and don’t over-select the positive ones to display, which was certainly a concern I had when they first started posting reviews. But it authors and publishers can hold it liable if they let “revenge” reviews appear, I suspect it (and similar sites) will have to shut down reviewing altogether.
(Thanks to Sarvagya Kochak.)