There’s a truism that bothers many (except economists): if there is a good or service that has value to some and can be produced at a cost below that value by someone else, there will be a market. This is disturbing to many because it is as true for areas of dubious morality such as sexual transactions, clear immorality (human trafficking and slavery) as it is for lawn mowing and automobiles.
Likewise for online activities, as I’ve documented many times here. You can buy twitter followers, Yelp reviews, likes on Facebook, votes on Reddit. And, of course, Wikipedia, where you can buy pages or edits, or even (shades of The Sopranos), “protection”.
Here is an article that reports at some length on large scale, commercialized Wikipedia editing and page management services. Surprised? Just another PR service, like social media management services provided by every advertising / marketing / image management service today.
Well, not “everything”. But every measure on which decisions of value depend (e.g., book purchases, dating opportunities, or tenure) can and will be manipulated.
And if the measure depends on user-contributed content distributed on an open platform, the manipulation often will be easy and low cost, and thus we should expect to see it happen a lot. This is a big problem for “big data” applications.
This point has been the theme of many posts I’ve made here. Today, a new example: citations of scholarly work. One of the standard, often highly-valued (as in, makes a real difference to tenure decisions, salary increases and outside job offers) measures of the impact of a scholar’s work is how often it is cited in the published work of other scholars. ISI Thompson has been providing citations indices for many years. ISI is not so easy to manipulate because — though it depends on user-contributed content (articles by one scholar that cite the work of another) — that content is distributed on closed platforms (ISI only indexes citations from a set of published journals that have editorial boards which protect their reputation and brand by screening what they publish).
But over the past several years, scholars have increasingly relied on Google Scholar (and sometimes Microsoft Academic) to count citations. Google Scholar indexes citations from pretty much anything that appears to be a scholarly article that is reachable by the Google spiders crawling the open web. So, for example, it includes citations in self-published articles, or e-prints of articles published elsewhere. Thus, Google Scholar citation counts depends on user-contributed content distributed on an open platform (the open web).
And, lo and behold, it’s relatively easy to manipulate such citation counts, as demonstrated by a recent scholarly paper that did so: Delgado Lopez-Cozar, Emilio; Robinson-Garcia, Nicolas; Torres Salinas, Daniel (2012). Manipulating Google Scholar Citations and Google Scholar Metrics: simple, easy and tempting. EC3 Working Papers 6: 29 May, 2012, available as http://arxiv.org/abs/1212.0638v2.
Their method was simple: they created some fake papers that cited other papers, and published the fake papers on the Web. Google’s spider dutifully found them and increased the citation counts for the real papers that these fake papers “cited”.
The lesson is simple: for every measure that depends on user-contributed content on an open platform, if valuable decisions depend on it, we should assume that it is vulnerable to manipulation. This is a sad and ugly fact about a lot of new opportunities for measurement (“big data”), and one that we must start to address. The economics are unavoidable: the cost of manipulation is low, so if there is much value to doing so, it will be manipulated. We have to think about ways to increase the cost of manipulating, if we don’t want to lose the value of the data.
Here is a recent article about high school students manipulating their Facebook presence to fool college admissions officers. Not terribly surprising: the content is (largely) created and controlled by the target of the background searches (by admissions, prospective employers, prospective dating partners etc) so it’s easy to manipulate. We’ve been seeing this sort of manipulation since the early days of user-contributed content.
People mining user-contributed content should be giving careful thought to this. Social scientists like it when they can observe behavior, because it often reveals something more authentic than simply asking someone a question (about what they like, or what they would have done in a hypothetical situation, etc). Economists, for example, are thrilled when they get to observe “revealed preference”, which are choices people make when faced with a true resource allocation problem. It could be that I purchased A instead of B to fool an observer, but there is a cost to my doing so (I bought and paid for a product that I didn’t want), and as long as the costs are sufficiently salient, it is more likely that we are observing preferences untainted by manipulation.
There are costs to manipulating user-contributed content, like Facebook profiles, of course: some amount of time, at the least, and probably some reduced value from the service (for example, students say that during college application season they hide their “regular” Facebook profile, and create a dummy in which they talk about all of the community service they are doing, and how they love bunnies and want to solve world hunger: all fine, but they are giving up the other uses of Facebook that they normally prefer). But costs of manipulating user-contributed content often may be low, and thus we shouldn’t be surprised if there is substantial manipulation in the data, especially if the users have reason to think they are being observed in a way that will affect an outcome they care about (like college admissions).
Put another way, the way people portray themselves online is behavior and so reveals something, but it may not reveal what the data miner thinks it does.
Actually, I didn’t see this coming, but I wish I had: scholarly authors who see themselves coming by suggesting themselves (via “sybils”) as their own article reviewers (referees)! Lovely case of online information manipulation in response to (fairly intense) incentives to increase one’s publication count.
How could an editor be dumb enough to send an article back to the author for review? The trick is simple (though also it shouldn’t be that hard for editors to see through it, and apparently checking is becoming more commonplace: so what will be the next clever idea as this particular arm’s race escalates?). Submit to a journal that asks authors to suggest potential reviewers. (Many journals do this — one hopes the editor selects some reviewers from an independent list, not just from the author’s suggestions!) Then submit a name and university and a false email address, one to a mailbox you control. Then, bingo, if the editor selects that reviewer, you get to write the review.
To reduce your chances of getting caught, you can suggest a real, and appropriate reviewer, just providing an inocuous but false email address (some variant on his or her name @gmail, for example).
Via The Chronicle of Higher Education.
This is not a profound observation, but one that is useful to keep in mind: if there’s a transaction that is of some value to more than a handful of people, there’s likely to be a market for it. This is more true than ever in the Internet age, because the costs of finding potential traders, and of executing trades, is much lower than it was just 10 or 15 years ago. The lower the costs of trading, the lower the value of things that will find markets.
One well-known example: text and banner ads that may sell for only a few cents a click.
But there are always new and fun examples. Today’s: a New York Times article about the market for buying Twitter followers. Want to be as popular as Ashton Kutcher?
“How do I get people to do X”, or “more of X”? That question is pretty much the motivation for the notes I write myself here.
We economists are pretty expert at answers of the form “pay them the right amount, at the right time, as the right function of observables”. But on the interwebs, the question often is how to get people to work harder or contribute more for free. For one thing, a lot of ventures don’t bother with things like, well, revenues (at least initially). And often more important, the transaction costs of identifying, contracting with, and setting up and executing payments to a large number of micro-contributors exceed the benefits of paying them.
So, there is much attention to intrinsic motivation: making people feel good enough about what they’re doing that they want to do it without something messy, like being paid. A lot of sites have been developing and refining tools like leaderboards and badges to give people a sense of accomplishment, some recognition, perhaps a reputation.
More recently, especially following the explosive success of lo-fi casual social gaming (can you spell F-A-R-M-V-I-L-L-E?), folks are trying to combine gaming with intrinsic motivators, in what is called “gamification”. Foursquare does this with its badges; so does Scvngr. A recent article in Incentives Magazine (of course) provides a pretty detailed overview of the emerging gamification industry. A number of firms now sell tools, widgets and platforms allowing folks to gamify any web site.
Games have been used for a while to induce socially useful work: these are usually called “games with a purpose”, and their early growth and success is due largely to the work of Louis van Ahn and Laura Dabbish. The idea behind GWAP is to deisgn a game that is intrinsically fun to play, but the playing of which directly produces useful work. One well-known example is the ESP game: two people anonymously matched over the web are shown the same images: they type in labels. The more times they type in the same labels, the more points the score. Meanwhile, labels that are popular are saved as tags for the image. Google uses this system now in its image labeler.
The gamification business generalizes this. The games themselves need not produce useful work: rather , the fun of being able to play them motivates the user to do something (not necessarily the game) that the provider values. For example, customers might stay on a site longer, or engage more so that they remember the site (or develop loyalty) and return later.
I particularly like the following observation from the article, because it touches on the critical importance of storytelling in effective (and persuasive) communication: storytelling. Barry Kirk, solution vice president of consumer loyalty at Maritz Loyalty & Motivation, said, “before slapping badges on everything, make sure your ‘game story’ is well thought out”. He added, “If this were a game, would it be interactive, playful, and engaging? All good games are special experiences, and how to apply gamification is just getting started.”
Curiouser and curiouser.
Last week Jonathan Tasini, a free-lance writer, filed a lawsuit on behalf of himself and other bloggers who contributed — well maybe not contributed — their work to the Huffington Post site. His complaint is that Huffington Post sold itself to AOL for $315 million and did not share any of the gain with the volunteer — well maybe not volunteer — writers.
The lawsuit complaint makes fun reading, as these things go.
The main gripe (other than class warfare: it’s unfair!) seems to be that HuffPo “lured” (paragraph 2) writers to contribute their work not for payment but for “exposure (visibility, promotion and distribution)”, yet did not provide “a real and accurate measure of exposure” (paragraph 103). However, as far as I can see, there is no claim that HuffPo ever told its writers that HuffPo would not be earning revenue, nor a promise that it would provide any page view or other web analytic data.
How deceived was Tasini? He’s no innocent. In fact, he volunteers (oops! there’s that word again) in the complaint that he runs his own web site, that he posts articles to it written by volunteers, and that he earned revenue from the web site (paragraph 15). And he was the lead plaintiff in the famous (successful) lawsuit against the New York Times when it tried to resell freelance writer content to digital outlets (not authorized in its original contracts with the writers). And, gosh, though he was “lured” into writing for the HuffPo, and was “deceived” into thinking it was a “free forum for ideas”, he didn’t notice that they sold ads and were making money during the several years in which he contributed 216 articles to the site. That’s a pretty powerful fog of deception! Maybe Arianna Huffington should work for the CIA.
There is an article in the new issue of CACM on “Crowdsourcing systems on the World-Wide Web”, by Anhai Doan, Raghu Ramakrishnan, and Alon Y. Halevy. In it they offer a definition of crowdsourcing systems, characterize them along nine dimensions, and discuss some of the dimensions as challenges.
It’s a useful review article, with many examples and a good bibliography. The characterization in nine dimensions is clear and I think mostly useful.
I’m particularly pleased to see that they have given prominent attention to the incentive-centered design issues on which I (and this blog) have focused for years. Indeed, they define crowdsourcing systems in terms of four incentive problems that must be solved (distinguishing them from, say, crowd management systems that only address three of the questions). They define crowdsourcing as “A system that enlists humans to help solve a problem defined by the system owners, if in doing so it addresses four fundamental challenges:
- How to recruit and retain users?
- What contributions can users make?
- How to combine user contributions to solve the target problems?
- How to evaluate users and their contributions?
The first and second are the “getting stuff in” (contribution) problem about which I write. How to get people to make effort to contribute something to the good of others? The fourth is the quality incentive problem, which I usually separate into “getting good stuff in” (positive quality), and “keeping bad stuff out”.
Not a new story, but the New York Times reports some interesting details (including prices) of human farms hired by robots (well, not really) to solve CAPTCHAs.
Macduff Hughes, at Google, captures the main point I’ve been making for years: screening out unwanted intruders is an economic problem, and CAPTCHAs are an economic (signaling) mechanism, trying to raise the price sufficiently for bad guys to keep them out.