Category Archives: Pollution

Everything (of value) is for sale

There’s a truism that bothers many (except economists): if there is a good or service that has value to some and can be produced at a cost below that value by someone else, there will be a market. This is disturbing to many because it is as true for areas of dubious morality such as sexual transactions, clear immorality (human trafficking and slavery) as it is for lawn mowing and automobiles.
Likewise for online activities, as I’ve documented many times here. You can buy twitter followers, Yelp reviews, likes on Facebook, votes on Reddit. And, of course, Wikipedia, where you can buy pages or edits, or even (shades of The Sopranos), “protection”.
Here is an article that reports at some length on large scale, commercialized Wikipedia editing and page management services. Surprised? Just another PR service, like social media management services provided by every advertising / marketing / image management service today.

Everything can — and will — be manipulated

Well, not “everything”. But every measure on which decisions of value depend (e.g., book purchases, dating opportunities, or tenure) can and will be manipulated.
And if the measure depends on user-contributed content distributed on an open platform, the manipulation often will be easy and low cost, and thus we should expect to see it happen a lot. This is a big problem for “big data” applications.
This point has been the theme of many posts I’ve made here. Today, a new example: citations of scholarly work. One of the standard, often highly-valued (as in, makes a real difference to tenure decisions, salary increases and outside job offers) measures of the impact of a scholar’s work is how often it is cited in the published work of other scholars. ISI Thompson has been providing citations indices for many years. ISI is not so easy to manipulate because — though it depends on user-contributed content (articles by one scholar that cite the work of another) — that content is distributed on closed platforms (ISI only indexes citations from a set of published journals that have editorial boards which protect their reputation and brand by screening what they publish).
But over the past several years, scholars have increasingly relied on Google Scholar (and sometimes Microsoft Academic) to count citations. Google Scholar indexes citations from pretty much anything that appears to be a scholarly article that is reachable by the Google spiders crawling the open web. So, for example, it includes citations in self-published articles, or e-prints of articles published elsewhere. Thus, Google Scholar citation counts depends on user-contributed content distributed on an open platform (the open web).
And, lo and behold, it’s relatively easy to manipulate such citation counts, as demonstrated by a recent scholarly paper that did so: Delgado Lopez-Cozar, Emilio; Robinson-Garcia, Nicolas; Torres Salinas, Daniel (2012). Manipulating Google Scholar Citations and Google Scholar Metrics: simple, easy and tempting. EC3 Working Papers 6: 29 May, 2012, available as http://arxiv.org/abs/1212.0638v2.
Their method was simple: they created some fake papers that cited other papers, and published the fake papers on the Web. Google’s spider dutifully found them and increased the citation counts for the real papers that these fake papers “cited”.
The lesson is simple: for every measure that depends on user-contributed content on an open platform, if valuable decisions depend on it, we should assume that it is vulnerable to manipulation. This is a sad and ugly fact about a lot of new opportunities for measurement (“big data”), and one that we must start to address. The economics are unavoidable: the cost of manipulation is low, so if there is much value to doing so, it will be manipulated. We have to think about ways to increase the cost of manipulating, if we don’t want to lose the value of the data.

Even academics pollute Amazon reviews (updated)

[Oops. Turns out that Orlando Figes himself was the poison pen reviewer, and that he simply compounded his dishonesty by blaming his wife. That’s got to put a bit of strain on the home life.]
That people use pseudonyms to write not-arm’s-length book reviews on Amazon is no longer news.
But I couldn’t resist pointing out this new case, if nothing else as an especially fun example to use in teaching. Dr. Stephanie Palmer, a senior law (!) lecturer at Cambridge University (UK), was outed by her husband, Prof. Orlando Figes, for writing reviews under a pseudonym that savaged the works of his rivals, while also writing a review of a book by her husband that it was a “beautiful and necessary” account, written by an author with “superb story-telling skills.” Story-telling, indeed.
A closing comment by the editor of the Times Literary Service, which broke the story: “What is new and is regrettable is when historians use the law to stifle debate and to put something in the paper which is untrue….[Figes’s] whole business is replacing a mountain of lies with a few truths”.
Via The Guardian.

What is pollution, what is manipulation?

For some time, I’ve referred to a variety of user-contributed content activities as “pollution”. Spam, for example. The typical user doesn’t want to receive it. It pollutes the inbox.
But some behaviors that reduce the value of user-contributed content are commonly called “manipulation”. For example, stuffing the ballot box in an online rating system, such as Netflix, which might be done by, say, the producer of a movie. My colleagues, Resnick and Sami, have been publishing work on “manipulation-resistant” systems [1][2].
Is there a difference? In both cases, a user with a product to sell is submitting content that most users would agree (if they knew) has negative value. Why not call them both pollution? Both manipulation?
I think there is a difference, but it’s more a matter of degree than absolute. The defining features of pollution are that the polluter does not benefit from the pollution itself: the pollution cost imposed on users is an inadvertent. They are victims of a side-effect. This is also known as an externality problem: the producer creates and benefits from creating X; X imposes a cost on others, but the producer’s benefit is not directly related to the cost imposed on others (the producer is not generating pollution because she gets satisfaction from making others suffer).
Manipulation costs are not externalities: the benefit to the producer is directly related to the cost experienced by the others. For example, in the Netflix example, the cost to users is that they pay for and watch movies that are not as suited to their tastes as they otherwise would. But that is precisely the outcome that the manipulative content producer wanted to achieve. The manipulator intends to get the others to do or experience something they would would rather not.
I said this was a matter of degree: In the spam example, some of the benefit (sometimes, perhaps most) to the producer is that she convinces consumers to purchase, even though ex ante those consumers might have said they would rather not receive the spam advertisements (considers them polluting). Thus, part of the costs of spam might be manipulation costs. The producer doesn’t care about many users, who ignore the spam but suffer from it, but does care about the effect she is having on those she manipulates into purchasing.
Why does it matter what we call them? It may not matter much, but labels are useful for guiding our understanding, and our design efforts. If I recognize something as having the features of a pollution problem, I immediately know that I can refer to the literature on pollution problems to help me characterize it, and the literature on solving pollution problems to help general good designs. Labels are short hand for abstracting models, compactly representing the essential features of the problem and its context.
[1] Eric Friedman, Paul Resnick, and Rahul Sami (2007). “Manipulation-Resistant Reputation Systems”, Ch. 27 in Algorithmic Game Theory, (N.~Nisan, T.~Roughgarden, E.~Tardos, V.~Vazirani, editors),
Cambridge University Press, 2007.
[2] Paul Resnick and Rahul Sami (2007). “The Influence-Limiter: Provably Manipulation-Resistant Recommender Systems”, Proceedings of the ACM Recommender Systems Conference.

The fine line between spam and foie gras

The New York Times (following others) reported today on a large number of detailed, informed, and essentially all flattering edits to Sarah Palin’s Wikipedia page made — hmmm — in the 24 hours before her selection as the Republican vice presidential nominee was made public. The edits were made anonymously, and the editor has not yet been identified, though he acknowledges that he is a McCain campaign volunteer.
Good or bad content? The potential conflict of interest is clear. But that doesn’t mean the content is bad. Most of the facts were supported with citations. But were the written in overly flattering langauge? And was the selection of facts unbiased? Much of the material has been revised and toned down or removed in the few days since, which is not surprising regardless of the quality of this anonymous editor’s contributions, given the attention that Ms. Palin has been receiving.

Pollution as revenge

One of my students alerted me to a recent dramatic episode. Author and psychologist Cooper Lawrence appeared on a Fox News segment and made some apparently false statements about the Xbox game “Mass Effect”, which she admitted she had never seen or played. Irate gamers shortly thereafter started posting (to Amazon) one-star (lowest possible score) reviews of her recent book that she was plugging on Fox News. Within a day or so, there were about 400 one-star reviews, and only a handful any better.



Some of the reviewers acknowledged they had not read or even looked at the book (arguing they shouldn’t have to since she reviewed a game without looking at it). Many explicitly criticized her for what she said about the game, without actually saying anything about her book.

When alerted, Amazon apparently deleted most of the reviews. Its strategy apparently was to delete reviews that mentioned the name of the game, or video games at all (the book has nothing to do with video games). With this somewhat conservative strategy, the reviews remaining (68 at the moment) are still lopsidedly negative (57 one-star, 8 two-star, 3 five-star), more than I’ve ever noticed for any somewhat serious book, though there’s no obvious way to rule these out as legitimate reviews. (I read several and they do seem to address the content of the book, at least superficially.)
Aside from being a striking, and different example of book review pollution (past examples I’ve noted have been about favorable reviews written by friends and authors themselves), I think this story highlights troubling issues. The gamers have, quite possibly, intentionally damaged Lawrence’s business prospects: her sales likely will be lower (I know that I pay attention to review scores when I’m choosing books to buy). Of course, she arguably damaged the sales of “Mass Effect”, too. Arguably, her harm was unintentional and careless (negligent rather than malicious). But she presumably is earning money by promoting herself and her writing by appearing on TV shows: is a reasonable social response to discipline her in her for negligence? (And the reviewers who have more or less written “she speaks about things she doesn’t know; don’t trust her as an author” may have a reasonable point: so-called “public intellectuals” probably should be guarding their credibility in every public venue if they want people to pay them for their ideas.)
I also find it disturbing, as a consumer of book reviews, but not video games, that reviews might be revenge-polluted. Though this may discipline authors in a way that benefits gamers, is it right for them to disadvantage book readers?
I wonder how long it will be (if it hasn’t already happened) before an author or publisher sues Amazon for providing a nearly-open access platform for detractors to attack a book (or CD, etc.). I don’t know the law in this area well enough to judge whether Amazon is liable (after all, arguably she could sue the individual reviewers for some sort of tortious interference with her business prospects), but given the frequency of contributory negligence or similar malfeasances in other domains (such as Napster and Grokster facilitating the downloading of copyrighted materials), it seems like some lawyer will try to make the case one of these days. After all, Amazon provides the opportunity for readers to post reviews in order to advance its own business interests.
Some significant risk of contributory liability could be hugely important for the problem of screening pollution in user-contributed content. If you read some of the reviews still on Amazon’s site in this example, you’ll see that it would not be easy to decide which of them were “illegitimate” and delete all of those. And what kind of credibility would the review service have if publishers made a habit of deciding (behind closed doors) which too-negative reviews to delete, particularly en masse. I think Amazon has done a great job of making it clear that they permit both positive and negative reviews and don’t over-select the positive ones to display, which was certainly a concern I had when they first started posting reviews. But it authors and publishers can hold it liable if they let “revenge” reviews appear, I suspect it (and similar sites) will have to shut down reviewing altogether.
(Thanks to Sarvagya Kochak.)

Keeping the good stuff out at Yahoo! Answers

This is, I think, an amusing and instructive tale. I’m a bit sorry to be telling it, because I have a lot of friends at Yahoo! (especially in the Research division), and I respect the organization. The point is not to criticize Yahoo! Answers, however: keeping pollution out is a hard problem for user-contributed content information services, and that their system is imperfect is a matter for sympathy, not scorn.
While preparing for my recent presentation at Yahoo! Research, I wondered whether Yahoo! Mail was still using the the Goodmail spam-reduction system (which is based on monetary incentives). I couldn’t find the answer with a quick Google search, nor by searching the Goodmail and Yahoo! corporate web sites (Goodmail claims that Yahoo! is a current client, but there was no information about whether Yahoo! is actually using the service, or what impact it is having).

So, I thought, this is a great chance to give Yahoo! Answers a try. I realize the question answerers are not generally Yahoo! employees, but I figured some knowledgeable people might notice the question. Here is my question, in full:

Is Yahoo! Mail actually using Goodmail’s Certified Email? In 2005 Yahoo!, AOL and Goodmail announced that the former 2 had adopted Goodmail’s “Certified Email” system to allow large senders to buy “stamps” to certify their mail (see e.g., http://tinyurl.com/2atncr). The Goodmail home page currently states that this system is available at Yahoo!. Yet I can find nothing about it searching Yahoo!Mail Help, etc. My question: I the system actually being used at Yahoo!Mail? Bonus: Any articles, reports, etc. about its success or impacts on user email experience?

A day later I received the following “Violation Notice” from Yahoo! Answers:

You have posted content to Yahoo! Answers in violation of our Community Guidelines or Terms of Service. As a result, your content has been deleted. Community Guidelines help to keep Yahoo! Answers a safe and useful community, so we appreciate your consideration of its rules.

So, what is objectionable about my question? It is not profane or a rant. It is precisely stated (though compound), and I provided background context to aid answerers (and so they knew what I already knew).
I dutifully went and read the Community Guidelines (CG) and the Terms of Service (TOS), and I could not figure out what I had violated. I had heard elsewhere that some people did not like TinyURLs because it it not clear where you are being redirected, and thus it might be used to maliciously direct traffic. But I saw nothing in the CG or TOS that prohibited URLs in general, or TinyURLs specifically.
So I contacted the link they provided to appeal the deletion. A few days later I received a reply that cut-and-pasted the information from the Yahoo! Answers help page explaining why content is deleted. This merely repeated what I had been told in the first message (since none of the other categories applied): my content was in violation of the CG or TOS. But no information was provided (second time) on how the content violated these rules.
Another address was provided to appeal the decision, so I wrote a detailed message to that address, explaining my question, and my efforts to figure out what I was violating. A few days later, I got my third email from Yahoo! Answers:

We have reviewed your appeal request. Upon review we found that your
content was indeed in violation of the Yahoo! Answers Community
Guidelines, Yahoo! Community Guidelines or the Yahoo! Terms of Service. As a result, your content will remain removed from Yahoo! Answers.

Well… Apparently it’s clear to others that my message violates the CG or the TOS, but no one wants to tell me what the violation actually is. Three answers, all three with no specific explanation. Starting to feel like I’m a character in a Kafka novel.
At this point, I laughed and gave up (it was time for me to travel to Yahoo! to give my — apparently dangerous and community-guideline-violating — presentation anyway).
I have to believe that there is something about the use of a URL, a TinyURL, or the content to which I pointed that is a violation. I’ve looked, and found many answers that post URLs (not surprisingly) to provide people with further information. Perhaps the problem is that I was linking to a Goodmail press release on their web site, and they have a copyright notice on that page? But does Yahoo! really think providing a URL is “otherwise make available any Content that infringes any patent, trademark, trade secret, copyright” (from the TOS)? Isn’t that what Yahoo’s search engine does all the time?
End of story.
Moral? Yahoo! Answers is a user-contributed content platform. Like most, that means it is fundamentally an open-access publishing platform. There will be people who want to publish content that is outside the host’s desired content scope. How to keep out the pollution? Yahoo! uses a well-understood, expensive method to screen: labor. People read the posted questions and make determinations about acceptability. But, as with any screen, there are Type I (false negative) and Type II (false positive) errors. Screening polluting content is hard.
(My question probably does violate something, but surely the spirit of my question does not. I had a standard, factual, reference question, ironically, to learn a fact that I wanted to use in a presentation to Yahoo! Research. A bit more clarity about what I was violating and I would have contributed desirable content to Yahoo! Answers. Instead, a “good” contributor was kept out.)