Scrape unto others as they would scrape unto you

Scrape unto others as they would scrape unto you
: Well, ain’t it ironic that Google is stopping others from scraping its “content” when it’s “content” is nothing but that which Google scraped from others.

That is, a guy started to scrape GoogleNews to create RSS feeds and Google is blocking him — even though Google makes GoogleNews by scraping the headlines of news sites. What if those news sites did what Google did? Ah, you say, but they wouldn’t because they want Google links and traffic. And that puts Google in the seat of power. And now Google is flexing its power against the little guy.

Sounds evil to me.

  • Dean

    Google has agreements with those news sites to syndicate the content, and pays for it. The RSS feed guy doesn’t.

  • No, Dean, that’s wrong. I run sites that are scraped and I’m glad they do; it sends traffic our way; that’s why I don’t stop them from scraping. There are no syndication deals whatsoever.

  • anne.elk

    Google adds a lot of value in their aggregation — they supply R&D, search algorithms, automated ways to evaluate what they have collected and to present it, disk space, sysadmin, power, etc. They pay for all of this through ads. They are going to IPO soon and are in a very sensitive period. It would be damaging to their investors to let that added value be scraped away from them and offered by someone else in a google ad free manner.
    Any third party site can block google with a simple robots.txt. There are other search engines, yahoo for one, and Microsoft intends to launch their search service soon. That Microsoft has not yet launched and will soon launch demonstrates how competitive environment it still is, and how it is very reasonable for google to feel vulnerable.
    In your post above, you mention how basically sucky you feel google’s ads are in comparison to manually purchased blogads. That google does as well as they do is by paying dozens of engineers lots of money to provide very sophisticated algorithms. And yet all of those efforts pale in comparison to one or two folks hand placing ads.
    Julian’s effort, while of value to many, does not approach any of the scale of google’s added value.
    Google’s SOAP API does show that Google is experimenting with collaborative web protocols and trying to determine how that effects their bottom line.
    Google revolutionized net searches to the extent that many people erroneously think Google is more of a web public utility and ought to be regulated and forced into certain actions.
    They also pay for their search results through ad presentation.
    It is not at all clear from this example that google is evil — far from it in fact. They scrape, they add lots of value, they present it, they present their terms of service.
    Julian’s efforts could conceivably destroy all that.
    Jeff, Jeff, Jeff, no matter how much you wish for it, the world is not black and white.
    Without google, how much harder it would be to accurately and quickly refute Glenn Reynolds and to correct you all between compiling programs!

  • sol

    Hey, a company’s being successful! Quick, let’s call them evil!

  • “Julian’s efforts could conceivably destroy all that.”
    Many a news organization could say the same of Google News. Jeff is right here; if everyone had used the same rationale you’re currently deploying when Google was starting its own news service, it would never have gotten off the ground.
    To say that their hypocritical behavior is alright because it adds more value is simply to defend the status quo; how are we ever going to know if Julian Bond’s service might not end up adding far greater value if it isn’t even allowed to take off?

  • anne.elk

    You are an aggregator too, aren’t you? People come to your blog to see your content and follow the links you provide to other’s content. You too have a copyright, it says, “COPYRIGHT NOTICE:
    It’s mine, I tell you, mine! All mine! You can’t have it because it’s mine! You can read it (please); you can quote it (thanks); but I still own it because it’s mine! I own it and you don’t. Nya-nya-nya. So there.
    COPYRIGHT 2001-2003-20?? by Jeff Jarvis”
    How does your copyright notice differentiate you from Google News? Are you more or less evil than Google?
    If I took your RSS feed and reposted it from my servers how would that be different from Julian’s efforts? If Julian took Technorati’s search results and made RSS feeds for those, and Technorati stopped him, would Technorati be evil?
    Notes to Abiola: Google has always respected robots.txt and provides other mechanisms that allow your site to opt out of googling. I am not sure why you would prevent them from asking the same of others.
    Google was started on a shoe string (with two brilliant grad students and some very good advisors), Google started against entrenced industry stalwarts Yahoo, Altavista, and Hotbot (remember the last two?) If Julian is offering value, presumably the market is immature enough that he can get a couple of engineers and vcs behind him and start a competing service.

  • Greg G
    A beautiful value-add to the google news.

  • Sigivald

    I remember AltaVista … *moment of silence*.
    I also don’t see the evil here, especially since, er, if all the feeds Google aggregates are “free” (Google isn’t paying for them, according to Mr. Jarvis), why can’t the Little Guy just scrape them himself, if it’s not a matter of taking Google’s “added value”?
    (Contra Abiola, if a news organisation thinks Google will “hurt” them by aggregating their feed, why, like everyone’s said, they can easily stop Google from doing so. They have the same ability to not be aggregated that Google is being slammed for desiring… I just don’t see what the problem is. It’s not like Google is suing the guy or something.)

  • “Notes to Abiola: Google has always respected robots.txt and provides other mechanisms that allow your site to opt out of googling. I am not sure why you would prevent them from asking the same of others.”
    Thanks for telling me what I already know, but the point isn’t about what’s technically possible and what isn’t, but about what is morally proper. If everyone followed Google’s maxim, there’d be no Google News to scrape in the first place. No amount of “brilliance” and shoe-string engineering does anything to alter the validity of this argument; or are you trying to say that we ought to have different ethical principles for different people, based merely on the impressiveness of their resumes?
    “I also don’t see the evil here, especially since, er, if all the feeds Google aggregates are “free” (Google isn’t paying for them, according to Mr. Jarvis), why can’t the Little Guy just scrape them himself, if it’s not a matter of taking Google’s “added value”?”
    This is a silly argument. Why can’t Google do its own news collection, if it’s not a matter of taking others’ “added value?”
    “It’s not like Google is suing the guy or something.”
    The point isn’t that Google is taking him to court (yet). The point is that it is hypocritical of Google to be threatening people for doing to them what they themselves have built a business on doing to others. It may be legal, but it still is hypocritical. The existence of “robots.txt” is completely irrelevant to the charge of hypocrisy.

  • If I might add something here since everyone is talking about me…
    In Oct 2002, I saw Google News Beta and being an RSS junkie thought “I want some of that”. I emailed news-feedback asking for RSS Output and got no reaction. So I hacked, cut and pasted some scraping code and produced gnews2rss.php and started using it to feed my personal aggregator. I made the source public domain and encouraged people to use it, host it themselves and hack it on further. I got really burnt on bandwidth costs by doing the same thing for blogger before they had RSS, so twice a month I insert a dummy item telling people to host it themselves and to email Google asking for RSS from News. About 6 months ago I started including gnews2rss feeds in Ecademy and waited for the Google complaint. It’s finally arrived so the feeds have gone from Ecademy. Google have never asked me to remove gnews2rss, only to stop republishing the data on the web.
    Now, Google is heading into an IPO and needs to be seen to enforce their T&Cs. I have no problem with that. I was playing fas and loose with their terms and now I’ve had my knuckles rapped.
    But the underlying problems remain.
    – No Ads on Google News. So what’s the problem?
    – Google News still in beta. Why?
    – Google API is unchanged since launch and still only covers the main search. Where’s the API for images, news, groups, froogle?
    – No metadata or XML/RDF output from any of their systems apart from Blogger. And when the Blogger people finally do syndication (now that Google has removed the bandwidth objection) the Blogger people choose Atom.
    Why? Is it just lack of programming resource? Perhaps the people who might spend their 20% private programming time on this aren’t interested in metadata?
    Meanwhile Yahoo has the MyYahoo aggregator and RSS from news search. So I’ve just switched to Yahoo! news for the websites. Right now though, Google news search is still superior so for my own personal use I continue to use gnews2rss.
    So instead of (or as well as) arguing here, can I ask you all to email Google news feedback and ask for RSS/Atom output? The API was a great start and won them a lot of geekie kudos. But they’ve dropped the ball.
    Yahoo! has pretty much everything now that Google has and in some cases more. It’s just that the quality is not quite as good. That’s a pretty slim margin for Google to base themselves on.