Link Spam Detection Based on Mass Estimation


Monday, November 28, 2005

Not sure how many people believe TrustRank is in effect in the current Google algorithm, but I would be willing to bet it is. Recently another link quality research paper came out by the name of Link Spam Detection Based on Mass Estimation [PDF].

It was authored by Zoltan Gyongyi, Pavel Berkhin, Hector Garcia-Molina, and Jan Pedersen.

The proposed method for determining Spam Mass works to detect spam, so it compliments nicely with TrustRank (TrustRank is primarily aimed to detect quality pages and demote spam).

The paper starts off by defining what spam mass is.

Spam Mass - an estimate of how much PageRank a page accumulates by being linked to from spam pages.

I covered a bunch of the how it works in theory stuff in the extended area of this post, but the general takehome tips from the article are

  • .edu and .gov love is the real deal, and then some
  • Don't be scared of getting a few spammy links (everyone has some).
  • TrustRank may deweight the effects of some spammy links. Since most spammy links have a low authority score they do not comprise a high percentage of your PageRank weighted link popularity if you have some good quality links. A few bad inbound links are not going to put your site over the edge to where it is algorithmically tagged as spam unless you were already near the limit prior to picking them up.
  • If you can get a few well known trusted links you can get away with having a large number of spammy links.
  • These types of algorithms work on a relative basis. If you can get more traditional media coverage than the competition you can get away with having a bunch more junk links as well.
  • Following up on that last point, some sites may be doing well in spite of some of the things they are doing. If you aim to replicate the linkage profile of a competitor make sure you spend some time building up some serious quality linkage data before going after too many spammy or semi spammy links.
  • Human review is here to stay in search algorithms. Humans are only going to get more important. Inside workers, remote quality raters, and user feedback and tagging gives search engines another layer to build upon beyond link analysis.
  • Only a few quality links are needed to rank in Google in many fields.
  • If you can get the right resources to be interested in linking your way (directly or indirectly) a quality on topic high PageRank .edu link can be worth some serious cash.
  • Sometimes the cheapest way to get those kinds of links will be creating causes or linkbait, which may be external to your main site.

On to the review...

  • To determine the effect of spam mass they computate PageRank twice. Once normally and then again with more weight on known trusted sites that would be deemed to have a low spam mass.
  • Spammers either use a large number of low PageRank links, a few hard to get high PageRank links, or some combination of the two.
  • While the quality authoritative links to spam sites are more rare, they are often obtained through the following
    • blog / comment / forum / guestbook spam
    • honey pots (creating something useful to gather link popularity to send to spam)
    • buying recently expired domain names
  • if the majority of inlinks are from spam nodes it is assumed that the host is spam, otherwise it is labeled good. Rather than looking at the raw link count this can further be biased by looking at percent of total PageRank which comes from spam nodes
  • to further determine the percent of PageRank due to spam nodes you can also look at link structure of in-direct nodes and how they pass PageRank toward the end node
  • the presumption of knowing weather something is good or bad is not feasible, so it must be estimated from a subset of the index
  • for this to be practical search engines must have white lists and / or black lists to compare other nodes to. this can be automated or manual compiled
  • it is easier to assemble a good core since it is fairly reliable and does not change as often as spam techniques and spam sites (Aaron speculation: perhaps this is part of the reason some uber spammy older sites are getting away with murder...having many links from the good core from back when links were easier to obtain)
  • since the small reviewed core will be much smaller of a sample than the number of good pages on the web you must also review a small random uniform sample of the web to determine the approximate percent of the web that is spam to normalize the estimated spam mass
  • due to sampling methods some nodes may have a negative spam mass, and are likely to be nodes that were either assumed to be good in advance or nodes which are linked closely and heavily to other good nodes
  • it was too hard to manually create a large human reviewed set, so
    • they placed all sites listed in a small directory they considered to be virtually void of spam in the good core (they chose not to disclose the URL...anyone want to guess which one it was?). this group consisted of 16,776 hosts.
    • .gov and .edu hosts (and a few international organizations) also got placed in the good core
    • those sources gave them 504,150 unique trusted hosts
  • of the 73.3 million hosts in their test set 91.1% have a PageRank less than 2 (less than double the minimum PageRank value)
  • only about 64,000 hosts had a PageRank 100 times the minimum or more
  • they selected an arbitrary limit for minimum PageRank for reviewing the final results (since you are only concerned about the higher PageRank results that would appear atop search results) of this group of 883,328 sites and they hand reviewed 892 hosts
    • 564 (63.2%) were quality
    • 229 (25.7%) were spam
    • 54 (6.1%) uncertain (like beauty, spam is in the eye of the beholder)
    • 45 (5%) hosts down
  • ALL high spam mass anomalies on good sites were categorized into the following three groups
    • some Alibia sites (Chinese was far from the core group),
    • Blogger.com.br (relatively isolated from core group),
    • .pl URLs (there were only 12 polish educational institusions in the core group)
  • Calculating relative mass is better than absolute mass (which is only logical if you wanted the system to scale, so I don't know why they put it in the paper). Example of why absolute spam mass does not work:
    • Adobe had lowest absolute spam mass (Aaron speculation: those taking the time to create a PDF are probably more concerned with content quality than the average website)
    • Macromedia had third highest absolute spam mass (Aaron speculation: lots of adult and casino type sites have links to Flash)

    [update: Orion also mentioned something useful about the paper on SEW forums.

    "A number of recent publications propose link spam detection methods. For instance, Fetterly et al. [Fetterly et al., 2004] analyze the indegree and outdegree distributions of web pages. Most web pages have in- and outdegrees that follow a power-law distribution. Occasionally, however, 17 search engines encounter substantially more pages with the exact same in- or outdegrees than what is predicted by the distribution formula. The authors find that the vast majority of such outliers are spam pages. Similarly, Bencz´ur et al. [Bencz´ur et al., 2005] verify for each page x whether the distribution of PageRank scores of pages pointing to x conforms a power law. They claim that a major deviation in PageRank distribution is an indicator of link spamming that benefits x. These methods are powerful at detecting large, automatically generated link spam structures with “unnatural” link patterns. However, they fail to recognize more sophisticated forms of spam, when spammers mimic reputable web content. "
    So if you are using an off the shelf spam generator script you bought from a hyped up sales letter and a few thousand other people are using it that might set some flags off, as search engines look at the various systematic footprints most spam generators leave to remove the bulk of them from the index.]

    Link from Gary

    Search The Search Engines for "Link Spam Detection Based on Mass Estimation": Google | Yahoo! Search | Teoma | MSN Search Beta | Wisenut | Gigablast
    Category(s) seo tips
    Posted by Aaron Wall (of SEO Book.com) at November 10, 2005 07:47 AM
Comments

From the above: "Spammers either use a large number of low PageRank links, a few hard to get high PageRank links, or some combination of the two."

Seems pretty broad and could almost cover the incoming links for, er, possibly 80% of sites on the net?

Posted by: bock at November 14, 2005 04:17 PM

>Seems pretty broad and could almost cover the incoming links for, er, possibly 80% of sites on the net?

Well I think if you look at the power laws bit mentioned by Orion that you can think of it a bit more as mathematical patterns.

Most automated spam systems have patterns associated with them.

Algorithms and systems like these are not about catching all spam, but are about reducing the volume of spam to a manageable level.

Posted by: aaron wall at November 15, 2005 03:04 AM

They are looking for spikes in the high PageRank Web sites (and their scale for PageRank does not match the classic PageRank definition -- this is a modified PageRank algorithm).

According to their test, about 1 in 4 "high PageRank" sites were found to be spam. Given how few of the sites were tested (less than 1 million out of a database of 70+ million), these sites would probably correspond to Toolbar PR 8 and above sites, maybe only PR 9 and above.

A lot of layering is implied in this model. That is, they suggest that the spammers create(d) low-quality sites that were used to boost a few other sites, which were used to boost ultimate target sites to high PR. My feeling is that the practice is a bit more sophisticated than that.

The paper also identifies "cliques" on the Web, isolated regions where good content pretty much links in a closed group. The MegaSpam sites apparently don't fall into the clique categories. They have stepped past the boundaries of self-inclusion and actually participate in natural Web community linking.

But the paper argues that, despite this apparent blending with the natural Web community, bad sites with high PR still stick out as bad sites.

Since PR doesn't affect search rankings very much, the paper in itself is not significant, but it may signal future trends in spam detection that could feed new filters.

0 Comments:

Post a Comment

<< Home