Google's Patent: Information Retrieval Based on Historical Data


Wednesday, November 30, 2005

This report has been prepared to help SEOs understand the concepts and practical applications contained in Google's US Patent Application #20050071741 - Information Retrieval Based on Historical Data. My own advice and interpretation is offered throughout this paper - please conduct your own research before acting on the recommendations.


Overview of the 5 Most Critical Concepts from this Paper

These 5 concepts are what I believe to be the most ground-breaking and important for search engine optimization professionals to understand in order to best conduct their work.

1. Google's Concept of "Document Inception"

The date of "document inception", which can refer to either a website as a whole or a single page is used in many different areas by Google. This data can come from the registration info, the date Google first found a link to the site/page or the site/page itself. Google will be using this data to rank documents and establish credibility and relevance.

2. How Changing Content can Affect Rankings

Changing content over time has a huge impact in Google's measures according to this patent. They use changes to determine "freshness" or "staleness" of websites and pages and how that data impacts the value of the links on the page as well its rankings. They'll also measure large, "real", content changes vs. superfluous changes and rank based on that data.

Google also says that for some types of queries, particular results are more valuable - stale results may be desirable for information that doesn't need updating, fresh content is good for results that require it, seasonal results may pop up or down in the rankings based on the time of month/year, etc.

3. Spam Detection & Punishment

Google is employing many new systems of spam detection and prevention according to the patent. These include:

  • Watching for sites that rise in the rankings too quickly
  • Watching for registration information, IP addresses, name servers, hosts, etc that are on their "bad list"
  • Growth of off-topic links
  • Speed of link gain
  • Percentage of similar anchor text
  • Topic/Subject shifts or additions

4. What Google is Attempting to Measure

Google wants to measure or is attempting to actively measure each of the following:

  • Domain information
    • Registration date
    • Length of renewal (10 years, 5 years, 1 year, etc)
    • Addresses and Names of admin & technical contacts
    • DNS Records
    • Address of Name Servers
    • Hosting Location & Company
    • Stability of this data
  • Information on User Behavior Online
    • CTR (Click-Through Rate) of individual results in the SERPs
    • Length of time spent on a given site/page
  • Data contained on your computer
    • Favorites/Bookmarks List
    • Cache & Temp Files
    • Frequency of visits to particular sites/pages (history)

5. The Impact of this Patent

I believe that this patent will help to verify most of the theories surrounding Google's rankings. There has been speculation over the past 18-24 months on nearly every subject covered in this patent at the major SEO forums, but this will serve as verification.

Although it is long, I urge every SEO/Webmaster to read this page completely. I have attempted to make the information legible and readable, and only pulled out parts that are important to the active practice of SEO (which was almost 2/3 of the document, surprisingly). If you have any questions or corrections on this summary, please send me an email.


Analysis & Interpretation of the 63 Patent Components

History Data

1. Documents may be scored in Google's rankings based on "one or more types of history data".

Inception Date

2. The "inception date" read - registration date - may be considered as a scoring factor (I assume that older will be considered better, but this is not spelled out).

3. Google may determine how old each of the pages on a given website is and then determine the average age of pages on the website as a whole. The difference between a specific page's age and the average age of all documents on the site will be used in the ranking score.

4. The score for a website may include the amount of time since "document inception" - i.e. how old the website is.

5. One methodology of discovering site age might include when Google first "discovered" - read spiders the site, when Google first finds a link to the site, and when the site contains a "predetermined number of pages". I interpret this to mean that Google has some kind of threshold for site size (number of pages) that when reached, triggers a scoring effect (probably positive).

Frequency of Document Changes over Time

6. Google's scoring will (according to the patent) be based on "determining a frequency at which the content changes over time".

7. The "frequency at which the content changes" will be determined by the average time between changes, the number of changes over a particular time period, and the rate of change of one time period vs. the rate of change for another time period. So, if you are updating your website every day, then switch to updating once a week, your scoring in the historical measurements at Google will shift.

8. Scoring will also include how much of the site has changed over a given time period (new pages, changes, etc.).

9. The scoring based on changes (described in #8) will be determined by the number of new pages within a time period, the ratio of new pages vs. old pages and the total "percentage of the content of the document that has changed during a timed period."

10. The scoring of changes (from #8) will be based on the "perceived importance of the portions" that have been changed. The score will also take into account the changes as compared to the weighting(s) of each of the different pages of the site - i.e. if important pages change, it will have a different impact than if unimportant pages changed. My guess is that importance is mostly determined by links (both internal and external) that point to a given page. So if your contact page changes, it's not a big deal, but if your home page changes, that's a bigger deal.

11. The scoring for a "plurality of documents" - many pages in a given website - includes determining the last date of change for each page, determining the average date of change, and scoring the documents based on, "at least in part", the difference between a specific page's change and the average document's change. So, if one page had new information added, it would be scored differently than the other pages, while if all the pages changed together (maybe a new date, or new link or copyright in the footer, etc.), they would all be equal (since their date of change compared to the average is the same).

Amount of Changes over Time

12. Google's score may also include a measure of the amount of content which changes over time on the given website.

13. The "amount of content changes" from #11 will be determined by the ratio of new pages vs. the total number of pages on the site, and the percentage of content change over a given time period.

14. The "changes over a given time" from #12 will be scored based on "weighting different portions of the content differently based on a perceived importance" - once again, I read this as internal and external links to a page - the more links, the more "perceived importance".

Click-Through Rate Data

15. The "history data" from #1 could include information on "how often the document is selected when the document is included in a set of search results". This is literally tracking clickthroughs and rewarding those sites with higher CTR - just like AdSense does. Google will be scoring based on the "extent to which the document is selected over time... when included in a set of search results". We always assumed this to be true, but this is the first hard evidence I've seen directly from the horse's mouth.

16. Google may assign a "higher score" when the document is selected more often. No-brainer.

Document Association to Search Terms

17. Google might be scoring based on "determining whether a document (that has been showing up in the search results) is associated with the search terms".

Queries that Remain the Same but have New Meanings over Time

18. Google (according to the patent) calculates whether the "information relating to queries" remains the same or changes and scores documents based on this. For example, prior to September 11, the phrase 9-11 would not be related with terrorism, afterwards, it would be. Google will score documents based on the changes in the results for a given query to keep up with the times.

Staleness of Documents

19. The "staleness of documents" might be calculated as part of Google's scoring.

20. Google may also determine whether "stale documents" are preferable for certain types of queries (those that don't change over time, or for which a specific, single answer is what's necessary).

21. The "favorability" of stale documents may be determined by how often they are clicked on in the search results (over other documents). I relate this to a Wikipedia article on the nature of volcanoes - it doesn't need too much updating and will be a good relevant source for a long time for the query - "nature of volcanoes".

Link Behavior

22. History data scores might also consider the "behavior of links over time".

23. The appearance and disappearance of links figure into the scoring for link behavior (from #22).

24. The appearance/disappearance of links are dated by Google and used in the scoring.

25. The link appearances/disappearances are monitored and Google measures "how many links... appear or disappear during a time period, and whether there is a trend" toward more links or fewer links. The temporal (time-based) nature of groups of links will be scored by Google.

Freshness of Links

26. Google may use the "freshness of links" and assign weights to links based on freshness.

27. The "freshness" of a link (from #26) is calculated by the date of appearance of that link, the date of any changes in the link or anchor text, the date which the page and site that the link is from appeared and the date of the links to that linking page. So, if you have a new blog entry that points to a new site, the freshness will be super-fresh, since the page is new, the link to the page is new, the blog page that links to it is new, and the link to your blog entry on your own site is new (that's a lot of new, hence it's super-fresh).

28. The weight of a link also takes into account how "trusted" the site is, how authoritative the page with the link on it is and how "fresh" the page & site containing the link are.

29. The scoring also takes into account the "age distribution associated with the links based on the ages of the links". Google will take into account the age of the links to your page, and the time periods over which you got the links, i.e. lots of new links, a wide distribution over time, most links from a long time ago, etc.

Anchor Text Changes over Time

30. Google may also calculate changes in anchor text over time and use this data to score. My guess is that anchor text doesn't change very often, but they're certainly free to measure it.

Content Changes in a Document compared to Linking Anchor Text

31. Google might also measure if the content of a document changes, but the anchor text remains the same, or vice-versa. They're trying to protect against the anchor text "bait and switch" that makes a document look relevant to the anchor text, then replaces it with something else.

Freshness of Anchor Text

32. Freshness of anchor text can be considered.

33. Freshness of anchor text is calculated by "date of appearance", "date of change", and the dates of change and appearance of the page the link is on.

Traffic Characteristics of Site/Page

34. Traffic characteristics associated with a page/site may be taken into account in scoring.

35. The traffic pattern will have associated analysis that might feed into Google's score. So Google must be measuring traffic to a site/page and determining if, over time, it increases, decreases, etc. - they're seeking trends on which to base scoring.

User Behavior

36. User behavior regarding a particular page/site may figure into the scoring.

37. Google says that user behavior (from #36) is basically just the percentage of the time users click on a site/page when it is listed in the search results pages, along with the amount of time that users spend "accessing the document". I guess we all need to keep up the amount of time people spend on our sites.

Domain Related Information

38. The scoring might also include the sites associated with a given site and the "domain-related" information. This is defined in greater detail below.

39. Associated sites (from #38) are measured in terms of "legitimacy", which I interpret to mean non-spam, different owner, etc. Google says, specifically "scoring the document based... on whether the domain associated with the document is legitimate."

40. The "expiration date of the domain", the "domain name server record" and the "name server associated with the domain" are all parts of how Google will establish the legitimacy of an "associated" site.

Prior Rankings Data

41. History data scores could also take into account "information relating to a prior ranking". This means Google will be storing information about previous rankings for a site and using them to base scores on.

42. Google may also calculate where in the previous rankings the site was and how it moved around as pieces to figure into the scoring data.

43. In reference to #41, Google is using seasonal, "burstiness" and changes in scores over time as metrics to calculate the prior rankings scoring. So if a site is particularly relevant for "gifts for girlfriend" around Valentine's Day, but not as much for the same query at Christmas, Google will record this information and rank accordingly.

44. Google could also, with regard to #41, record "spikes in the rank" of site/pages in the search results.

User Maintained Data

45. "User maintained data" may also be recorded and monitored for the rankings scores.

46. "User maintained data" includes; favorites lists, bookmarks, temp files and cache files of monitored users. I'm not sure how they could obtain this data without installing "Google Spyware" - perhaps in the form of desktop search or the Google toolbar.

47. Monitoring the rate at which a site/page "is added to or removed from user generated data" may be used in the scoring.

Growth Profiles of Anchor Text

48. Scores might include "growth profiles of anchor text" - Google could monitor the use of anchor text in large groups and where/when they point to different sites & pages.

Linkage of Independent Peers

49. Information "relating to linkage of independent peers" might be added to scoring by "determining the growth in a number of independent peers that include the document". Google will basically be monitoring sites that are not in your subject category and how they link to you (I assumed they meant non-related subject peers, but they actually mean off-topic sites; see - Linkage of Independent Peers, below).

Document Topics

50. "Document topics" may be included in the scoring, this includes using "topic extraction". I assume this is determined by Google's text mining and analysis of the actual words on the page.

Identifying Relevant Documents

51. Relevance of documents to a given search query may be part of the scoring system. This is just Google's way of saying that documents about "pink dogs" will be part of those analyzed by the ranking algorithm when a user queries "pink dogs".

Plurality of History Data

52. Google might also use "means for obtaining a plurality of types of history data associated with the document" to score sites/pages. This just means that they will use a methodology that groups all of the bits of historical information into the rankings together to determine scoring.

History Component

53. "History data" can be measured by Google and used in the rankings. I'm not sure to what they're referring here - the entire quote is; "A system for scoring a document, comprising: a history component configured to obtain one or more types of history data associated with a document; and a ranking component configured to: generate a score for the document based, at least in part, on the one or more types of history data."

Ranking of Linked Documents

54. Google may be measuring the documents you link to and scoring based "on a decaying function of the age of the linkage data". So, fresher links vs. stale links will be taken into account (although whether there is a positive or negative effect associated with this is unknown).

55. For #54, Google says the "linkage data includes at least one link." So, they won't be measuring linkage data for pages with no links.

56. For #54, Google may include the anchor text in the linkage data.

57. For #54, Google says the "linkage data includes a rank based... on links and anchor text provided by one more linking documents." Google is simply saying that linkage data includes the anchor text and other info about the links coming to a page.

58. Google can use the "longevity of the linkage data" and determine from that an adjustment of the rankings based on the changes, stability & age of the linkage data. They explain below how they score this.

59. Google will be "penalizing the ranking if the longevity indicates a short life for the linkage data and boosting the ranking if the longevity indicates a long life for the linkage data." Google is, in effect, explaining a little of what we call "sandboxing" - they're saying that the older a link is, the more value it has, while new links have relatively lower value. This doesn't completely explain the effect, as many sites rank well quickly, etc. - but, it is an explanation for the phenomenon.

60. Google can adjust scoring by penalizing for linking documents they consider "stale" over a period of time and boost scoring if the content is frequently updated. So, it's better to be linked to on a page that frequently updates its content.

61. "Link churn" may be measured (explained in #62) and scoring adjusted based on this.

62. "Link churn" is "computed as a function of an extent to which one or more links provided by the document changes over time". Once again, Google is referring to the changes in where links point, their anchor text, etc. on a given page. More changes means more "link churn".

63. "Link churn" might create a penalization if it is above a certain threshold. So, if your links are changing all the time, the link will not be as valuable. This would shut down the methods used by the popular "Traffic Power/1p" spam company.


Patent Description:

Background of the Invention:

This is designed for IR (Information Retrieval) Systems and specifically to the methods used to generate search results.

Description of Related Art:

This information is largely irrelevant, but one important quote is: "There are several factors that may affect the quality of the results generated by a search engine. For example, some web site producers use spamming techniques to artificially inflate their rank. Also, "stale" documents (i.e., those documents that have not been updated for a period of time and, thus, contain stale data) may be ranked higher than "fresher" documents (i.e., those documents that have been more recently updated and, thus, contain more recent data). In some particular contexts, the higher ranking stale documents degrade the search results. Thus, there remains a need to improve the quality of results generated by search engines."

Summary of the Invention:

Google says "history data associated with the documents" may be used to score them in the search results. The invention provides a "method for scoring a document" and it "may include determining the age of linkage data associated with a linked document and ranking the linked document based on a decaying function of the age of the linkage data."

Brief Description of the Drawings:

The drawings are all exceptional simple charts showing the process for examination. A PDF with the charts at the bottom is available at http://files.bighosting.net/tr19070.pdf

Exemplary History Data:

This is the canonical and expository section of the patent description. It contains examples and explanations of many of the most important parts of this study, including detailed descriptions for many of the 63 components.

Document Inception Date

Google notes that the "date" label is used broadly and may include many time & date measurements. Google describes several of the techniques used to obtain an "inception date" and mentions that some techniques are "biased" because they can be influenced by a 3rd party.

The first technique used is when Google learns of or indexes the document - either by finding a link to the site/page, or following it. A second technique uses the registration date of the URL or the first time it was referenced in a "news article, newsgroup, mailing list" or combination of these types of documents.

The patent mentions that Google assumes that a "fairly recent inception date will not have a significant number of links from other documents." However, they say that the document's rankings can be adjusted accordingly based on how well it is doing in terms of links with consideration for its age.

Google is also wary of spam, they use the following example (which is already being quoted around the web):

"Consider the example of a document with an inception date of yesterday that is referenced by 10 back links. This document may be scored higher by (Google) than a document with an inception date of 10 years ago that is referenced by 100 back links because the rate of link growth for the former is relatively higher than the latter. While a spiky rate of growth in the number of back links may be a factor used by (Google) to score documents, it may also signal an attempt to spam search engine 125. Accordingly, in this situation, (Google) may actually lower the score of a document(s) to reduce the effect of spamming."

Google might also use the date of inception as a method for measuring the "rate at which links to the document are created". They say that "this rate can then be used to score the document, for example, giving more weight to documents to which links are generated more often."

The patent goes so far as to provide a formula for link-based score modification:

H=L/log(F+2),

H = history-adjusted link score
L = link score given to the document, which can be derived using any known link scoring technique that assigns a score to a document based on links to/from the document
F = elapsed time measured from the inception date associated with the document (or a window within this period).

The result of this formula would be that on the day of inception, L will be divided by 0.301 - the equivalent of multiplying L by 33.2. After 10 days (or any other unit of time), the formula will divide L by 1.079, making H smaller and smaller as time goes on.

The patent then suggests that "for some queries, older documents may be more favorable than newer ones" and that, as a result, Google may "adjust the score of a document based on the difference (in age) from the average age of the result set". This would push certain pages up or down in the rankings depending on their age and the age of their competition.

Content Updates/Changes

Google says that a "document's content changes over time may be used to generate/alter a score associated with that document." They again offer a formula for calculating this:

U=f(UF, UA)

f = a function, such as a sum or weighted sum
UF = update frequency score that represents how often a document (or page) is updated
UA = update amount score that represents how much the document (or page) has changed over time

Google notes that UA can also be determined as:

  • The number of "new" or unique pages associated with a document over a period of time
  • The ratio of the number of new or unique pages associated with a document over a period of time versus the total number of pages associated with that document
  • The amount that the document is updated over one or more periods of time (e.g., n % of a document's visible content may change over a period t (e.g., last m months)), which might be an average value
  • The amount that the document (or page) has changed in one or more periods of time (e.g., within the last x days)

UA could also different pieces of the content weighted differently, helping to eliminate changes that are cosmetic or insubstantial. Google mentions:

  • JavaScript
  • Comments
  • Advertisements
  • Navigational elements
  • Boilerplate material
  • Date/time tags

They also identify some important areas where content changes might necessitate greater weight:

  • Title
  • Anchor text of forward links

Google also mentions the use of trend analysis in the changes of a site/page by comparing an acceleration or deceleration of the rate of change (amount of new content, etc.). Google notes that maintaining all of this information may be too intensive for practical data storage and proposes measuring only large changes and storing "term vectors" only or "a small portion" of a page "determined to be important".

The patent notes that Google may, on occasion prefer stale documents for certain types of queries. They may also cerate an average age of change and adjust the scoring for documents based on their relations to the average (if more stale or more fresh content is desired).

Query Analysis

This technique describes several phenomenon that can influence rankings:

  • Clicks on a site/page in the SERPs can be used to rank it higher or lower - those clicked more often, move higher in the rankings (so make sure your title & description are good)
  • If a particular search term is increasingly associated with particular subjects, the pages on those subjects would rank higher for that query. For example, the meaning of the word "soap" was increasingly associated with Simple Object Access Protocol, rather than a cleansing agent, so pages on those subjects rose in the results.
  • The number of search results for a particular term is measured to check for "hot topics" or "breaking news" to help Google follow or become aware of trends. An example might be the recent Tsunami in East Asia, where thousands of pages popped up overnight on the subject.
  • Google also measures search queries whose answers or relevance changes over time. They use the example of "World Series Champion" which would be different after each Baseball season.
  • "Staleness" can be a deciding factor in the rankings. Google will use user clicks and traffic to decide if "stale" results are relevant for a particular query or not and rank accordingly. Google says it measures "staleness" by:
    • Creation Date
    • Anchor Growth
    • Traffic
    • Content Changes
    • Forward/Back link growth

Link Based Criteria

Google can measure various linking based factors including:

  • The dates new links appear to a site/page
  • Dates that link or pages linking to a site/page disappeared
  • The time-varying behavior of links to a page and any possible "trends" that are indicated by this, i.e. is the site gaining links overall or losing them? A downward trend might indicate "staleness", while an upward trend would indicate "freshness".
  • Google may check the number of new links to a document over a given time period compared to the new links the document has received since it was first found. They'll also use the "oldest age of the most recent y% of links compared to the age of the first link found."
  • Google gives an example in the patent of two websites that were both found 100 days ago:
    • Site #1 - 10% of the links were found less than 10 days ago
    • Site #2 - 0% of the links were found less than 10 days ago
    • This data might be used to " predict if a particular distribution signifies a particular type of site (e.g., a site that is no longer updated, increasing or decreasing in popularity, superceded, etc.)"
  • Freshness weights assigned to a link can also be used to rank sites/pages. Several factors can influence link freshness:
    • Date of appearance
    • Date of change of anchor text
    • Date of change of the page the link is on
    • Date of appearance of page the link is on
    • Google says they theorize that a page that is updated (significantly) while the link remains the same is a good indicator of a "relevant and good" link.
  • Other weights for links include:
    • How trusted the links are (they specifically mention government documents as being assigned higher trust)
    • How authoritative the websites and pages linking to the page are
    • Freshness of the page/site - they mention the Yahoo! homepage as one where links frequently appear and disappear.
    • The "sum of the weight of the links" pointing to a page/site may be used to raise or lower the scoring in the rankings. Google will measure the freshness of the page based on the freshness of the links to it and the freshness of the pages which the links are on.
    • Age distribution over time will also be measured, i.e. a site/page will be compared against all of its links over time and when it received them.
  • Google may use link date appearance to "detect spam", "where owners of documents or their colleagues create links to their own document for the purpose of boosting the score assigned by a search engine". Google says that legitimate sites/pages "attract back links slowly" and that a "large spike in the quantity of back links" may signal either a "topical phenomenon" or "attempts to spam a search engine."
    • Google gives the example of the CDC website after the outbreak of SARS as an example of a "topical phenomenon".
    • Google gives 3 examples of link spam techniques - "exchanging links", "purchasing links" or "gaining links from documents without editorial discretion on making links".
    • Google also gives examples of "documents that give links without editorial discretion" - including guest books, referrer logs and "free for all pages that let anyone add a link to a document."
  • A decrease over time in the number of links a document has can be used to indicate irrelevance, and Google notes that it will discount the links from these "stale" documents.
  • The "dynamic-ness" of links will also be measured and scored, based on how consistently links are given to a particular page. They use the example of "featured link" of the day and note that they'll use a page score based on the pages that link to the page, "for all versions of the documents within a window of time."

Anchor Text

Google can use anchor text measurements to determine ranking scores:

  • Anchor text changes over time might be used to indicate "an update or change of focus" on a site/page.
  • Anchor text that is no longer relevant or on-topic with the site/page it links to may be tracked and discounted if necessary. Large document changes will result in Google checking the anchor text to see if the subject matter is still the same as the anchor text.
  • Freshness of anchor text can be calculated. It can be determined by:
    • Date of appearance/change of the anchor text
    • Date of appearance/change of the linked to page
    • Date of appearance/change of the page with the link on it
    • Google notes that the date of appearance/change of the page with the link on it makes the link and anchor text more "relevant and good"

Traffic

Google can measure traffic levels to a page/site as part of their ranking scores.

  • A "large reduction in traffic may indicate that a document may be stale"
  • Google may compare the average traffic for a page/site over the past "j days" (as an example j=30) to the average traffic over the last year to see if the page/site is still as relevant for the query.
  • Google might also use seasonality to help determine if a particular site is more/less relevant for a query during specific times of the year.
  • Google is going to measure "advertising traffic" for websites:
    • "The extent to and rate at which advertisements are presented or updated by a given document over time"
    • The "quality of the advertisers". They note that referrers like Amazon.com will be given more trust and weight than a "pornographic site's" advertisements.
    • The "click-through rate" of the traffic referrals from the pages the ads are on.

User Behavior

Google may be measuring "aggregate user behavior". This can include:

  • The "number of times that a document is selected from a set of search results"
  • The "amount of time one or more users spend accessing the document"
  • The relative "amount of time" compared to an average that users spend on a particular site/page
    • Google uses an example of a swimming schedule page that users typically spent 30 seconds accessing, but have recently spent "a few seconds" accessing.
    • Google says this can be an indication for them that the page "contains an outdated swimming schedule" and they will push down its rank.

Domain-Related Information

Information associated with a domain can be used by Google to score sites in the rankings. They mention specific types of " information relating to how a document is hosted within a computer network (e.g., the Internet, an intranet, etc.)" including:

  • Doorway and "throwaway" domains - Google says they will use "information regarding the legitimacy of the domains"
  • Valuable domains, according to Google, "are often paid for several years in advance", while the throwaway domains "rarely are used for more than a year."
  • The DNS records will also be checked to determine legitimacy:
    • Who registered the domain
    • Admin & technical addresses and contacts
    • Address of name servers
    • Stability of data (and host company) vs. high number of changes
  • Google claims they will use "a list of known-bad contact information, name servers, and/or IP addresses" to predict whether a spammer is running the domain.
  • Google will also use information regarding a specific name server in similar ways -
    • "A "good" name server may have a mix of different domains from different registrars and have a history of hosting those domains, while a "bad" name server might host mainly pornography or doorway domains, domains with commercial words (a common indicator of spam), or primarily bulk domains from a single registrar, or might be brand new"

Ranking History

Google can measure the history of where a site ranked over time and data associated with this. Some specifics include:

  • A site that "jumps in rankings across many queries might be a topical document or it could signal an attempt to spam search engine"
  • The "quantity or rate that a document moves in rankings over a period of time might be used to influence future scores"
  • Sites can be weighted according to their position in the results, where the top result receives a higher score and the lower sites receive progressively lower scores. Google uses the equation:
    • [((N+1)-SLOT)/N]
    • Where N=the number of search results measured and SLOT equals the ranking position of the measured site
    • In this equation, the 1st result receives a score of 1.0 and the last result receives a score close to 0.
  • Google could check "commercial queries" specifically and documents that gained X% in the rankings " may be flagged or the percentage growth in ranking may be used" to determine if the "likelihood of spam is higher".
  • Google may also monitor:
    • "The rate at which (a site/page) is selected as a search result over time"
    • Seasonality - fluctuations based on the time of month or year
    • Burstiness - Sudden gains or losses in clicks
    • Other patterns in CTR
  • The rate of change in scores can be measured over time to see if a search term is getting more/less competitive and additional attention is needed.
  • Google "may monitor the ranks of documents over time to detect sudden spikes in the ranks". This could indicate, according to the patent, "either a topical phenomenon (e.g., a hot topic) or an attempt to spam search engine"
  • Google may use preventative measures against spam by:
    • "Employing hysteresis to allow a rank to grow at a certain rate" - hysteresis in this instance probably means a pull that results in the growth rate falling. The terms has dozens of unique definitions.
    • Limiting the "maximum threshold of growth over a predefined window of time" for a given site/page.
    • Google will also "consider mentions of the document in news articles, discussion groups, etc. on the theory that spam documents will not be mentioned"
  • Certain types of sites/pages (Google specifically mentions "government documents, web directories (e.g., Yahoo), and documents that have shown a relatively steady and high rank over time") may be immune to the "spike" tracking and penalization
  • Google may also "consider significant drops in ranks of documents as an indication that these documents are "out of favor" or outdated"

User Maintained/Generated Data

Google wants to measure many different types of aggregate data that user keep on their computers about their web visits and experiences, including:

  • Bookmarks & Favorites lists in the browser
    • They want to obtain this data either via a "browse assistant" - like the toolbar or desktop search, or.
    • Directly via the browser itself - I predict they are developing their own Google Browser.
    • Google will use this data over time to predict how valuable a particular site or page is
  • Google also wants to document additions and removals from favorites & bookmarks over time to help predict the value of a site/page
  • Google will also measure how often users access the site/page from their browser to see if it is still relevant, or just a leftover ("outdated" or "unpopular")
  • The "temp or cache files associated with users could be monitored" by Google to identify their visiting patterns on the web and determine whether there is "an upward or downward trend in interest" in a given site/page.

Unique Word, Bigrams, Phrases in Anchor Text

Google intends to measure the profile of how anchor text appears over time to a particular site/page to watch for spam. They note that "naturally developed web graphs typically involve independent decisions. Synthetically generated web graphs, which are usually indicative of an intent to spam, are based on coordinated decisions". The difference in patterns can be measured and put to use to block spam.

Google notes that the "spikiness" of "anchor words/bigrams/phrases" is a prime measurement. They note that spam typical shows "the addition of a large number of identical anchors from many documents".

Linkage of Independent Peers

Google can also use link data from "independent peers (e.g., unrelated documents)" to check for spam. They say that a " sudden growth in the number of independent peers... with a large number of links... may indicate a potentially synthetic web graph, which is an indicator of an attempt to spam." Google notes that this "indication may be strengthened if the growth corresponds to anchor text that is unusually coherent or discordant" and that they can discount the value of these links either by a "fixed amount" or a "multiplicative factor" - this would give an additional penalty just for having these links.

Document Topics

Topic extraction can be performed by Google through the following methods:

  • Categorization
  • URL analysis
  • Content analysis
  • Clustering
  • Summarization
  • A set of unique low frequency words

The goal is to "monitor the topic(s) of a document over time and use this information for scoring purposes."

Google notes that "a spike in the number of topics could indicate spam" or that significant document topic changes may indicate that the website "has changed owners and previous document indicators, such as score, anchor text, etc., are no longer reliable." Google says that "if one or more of these situations are detected, (they) may reduce the relative score of such documents and/or the links, anchor text, or other data" from the website.


List of Additional Coverage & Resources

  1. The patent from US Patent and Trademark Office - US Patent #20050071741 - Information retrieval based on historical data
  2. From SEOChat Forums - Information Retrieval Based on Historical Data - Sandbox Explanation, Aging Delay?
  3. From Threadwatch - Google's War on SEO - Documented
  4. From SearchEngineWatch Forums - Does New Google Patent Validate Sandbox Theory?
  5. From HighRankings Forum - New Google Patent, Must Read
  6. From SERoundtable - Sandbox Explained by Google? "Information retrieval based on historical data"



Courtesy of SEOMoz

1 Comments:

Anonymous sheil said...

Manual Testing Training in Chennai provided by Manual Testing Experts. We are the Best

Manual Testing Training Institute & Center in Chennai with placements. Our fees

structure will be reasonable when compared to other training institute. To get better ideas

and training methods about our institute just visit our website. ISTQB

Certification in chennai

3:16 AM  

Post a Comment

<< Home