Advanced Pay-Per-Click Strategies


Wednesday, November 30, 2005

Pay-per-Click has grown substantially more competitive and expensive over the past 24 months. Whereas competitive bids once commanded $0.35-$0.60, they now range from $1.00-$5.00 per click. These rises are due in part to the recognition by companies big and small of the ROI on this form of exceptionally targeted advertising. No other paid method of inclusion provides such a well chosen audience, and measuring conversion & traffic down to the exact point of profit requires only a few steps. In order to compete in this market, search advertisers must use intelligent strategies to beat their competition. Attracting eyeballs and clicks to your ads while armed with the best tracking technology has to offer are the keys to using PPC successfully.

Writing Great PPC Ads

There is no 'secret' to writing great PPC ads. It is often trial and error that creates the best ad. However, common sense and a deep understanding of your customer base can help to start off on the right foot. Before writing your PPC ads, you should be able to cogently answer each of the following:

  • What are the demographics of your target market - geography, income, gender, age, etc.?

    The demographics will tell you what messages your audience will be receptive to - i.e. don't use the same writing style for urban males age 20-30 that you would for rural women aged 40+.
  • Are your buyers new to your product, or are you selling familiar goods/services?

    If your products are unfamiliar, you will have to do a little explaining in your ad - tell them what problem your product fixes or why they need it. If your products are well known and are being directly searched for, use more nuance and sizzle to generate click-throughs.
  • Is your company well established? Do you have a reputation?

    Depending on the degree of branding you want to achieve and your current standing, it may benefit you to reinforce why consumers should buy from your firm, or simply push the product - i.e. Amazon doesn't need to say anything about themselves; consumers already know who they are.
  • Are you selling on price, quality or intangible values?

    When selling on price, it's critical to mention free shipping if you have it, or sell the idea that your prices are the lowest. When using other qualities, don't forget to use words that appeal to the consumer who is willing to pay more - luxury, quality, guarantee, etc.

The best PPC ads also use something that catches the eyes of consumers - interesting or unexpected lead ins, text that requires a second glance, etc. The key to writing this type of text is to be creative - try new words, titles and phrases that are surprising and unique. The click-through rates will quickly tell you if your ads are working.

Using Tools & Tracking

Overture, Adwords & several of the minor PPC engines offer a high degree of accountability through their vast arrays of tracking tools. Setting up the tools properly and knowing what to track and how it should impact the decisions you make is critical to running the most successful PPC campaign possible. The following are critical measurements you should know in order to make the most of your PPC dollars.

  • Click-Through Rate

    Overture & AdWords both track this automatically - what it tells you is what percentage of people who were exposed to your ad clicked through to your site. This statistic can be misleading, however, because it does account for people who may visited the page your ad was on, but not seen it because it was "below the fold" - meaning below the area initially visible on the screen without scrolling. The PPC companies are not yet able to calculate how many actually saw your ad - so skepticism, particularly when your ad is not in the first 3-6 positions is wise.
  • Conversion Rate

    Conversion Rate measures how many visitors to your site made a purchase. The advanced tracking systems from Google and Overture enable you to add very specific conversion data to your website that gets reported back in the PPC Tool admin center. However, adding your own tracking system that measures all keyword phrases and referral traffic can be even better as you can use the ideas and terms to optimize your PPC use.

Google AdWords

Google's Adwords service serves two types of ads - search match and content match. Search match ads are listed on top and right-hand side of a search engine results page. Content match ads are provided on third-party (read - not Google) websites that serve the ads either vertically or horizontally on the page. According to Google, content match is determined by the words/phrases, etc. that appear on the page, however, many inconsistencies have been reported. If you decide to use content match from AdWords, make doubly sure that your ROI and conversion rates perform adequately.

An additional piece of information about Google's AdWords program of interest is how to get a site up to the top of the search rankings, rather than simply on the right-hand side of the search results pages. According to a Google AdWords rep at SearchEngineWatch Forums, the top position is achieved by getting a particularly high click-through rate, rather than paying a particularly dominating price per click. Manual approval by an AdWords worker is also required, although they will look at your ad (supposedly) if you are doing particularly well.

Overture

It's important to remember the power Overture has - although Google has a lion's share of the search market, Overture provides results to both Yahoo! & MSN as well as many other smaller sites - like AltaVista & AlltheWeb. Google provides results @ AskJeeves & AOL too - giving Overture only a slightly lower number of total eyeballs each day. But, Overture's click-through rates, conversion rates & satisfaction rates have always been superior to Google's - this could be due to the intrusiveness of the ads, the difficulty in distinguishing them from unpaid results, or the better, more usable system has built for its advertisers - being able to write individual ads for keywords, bidding by the keyword, etc. In any case, a PPC campaign may be better off starting with Overture and adding Google, rather than the other way around.

Additional PPC Information

It's important for webmasters and PPC buyers think of the service in very different terms than normal (organic) search engine optimization. Pay-per-click must provide a direct return on investment and the ROI is the most important factor to measure. If you're spending $1000 per day on PPC, your profits (not sales, but profits - after ALL expenses) must be greater than $1000 to make any return at all. If your production, shipping or other per piece costs are high, it's even more important to have a good grasp on the numbers to see if PPC is paying off for your business.

On the other hand, it's also important not to discount a minimal ROI from PPC - by spending more money, you may be able to make slim margins, but compensate through quantity. PPC is a great tool for advertisers and web-based businesses, but using it properly is extremely important.

For more information on who provides what results, see Bruce Clay's sublime relationship chart.

Advanced Link Building Tactics


Link building requires dedication and consistent effort. It is the most difficult, time-consuming and important task an SEO performs to get a website ranked well. Certain strategies are well known to SEOs, while others are in the realm of the experts or forward-thinkers who puzzle them out. I have attempted to list here many of the link building tactics used by the SEO community.

  1. Directories

    Although it may seem an obvious and tired way to build links, directory use is still valuable and valid. I have compiled an exhaustive list of directories including PR, pages that have been spidered, and submission requirements for directories - see List of SEO Directories. When submitting to directories, make sure to vary anchor text and try to use your keywords in the description and title fields as naturally as possible. A good SEO should be able to obtain between 75-150 unique backlinks just from directories.
  2. Top Competition

    This tactic can help you generate the most visitors to your site. Search Yahoo, Google, MSN and the other search engines for each of your important keyword phrases and make a point of attempting to get a link from each page. Oftentimes a direct phone call to site owners or a very tactful email can go a long way. I've even used handwritten letters to attempt to cajole site owners into giving me a link. Use whatever resources you have help persuade - links from other sites you own, money (if you have it), free SEO services, contributions of articles or content, etc.

    When I use this technique, I'll typically make a spreadsheet listing all of the sites I have a shot at getting a link from (i.e. excluding direct, obvious competitors). I then get as much contact information as I can about each one and run down the list. This can be very time-consuming, but also exceptionally worthwhile. Even if your site doesn't get ranked much higher, each link from a site that ranks in the top 20 will bring direct, targeted traffic.
  3. Link Searches

    This old technique has been honed to a science by many SEOs, but there is always room to improve. The goal is to use Google/Yahoo to search for pages that let you add a link (reciprocal, paid, etc.) that already have your keyword or some closely related content on them. Often these searches turn up smalltime site operators with reciprocal link directories managed by 3rd party software. Don't overlook these, and make sure you submit. But the real prize is pages that give free links or offer paid links to sites they feel are important - many of the highest PR links can be obtained through this method, especially if you conduct the search at SEOChat's PR Search Tool.

    The following searches are worthy of being tried, but feel free to experiment:
    1. intitle:add+url "keyword phrase"
    2. intitle:submit+site "keyword phrase"
    3. intitle:submit+url "keyword phrase"
    4. intitle:add+site "keyword phrase"
    5. intitle:add+your+site "keyword phrase"
    6. intitle:directory "keyword phrase"
    7. intitle:list "keyword phrase"
    8. intitle:sites "keyword phrase"

    All of the above searches can be used without the intitle to give even more results. As with other link building processes, these are tedious and time-consuming, but it is critical to stay focused on finding quality, relevant links. When using this method, make sure you note to which other sites the page is linking and what text content is listed. If it fits well with your content, don't worry if the page is PR0.
  4. Usurping Competitors' Links

    Although this may sound unseemly, it is one of the most effective ways to build links, and is commonly practiced. The goal is to find as many domains and pages as possible that link to your competitors' sites and get those sites to link to you as well. This is often easier than it sounds, because the site is already linking out to people in your industry. Money, services, links or simple requests are all used to get the job done.

    The most effective ways to find your competitors' links are listed below:
    1. At Yahoo! type linkdomain:url.com -site:url.com
    2. At MSN's techpreview type link:url.com -site:url.com
    3. At Yahoo! type link:http://www.url.com
    4. At Google type "url.com" -site:url.com

    These are typically the most reliable way to find the sites and pages that link to your competitors and, incidentally, those sites which link to you. Remember to be as courteous, direct and friendly as possible. Don't forget how effective non-electronic methods of communication can be - phone, letters, etc.

  5. Article Writing & Submission

    Writing unique, valuable content about your subject matter for your own site is the most critical piece of on-site SEO. But it can also be a fantastic way to generate quality links from other sites. Many existing websites are designed around hosting the content of others and providing links back to them. I have listed some of the biggest 3rd-party article hosting sites below.

    General Articles - These cover a wide range of subjects and will accept submission on many topics:

    These are just some of the thousands of sites that offer the ability to spread your content around the web. Additional sites on your specific topics can be found using the search engines. Especially prevalent are sites offering content in website marketing & technology - there are hundreds of sites that deal in content on these issues.

  6. Forums & Online Communities

    Generating links through forums is often considered spam, but forums are valuable tools for link builders as long as the proper methods are employed. The forum signature of a member is certainly a fair place to promote your site, and while doing so may not provide a worthy backlink in the search engines' eyes, it does offer visitors on the forum a chance to see your site.

    The primary goal of joining a forum should be to contribute and learn from the community, but promotion certainly isn't out of place, as long as it remains on-topic. In SEO forums, people constantly plug their new directories, tools, articles, etc. These are all appropriate as long as they are subtle, relevant and not overdone.

    When a specific page, tool or link would provide a great resource or answer to a question in a community, don't hesitate to provide it. Much of the information used to write this article and the SEO Survey tool was provided in forums through links. Just remember to carefully read the terms of use for the forums you post in - many are harsh if they feel someone is there to spam or advertise.
  7. Blogging & Comments

    Blogging is a very effective way to get natural inbound links, and by having a personal blog at your site, you can more easily make commentary at other sites that link back to you. DO NOT spam blog comments with advertisements; your goal is to create an online reputation you can be proud of. Blog owners and visitors will not take kindly to blog spam. The key to effective blogging for links is to make sure you comment relevantly & intelligently about the specific subject in the blog. A well thought-out response to a relevant blog or article can bring not only search engines, but more importantly other readers of the blog to your site through the link. If your content proves interesting, you could build much more than just a single link through your comments.

    Another important thing to remember is that your link in blog comments may not count at all in the search engines. Many blogs prevent access to these comments or don't provide direct, spiderable links for blog visitors. This, combined with the fact that these comments are buried at least 2-3 links deep from the main page means that you should keep your focus on the visitors you might generate, rather than the search engine optimization you might get.

    However, you should still be as SE friendly as possible. If the subject matter deals directly with your keyword phrases, you're in fine shape, but if they are on related subjects, it can't hurt to use your keyword phrase(s) once or twice in a longer comment/entry. You should also try to make your 'name' - which is typically the anchor text for your link contain at least one of your kw phrases, but remember to keep it varied and never let it overtake the importance of getting visitors, i.e. if your site is about Orange County T-Shirts, having a 'name' of OC-Tshirt-Man can't hurt you, but Orange-County-T-Shirts is probably not as wise.
  8. Renting Pages from Authority Sites

    This tactic is not often mentioned, but can be extremely valuable. The premise is to rent, for a monthly fee, a page hosted on a larger or more authoritative site in your industry. You then add content & links on the page to your own site(s) with the anchor text you choose and in the format you select. These are almost like 'advertorials' in magazines, where you provide content that the site links to (hopefully not more than 2 links from the homepage) and then link back to your own site.

    The fundamental difference between this tactic and simple article hosting is that you approach a site or company that doesn't normally host other people's content about having your information. Although rejection is often likely, the benefits can be phenomenal, as the page itself may rank well for your search terms very quickly.
  9. Purchasing Online Advertising

    Link building through advertising is exceptionally expensive, but it can bring in good returns if properly managed. Make sure as you buy links that you get the click-through rate from the seller for the past 6 months. You should get a lot of other statistics to review as well - make sure to avoid advertising in places where the seller doesn't have good information on click-through rates, visitors, etc. You'll also have to work around re-direct scripts, links sent through ad networks (like DoubleClick), etc. As long as you are up front and friendly, you'll be able to have some success in this arena as long as you have the funding.
  10. Text Link Brokers

    Although initial thoughts about link brokers can give a bad impression, this new industry can actually have a positive value on your link building efforts. I recommend against purchasing sitewide links unless your only goal is to build PR (i.e. directories or SEO related sites, etc.). Sitewide links have come to be associated with "manipulation" of the SERPs by search engines and should be largely avoided. However, purchasing text links on specific, related sites surrounded by the content of your choice can be beneficial, if difficult to find.

    The key to a successful text link is that all the components of natural SEO are there:
    • Is the page closely related to your site's topic?
    • Is the link placed in a strategic position - i.e. not in an obvious advertising space or footer?
    • Is the site related to the topic of your site and the topic of the page?
    • Is the site spidered frequently by the search engines and do the pages have visible PR?

    If the link meets these criteria, it could be very worthwhile. However, be careful to avoid paying high prices for a particular page or site, simply because of its PR. It is more likely that multiple low PR (3,4 & 5) pages are worth more to you than a single PR7 or 8.

    Several lists of text links sellers are available online:

    Text Link Brokers - Discussion and some links at SearchEngineWatch Forums

    Be wary and always do some search research and asking at the major SEO forums prior to buying from text link brokers. The industry is, sadly, rife with scammers & companies that don't deliver as promised.

  11. Unique Tools & Services

    By offering specific, relevant free Internet tools/services at your site, you can generate naturally built links. My recommendation is to find a tool or service that isn't currently offered or that you believe you can improve upon. For some industries, this will be much harder than others, but creativity is your best tool. Imagine what automated web service, list of resources or submission-type tool would best benefit you and then build it and offer it on your site.

    Some excellent examples are listed below (these come straight from my imagination):
    • For a site about animals, a tool that visually diagrams the phylum/genus/species relationships
    • For a site about travel, a tool that creates a pre-planned itinerary based on a selection of interests
    • For a site about real estate, a tool that estimates property value or gives a % chance of selling for a based on time, region & price.

    The possibilities are endless, and there is little doubt that the web can automate and make easier hundreds of tasks in any industry. The key, after you have built your tool, is to promote it with industry leaders - send it to bloggers who discuss your subject matter, ask people to use and evaluate it on forums, etc. You can even try to promote it by writing an article about your findings, development or the simple existence of the tool and submitting it across the newswires.

  12. Automated Link Building Programs

    Many SEOs feel that automated link building is a surefire way to get banned/penalized by the search engines. If you are caught practicing this tactic, I would certainly agree. However, for those willing to take risks, the rewards can be substantial. Well-built and well hidden link networks often bring higher rankings in a short amount of time. The key is to select a provider who will offer the following:
    1. A vastly distributed network that is not interlinked and doesn't share any C-Block IP addresses
    2. Linking pages that contain some relevant content that has been well-linked through a sitemap system
    3. Pages that are as close to naturally-designed and built as possible
    4. A company that can speak at an expert level about SEO and the tactics they use to avoid detection - call them and talk to them first; their knowledge and competence should be instantly apparent.
    5. Someone you trust and have researched thoroughly - I DO NOT recommend posting on forums about these companies, as you could damage your chances for success. Just ask for recommendations from trusted members of the SEO community.
    6. Pages that link not only to your site(s) but to other relevant, related sites as well - even if they are your composition. That way, if you get penalized, at least your competition is suffering as well.

    This is one of the riskiest strategies that can be pursued in link building and I recommend it only with extreme caution. You are breaking the search engine's printed rules by "intentionally manipulating the search results" and you should have a contingency plan in the event of disaster.

  13. Donations & Charity

    Many high PR websites are run by charity organizations, open-source technology groups & the like. Contributions of time, money or resources in exchange for links are commonplace and very worthwhile. For this type of link building, relevancy is still important, but you may not be able to find any link partners in this segment if you focus solely on topic-sensitive sites & pages. It may be very worthwhile to donate regardless and see if the links from PR7 & 8 pages can help to boost your importance.

    In order to find sites like these, I use the PR Search tool at SEOChat to search for terms/phrases like donate, contribute, give, etc. Appending your broadest search terms to these types of words can help you find the sites most likely to pass you a good link. Remember that even if non-profits, charities, etc. don't have a link program in place, they may be more than happy to provide a special link for you if you request it when contributing. If nothing else, these links and the time/money you spend getting them are a great way to sleep better at night.
  14. Press Releases

    Press releases written by an SEO firm can contain a carefully worded title, well placed keywords and a link to the site that is the subject of the article. SEOs have been using press release services for some time, and as with many other things, the more you spend, the better distribution is. Press releases can also rank well on their own, particularly in the first few days/weeks of release as they receive the 'fresh content' boost in the SERPs. A particularly interesting or well-written press release may even garner the attention of some larger players in the new market, and if your article is picked up by one or several good sources, it can serve as a great boon to traffic and exposure.

    Success relies on the careful crafting of the article itself - providing excellent, factual information in a way that is interesting to read and relevant to the industry. Hiring a journalism student for $100-$200 an article or a more experienced professional for more may well be worth the money. In any case, the press release needs to be sent through a distribution site. The following are popular in the SEO industry because they offer links, serve to a large variety of sites and provide enough flexibility to let an SEO perform well.

    The top two sites listed are the largest, with the greatest distribution networks, but each site offers its own unique benefits and some are less expensive than others. It is worthwhile, especially if you have the money and are in a competitive arena, to opt for the maximum spend at PRWeb - somewhere around $250 per article. The distribution that is achieved will determine the value of the links and the article. It's also important to keep in mind that duplicate content penalties may hurt you - DO NOT re-publish the press release on your own site.

  15. Natural Link Building

    Although undoubtedly a difficult and time consuming method, as well as one of the more unreliable ones, natural link building is what powers most of the top sites on the Internet. This method involves developing the most useful, relevant content that provides the best resources, tools, services, prices, etc. in the industry. By offering the web community the best possible site, you can gain natural links through the power of having others on the web link to you.

    Sadly, this tactic is somewhat usurped by the search engines, as newer sites often fair exceptionally poorly in the search results, especially for popular terms. In order to build natural links, people must be exposed to your site. At one time natural link building was the very best method to get traffic and links, but in the current state of search and traffic, it is more of a boost that you use to help convert some of the webmasters & bloggers who might visit your site into link builders.

    Building valuable content is no easy task, and building an industry-leading resource is even more challenging. However, the long-term benefits can be spectacular, assuming you have the money and time to maintain your resource without proportional income. Remember to ask yourself what someone who comes to your site might want and try to provide the absolute best service/information in your industry in the cleanest, most usable site possible.

Guide to Applying 301 Redirects in Apache


seomoz.org was hosted under www.socengine.com/seo/ rather than as its own domain. We were moving seomoz.org to its own dedicated server and wanted it to be accessed as its own domain rather than as a subdirectory of socengine.com. We needed visitors accessing anything in www.socengine.com/seo/ to be redirected to www.seomoz.org. The redirection had to accommodate several file and folder name changes and had to be done with 301 redirects in order to be search engine friendly and to ensure compatibility across all web browsers. We also needed to forward http://seomoz.org to http://www.seomoz.org for aesthetic purposes and also to avoid a 301 sabotage.

Solution:

The simplest approach would have been to add 301 redirects to the PHP code that powers SEOmoz.org using PHP’s header function. Utilizing the power of the apache module mod_rewrite, however, we could match specific patterns for entire folders and redirect them to their new URLs without having to go through every PHP script. Also, several of our pages were static HTML and it was not practical to use javascript or META tags for redirection.

Installation:

If your web server does not have mod_rewrite installed, I suggest reading over the apache documentation for installing modules. It will usually require you to recompile apache with the option –enable-module=rewrite or --enable-module=most.

If your hosting services does not support mod_rewrite, I would urge your systems administrator to have it installed. Most apache installations will have mod_rewrite installed by default. Our server is running FreeBSD and mod_rewrite was included by default when installing from the ports collection. Once it is installed, you can verify it is working by adding this line to your apache configuration file or your .htaccess file:

RewriteEngine On

Context

The mod_rewrite module operates in per-server context or in per-directory context.

The per-server context requires you to edit the apache configuration file, httpd.conf , while the per-directory context uses .htaccess files that exist in each folder you want to configure. If you do not have access to httpd.conf, you will have to use .htaccess files.

Regular Expressions (aka Regexes)

From wikipedia.org:

A regular expression is a string that describes or matches a set of strings, according to certain syntax rules. Regular expressions are used by many text editors and utilities to search and manipulate bodies of text based on certain patterns.

We will be using regular expressions to match patterns in the client URL and redirect them accordingly. Regular expressions are an invaluable skill to learn if as both a programmer and a systems administrator. To redirect URLs according to the examples in this document, you will only have to understand the basics of using regexes. This is a list of the characters and operators you will use in the regexes described in this document:

  • . Period - matches anything.
  • * Asterick – matches zero or more of the preceding character
  • + Plus sign – matches one or more of the preceding character
  • ( ) Parenthesis - enclosing a value in parenthesis will store what was matched in a variable to be used later. This is also referred to as a back-reference.
  • (value1|value2) - Enclosing two or more values in parenthesis and separating them with a pipe character is the equivalent of saying: “matching value1 OR value2.”

Redirecting specific files and folders from one domain to another.

We needed redirection from the old server to the new one with the filenames preserved.

Example

Redirect: http://www.socengine.com/seo/somefile.php
To: http://www.seomoz.org/somefile.php

Solution

Add the following directive:
RedirectMatch 301 /seo/(.*) http://www.seomoz.org/$1

Explanation:

The regular expression /seo/(.*) tells apache to match the seo folder followed by zero or more of any character. Surrounding the .* in parenthesis tells apache to save the matched string as a back-reference. This back-reference is placed at the end of the url we are directing to, in our case, $1.

Redirecting without preserving the filename

Several files that existed on the old server were no long present on the new server. Instead of preserving the file names in the redirection (which would result in a 404 not found error on the new server), the old files just needed to be redirected to the root URL of the new domain.

Redirect: http://www.socengine.com/seo/someoldfile.php
To: http://www.seomoz.org/

Solution:

Add the following directive:
RedirectMatch 301 /seo/someoldfile.php http://www.seomoz.org

Explanation:

Omitting any parenthesis, all requests for /seo/someoldfile.php should redirect to the root URL of http://www.seomoz.org

Redirecting the GET string

Some of our PHP scripts had different names but the GET string stayed the same. We needed to redirect the visitors to the new PHP scripts while preserving these GET strings. The GET string is the set of characters that come after a filename in the URL and are used to pass data to a web page. An example of a GET string in the URL http://www.seomoz.org/myfile.php?this=that&foo=bar would be “?this=that&foo=bar.”

Redirect: http://www.socengine.com/seo/categorydetail.php?CAT_ID=12345
To: http://www.seomoz.org/artcat.php?CAT_ID=12345

Solution:

Add the following directive:
RedirectMatch 301 /seo/categorydetail.php(.*) http://www.seomoz.org/artcat.php$1

Explanation:

Once again the regular expression (.*) tells apache to match zero or more of any character and save it as the back-reference $1. Since we put $1 after /seo/categorydetail.php, it will now redirect the get string to this new php file.

Redirecting while changing file extensions

We had a folder of files on the old server that were mixed HTML and PHP. On the new server these were all PHP and we needed the old HTML files to change to this new extension.

Redirect: http://www.socengine.com/seo/guide/anyfile.html
To: http://www.seomoz.org/articles/anyfile.php

Redirect: http://www.socengine.com/seo/guide/anyfile2.php
To: http://www.seomoz.org/articles/anyfile2.php

Solution:

Add the following directive:
RedirectMatch 301 /seo/guide/(.*)\.(php|html) http://www.seomoz.org/articles/$1.php

Explanation:

(*.) matches zero or more of any character and saves it as the back-reference $1. \.(php|html) tells apache to match a period followed by either “php” or “html” and saves it as the back-reference $2 (although we won’t be using it in this example). Notice we had to escape the period with a backslash. This is to ensure apache does not interpret the period as meaning “any character” but rather as an actual period. Enclosing “php” and “html” in parenthesis and separating them with a pipe “|” character means to match either one of the values. So if it were to say (php|html|css|js|jpg|gif) the regex would match any of the files with the extensions php, html, css, js, jpg, or gif.

Also, if for some reasons we needed to preserve the name of the extension we matched, it would be stored as the back-reference $2. Back-references are incremented in accordance with how many sets of parenthesis are in the regular expression.

Redirecting canonical hostnames

We needed to redirect any requests that do not start with www.seomoz.org to make sure they include the www. We did this not only because it looks better, but to avoid the now common 301 sabotage.

Redirect: http://seomoz.org/
To: http://www.seomoz.org/

Redirect: http://mail.seomoz.org/
To: http://www.seomoz.org

Redirect: http://seomoz.org/somefile.php
To: http://www.seomoz.org/somefile.php

Solution:

Add the following directive:
RewriteCond %{HTTP_HOST} !^www\.seomoz\.org
RewriteRule ^/(.*) http://www.seomoz.org/$1 [R=301,L]

Explanation:

This directive tells apache to examine the host the visitor is accessing (in this case: seomoz.org), and if it does not equal www.seomoz.org redirect them to www.seomoz.org. The exclamation point (!) in front of www.seomoz.org negates the comparison, saying “if the host IS NOT www.seomoz.org, then perform RewriteRule.” In our case RewriteRule redirects them to www.seomoz.org while preserving the exact file they were accessing in a back-reference.

Conclusion:

By harnessing the power of mod_rewrite and a little regular expression magic we can develop a set of simple rules for redirecting web pages. By using 301 redirects we are maintaining browser compatibility and are staying search engine friendly. If you are interested in learning more I recommend reading a few of the many regular expressions tutorials found on the internet. O’Reilly also has a great book, “Mastering regular expressions” that I highly recommend. I read the book in its entirety many years ago and what I learned from it has proved to be invaluable. I would also read up on mod_rewrite in the URL rewriting guide and the mod_rewrite reference documentation found at the Apache Software Foundation website.

2005 Analysis of Google's Sandbox


Google's infamously and arguably mis-titled"Sandbox Effect" has been an observed phenomenon since early 2004. Although many continue to argue and debate the causes and effects of this unusual algorithmic element, there is virtually no debate on its existence. At one time, the best explanation of the sandbox was:

"The penalty or devaluation in the Google SERPs of sites with SEO efforts begun after March of 2004."

However, with time, observation and refinement, a new, more detailed and accurate definition can be made:

"The observed phenomenon of a site whose rankings in the Google SERPs are vastly, negatively disparate from its rank in other search engines (including Yahoo!, MSN & Teoma) and in Google's own allin: results for the same queries."

This penalization system is known to be unpredictable and particularly difficult to analyze or understand due to the ways in which it behaves. This article attempts to sum up the experience of many SEOs and websites in the field that have fallen under the effect of the sandbox. I have had the particular privilege of analyzing many dozens of sites affected by the filter thanks (primarily) to e-mail contacts with users ranking particularly high on the sandbox detection tool here at SEOmoz. Although I cannot reveal most of these sources by URL or name, the observed effects should be familiar to many in the SEO business who have started optimization of new websites since March of 2004.

List of Observed "Sandbox" Phenomenon

The Sandbox Effect has been noted to affect many unique aspects of rankings in the SERPs. This list is a collection of the most commonly mentioned and obvious factors weighing into the observations.

The Sandbox is known to affect...

  1. ...entire top-level domains, rather than simply web pages, directories or sub-domains.
  2. ...a higher percentage of newly registered (post-2003) websites than those registered prior to that year. There are, however, several examples that show exceptions to this rule.
  3. ...most commonly, those websites which have had search optimization tactics performed on them, specifically on-page optimization of text, titles, meta data, etc. and external link building efforts. There are exceptions to this rule as well, most often from sites which have received a high number of external links over a short period of time.
  4. ...websites primarily in the English language. While reports exist of sandbox-like factors affecting some other languages, it is noticeably absent particularly from Italian & Dutch language websites targeting searches at Google.it and Google.nl.
  5. ...rankings for all levels of difficulty. Despite rumors that the sandbox only targets highly competitive keyword phrases, the most heavily "boxed" sites I've reviewed could not rank successfully for even the most non-competitive terms. Several sites even had unique terms found on virtually no other sites prior to the existence of the domain, and yet were outranked by pages at other sites mentioning or linking to them.
  6. ...rankings only at the Google search engine. Sites that are most heavily penalized will often be ranking in the top 3 results at the other major search engines (Yahoo!, MSN, AskJeeves), yet will be ranked in the 100's or worse at Google.
  7. ...rankings only in standard SERPs. A search using alllinanchor:, allintext:, allintitle:, or allinurl: will return the site in its 'normal' position. This effect is also perceived when using the "-asdf trick" where the search phrase is proceeded by 16-20 negative modifiers such as -asdf. See an example here.
  8. ...low quality, affiliate, spam and sites carrying AdSense more often than those without these features. This could very well be the intended result of the system, and would therefore be only natural. However, these sites certainly are not universally affected, nor are they alone in being affected - the most prominent examples of sandboxed sites are often purely "white-hat", natural, organic sites.
  9. ...commercial and private sector sites only. There has never been a reported case of a .gov, .mil, .edu or other official use TLD affected by the sandbox.
  10. ...rankings for anywhere from 1 month to more than 1 year. Examples have been shown to me of sites that seem to 'never' escape the sandbox, though these sites are often of the "low quality" variety described above.

The sandbox has also been observed to typically release sites into "normal" rankings en masse, which is to say that there have been virtually no examples of a single site "escaping" by itself. It appears that certain updates in Google's search engine release many sites all at once. Speculation about this centers around Google wishing to avoid the appearance of manually reviewing sites one by one, although other reasons have been proposed as well.

Technological Explanations for the Sandbox

Several theories have evolved over time to explain how Google flags websites to be put into the sandbox, and why the effect is not universal. The following are either extremely popular, or have stood up to most evidence and appear to be logical explanations:

Over-Optimization Flagging

Many suspect that Google initially identifies websites to be "sandboxed" by analyzing commonly optimized components like the backlinks structure, on-page stuffing of keyword terms or phrases, and rate at which inbound links come to the site. There can be little doubt that through careful analysis, Google has an excellent idea of what natural text and link structures and a very good idea of what these look like for spam sites and can thus distinguish between the two. Many unique criteria have been mentioned in regards to what can trigger these situations:

  1. Rate of Inbound Links

    As mentioned in the recent Google patent, the rate at which new links to a website or page are found can be measured and compared against historical data to determine whether the page/site has become particularly relevant or whether this is an effort to spam. The key to using this data would be comparing how a popular page that is picked up by the blogosphere or news websites differs from a page that simply has purchased or created thousands of links in order to manipulate the search engines. There are examples that appear to have fallen afoul of this portion of the sandbox filter despite using purely natural techniques.
  2. Over-Optimized Anchor Text

    Similar to the rate of inbound links above, Google also has a very good idea of what constitutes natural anchor text across the structure of many dozens, hundreds or thousands of links. When these structures appear over-optimized, which is to say particularly focused on specific commercial phrases or terms, it is suspected that this can trigger "sandboxing".
  3. On-Page Over Optimization

    Keyword Stuffing or over-targeting of particular terms on the page or across the site has been named as a possible culprit for being sandboxed. This particular rationale is often used to explain why so many 'SEOd' sites get filtered, while many non-optimized sites do not (although there are plenty of exceptions on both sides).

Commercial Keyword Targeting

There have been suggestions, although these have largely lacked solid evidence, that by targeting specific, commercial search terms, your website may be more likely to fall afoul of the filter. There have been many, many examples of sites that have been filtered despite targeting non-commercial phrases and largely non-competitive terms that my personal opinion is that the sandbox does not discriminate based on the targeted terms/phrases.

Natural Text Analysis

Many patents and white papers have been written by the major search engines on the subject of analyzing and differentiating naturally, human written text from computer-aided or generated text, commonly used by spam websites. This has motivated many to believe that Google is conducting this analysis with an eye towards "sandboxing" non-natural text. Luckily, for those SEOs who write their own content, this should be a relatively easy problem to work around, as false positives when conducting automated text analysis are highly unlikely. Sadly, experience has shown that many sites whose text is entirely human-written, never duplicated and of generally high quality still experience the sandbox phenomenon. My personal opinion on this issue is that text analysis of any kind is not to blame for the sandbox, though low quality text could pre-dispose a site to penalties or make it more difficult to "escape" the sandbox.

Manual Review

Thanks to Henk Van Ess' Search Bistro and his exposé therein of Eval.Google, the theory that Google manually reviews new websites that gain any significant number of links, are sent a certain quantity of visitors, or trigger some other set of parameters are flagged for manual review has skyrocketed. This belief may not be unfounded, however, as a Google representative at SES NYC, Craig Manning, pointed out that Google will indeed review sites like ChristpoherReeve.org or Tsunami.Blogspot.com to check if the large number of links and high rankings they have achieved are indeed warranted. Craig noted that this was a way to keep low quality sites from ranking well via link-bombing techniques.

Many in the industry, however, reject the manual review idea behind the sandbox as being too convenient and not efficient enough for Google. They point to Google's overarching theme of their technology as being completely fair and automated, which would certainly preclude human judgments from affecting the search landscape as completely as the sandbox phenomenon does.

Manual review is certainly an explanation that fits all the facts, it accounts for both inconsistencies in the application of the sandbox, the widely varying escape times, and even the proclivity of sites exhibiting common SEO traits (optimized pages, links, etc.) to be more susceptible to penalization.

Major Myths, Red Herrings & Exceptions

For every rule, there is an exception and with Google's sandbox, where few hard rules exist, this is even more true. However, it's important to point out certain exemptions and factors that are outside the normal understanding of the phenomenon. A short list follows:

  • Escaping the Sandbox in only a few weeks - I have never directly observed this phenomenon, but it has been reported to me once, and appeared in the forums another time. For the site I have knowledge of, I can say that it was of very high quality in terms of design, content and usability. I have no way of knowing whether these factors influenced its quick "escape".
  • Extensions - It had been reported that .org websites were less vulnerable to sandboxing than .com, .net or other domains (.info, .tv, etc.). However, based on my experience, and the direct experience of SEOmoz.org, I can say with near certainty that this is not an accurate representation of reality. The only domains never observed to be affected are .mil (military), .gov (government) and .edu (educational).
  • "Trusted Links" to Escape - Rumors have abounded that specific links from places like DMOZ, .gov or .edu websites, or even major news sites like CNN, Reuters or the AP can "release" a site from the sandbox. While these types of links can correlate with a high quality site (occasionally an indicator that the sandbox stint will be shorter), they do not dictate an immediate release. I've personally seen several sites obtain links like these and remain in the "box" for several months afterwards.
  • Have an "in" at Google - Myths have been circulating the SEO forums that suggest having a relationship with "someone" at Google can get you un-sandboxed. I believe this to be largely false, with the singular exception of a colleague who showed his website to Matt Cutts specifically after a session in NYC and had it promptly un-"boxed" about 2 weeks after the show. Whether this is a direct result of the conversation or simple coincidence, I cannot be sure. I do admit to eavesdropping while waiting in line to speak with Matt (I couldn't help it!). My guess is that Google wrongly penalized the site and rectified the mistake. I'm guessing the $1500 ticket to attend an SES conference was a bargain for the site owner.

Possible Solutions & Suggestions

Although many suggestions have emerged as to how to prevent entry of a new site into Google's sandbox, very few have panned out. Several, such as the recommendation to use a subdomain of an existing TLD have met with mixed success, while others including the "don't get links" advice are self-defeating. The best advice pieces I've seen, mixed with those new sites that have dodged the sandbox are listed below:

  1. Target "Topical Phenomena" & a Non-Commercial Audience - If you know that you're building a great website that's going to earn a lot of links, the #1 piece of advice I can give is to target your site towards non-profit/educational at first. I'd also highly recommend targeting something newsworthy and interesting to a massive audience. For example, if you're creating a site on real estate in Boston, start with a news/blog site that tracks trends and information about the real estate market, rather than pushing a single service. Offer a tool like Google Maps integrated with the MLS or Craigslist real estate listings or another great piece of information that would be likely to earn lots of natural incoming links.

    The idea behind this strategy is to legitimize the link gain you'll experience at the beginning of the site and, with some luck, avoid the sandbox by being a "topical phenomenon". This strategy is difficult and takes not only hard work and dedication, but out-of-the-box thinking. However, if you're shooting for high traffic and high links quickly, this is the best way to dodge the sandbox.
  2. Build Natural Links & Avoid Getting Blogrolled - One of the most common elements suspected for sandboxing completely"natural" sites is their addition to blogrolls. These links are sitewides on URLs that frequently have many thousands of pages in Google's index and it appears on the surface that they can cause the link problems that lead to sandboxing. The best way to avoid this is to watch your logs for referring URLs and request to be removed from any blogrolls that are sent to you. With some luck, the sympathetic blogger will understand and remove you. It seems ridiculous to have to go to these extents to avoid sandboxing, but in the commercial reality of the web, it may, in fact, help you in both the short and long run. Naturally, if you aren't running a blog on your site, it's much easier to "stay off the rolls", but you also miss inclusion in great blog directories and traffic sources that can earn you high quality links (i.e. Technorati, Blogwise, etc).
  3. Get Noticed in the News - Although this is exceptionally difficult, being picked up by major news services and syndicated to online news portals and newspapers is a great way to avoid the sandbox. It seems that the legitimacy of these link sources can actually help to actively prevent sandboxing. This effect has been noted to me on 2 occasions, and fits with the "topical phenomena" theory.
  4. Build Exceptional Quality Sites - This tip seems highly suspicious, but it also fits the facts. The sites that have been "escaping" Google's filter are often those that are outstanding sources of information for their target group, offer top-notch usability and information architecture, and employ the highest level of professional design. If your site looks like a Fortune 500 website, you're on the right track. It could be simple coincidence that these types of sites have not been "boxed", or it could be the manual review system in play. In either case, this method of site building isn't just good for the sandbox, it also will build links quickly, achieve consumer trust and be a more profitable venture overall. There's simply no reason not to attempt this.
  5. Don't Rely on Google's Traffic - If you know that your site will likely be sandboxed, you can always opt to dodge Google entirely and instead shoot for traffic from other sources. The best way to do this is to target highly or moderately competitive terms at Yahoo!, MSN & Ask that get thousands of searches each day at those engines. Although your link building and site building will require great strength to compete, it's much faster and easier to target these engines with a good site, than to go after Google. You can also look for alternative traffic sources like ads on the top SERPs at Google for the terms, traffic from alternative search sources like Wikipedia, Technorati or topical communities (blogs & forums). If you opt for this methodology, get creative, but don't get sloppy. You still want to build smart and naturally - after all, the Google sandbox is finite in length, and eventually, you will want to "escape".

Predictions for the Future and Conclusions

The Google Sandbox's continued existence and impact on the SERPs is difficult for those outside of Google to see. There are those who would argue that Google has gotten both more and less spammy in its SERPs, and equivalent folks on both sides of the increased/decreased relevancy debate. What has emerged from the last 18 months of study, observation and testing is that the Sandbox is almost certainly designed to help reduce the amount of spam and manipulation of links to boost rankings in Google's search engine.

As such, it would appear that the best way to avoid the penalty is to avoid techniques common to spam and manipulation. Sadly, since Google is erring on the side of less spam, many legitimate websites are being wiped out of the SERPs for long periods. It's important both to avoid over-diagnosis of the Sandbox and to be aware of the filter's qualities to make identification easier. Although many in the search community advise "sitting back and waiting", I personally do not approve of this approach. While it is important not to obsess over Google's rankings during a sandbox period, it's also important to experiment, grow and push your site and brand. "Sitting back and waiting" is never good advice in the web promotion space.

The future likely holds more of the same from Google and the sandbox. Despite webmasters' frustration, my opinion is that Google's engineers and quality raters are pleased with the success of the filter and are not planning to remove it in the near future. For the long term, however, I predict that much more sophisticated spam filters and link analysis techniques will emerge to replace the sandbox. The current sloppiness of the filter means that many websites that Google would like to have in their results are being caught improperly, and filtration is on a constant evolution at the search engines.

Several people have also predicted that Yahoo! or MSN may take up similar techniques to help stop spam. This phenomenon could seriously undermine new SEO/Ms and new campaigns, but it is a possibility. My recommendation is not to discount this possibility and launch projects or at least holding sites and their promotional efforts ASAP. The web environment right now is still relatively friendly to new sites, but will certainly become more competitive and unforgiving with time, no matter what saerch engine filters exist.

Additional Resources & Tools:

Courtesy of SEOMoz

Google's Patent: Information Retrieval Based on Historical Data


This report has been prepared to help SEOs understand the concepts and practical applications contained in Google's US Patent Application #20050071741 - Information Retrieval Based on Historical Data. My own advice and interpretation is offered throughout this paper - please conduct your own research before acting on the recommendations.


Overview of the 5 Most Critical Concepts from this Paper

These 5 concepts are what I believe to be the most ground-breaking and important for search engine optimization professionals to understand in order to best conduct their work.

1. Google's Concept of "Document Inception"

The date of "document inception", which can refer to either a website as a whole or a single page is used in many different areas by Google. This data can come from the registration info, the date Google first found a link to the site/page or the site/page itself. Google will be using this data to rank documents and establish credibility and relevance.

2. How Changing Content can Affect Rankings

Changing content over time has a huge impact in Google's measures according to this patent. They use changes to determine "freshness" or "staleness" of websites and pages and how that data impacts the value of the links on the page as well its rankings. They'll also measure large, "real", content changes vs. superfluous changes and rank based on that data.

Google also says that for some types of queries, particular results are more valuable - stale results may be desirable for information that doesn't need updating, fresh content is good for results that require it, seasonal results may pop up or down in the rankings based on the time of month/year, etc.

3. Spam Detection & Punishment

Google is employing many new systems of spam detection and prevention according to the patent. These include:

  • Watching for sites that rise in the rankings too quickly
  • Watching for registration information, IP addresses, name servers, hosts, etc that are on their "bad list"
  • Growth of off-topic links
  • Speed of link gain
  • Percentage of similar anchor text
  • Topic/Subject shifts or additions

4. What Google is Attempting to Measure

Google wants to measure or is attempting to actively measure each of the following:

  • Domain information
    • Registration date
    • Length of renewal (10 years, 5 years, 1 year, etc)
    • Addresses and Names of admin & technical contacts
    • DNS Records
    • Address of Name Servers
    • Hosting Location & Company
    • Stability of this data
  • Information on User Behavior Online
    • CTR (Click-Through Rate) of individual results in the SERPs
    • Length of time spent on a given site/page
  • Data contained on your computer
    • Favorites/Bookmarks List
    • Cache & Temp Files
    • Frequency of visits to particular sites/pages (history)

5. The Impact of this Patent

I believe that this patent will help to verify most of the theories surrounding Google's rankings. There has been speculation over the past 18-24 months on nearly every subject covered in this patent at the major SEO forums, but this will serve as verification.

Although it is long, I urge every SEO/Webmaster to read this page completely. I have attempted to make the information legible and readable, and only pulled out parts that are important to the active practice of SEO (which was almost 2/3 of the document, surprisingly). If you have any questions or corrections on this summary, please send me an email.


Analysis & Interpretation of the 63 Patent Components

History Data

1. Documents may be scored in Google's rankings based on "one or more types of history data".

Inception Date

2. The "inception date" read - registration date - may be considered as a scoring factor (I assume that older will be considered better, but this is not spelled out).

3. Google may determine how old each of the pages on a given website is and then determine the average age of pages on the website as a whole. The difference between a specific page's age and the average age of all documents on the site will be used in the ranking score.

4. The score for a website may include the amount of time since "document inception" - i.e. how old the website is.

5. One methodology of discovering site age might include when Google first "discovered" - read spiders the site, when Google first finds a link to the site, and when the site contains a "predetermined number of pages". I interpret this to mean that Google has some kind of threshold for site size (number of pages) that when reached, triggers a scoring effect (probably positive).

Frequency of Document Changes over Time

6. Google's scoring will (according to the patent) be based on "determining a frequency at which the content changes over time".

7. The "frequency at which the content changes" will be determined by the average time between changes, the number of changes over a particular time period, and the rate of change of one time period vs. the rate of change for another time period. So, if you are updating your website every day, then switch to updating once a week, your scoring in the historical measurements at Google will shift.

8. Scoring will also include how much of the site has changed over a given time period (new pages, changes, etc.).

9. The scoring based on changes (described in #8) will be determined by the number of new pages within a time period, the ratio of new pages vs. old pages and the total "percentage of the content of the document that has changed during a timed period."

10. The scoring of changes (from #8) will be based on the "perceived importance of the portions" that have been changed. The score will also take into account the changes as compared to the weighting(s) of each of the different pages of the site - i.e. if important pages change, it will have a different impact than if unimportant pages changed. My guess is that importance is mostly determined by links (both internal and external) that point to a given page. So if your contact page changes, it's not a big deal, but if your home page changes, that's a bigger deal.

11. The scoring for a "plurality of documents" - many pages in a given website - includes determining the last date of change for each page, determining the average date of change, and scoring the documents based on, "at least in part", the difference between a specific page's change and the average document's change. So, if one page had new information added, it would be scored differently than the other pages, while if all the pages changed together (maybe a new date, or new link or copyright in the footer, etc.), they would all be equal (since their date of change compared to the average is the same).

Amount of Changes over Time

12. Google's score may also include a measure of the amount of content which changes over time on the given website.

13. The "amount of content changes" from #11 will be determined by the ratio of new pages vs. the total number of pages on the site, and the percentage of content change over a given time period.

14. The "changes over a given time" from #12 will be scored based on "weighting different portions of the content differently based on a perceived importance" - once again, I read this as internal and external links to a page - the more links, the more "perceived importance".

Click-Through Rate Data

15. The "history data" from #1 could include information on "how often the document is selected when the document is included in a set of search results". This is literally tracking clickthroughs and rewarding those sites with higher CTR - just like AdSense does. Google will be scoring based on the "extent to which the document is selected over time... when included in a set of search results". We always assumed this to be true, but this is the first hard evidence I've seen directly from the horse's mouth.

16. Google may assign a "higher score" when the document is selected more often. No-brainer.

Document Association to Search Terms

17. Google might be scoring based on "determining whether a document (that has been showing up in the search results) is associated with the search terms".

Queries that Remain the Same but have New Meanings over Time

18. Google (according to the patent) calculates whether the "information relating to queries" remains the same or changes and scores documents based on this. For example, prior to September 11, the phrase 9-11 would not be related with terrorism, afterwards, it would be. Google will score documents based on the changes in the results for a given query to keep up with the times.

Staleness of Documents

19. The "staleness of documents" might be calculated as part of Google's scoring.

20. Google may also determine whether "stale documents" are preferable for certain types of queries (those that don't change over time, or for which a specific, single answer is what's necessary).

21. The "favorability" of stale documents may be determined by how often they are clicked on in the search results (over other documents). I relate this to a Wikipedia article on the nature of volcanoes - it doesn't need too much updating and will be a good relevant source for a long time for the query - "nature of volcanoes".

Link Behavior

22. History data scores might also consider the "behavior of links over time".

23. The appearance and disappearance of links figure into the scoring for link behavior (from #22).

24. The appearance/disappearance of links are dated by Google and used in the scoring.

25. The link appearances/disappearances are monitored and Google measures "how many links... appear or disappear during a time period, and whether there is a trend" toward more links or fewer links. The temporal (time-based) nature of groups of links will be scored by Google.

Freshness of Links

26. Google may use the "freshness of links" and assign weights to links based on freshness.

27. The "freshness" of a link (from #26) is calculated by the date of appearance of that link, the date of any changes in the link or anchor text, the date which the page and site that the link is from appeared and the date of the links to that linking page. So, if you have a new blog entry that points to a new site, the freshness will be super-fresh, since the page is new, the link to the page is new, the blog page that links to it is new, and the link to your blog entry on your own site is new (that's a lot of new, hence it's super-fresh).

28. The weight of a link also takes into account how "trusted" the site is, how authoritative the page with the link on it is and how "fresh" the page & site containing the link are.

29. The scoring also takes into account the "age distribution associated with the links based on the ages of the links". Google will take into account the age of the links to your page, and the time periods over which you got the links, i.e. lots of new links, a wide distribution over time, most links from a long time ago, etc.

Anchor Text Changes over Time

30. Google may also calculate changes in anchor text over time and use this data to score. My guess is that anchor text doesn't change very often, but they're certainly free to measure it.

Content Changes in a Document compared to Linking Anchor Text

31. Google might also measure if the content of a document changes, but the anchor text remains the same, or vice-versa. They're trying to protect against the anchor text "bait and switch" that makes a document look relevant to the anchor text, then replaces it with something else.

Freshness of Anchor Text

32. Freshness of anchor text can be considered.

33. Freshness of anchor text is calculated by "date of appearance", "date of change", and the dates of change and appearance of the page the link is on.

Traffic Characteristics of Site/Page

34. Traffic characteristics associated with a page/site may be taken into account in scoring.

35. The traffic pattern will have associated analysis that might feed into Google's score. So Google must be measuring traffic to a site/page and determining if, over time, it increases, decreases, etc. - they're seeking trends on which to base scoring.

User Behavior

36. User behavior regarding a particular page/site may figure into the scoring.

37. Google says that user behavior (from #36) is basically just the percentage of the time users click on a site/page when it is listed in the search results pages, along with the amount of time that users spend "accessing the document". I guess we all need to keep up the amount of time people spend on our sites.

Domain Related Information

38. The scoring might also include the sites associated with a given site and the "domain-related" information. This is defined in greater detail below.

39. Associated sites (from #38) are measured in terms of "legitimacy", which I interpret to mean non-spam, different owner, etc. Google says, specifically "scoring the document based... on whether the domain associated with the document is legitimate."

40. The "expiration date of the domain", the "domain name server record" and the "name server associated with the domain" are all parts of how Google will establish the legitimacy of an "associated" site.

Prior Rankings Data

41. History data scores could also take into account "information relating to a prior ranking". This means Google will be storing information about previous rankings for a site and using them to base scores on.

42. Google may also calculate where in the previous rankings the site was and how it moved around as pieces to figure into the scoring data.

43. In reference to #41, Google is using seasonal, "burstiness" and changes in scores over time as metrics to calculate the prior rankings scoring. So if a site is particularly relevant for "gifts for girlfriend" around Valentine's Day, but not as much for the same query at Christmas, Google will record this information and rank accordingly.

44. Google could also, with regard to #41, record "spikes in the rank" of site/pages in the search results.

User Maintained Data

45. "User maintained data" may also be recorded and monitored for the rankings scores.

46. "User maintained data" includes; favorites lists, bookmarks, temp files and cache files of monitored users. I'm not sure how they could obtain this data without installing "Google Spyware" - perhaps in the form of desktop search or the Google toolbar.

47. Monitoring the rate at which a site/page "is added to or removed from user generated data" may be used in the scoring.

Growth Profiles of Anchor Text

48. Scores might include "growth profiles of anchor text" - Google could monitor the use of anchor text in large groups and where/when they point to different sites & pages.

Linkage of Independent Peers

49. Information "relating to linkage of independent peers" might be added to scoring by "determining the growth in a number of independent peers that include the document". Google will basically be monitoring sites that are not in your subject category and how they link to you (I assumed they meant non-related subject peers, but they actually mean off-topic sites; see - Linkage of Independent Peers, below).

Document Topics

50. "Document topics" may be included in the scoring, this includes using "topic extraction". I assume this is determined by Google's text mining and analysis of the actual words on the page.

Identifying Relevant Documents

51. Relevance of documents to a given search query may be part of the scoring system. This is just Google's way of saying that documents about "pink dogs" will be part of those analyzed by the ranking algorithm when a user queries "pink dogs".

Plurality of History Data

52. Google might also use "means for obtaining a plurality of types of history data associated with the document" to score sites/pages. This just means that they will use a methodology that groups all of the bits of historical information into the rankings together to determine scoring.

History Component

53. "History data" can be measured by Google and used in the rankings. I'm not sure to what they're referring here - the entire quote is; "A system for scoring a document, comprising: a history component configured to obtain one or more types of history data associated with a document; and a ranking component configured to: generate a score for the document based, at least in part, on the one or more types of history data."

Ranking of Linked Documents

54. Google may be measuring the documents you link to and scoring based "on a decaying function of the age of the linkage data". So, fresher links vs. stale links will be taken into account (although whether there is a positive or negative effect associated with this is unknown).

55. For #54, Google says the "linkage data includes at least one link." So, they won't be measuring linkage data for pages with no links.

56. For #54, Google may include the anchor text in the linkage data.

57. For #54, Google says the "linkage data includes a rank based... on links and anchor text provided by one more linking documents." Google is simply saying that linkage data includes the anchor text and other info about the links coming to a page.

58. Google can use the "longevity of the linkage data" and determine from that an adjustment of the rankings based on the changes, stability & age of the linkage data. They explain below how they score this.

59. Google will be "penalizing the ranking if the longevity indicates a short life for the linkage data and boosting the ranking if the longevity indicates a long life for the linkage data." Google is, in effect, explaining a little of what we call "sandboxing" - they're saying that the older a link is, the more value it has, while new links have relatively lower value. This doesn't completely explain the effect, as many sites rank well quickly, etc. - but, it is an explanation for the phenomenon.

60. Google can adjust scoring by penalizing for linking documents they consider "stale" over a period of time and boost scoring if the content is frequently updated. So, it's better to be linked to on a page that frequently updates its content.

61. "Link churn" may be measured (explained in #62) and scoring adjusted based on this.

62. "Link churn" is "computed as a function of an extent to which one or more links provided by the document changes over time". Once again, Google is referring to the changes in where links point, their anchor text, etc. on a given page. More changes means more "link churn".

63. "Link churn" might create a penalization if it is above a certain threshold. So, if your links are changing all the time, the link will not be as valuable. This would shut down the methods used by the popular "Traffic Power/1p" spam company.


Patent Description:

Background of the Invention:

This is designed for IR (Information Retrieval) Systems and specifically to the methods used to generate search results.

Description of Related Art:

This information is largely irrelevant, but one important quote is: "There are several factors that may affect the quality of the results generated by a search engine. For example, some web site producers use spamming techniques to artificially inflate their rank. Also, "stale" documents (i.e., those documents that have not been updated for a period of time and, thus, contain stale data) may be ranked higher than "fresher" documents (i.e., those documents that have been more recently updated and, thus, contain more recent data). In some particular contexts, the higher ranking stale documents degrade the search results. Thus, there remains a need to improve the quality of results generated by search engines."

Summary of the Invention:

Google says "history data associated with the documents" may be used to score them in the search results. The invention provides a "method for scoring a document" and it "may include determining the age of linkage data associated with a linked document and ranking the linked document based on a decaying function of the age of the linkage data."

Brief Description of the Drawings:

The drawings are all exceptional simple charts showing the process for examination. A PDF with the charts at the bottom is available at http://files.bighosting.net/tr19070.pdf

Exemplary History Data:

This is the canonical and expository section of the patent description. It contains examples and explanations of many of the most important parts of this study, including detailed descriptions for many of the 63 components.

Document Inception Date

Google notes that the "date" label is used broadly and may include many time & date measurements. Google describes several of the techniques used to obtain an "inception date" and mentions that some techniques are "biased" because they can be influenced by a 3rd party.

The first technique used is when Google learns of or indexes the document - either by finding a link to the site/page, or following it. A second technique uses the registration date of the URL or the first time it was referenced in a "news article, newsgroup, mailing list" or combination of these types of documents.

The patent mentions that Google assumes that a "fairly recent inception date will not have a significant number of links from other documents." However, they say that the document's rankings can be adjusted accordingly based on how well it is doing in terms of links with consideration for its age.

Google is also wary of spam, they use the following example (which is already being quoted around the web):

"Consider the example of a document with an inception date of yesterday that is referenced by 10 back links. This document may be scored higher by (Google) than a document with an inception date of 10 years ago that is referenced by 100 back links because the rate of link growth for the former is relatively higher than the latter. While a spiky rate of growth in the number of back links may be a factor used by (Google) to score documents, it may also signal an attempt to spam search engine 125. Accordingly, in this situation, (Google) may actually lower the score of a document(s) to reduce the effect of spamming."

Google might also use the date of inception as a method for measuring the "rate at which links to the document are created". They say that "this rate can then be used to score the document, for example, giving more weight to documents to which links are generated more often."

The patent goes so far as to provide a formula for link-based score modification:

H=L/log(F+2),

H = history-adjusted link score
L = link score given to the document, which can be derived using any known link scoring technique that assigns a score to a document based on links to/from the document
F = elapsed time measured from the inception date associated with the document (or a window within this period).

The result of this formula would be that on the day of inception, L will be divided by 0.301 - the equivalent of multiplying L by 33.2. After 10 days (or any other unit of time), the formula will divide L by 1.079, making H smaller and smaller as time goes on.

The patent then suggests that "for some queries, older documents may be more favorable than newer ones" and that, as a result, Google may "adjust the score of a document based on the difference (in age) from the average age of the result set". This would push certain pages up or down in the rankings depending on their age and the age of their competition.

Content Updates/Changes

Google says that a "document's content changes over time may be used to generate/alter a score associated with that document." They again offer a formula for calculating this:

U=f(UF, UA)

f = a function, such as a sum or weighted sum
UF = update frequency score that represents how often a document (or page) is updated
UA = update amount score that represents how much the document (or page) has changed over time

Google notes that UA can also be determined as:

  • The number of "new" or unique pages associated with a document over a period of time
  • The ratio of the number of new or unique pages associated with a document over a period of time versus the total number of pages associated with that document
  • The amount that the document is updated over one or more periods of time (e.g., n % of a document's visible content may change over a period t (e.g., last m months)), which might be an average value
  • The amount that the document (or page) has changed in one or more periods of time (e.g., within the last x days)

UA could also different pieces of the content weighted differently, helping to eliminate changes that are cosmetic or insubstantial. Google mentions:

  • JavaScript
  • Comments
  • Advertisements
  • Navigational elements
  • Boilerplate material
  • Date/time tags

They also identify some important areas where content changes might necessitate greater weight:

  • Title
  • Anchor text of forward links

Google also mentions the use of trend analysis in the changes of a site/page by comparing an acceleration or deceleration of the rate of change (amount of new content, etc.). Google notes that maintaining all of this information may be too intensive for practical data storage and proposes measuring only large changes and storing "term vectors" only or "a small portion" of a page "determined to be important".

The patent notes that Google may, on occasion prefer stale documents for certain types of queries. They may also cerate an average age of change and adjust the scoring for documents based on their relations to the average (if more stale or more fresh content is desired).

Query Analysis

This technique describes several phenomenon that can influence rankings:

  • Clicks on a site/page in the SERPs can be used to rank it higher or lower - those clicked more often, move higher in the rankings (so make sure your title & description are good)
  • If a particular search term is increasingly associated with particular subjects, the pages on those subjects would rank higher for that query. For example, the meaning of the word "soap" was increasingly associated with Simple Object Access Protocol, rather than a cleansing agent, so pages on those subjects rose in the results.
  • The number of search results for a particular term is measured to check for "hot topics" or "breaking news" to help Google follow or become aware of trends. An example might be the recent Tsunami in East Asia, where thousands of pages popped up overnight on the subject.
  • Google also measures search queries whose answers or relevance changes over time. They use the example of "World Series Champion" which would be different after each Baseball season.
  • "Staleness" can be a deciding factor in the rankings. Google will use user clicks and traffic to decide if "stale" results are relevant for a particular query or not and rank accordingly. Google says it measures "staleness" by:
    • Creation Date
    • Anchor Growth
    • Traffic
    • Content Changes
    • Forward/Back link growth

Link Based Criteria

Google can measure various linking based factors including:

  • The dates new links appear to a site/page
  • Dates that link or pages linking to a site/page disappeared
  • The time-varying behavior of links to a page and any possible "trends" that are indicated by this, i.e. is the site gaining links overall or losing them? A downward trend might indicate "staleness", while an upward trend would indicate "freshness".
  • Google may check the number of new links to a document over a given time period compared to the new links the document has received since it was first found. They'll also use the "oldest age of the most recent y% of links compared to the age of the first link found."
  • Google gives an example in the patent of two websites that were both found 100 days ago:
    • Site #1 - 10% of the links were found less than 10 days ago
    • Site #2 - 0% of the links were found less than 10 days ago
    • This data might be used to " predict if a particular distribution signifies a particular type of site (e.g., a site that is no longer updated, increasing or decreasing in popularity, superceded, etc.)"
  • Freshness weights assigned to a link can also be used to rank sites/pages. Several factors can influence link freshness:
    • Date of appearance
    • Date of change of anchor text
    • Date of change of the page the link is on
    • Date of appearance of page the link is on
    • Google says they theorize that a page that is updated (significantly) while the link remains the same is a good indicator of a "relevant and good" link.
  • Other weights for links include:
    • How trusted the links are (they specifically mention government documents as being assigned higher trust)
    • How authoritative the websites and pages linking to the page are
    • Freshness of the page/site - they mention the Yahoo! homepage as one where links frequently appear and disappear.
    • The "sum of the weight of the links" pointing to a page/site may be used to raise or lower the scoring in the rankings. Google will measure the freshness of the page based on the freshness of the links to it and the freshness of the pages which the links are on.
    • Age distribution over time will also be measured, i.e. a site/page will be compared against all of its links over time and when it received them.
  • Google may use link date appearance to "detect spam", "where owners of documents or their colleagues create links to their own document for the purpose of boosting the score assigned by a search engine". Google says that legitimate sites/pages "attract back links slowly" and that a "large spike in the quantity of back links" may signal either a "topical phenomenon" or "attempts to spam a search engine."
    • Google gives the example of the CDC website after the outbreak of SARS as an example of a "topical phenomenon".
    • Google gives 3 examples of link spam techniques - "exchanging links", "purchasing links" or "gaining links from documents without editorial discretion on making links".
    • Google also gives examples of "documents that give links without editorial discretion" - including guest books, referrer logs and "free for all pages that let anyone add a link to a document."
  • A decrease over time in the number of links a document has can be used to indicate irrelevance, and Google notes that it will discount the links from these "stale" documents.
  • The "dynamic-ness" of links will also be measured and scored, based on how consistently links are given to a particular page. They use the example of "featured link" of the day and note that they'll use a page score based on the pages that link to the page, "for all versions of the documents within a window of time."

Anchor Text

Google can use anchor text measurements to determine ranking scores:

  • Anchor text changes over time might be used to indicate "an update or change of focus" on a site/page.
  • Anchor text that is no longer relevant or on-topic with the site/page it links to may be tracked and discounted if necessary. Large document changes will result in Google checking the anchor text to see if the subject matter is still the same as the anchor text.
  • Freshness of anchor text can be calculated. It can be determined by:
    • Date of appearance/change of the anchor text
    • Date of appearance/change of the linked to page
    • Date of appearance/change of the page with the link on it
    • Google notes that the date of appearance/change of the page with the link on it makes the link and anchor text more "relevant and good"

Traffic

Google can measure traffic levels to a page/site as part of their ranking scores.

  • A "large reduction in traffic may indicate that a document may be stale"
  • Google may compare the average traffic for a page/site over the past "j days" (as an example j=30) to the average traffic over the last year to see if the page/site is still as relevant for the query.
  • Google might also use seasonality to help determine if a particular site is more/less relevant for a query during specific times of the year.
  • Google is going to measure "advertising traffic" for websites:
    • "The extent to and rate at which advertisements are presented or updated by a given document over time"
    • The "quality of the advertisers". They note that referrers like Amazon.com will be given more trust and weight than a "pornographic site's" advertisements.
    • The "click-through rate" of the traffic referrals from the pages the ads are on.

User Behavior

Google may be measuring "aggregate user behavior". This can include:

  • The "number of times that a document is selected from a set of search results"
  • The "amount of time one or more users spend accessing the document"
  • The relative "amount of time" compared to an average that users spend on a particular site/page
    • Google uses an example of a swimming schedule page that users typically spent 30 seconds accessing, but have recently spent "a few seconds" accessing.
    • Google says this can be an indication for them that the page "contains an outdated swimming schedule" and they will push down its rank.

Domain-Related Information

Information associated with a domain can be used by Google to score sites in the rankings. They mention specific types of " information relating to how a document is hosted within a computer network (e.g., the Internet, an intranet, etc.)" including:

  • Doorway and "throwaway" domains - Google says they will use "information regarding the legitimacy of the domains"
  • Valuable domains, according to Google, "are often paid for several years in advance", while the throwaway domains "rarely are used for more than a year."
  • The DNS records will also be checked to determine legitimacy:
    • Who registered the domain
    • Admin & technical addresses and contacts
    • Address of name servers
    • Stability of data (and host company) vs. high number of changes
  • Google claims they will use "a list of known-bad contact information, name servers, and/or IP addresses" to predict whether a spammer is running the domain.
  • Google will also use information regarding a specific name server in similar ways -
    • "A "good" name server may have a mix of different domains from different registrars and have a history of hosting those domains, while a "bad" name server might host mainly pornography or doorway domains, domains with commercial words (a common indicator of spam), or primarily bulk domains from a single registrar, or might be brand new"

Ranking History

Google can measure the history of where a site ranked over time and data associated with this. Some specifics include:

  • A site that "jumps in rankings across many queries might be a topical document or it could signal an attempt to spam search engine"
  • The "quantity or rate that a document moves in rankings over a period of time might be used to influence future scores"
  • Sites can be weighted according to their position in the results, where the top result receives a higher score and the lower sites receive progressively lower scores. Google uses the equation:
    • [((N+1)-SLOT)/N]
    • Where N=the number of search results measured and SLOT equals the ranking position of the measured site
    • In this equation, the 1st result receives a score of 1.0 and the last result receives a score close to 0.
  • Google could check "commercial queries" specifically and documents that gained X% in the rankings " may be flagged or the percentage growth in ranking may be used" to determine if the "likelihood of spam is higher".
  • Google may also monitor:
    • "The rate at which (a site/page) is selected as a search result over time"
    • Seasonality - fluctuations based on the time of month or year
    • Burstiness - Sudden gains or losses in clicks
    • Other patterns in CTR
  • The rate of change in scores can be measured over time to see if a search term is getting more/less competitive and additional attention is needed.
  • Google "may monitor the ranks of documents over time to detect sudden spikes in the ranks". This could indicate, according to the patent, "either a topical phenomenon (e.g., a hot topic) or an attempt to spam search engine"
  • Google may use preventative measures against spam by:
    • "Employing hysteresis to allow a rank to grow at a certain rate" - hysteresis in this instance probably means a pull that results in the growth rate falling. The terms has dozens of unique definitions.
    • Limiting the "maximum threshold of growth over a predefined window of time" for a given site/page.
    • Google will also "consider mentions of the document in news articles, discussion groups, etc. on the theory that spam documents will not be mentioned"
  • Certain types of sites/pages (Google specifically mentions "government documents, web directories (e.g., Yahoo), and documents that have shown a relatively steady and high rank over time") may be immune to the "spike" tracking and penalization
  • Google may also "consider significant drops in ranks of documents as an indication that these documents are "out of favor" or outdated"

User Maintained/Generated Data

Google wants to measure many different types of aggregate data that user keep on their computers about their web visits and experiences, including:

  • Bookmarks & Favorites lists in the browser
    • They want to obtain this data either via a "browse assistant" - like the toolbar or desktop search, or.
    • Directly via the browser itself - I predict they are developing their own Google Browser.
    • Google will use this data over time to predict how valuable a particular site or page is
  • Google also wants to document additions and removals from favorites & bookmarks over time to help predict the value of a site/page
  • Google will also measure how often users access the site/page from their browser to see if it is still relevant, or just a leftover ("outdated" or "unpopular")
  • The "temp or cache files associated with users could be monitored" by Google to identify their visiting patterns on the web and determine whether there is "an upward or downward trend in interest" in a given site/page.

Unique Word, Bigrams, Phrases in Anchor Text

Google intends to measure the profile of how anchor text appears over time to a particular site/page to watch for spam. They note that "naturally developed web graphs typically involve independent decisions. Synthetically generated web graphs, which are usually indicative of an intent to spam, are based on coordinated decisions". The difference in patterns can be measured and put to use to block spam.

Google notes that the "spikiness" of "anchor words/bigrams/phrases" is a prime measurement. They note that spam typical shows "the addition of a large number of identical anchors from many documents".

Linkage of Independent Peers

Google can also use link data from "independent peers (e.g., unrelated documents)" to check for spam. They say that a " sudden growth in the number of independent peers... with a large number of links... may indicate a potentially synthetic web graph, which is an indicator of an attempt to spam." Google notes that this "indication may be strengthened if the growth corresponds to anchor text that is unusually coherent or discordant" and that they can discount the value of these links either by a "fixed amount" or a "multiplicative factor" - this would give an additional penalty just for having these links.

Document Topics

Topic extraction can be performed by Google through the following methods:

  • Categorization
  • URL analysis
  • Content analysis
  • Clustering
  • Summarization
  • A set of unique low frequency words

The goal is to "monitor the topic(s) of a document over time and use this information for scoring purposes."

Google notes that "a spike in the number of topics could indicate spam" or that significant document topic changes may indicate that the website "has changed owners and previous document indicators, such as score, anchor text, etc., are no longer reliable." Google says that "if one or more of these situations are detected, (they) may reduce the relative score of such documents and/or the links, anchor text, or other data" from the website.


List of Additional Coverage & Resources

  1. The patent from US Patent and Trademark Office - US Patent #20050071741 - Information retrieval based on historical data
  2. From SEOChat Forums - Information Retrieval Based on Historical Data - Sandbox Explanation, Aging Delay?
  3. From Threadwatch - Google's War on SEO - Documented
  4. From SearchEngineWatch Forums - Does New Google Patent Validate Sandbox Theory?
  5. From HighRankings Forum - New Google Patent, Must Read
  6. From SERoundtable - Sandbox Explained by Google? "Information retrieval based on historical data"



Courtesy of SEOMoz