High placement in a search engine is critical for the success of any online business. Pages appearing higher in the search engine results to queries relevant to a site's business will get higher targeted traffic. To get this kind of competitive advantage Internet companies employ various SEO techniques in order to optimize certain factors used by search engines to rank results.
In the best case SEO specialists create relevant well-structured keyword rich pages, which not only please the eyes of a search engine crawler but also have value to the human visitor. Unfortunately it takes months for this strategic approach to produce feasible results, and many search engine optimizers use so-called "black-hat" SEO.
'Black Hat' SEO and Search Engine Spam
The oldest and simplest "black SEO" strategy is adding a variety of popular keywords into web pages to make them rank high for popular queries. This behavior is easily detected since generally such pages include unrelated keywords that lack topical focus. With the introduction of the term vector analysis search engine became immune to this sort of manipulation. However "black-hat' SEO went one step further creating the so-called "doorway' pages - tightly focused pages consisting of a bunch of keywords relevant to a single topic. In terms of keyword density such pages are able to rank high in search results but never seen by human visitors as they are redirected to the page intended to receive the traffic.
Another trend is the abusing the link popularity based ranking algorithms, such as PageRank with the help of dynamically-generated pages. Such pages receive the minimum guaranteed PageRank and the small endorsements from thousands of these pages are able to produce a sizeable PageRank for the target page. Search engines constantly improve their algorithms trying to minimize the effect of "black-hat"' SEO techniques, but SEOs also persistently respond with new more sophisticated and technically advanced tricks so that this process bears a resemblance to an arms race.
"Black-hat" SEO is responsible for the immense amount of search engine spam-pages and links created solely to mislead search engines and boost rankings for client web sites. To weed out the web spam search engines can use statistical methods that allow computing distributions for a variety of page properties. The outlier values in these distributions can be associated with web spam. The ability to identify web spam is extremely valuable to search engine not just because it allows excluding spam pages from their indices but also using them to train more sophisticated machine learning algorithms capable to battle web spam with higher precision.
Using Statistics to Detect Search Engine Spam
An example of an application of statistical methods to detect web spam is presented in the paper "Spam, Damn Spam and Statistics" by Dennis Fetterly, Mark Manasse and Marc Najork from Microsoft. They used two sets of pages downloaded from the Internet. The first set was crawled repeatedly from November 2002 to February 2003 and consisted from 150 million URLs. For each page the researches recorded HTTP status, time of download, document length, number of non-markup words, and a vector indicating the changes in page content between downloads. A sample of this set (751 pages) was inspected manually and 61 spam pages were discovered, or 8.1% of the set with a confidence interval of 1.95% at 95% confidence.
Another set was crawled between July and September 2002 and comprises 429 million pages and 38 million HTTP redirects. For this set the following properties were recorded: URL, URLs of outgoing links; for the HTTP redirects - the source and the target URL. 535 pages were manually inspected and 37 of them were identified as spam (6.9%).
The research concentrates on studying the following properties of web pages: - URL properties, including length and percentage of non-alphabetical characters (dashes, digits, dots etc.). - Host name resolutions. - Linkage properties. - Content properties. - Content evolution properties. - Clustering properties.
URL Properties
Search engine optimizers often use numerous automatically generated pages to massively distribute their low PageRank to a single target page. Since the pages are machine generated we can expect their URLs to look differently from those created by humans. The assumptions are that these URLs are longer and include more non-alphabetical characters such as dashes, slashes or digits. When searching for spam pages we should consider the host component only, not the entire URL down to the page name.
The manual inspection of the 100 longest hostnames had revealed that 80 of them belong to adult site and 11 refer to the financial and credit related sites. Therefore in order to produce a spam identification rule the length property has to be combined with the percentage of non-alphabetical characters. In the given set 0.173% of URLs are at least 45 characters long and contain at least 6 dots, 5 dashes or 10 digits-and the vast majority of these pages appear to be spam. By changing the threshold values we can change the number of pages flagged as spam and the number of false positives.
Host Name Resolutions
One can notice that Google, given a query q, tends to rank a page higher if the host component of the page's URL contains keywords from q. To utilize this search engine optimizers stuff pages with URLs containing popular keywords and keyphrases and set up DNS servers to resolve these URLs to a single IP. Generally SEOs generate a large number of host names to rank for a wide variety of popular queries.
This behavior can also be relatively easy detected by observing the number of host name resolutions to a single IP. In our set 1,864,807 IP addresses are mapped to only one host name, and 599,632 IPs-to 2 host names. There are also some extreme cases with hundreds of thousands host names mapped to a single IP, and the record-breaking IP referred by 8,967,154 host names.
To flag pages as spam a threshold of 10,000 name resolutions was chosen. About 3.46% of the pages in the Set 2 are served from IP addresses referred by 10,000 and more host names and the manual inspection of this sample proved that with very few exceptions they were spam. Lower threshold (1,000 name resolutions or 7.08% pages in the set) produces an unacceptable amount of false positives.
Linkage Properties
The Web consisting of interlinked pages has a structure of a graph. Therefore in graph terminology the number of outgoing links of a page can be referred to as the out-degree, while the in-degree equals to the number link pointing to a page. By analyzing out- and in-degrees values it is also possible to detect spam pages which would represent the outliers in the corresponding distributions.
In our set for example there are 158,290 pages with out-degree 1301, while according to the overall trend only 1,700 such pages are expected. Overall 0.05% of pages in the Set 2 have out-degrees at least three times more than suggested by the Zipfian distribution, and according to the manual inspection of a cross section, almost all of them are spam.
Similarly the distribution for in-degrees is calculated. For example 369,457 pages have the in-degree of 1001, while according to the trend only 2,000 such pages are expected. Overall, 0.19% of pages in the Set 2 have in-degrees at least three times more common than the Zipfian distribution would suggest, and the majority of them are spam.
Content Properties
Despite the recent measures taken by search engines to diminish the effect of keyword stuffing, this technique is still used by some SEOs who generate pages filled with meaningless keywords to promote their AdSense pages. Quite often such pages are based on a single template and even have the same number of words which makes them especially easy to detect using statistical methods.
For Set 1 the number of non-markup words in each page was recorded, so we can draw the variance of word count in pages downloaded from a given host name. The variance is plotted on the x-axis and the word count is shown on the y-axis, both axes are drawn on a logarithmic scale. Points in the left side of the graph marked with blue represent cases where at list 10 pages from a given host have the same word count. There are 944 such hosts (0.21% of the pages in Set 1). A random sample of 200 these pages was examined manually: 35% were spam, 3.5% contained no text and 41.5% were soft errors (a page with a message indicating that the resource is not currently available, despite the HTTP status code 200 "OK").
Content Evolution
The natural evolution of the content in the Web is slow. In a period of a week 65% of all pages will not change at all, while only 0.8% will change completely. In contrast many spam SEO web pages generated in response to an HTTP request independent of the requested URL will change completely of every download. Therefore by looking into extreme cases of content mutation we search engines are able to detect web spam.
The outliers represent IPs serving the pages that change completely every week. Set 1 contains 367 such servers with 1,409,353 pages (97.2%). The manual examination of a sample of 106 pages showed that 103 (97.2%) were spam, 2 were soft errors and 1 adult pages counted as a false positive.
Clustering Properties
Automatically generated spam pages tend to look very similar. In fact, as already said above, most of them are based on the same model and have only minor differences (like inserting varying keywords into a template). Pages with such properties can be detected by applying clustering analysis to our samples.
To form clusters of similar pages the 'shingling' algorithm described by Broder et al. [2] will be used. Figure 7 shows the distribution of the cluster sizes on near duplicate pages in Set 1. The horizontal axis shows the size of the cluster (the number of pages in the near-equivalence class), and the vertical axis shows how many such clusters Set 1 contains.
The outliers can be put into two groups. The first group did not contain any spam pages, pages in this group are more related to the duplicated content issue. In the same time the second group is populated predominantly by spam documents. 15 of 20 largest clusters were spam containing 2,080,112 pages (1.38% of all pages in Set 1)
To Sum Up
The methods described above are the examples of a fairly simple statistical approach to spam detection. The real life algorithms are much more sophisticated and are based on machine learning technologies which allow search engine to detect and battle spam with a relatively high efficiency at an acceptable rate of false positives. Applying the spam detection techniques enables search engine to produce more relevant results and ensures a more fair competition based on the quality of web resources and not on technical tricks.
References:
1. Dennis Fetterly, Mark Manasse, Marc Najork. "Spam, Damn Spam, and Statistics: Using statistical analysis to locate spam web pages" (2004). Microsoft Research.
2. A. Broder, S. Glassman, M. Manasse, and G. Zweig. "Syntactic Clustering of the Web". In 6th International World Wide Web Conference, April 1997.
- Related Videos
- Related Articles
- Ask / Related Q&A




Search Engine Secrets And Keywords
By: Sam Ellis | 16/11/2009Search engine secrets to getting your web page listed in the top 10 on Google is something everyone wishes they could learn. If we could break Google’s secret code, we could be in a position to get tons of traffic.
Search Engine Secrets And Keywords
By: Sam Ellis | 16/11/2009Search engine secrets to getting your web page listed in the top 10 on Google is something everyone wishes they could learn. If we could break Google’s secret code, we could be in a position to get tons of traffic.
Seo Silo Structure - Increase Your Website's Rankings Without Having to Struggle For Back Links
By: Cristian Lungu | 15/11/2009What is Silo Structure as an SEO method, what's the logic behind it and how to implement this concept with your own website. By taking advantage of this information you will easily increase buoyancy of your website in the SERPs, even if you're not the owner of an impressive inbound link portfolio.
Understanding how to deal with duplicate content
By: Stefanos Anastasiadis | 15/11/2009Duplicate content can affect negatively a web site’s rankings, since search engines might remove from their index the pages of the web site that they might considerate as duplicates, while original pages might not even get indexed.
Brute Force EVOII
By: Mats Mullins | 14/11/2009Brute Force EVO2 is simply an automated website creator and content distributer, while at the same time providing your money site a TON of relevant and powerful backlinks. It creates and promotes blogs, using content the software provides you in the form of articles, based on the keyword, or keyword phrase that you would like your money site to rank for. It then promotes these new blogs along with the URL of your money site to a Mass of web 2 properties around the internet and does some ve
Approaching Keywords Has Changed And Here's Why
By: Cindy Weathers | 14/11/2009Choosing the most relevant keywords for your business is an absolute must.
The Right Marketing Strategy Helps In Long - Term Benefits
By: Dmytro Fedosyeyev | 14/11/2009If you are intending to make some fruitful online ventures and bring in huge prospects from e- commerce then you have come to the perfect place to achieve your targets. The all new online business is attracting customers from all over the world due to the successful e- campaigns and customer satisfaction.
SEO BASICS
By: Cindy Woudenberg | 14/11/2009If you are a business owner with a website, you know that everyone is talking about Search Engine Optimization (SEO), and the positive impact it can have on your business. Is SEO alone enough to increase sales? Can you do the work yourself, or should you hire an SEO consultant? You don’t need to be proficient in web development or a marketing expert to make an educated decision about small business SEO.
Riu Hotel: A Piece of Paradise On Paradise Island Bahamas
By: Oleg Ishenko | 13/02/2007 | TravelA traveler reviews a trip to the Bahamas and a stay at Riu Paradise Island All Inclusive hotel. The hotel is in prime condition and has a great service promising you a comfortable and relaxing stay. The downside is the relatively expensive food and drinks, but this is not something unusual for Bahamas.
Sandyport Beaches Resort: A Little Worn But Comfy
By: Oleg Ishenko | 12/02/2007 | TravelA traveler describes a trip to Bahamas and a stay at Sandyport Beaches resort. Overall the hotel is satisfactory although it requires a lot of improvements. Prices at the hotel are moderate but the food is overpriced even in grocery stores. There are more useful tips in the article...
Comfort Suites Paradise Island Bahamas: a Good Economy Hotel
By: Oleg Ishenko | 11/02/2007 | TravelA traveler's review of a trip to Bahamas and a stay at Comfort Suites Paradise Island. The hotel is a bit old and needs some maintenance but still is a great alternative to the overpriced Atlantis Resort. Plus the guests have a free and easy access to all the Atlantis resort facilities. Overall, a good economy opion for your Bahamas vacation.
Paradise Island Harbour Resort: Generally Satisfied
By: Oleg Ishenko | 08/02/2007 | TravelA traveler tells about his trip to Bahamas and reviews his stay at Paradise Island Harbour resort. Although the hotel requires some renovation it still provides a satisfactory experience and worth the money.
Radisson Aruba Resort And Casino: Review of a Trip
By: Oleg Ishenko | 05/02/2007 | Online GamblingA traveler reviews her stay at Radisson Aruba, gives her impression on the hotel service, island attraction and gives some useful tips to make the most of an Aruba vacation...
Divi Aruba All Inclusive Resort in Oranjestad, Aruba
By: Oleg Ishenko | 31/01/2007 | TravelA traveler reviews his trip to this small Dutch island in the Caribbean and tells about his stay at Divi Aruba All Inclusive resort
Guidelines to a Perfect Link Exchange Scam
By: Oleg Ishenko | 10/01/2007 | Link PopularityLink exchange scam is an interesting theme for a study per se and still awaits its researchers. But in the meanwhile the SEO community is being successful in summarizing the guidelines for the most perfect link exchange scam. (Please, don't take them seriously:)