Search Engine Land is reporting on a couple of new patents that have been granted to Google. What is interesting is that one of the patents deals with estimating similarity between web pages and documents which may help to filter duplicate content. The patent was originally filed just over five years ago on December 31, 2001.
The main features of the patent deals with the duplicate content process that includes:
- ideas to reduce the amount of redundant or nearly redundant documents/pages crawled and returned in response to a user's search query
- helping search engine spider programs become more efficient by avoiding crawling the sites determined to be substantial duplicate
- similarity profiles for pages based upon lists of hyperlink in those pages
- search results that exceed a similarity threshold
- time comparisons of duplicate and near duplicate pages
Google has other patents dealing with duplicate content as well. Detecting duplicate and near-duplicate files and Detecting query-specific duplicate documents also discuss methods to identify and sort duplicate and near duplicate document/pages.
Duplicate content still an issue? Of course it is. The engines, especially Google and ASK are focused on providing relevant results and although Vanessa Fox of Google mentions that your site may not be penalized for duplicate content per se, it is most definitely still an issue.