Table of Contents





Abstract

This is a survey of the science and practice of web crawling. While at first glance web crawling may appear to be merely an application of breadth-first-search, the truth is that there are many challenges ranging from systems concerns such as managing very large data structures to theoretical questions such as how often to revisit evolving content sources. This survey outlines the fundamental challenges and describes the state-of-the-art models and solutions. It also highlights avenues for future work.

Introduction

A web crawler (also known as a robot or a spider) is a system for the bulk downloading of web pages. Web crawlers are used for a variety of purposes. Most prominently, they are one of the main components of web search engines, systems that assemble a corpus of web pages, index them, and allow users to issue queries against the index and find the web pages that match the queries. A related use is web archiving (a service provided by e.g., the Internet archive [77]), where large sets of web pages are periodically collected and archived for posterity. A third use is web data mining, where web pages are analyzed for statistical properties, or where data analytics is performed on them (an example would be Attributor [7], a company that monitors the web for copyright and trademark infringements). Finally, web monitoring services allow their clients to submit standing queries or triggers, and they continuously crawl the web and notify clients of pages that match those queries (an example would be GigaAlert [64]).

The raison d’ˆetre for web crawlers lies in the fact that the web is not a centrally managed repository of information, but rather consists of hundreds of millions of independent web content providers, each one providing their own services, and many competing with one another. In other words, the web can be viewed as a federated information repository, held together by a set of agreed-upon protocols and data formats, such as the Transmission Control Protocol (TCP), the Domain Name Service (DNS), the Hypertext Transfer Protocol (HTTP), the Hypertext Markup Language (HTML) and the robots exclusion protocol. So, content aggregators (such as search engines or web data miners) have two choices: They can either adopt a pull model where they will proactively scour the web for new or updated information, or they could try to establish a convention and a set of protocols enabling content providers to push the content of interest to the aggregators. Indeed, the Harvest system [24], one of the earliest search services, adopted such a push model. However, this approach did not succeed, and virtually all content aggregators adopted the pull approach, with a few provisos to allow content providers to exclude all or part of their content from being crawled (the robots exclusion protocol) and to provide hints about their content, its importance and its rate of change (the Sitemaps protocol [110]).

There are several reasons why the push model did not become the primary means of acquiring content for search engines and other content aggregators: The fact that web servers are highly autonomous means that the barrier of entry to becoming a content provider is quite low, and the fact that the web protocols were at least initially extremely simple lowered the barrier even further — in fact, this simplicity is viewed by many as the reason why the web succeeded where earlier hypertext systems had failed. Adding push protocols would have complicated the set of web protocols and thus raised the barrier of entry for content providers, while the pull model does not require any extra protocols. By the same token, the pull model lowers the barrier of entry for content aggregators as well: Launching a crawler does not require any a priori buy-in from content providers, and indeed there are over 1,500 operating crawlers [47], extending far beyond the systems employed by the big search engines. Finally, the push model requires a trust relationship between the content provider and content aggregator, something that is not given on the web at large — indeed, the relationship between content providers and search engines are characterized by both mutual dependence and adversarial dynamics (see Section 6).

Challenges

The basic web-crawling algorithm is simple: Given a set of seed Uniform Resource Locators (URLs), a crawler downloads all the web pages addressed by the URLs, extracts the hyperlinks contained in the pages, and iteratively downloads the web pages addressed by these hyperlinks. Despite the apparent simplicity of this basic algorithm, web crawling has many inherent challenges:

• Scale. The web is very large and continually evolving. Crawlers that seek broad coverage and good freshness must achieve extremely high throughput, which poses many difficult engineering problems. Modern search engine companies employ thousands of computers and dozens of high-speed network links.

• Content selection tradeoffs. Even the highest-throughput crawlers do not purport to crawl the whole web or keep up with all the changes. Instead, crawling is performed selectively and in a carefully controlled order. The goals are to acquire high-value content quickly, ensure eventual coverage of all reasonable content, and bypass low-quality, irrelevant, redundant, and malicious content. The crawler must balance competing objectives such as coverage and freshness while obeying constraints such as per-site rate limitations. A balance must also be struck between exploration of potentially useful content and exploitation of content already known to be useful.

• Social obligations. Crawlers should be “good citizens” of the web, i.e., not impose too much of a burden on the web sites they crawl. In fact, without the right safety mechanisms, a high-throughput crawler can inadvertently carry out a denial-of-service attack.

• Adversaries. Some content providers seek to inject useless or misleading content into the corpus assembled by the crawler. Such behavior is often motivated by financial incentives, for example (mis)directing traffic to commercial web sites.

Outline

Web crawling is a many-faceted topic, and as with most interesting topics it cannot be split into fully orthogonal subtopics. Bearing that in mind, we structure the survey according to five relatively distinct lines of work that occur in the literature:

• Building an efficient, robust, and scalable crawler (Section 2).

• Selecting a traversal order of the web graph, assuming the content is well-behaved and is interconnected via HTML hyperlinks (Section 4).

• Scheduling revisitation of previously crawled content (Section 5).

• Avoiding problematic and undesirable content (Section 6).

• Crawling so-called “deep web” content, which must be accessed via HTML forms rather than hyperlinks (Section 7). Section 3 introduces the theoretical crawl ordering problem studied in Sections 4 and 5 and describes the structural and evolutionary properties of the web that influence crawl ordering. Section 8 gives a list of open problems.

Future Directions

As this survey indicates, crawling is a well-studied problem. However, there are at least as many open questions as there are resolved ones. Even in the material we have covered, the reader has likely noticed many open issues, including:

• Parameter tuning. Many of the crawl ordering policies rely on carefully tuned parameters, with little insight or science into how best to choose the parameters. For example, what is the optimal level of greediness for a scoped crawler (Section 4.2)?

• Retiring unwanted pages. Given finite storage capacity, in practice crawlers discard or retire low-quality and spam pages from their collections, to make room for superior pages. However, we are not aware of any literature that explicitly studies retirement policies. There is also the issue of how much metadata to retain about retired pages, to avoid accidentally rediscovering them, and to help assess the quality of related pages (e.g., pages on the same site, or pages linking to or linked from the retired page).

• Holistic crawl ordering. Whereas much attention has been paid to various crawl ordering sub-problems (e.g., prioritizing the crawl order of new pages, refreshing content from old pages, revisiting pages to discover new links), there is little work on how to integrate the disparate approaches into a unified strategy.

• Integration of theory and systems work. There is also little work that reconciles the theoretical work on crawl ordering (Sections 3–5) with the pragmatic work on largescale crawling architectures (Section 2). For example, diskbased URL queues make it infeasible to re-order URLs frequently, which may impede exact implementation of some of the crawl ordering policies. While we have hypothesized scalable implementations of some of the policies (Sections 4.3 and 5.3), a more comprehensive and empirical study of this topic is needed.

• Deep web. Clearly, the science and practice of deep web crawling (Section 7) are in its infancy.

There are also several nearly untouched directions:

• Vertical crawling. Some sites may be considered important enough to merit crawler specialization, e.g., eBay’s auction listings or Amazon’s product listings. Also, as the web matures, certain content dissemination structures become relatively standardized, e.g., news, blogs, tutorials, and discussion forums. In these settings the unit of content is not always a web page; instead, multiple content units are sometimes appended to the same page, or a single content unit may span multiple pages (connected via “next page” and ”previous page” links). Also, many links lead to redundant content (e.g., shortcut to featured story, or items for sale by a particular user). A crawler that understands these formats can crawl them more efficiently and effectively than a general-purpose crawler. There has been some work on crawling discussion forums [119], but little else. Vertical crawling is not the same as scoped crawling (Section 4.2): Scoped crawling focuses on collecting semantically coherent content from many sites, whereas vertical crawling exploits syntactic patterns on particular sites.

• Crawling scripts. Increasingly, web sites employ scripts (e.g., JavaScript, AJAX) to generate content and links on the fly. Almost no attention has been paid to whether or how to crawl these sites. We are aware of only one piece of preliminary work, by Duda et al. [55].

• Personalized content. Web sites often customize their content to individual users, e.g., Amazon gives personalized recommendations based on a user’s browsing and purchasing patterns. It is not clear how crawlers should treat such sites, e.g., emulating a generic user versus attempting to specialize the crawl based on different user profiles. A search engine that aims to personalize search results may wish to push some degree of personalization into the crawler.

• Collaboration between content providers and crawlers. As discussed in Section 1, crawling is a pull mechanism for discovering and acquiring content. Many sites now publish Sitemaps and RSS feeds, which enable a push-oriented approach. Modern commercial crawlers employ a hybrid of push and pull, but there is the little academic study of this practice and the issues involved. Schonfeld and Shivakumar [110] examined tradeoffs between reliance on Sitemaps for content discovery and the classical pull approach.

About KSRA

The Kavian Scientific Research Association (KSRA) is a non-profit research organization to provide research / educational services in December 2013. The members of the community had formed a virtual group on the Viber social network. The core of the Kavian Scientific Association was formed with these members as founders. These individuals, led by Professor Siavosh Kaviani, decided to launch a scientific / research association with an emphasis on education.

KSRA research association, as a non-profit research firm, is committed to providing research services in the field of knowledge. The main beneficiaries of this association are public or private knowledge-based companies, students, researchers, researchers, professors, universities, and industrial and semi-industrial centers around the world.

Our main services Based on Education for all Spectrum people in the world. We want to make an integration between researches and educations. We believe education is the main right of Human beings. So our services should be concentrated on inclusive education.

The KSRA team partners with local under-served communities around the world to improve the access to and quality of knowledge based on education, amplify and augment learning programs where they exist, and create new opportunities for e-learning where traditional education systems are lacking or non-existent.

FULL Paper PDF file:

Web Crawling

Bibliography

author

Christopher Olston and Marc Najork

Year

2010

Title

Web Crawling

Publish in

Foundations and Trends in Information Retrieval

DOI

10.1561/1500000017

PDF reference and original file: Click here

Website | + posts

Nasim Gazerani was born in 1983 in Arak. She holds a Master's degree in Software Engineering from UM University of Malaysia.

Website | + posts

Professor Siavosh Kaviani was born in 1961 in Tehran. He had a professorship. He holds a Ph.D. in Software Engineering from the QL University of Software Development Methodology and an honorary Ph.D. from the University of Chelsea.

+ posts

Somayeh Nosrati was born in 1982 in Tehran. She holds a Master's degree in artificial intelligence from Khatam University of Tehran.