Table of Contents




ABSTRACT

The World Wide Web is the largest collection of data today and it continues increasing day by day. A web crawler is a program from the huge downloading of web pages from World Wide Web and this process is called Web crawling. To collect the web pages from www a search engine uses web crawler and the web crawler collects this by web crawling. Due to limitations of network bandwidth, time-consuming and hardware’s a Web crawler cannot download all the pages, it is important to select the most important ones as early as possible during the crawling process and avoid downloading and visiting many irrelevant pages. This paper reviews help the researches on web crawling methods used for searching.

KEYWORDS

Web crawler, Web Crawling Algorithms, Search Engine

INTRODUCTION

A web crawler or spider is a computer program that browses the WWW in sequencing and automated manner. A crawler that is sometimes referred to as a spider, bot, or agent is software whose purpose is performed web crawling. The basic architecture of the web crawler is given below (Figure1). More than 13% of the traffic to a web site is generated by web search [1]. Today the size of the web is thousands of millions of web pages that are too high and the growth rate of web pages is also too high i.e. increasing exponentially due to this the main problem for the search engine is to deal with this amount of the size of the web. Due to this large size of web induces low coverage and search engine indexing not cover one-third of the publicly available web [12].

By analyzing various log files of different web site they found that maximum web request is generated by a web crawler and it is on an average 50% [15]. Crawling the web is not a programming task, but an algorithm design and system design challenge because of the web content is very large. At present, only Google claims to have indexed over 3 billion web pages. The web has doubled every 9-12 months and the changing rate is very high [1, 2, 3]. About 40% web pages change weekly [5] when we consider lightly change, but when we consider changing by one third or more than the changing rate is about 7% weekly [7].

Researchers are developing new scheduling policy for downloading pages for the world wide web which guarantees that, even if we want do not download all the web pages we still download the most important (by the user point of view) ones. As the size of Internet data grows, it will be very vital to download the important ones first, as it will be impossible to download all of them.

The rest of the paper is organized as follows. Section2 is explained about the fundamentals of web crawling. Section3 gives detail about web crawler strategies with diagrams. Section4 has a critical analysis with tables. The research scope is in section 5. Conclusion and references are at last.

CONCLUSION

The paper surveys several crawling methods or algorithms that are used for downloading the web pages from the World Wide Web. We believe that all of the algorithms discuss in this paper are well effective and high performance for web search, reduce the network traffic and crawling costs, but overall advantages and disadvantage favor more for By using HTTP Get Request and also Dynamic Web Page and download updated web pages By the using of the filter is produce relevant results.

About KSRA

The Kavian Scientific Research Association (KSRA) is a non-profit research organization to provide research / educational services in December 2013. The members of the community had formed a virtual group on the Viber social network. The core of the Kavian Scientific Association was formed with these members as founders. These individuals, led by Professor Siavosh Kaviani, decided to launch a scientific / research association with an emphasis on education.

KSRA research association, as a non-profit research firm, is committed to providing research services in the field of knowledge. The main beneficiaries of this association are public or private knowledge-based companies, students, researchers, researchers, professors, universities, and industrial and semi-industrial centers around the world.

Our main services Based on Education for all Spectrum people in the world. We want to make an integration between researches and educations. We believe education is the main right of Human beings. So our services should be concentrated on inclusive education.

The KSRA team partners with local under-served communities around the world to improve the access to and quality of knowledge based on education, amplify and augment learning programs where they exist, and create new opportunities for e-learning where traditional education systems are lacking or non-existent.

FULL Paper PDF file:

SURVEY OF WEB CRAWLING ALGORITHMS

Bibliography

author

Rahul kumar, Anurag Jain and Chetan Agrawal3

Year

2016

Title

SURVEY OF WEB CRAWLING ALGORITHMS

Publish in

Advances in Vision Computing: An International Journal

DOI

10.5121/avc.2016.3301

PDF reference and original file: Click here

Website | + posts

Nasim Gazerani was born in 1983 in Arak. She holds a Master's degree in Software Engineering from UM University of Malaysia.

Website | + posts

Professor Siavosh Kaviani was born in 1961 in Tehran. He had a professorship. He holds a Ph.D. in Software Engineering from the QL University of Software Development Methodology and an honorary Ph.D. from the University of Chelsea.

+ posts

Somayeh Nosrati was born in 1982 in Tehran. She holds a Master's degree in artificial intelligence from Khatam University of Tehran.