AN EFFECTIVE IMPLEMENTATION OF WEB CRAWLING TECHNOLOGY TO RETRIEVE DATA FROM THE WORLD WIDE WEB (WWW)

AN EFFECTIVE IMPLEMENTATION OF WEB CRAWLING TECHNOLOGY TO RETRIEVE DATA FROM THE WORLD WIDE WEB (WWW)

Table of Contents




Abstract

Internet (or just the web) is enormous, well off, the best, easily accessible, and proper wellspring of data, and its clients are expanding quickly now daily. To rescue data from the web, web indexes are utilized which access pages according to the prerequisite of the clients. The size of the web is exceptionally wide and contains organized semi-organized and unstructured information. The greater part of the information present on the web is unmanaged so it is absurd to expect to get to the entire World Wide Web (WWW) without a moment’s delay in a solitary endeavor, so web crawlers use web crawlers. A web crawler is a fundamental piece of the web search tool. Data Retrieval manages to look and recovering data inside the reports and it likewise looks through the online databases and the web. In this paper, discussed, developed, and programmed a web crawler to fetch the information from the internet and filter data for the useable and graphical purpose for users.

Keywords

Web Crawling, Web Technology, Data, Python, Data Extraction, Algorithm

INTRODUCTION

THE World Wide Web (WWW) is a web customer server design. It is an incredible framework dependent on complete independence to the server for serving data accessible on the web. The data is masterminded as a huge, circulated, and non-direct content framework known as the Hypertext Document framework. These frameworks characterize some portion of a report as being hypertext-bits of content or pictures which are connected to different records by means of stay labels. HTTP and HTML present a standard method for recovering and introducing the hyperlinked records. Web programs, use web crawlers to investigate the servers for required pages of data.

The pages sent by the servers are prepared at the customer side. Presently days it has turned into a significant piece of human life to utilize the Internet to obtain entrance data from World Wide Web (WWW). The present populace of the world is about 7.049 billion out of which 2.40 billion individuals (34.3%) use Internet [1] (see Figure 1). From .36 billion of every 2000, the measure of Internet clients has expanded to 2.40 billion out of 2012 i.e., an expansion of 566.4% from 2000 to 2012. In Asia out of 3.92 billion individuals, 1.076 billion (i.e.27.5%) use Internet, though in India out of 1.2 billion, .137 billion (11.4%) use Internet. The same development rate is normal in future as well and it isn’t far away when one will begin reasoning that life is deficient without Internet. Figure 1: outlines Internet Users in the World by Geographic Regions.

Starting in 1990, the World Wide Web (WWW) has developed exponentially in size. Starting today, it is assessed that it contains around 55 billion openly list capable web reports [2] spread everywhere throughout the world on a huge number of servers. It is difficult to scan for data from such an immense accumulation of web reports accessible on the World Wide Web (WWW). A web crawler is a significant strategy for gathering information on and staying aware of, the quickly growing Internet. Web creeping can likewise be called a diagram search issue as the web is viewed as a huge chart where hubs are the pages and edges are the hyperlinks. Web crawlers can be utilized in different regions, the most unmistakable one is to list an enormous arrangement of pages and enable other individuals to look through this record.

AN EFFECTIVE IMPLEMENTATION OF WEB CRAWLING TECHNOLOGY TO RETRIEVE DATA FROM THE WORLD WIDE WEB (WWW)
AN EFFECTIVE IMPLEMENTATION OF WEB CRAWLING
TECHNOLOGY TO RETRIEVE DATA FROM THE WORLD
WIDE WEB (WWW)

A Web crawler doesn’t really move around PCs associated with the Internet, as infections or shrewd operators do, rather it just sends demands for archives on web servers from a lot of as of now areas. The general procedure that a crawler takes is as per the following,

 It checks for the following page to download – the framework monitors pages to be downloaded in a line.

 Checks to check whether the page is permitted to be downloaded – checking a robot’s prohibition document and furthermore perusing the header of the page to check whether any rejection directions were given do this. A few people don’t need their pages filed via web indexes.

 Download the entire page.

 Extract all connections from the page (extra site and page locations) and add those to the line referenced above to be downloaded later.

 Extract all words and spare them to a database related to this page, and spare the request for the words so individuals can look for phrases, not simply catchphrases

 Optionally channel for things like grown-up content, language type for the page, and so forth.

 Save the outline of the page and refresh the last handled date for the page with the goal that the framework knows when it should re-check the page at a later stage.

CONCLUSION

Web Crawlers are a significant part of the web crawlers. Web slithering procedure regarded elite are essential segments of different web administrations. It’s anything but an insignificant issue to set up such frameworks: Data control by these crawlers spread a wide region. It is significant to save a decent harmony between irregular access memory and plate gets to. A web crawler is a way for the search engines and other users to regularly ensure that their databases are up to date. Web crawlers are a central part of search engines, and details on their algorithms and architecture are kept as business secrets. When crawler designs are published, there is often an important lack of detail that prevents others from reproducing the work. There are also emerging concerns about‖ Search Engine Spamming‖, which prevent major search engines from publishing their ranking algorithms. New Modification and extension of the techniques in Web crawling should be the next topics in this area of research.

About KSRA

The Kavian Scientific Research Association (KSRA) is a non-profit research organization to provide research / educational services in December 2013. The members of the community had formed a virtual group on the Viber social network. The core of the Kavian Scientific Association was formed with these members as founders. These individuals, led by Professor Siavosh Kaviani, decided to launch a scientific / research association with an emphasis on education.

KSRA research association, as a non-profit research firm, is committed to providing research services in the field of knowledge. The main beneficiaries of this association are public or private knowledge-based companies, students, researchers, researchers, professors, universities, and industrial and semi-industrial centers around the world.

Our main services Based on Education for all Spectrum people in the world. We want to make an integration between researches and educations. We believe education is the main right of Human beings. So our services should be concentrated on inclusive education.

The KSRA team partners with local under-served communities around the world to improve the access to and quality of knowledge based on education, amplify and augment learning programs where they exist, and create new opportunities for e-learning where traditional education systems are lacking or non-existent.

FULL Paper PDF file:

AN EFFECTIVE IMPLEMENTATION OF WEB CRAWLING TECHNOLOGY TO RETRIEVE DATA FROM THE WORLD WIDE WEB (WWW)

Bibliography

author

F. M. Javed Mehedi Shamrat, Zarrin Tasnim, A.K.M Sazzadur Rahman, Naimul Islam Nobel, Syed Akhter Hossain

Year

2020

Title

AN EFFECTIVE IMPLEMENTATION OF WEB CRAWLING TECHNOLOGY TO RETRIEVE DATA FROM THE WORLD WIDE WEB (WWW)

Publish in

INTERNATIONAL JOURNAL OF SCIENTIFIC & TECHNOLOGY RESEARCH VOLUME 9, ISSUE 01, JANUARY 2020

PDF reference and original file: Click here

Website | + posts

Nasim Gazerani was born in 1983 in Arak. She holds a Master's degree in Software Engineering from UM University of Malaysia.

Website | + posts

Professor Siavosh Kaviani was born in 1961 in Tehran. He had a professorship. He holds a Ph.D. in Software Engineering from the QL University of Software Development Methodology and an honorary Ph.D. from the University of Chelsea.

+ posts

Somayeh Nosrati was born in 1982 in Tehran. She holds a Master's degree in artificial intelligence from Khatam University of Tehran.