Search engine: how does it work?

Before we can ensure that our websites being listed among the highest positions, we must first understand how search engines work. A search engine consists of four different modules that are assigned with their own tasks.

search-engine
Figure2. Elements of a Search Engine [1]

Crawler module
The crawler module consists of software that collects and categorizes relevant objects from web documents [1][2]. This module creates a program called spider that crawls over the web-pages on www [3] and then returns back to search engine with these collected information to be stored in page repository [1]. Some popular pages that are frequently queried by users will remain at page repository perhaps for indefinite amount of time [1]. It is estimated that from 20 billion of existing web-pages, search engines crawled 8-10 million of them [3].

Indexing module
The indexing module retrieves pages stored in page repository and then extracts only their ‘vital descriptors’ [1]. The results of this extraction process are then compressed and stored in three types of indexes that differ in the information they kept. The first type of indexes, content index, is used for keeping content-based information such as keyword (met-tags), title and anchor pages used in a web-page [1]. The second type of index, structure index, is used for storing valuable information regarding to the hyperlink structure of a web-page [1]. Information such as amount and sources of in-link coming to a web-page is therefore stored in the structure index. Finally, the special purposes index is used for storing various information extracted from certain file types such as pdf and image file [1].

Query module
The query module process all queries made by users [1] by retrieving web information stored in indexes. For some popular web-pages whose information are not stored in indexes may be retrieved straightly from page repository. Results displayed on the user’s computer screen will be filtered by the ranking module described in the following subsection.

Ranking module
The ranking module takes relevant web-pages gained from both query module and structured index [1] and then rank them up based on the mathematical algorithm used by search engines [3]. Results from this process will take form as a set of ordered web-pages listed based on their relevancies. Therefore, in theory, pages that appear at the top of the list are those pages that are considered as the most desirable pages by users.

The search engine modules described in figure2 are categorized in two different categories based on the type of dependencies they have. Both crawling and indexing are done continuously on the web [1]; therefore, these modules are not triggered by user queries and grouped under the query independent category. On the other hand, query and ranking processes are triggered by queries made by users [1]; therefore these modules are grouped under query dependent category.

Other related topics:

What is E-Business?
How does e-business earn money online?

What makes a good website?
What is CRM? How can it affect on e-business?
Linkspam: What is that and how to deal with it?

Optimizing your e-business with SEO

[1] Langville, A.N. and Carl D. Meyer (2006). ,Google’s Page Rank and Beyond: The Science of Search Engine Rankings, Princeton University Press, New Jersey
[2] Baeza-Yates, R., Castillo, C., Junqueira, F., Plachouras, V. and Fabrizio Silvestri (2007), Challenges on Web Retrieval, Data Engineering, p.6-20, http://ieeexplore.ieee.org/iel5/4221634/4221635/04221649.pdf?tp=&isnumber=4221635&arnumber=4221649&punumber=4221634, Date accessed: April, 28 2008
[3] Fishkin R., Beginner’s Guide to SEO, http://www.seomoz.org/articles/beginners-1-php#0%230, Date accessed: April, 28 2008

One Response

Leave a Reply