Top

Introduction: Introduction
Getting webbase: Getting webbase
Concepts: Overview of the concepts used by webbase
Crawler: The crawler that mirror the web
Indexer: Indexing the documents
Meta database: Meta database
Homes Pages: Homes Pages database
C API: C langage interface to the crawler
Administrator guide: A more in-depth view of the crawler
Concept Index: Index of Concepts
--- The Detailed Node Listing ---

Introduction
What Is Webbase: What is webbase ?
What Webbase is not: What webbase is not ?
Overview of the concepts used by webbase
Home Pages: Home Pages Database
Crawling: The crawler that fetches WEB documents
Full Text: Using a full text database
Language Recognition: Using language recognition functionality
FURL: File equivalent of a URL
What is WEB: What is part of a WEB and what is not.
Web Bounds: Web Bounds
Planning: Good planning is crucial
Traces: Traces and debug help the manager
Space: Memory and disk space considerations
The crawler that mirrors the web
Running Crawler: Running the crawler
First Round: The first round
Recursion: Crawler recursion
Robot Exclusion: Robot Exclusion
Heuristics: Heuristics
Filtering Exploration: Filtering the exploration of a WEB
MIME Filtering: MIME type filtering
Panic Prevention: Panic prevention
Cookies: Cookies handling
Proxy: Proxy handling
Meta database
url table: url table
url_complete table: url_complete table
mime2ext table: mime2ext table
mime_restict: mime_restrict table
indexes: indexes
Homes Pages database - table start
Storage: Home Pages database storage
State information: State information
C langage interface to the crawler
Initialization: Initialize the crawler
Using the crawler: Using the crawler
webbase_url_t: The webbase_url_t structure
Compiling with webbase: Compiling an application with the webbase library
A more in-depth view of the crawler
Conventions: Conventions used in figures
Main crawl algorithms: How the crawler works
Presentation of the modules: Where the work is done
More on structures: Relationships between structures
Guidelines to use the crawler: Examples about how to use the crawler
How the crawler works
Crawl a virgin URL: Crawl a virgin URL
Crawl a list of URLs: Crawl a list of URLs
Crawl rebuild: Rebuild URLs
Where the work is done
crawler: crawler module
cookies: cookies module
dirsel: dirsel module
http: http module
robots: robots module
webbase: webbase module
webbase_url: webbase_url module
webtools: webtools module

Node:Introduction, Next:Getting webbase, Previous:Top, Up:Top

Introduction

What Is Webbase: What is webbase ?
What Webbase is not: What webbase is not ?

Node:What Is Webbase, Next:What Webbase is not, Previous:Introduction, Up:Introduction

What is `webbase` ?

webbase is a crawler for the Internet. It has two main functions : crawl the WEB to get documents and build a full text database with this documents.

The crawler part visit the documents and store intersting information about them localy. It visits the document on a regular basis to make sure that it is still there and updates it if it changes.

The full text database uses the localy copies of the document to build a searchable index. The full text indexing functions are not included in webbase.

Node:What Webbase is not, Previous:What Is Webbase, Up:Introduction

What `webbase` is not ?

webbase is not a full text database. It uses a full text database to search the content of the URLs retrieved.

Node:Getting webbase, Next:Concepts, Previous:Introduction, Up:Top

Getting webbase

The home site of webbase is Senga http://www.senga.org/webbase/html/. It contains the software, online documentation, formated documentation and related software for various platforms.

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.

Node:Concepts, Next:Crawler, Previous:Getting webbase, Up:Top

Overview of the concepts used by `webbase`

The crawler (or robot or spider) works according to specifications found in the Home Pages database.

The Home Pages database is a selection of starting points for the crawler, including specifications that drives its actions for each starting point.

The crawler is in charge of maintaining an up-to-date image of the WEB on the local disk. The set of URLs concerned is defined by the Home Pages database.

Using this local copy of the WEB, the full text database will build a searchable index to allow full text retrieval by the user.

Home Pages: Home Pages Database
Crawling: The crawler that fetches WEB documents
Full Text: Using a full text database
Language Recognition: Using language recognition functionality
FURL: File equivalent of a URL
What is WEB: What is part of a WEB and what is not.
Web Bounds: Web Bounds
Planning: Good planning is crucial
Traces: Traces and debug help the manager
Space: Memory and disk space considerations

Node:Home Pages, Next:Crawling, Previous:Concepts, Up:Concepts

Home Pages Database - table start

The Home Pages database is a list of starting points. The webbase crawler is not designed to explore the entire WEB. It is best suited to build specialized search engines on a particular topic. The best way to start an webbase crawler is to put a bookmark file in the Home Pages database.

Node:Crawling, Next:Full Text, Previous:Home Pages, Up:Concepts

The crawler that fetches WEB documents

The crawler is working on a set of URLs defined by the Home Pages database. It loads each page listed in the Home Pages database and extracts hypertexts links from them. Then it explores these links recursively until there are no more pages or the maximum number of documents has been loaded.

Node:Full Text, Next:Language Recognition, Previous:Crawling, Up:Concepts

Using a full text database

The full text databases are designed to work with local files, not with URLs. This is why the crawler has to keep a copy of the URLs found on the local disk. In fact, a full text database able to handle URLs is called an Internet search engine :-)

The hooks library in webbase is designed to provide a bridge between the crawler and a full text indexing library. It contains a stub that does nothing (the hooks.c file) and an interface to the mifluz full text indexing library (see http://www.senga.org/mifluz/html/ to download it).

Node:Language Recognition, Next:FURL, Previous:Full Text, Up:Concepts

Using language recognition functionality

When crawling a document, it is possible to retrieve the language of the document and to store this information in the url table along with other url information. To do this, you must use the langrec module. The language recognition module recognizes five languages : french, english, italian, spanish and german.

Node:FURL, Next:What is WEB, Previous:Language Recognition, Up:Concepts

File equivalent of a URL

The job of the crawler is to maintain a file system copy of the WEB. It is, therefore, necessary to compute a unique file name for each URL. A FURL is simply a transformation of a URL into a file system PATH (hence FURL = File URL).

The crawler uses a MD5 key calculated from the URL as a path name. For example http://www.foo.com/ is transformed into the MD5 key 33024cec6160eafbd2717e394b5bc201 and the corresponding FURI is 33/02/4c/ec6160eafbd2717e394b5bc201. This allows to store a large number of files even on file systems that do not support many entries in the same directory. The drawback is that it's hard to guess which file contains which URL.

An alternative encoding of the FURI is available thru the uri library. It's much more readable and can conviniently be used if the number of URLs crawled is low (less than 50 000). The following figure shows how the URL is mapped to a PATH.

# An URL is converted to a FURL in the following way
#
#             http://www.ina.fr:700/imagina/index.html#queau
#               |    \____________/ \________________/\____/
#               |          |                   |       lost
#               |          |                   |
#               |         copied and           |
#               |         default port         |
#               |         added if not         |
#               |         present              |
#               |          |                   |
#               |          |                   |
#               |  /^^^^^^^^^^^^^\/^^^^^^^^^^^^^^^^\
#            http:/www.ina.fr:700/imagina/index.html
#

A problem with URLs is that two URLs can lead to the same document and not be the same string. A well-known example of that is something%2Felse that is strictly identical to something/else. To cope with this problem a canonical form has been defined and it obeys complicated rules that leads to intuitive results.

The mapping is designed to

be easy to parse by a program.
allow bijective transformation. i.e. a URL can be translated into a FURL and a FURL can be translated into a URL, without losing information.
be readable by humans, which is definitely not the case when you use md5 encoding.

Node:What is WEB, Next:Web Bounds, Previous:FURL, Up:Concepts

What is part of a WEB and what is not.

When starting the exploration of a WEB, the crawler must answer this question : is that link part of the server I'm exploring or not ? It is not enough to state that the URL is absolute or relative. The method used in webbase is simple : if the URL is in the same directory of the same server, then it belongs to the same WEB.

Node:Web Bounds, Next:Planning, Previous:What is WEB, Up:Concepts

Web Bounds

When surfing the WEB one can reach a large number of documents but not all the documents available on the Internet.

A very common example of this situation arises when someone adds new documents in an existing WEB server. The author will gradually write and polish the new documents, testing them and showing them to friends for suggestions. For a few weeks the documents will exist on the Internet but will not be public: if you don't know the precise URL of the document you can surf forever without reaching it. In other words, the document is served but no links point to it.

This situation is even more common when someone moves a server from one location to another. For one or two months the old location of the server will still answer to requests but links pointing to it will gradually point to the new one.

Node:Planning, Next:Traces, Previous:Web Bounds, Up:Concepts

Good planning is crucial

webbase is constantly working to keep the base up-to-date. It follows a planning that you can describe in a crontab.

The administrator of webbase must dispatch the actions so that the Internet link and the machine are not overloaded for one day and idle the next day. For instance, it is a good idea to rebuild the database during the weekend and to crawl every weekday.

Node:Traces, Next:Space, Previous:Planning, Up:Concepts

Traces and debug help the manager

Various flags enable verbose levels in the crawler (see the manual page). They are usually quite verbose and only useful to know if the process is running or not. Error messages are very explicit, at least for someone who knows the internals of webbase.

Node:Space, Previous:Traces, Up:Concepts

Memory and disk space considerations

The WEB meta information database and the full text database are the main space eaters. The documents stored in the WLROOT cache can be very big if they are not expired.

webbase tends to be very memory hungy. An average crawler takes 4Mb of memory. For instance running five simultaneous mirroring operations you will need 20Mb of memory.

Node:Crawler, Next:Indexer, Previous:Concepts, Up:Top

The crawler that mirrors the web

The WEB can be viewed as a huge disk with a lot of bad blocks and really slow access time. The WEB is structured in a hierarchical way, very similar to file systems found on traditional disks.

Crawling part of this huge disk (the WEB) for specific purposes implies the need for specific tools to deal with these particularities. Most of the tools that we already have to analyse and process data, work with traditional file systems. In particular the mifluz database is able to efficiently build a full text database from a given set of files. The webbase crawler's main job is to map the WEB on the disk of a machine so that ordinary tools can work on it.

Running Crawler: Running the crawler
First Round: The first round
Recursion: Crawler recursion
Robot Exclusion: Robot Exclusion
Heuristics: Heuristics
Filtering Exploration: Filtering the exploration of a WEB
MIME Filtering: MIME type filtering
Panic Prevention: Panic prevention
Cookies: Cookies handling
Proxy: Proxy handling

Node:Running Crawler, Next:First Round, Previous:Crawler, Up:Crawler

Running the crawler

To run the crawler you should use the crawler(1) command.

crawler -base mybase -- http://www.ecila.fr/

will make a local copy of the http://www.ecila.fr/ URL.

Node:First Round, Next:Recursion, Previous:Running Crawler, Up:Crawler

The first round

When given a set of URLs, the crawler tries to load them all. It registers them as starting points for later recursion. These URLs will, therefore, be treated specially. The directory part of the URL will be used as a filter to prevent loading documents that do not belong to the same tree during recursion.

Node:Recursion, Next:Robot Exclusion, Previous:First Round, Up:Crawler

Crawler recursion

For each starting point the crawler will consider all the links contained in the document. Relative links will be converted to absolute links. Only the links that start with the same string as the starting point will be retained. All these documents will be loaded if they satisfy the recursions conditions (see below).

Any URLs contained in the pages and that cannot be put in a cannonical form will be silently rejected.

When all the documents found in the starting points are explored, they go through the same process. The recursion keeps exploring the servers until either it reaches the recursion limit or there are no more documents.

Node:Robot Exclusion, Next:Heuristics, Previous:Recursion, Up:Crawler

Robot Exclusion

Exploration of a web site can be stopped using the robot exclusion protocol. When the crawler finds a new host, it tries to load the robots.txt file. If it does not find one it assumes that it is allowed to explore the entire site. If it does find one, the content of the document is parsed to find directives that may restrict the set of searchable directories.

A typical robots.txt file looks like

User-agent: *
Disallow: /cgi-bin
Disallow: /secrets

In addition to the robots.txt file, the robot exclusion protocol forces the robot to not try to load a file from the same server more than once each minute.

Node:Heuristics, Next:Filtering Exploration, Previous:Robot Exclusion, Up:Crawler

Heuristics

In order to keep a large number of URLs up-to-date locally, webbase has to apply some heuristics to prevent overloading the Internet link. Those heuristics are based on error messages interpretation and delay definitions.

Error conditions are divided in two classes : transient errors and fatal errors. Transient errors are supposed to go away after a while and fatal errors mean that the document is lost for ever. One error condition (Not Found) is between the two : it is defined to be transient but is most often fatal.

When transient errors have been detected, a few days are required before the crawler will try to load the document again typically, 3 days).

When fatal errors have been detected, the crawler will never try to load it again. The document will go away after a while, however. It is important to remember that for a few weeks a document is associated with a fatal error, mainly because there is a good chance that many pages and catalogs still have a link on it. After a month or two, however, we can assume that every catalog and every search engine has been updated and does not contain any reference to the bad document.

If the document was loaded successfully, the crawler will not try to reload it before a week or two (two weeks typically). Unless someone is working very hard, bimonthly updates are quite rare. When the crawler tries to load the document again, two weeks after the first try, it is often informed that the document was not changed (Not Modified condition). In this case the crawler will wait even longer before the next try (four weeks typically).

The Not Found error condition is supposed to be transient. But since it is so often fatal, it will be reloaded only after four weeks. The fatal error condition that is supposed to match the transient Not Found condition is Gone, but it is almost never used.

Node:Filtering Exploration, Next:MIME Filtering, Previous:Heuristics, Up:Crawler

Filtering the exploration of a WEB

When starting to explore a starting point, the crawler uses a simple recursive algorithm as described above. It is possible, however, to control this exploration.

A filter can be specified to select eligible URLs. The filter is a emacs regular expression. If the expression returns true, the URL is explored; if it returns false, the URL is skipped.

Here is an example of a filter:

!/ecila\.com/ && ;\/french\/;

It matches the URLs not contained in the ecila.com host that have the /french/ directory somewhere in their path.

Node:MIME Filtering, Next:Panic Prevention, Previous:Filtering Exploration, Up:Crawler

MIME type filtering

By default ecila only accepts document whose MIME type is text/*,magnus-internal/*. You can change this behaviour by setting the value of the -accept option of the crawler. The values listed are comma separated and can be either a fully qualified MIME type or the beginning of a mime type followed by a start like in text/*. For instance to crawl PostScript documents in addition to HTML, the following option can be used:

-accept 'application/postscript,text/html'

An attempt is made to detect MIME types at an early stage. A table mapping the common file extensions to their MIME types allows the crawler to select the file names that are likely to contain such MIME types. This is not a 100% reliable method since only the server that provides the document is able to tell the crawler what type of document it is. But the standard on file name extension is so widely spread and this method saves so much connections that it is worth the risk.

Node:Panic Prevention, Next:Cookies, Previous:MIME Filtering, Up:Crawler

Panic prevention

Many tests are performed to prevent the crawler from crashing in the middle of a mirror operation. Exceptions are trapped, malformed URLs are rejected with a message, etc. Two tests are configurable because they are sometimes inadequate for some servers.

If a URL is too long, it is rejected. Sometimes, cgi-bin behave in such a way that they produce a call to themselves, adding a new parameter to make it slightly different. If the crawler dives into this page it will call the cgi-bin again and again, and the URL will keep growing. When the URL grows over a given size (typically, 160 bytes), it is rejected and a message is issued. It is possible to call the crawler with a parameter that changes the maximum size of a URL.

When mirroring large amounts of sites, you sometimes find really huge files. Log files from WEB servers of 50 Mb for instance. By default the crawler limits the size of data loaded from a document to 100K. If the document is larger, the data will be truncated and a message will be issued. This threshold can be changed if necessary.

Node:Cookies, Next:Proxy, Previous:Panic Prevention, Up:Crawler

Cookies handling

A cookie is a name/value pair assigned to the visitor of a WEB by the server. This pair is sent to the WEB client when it connects for the first time. The client is expected to keep track of this pair and resend it with further requests to the server.

If a robot fails to handle this protocol the WEB server usually build a special URL containing a string that identifies the client. Since all the WEB navigators build relative links from the current URL seen, the forged URL is used throughout the session and the server is still able to recognize the user.

webbase honors the cookie protocol transparently. This behaviour can be desactivated if it produces indesirable results. This may happen in the case of servers configured to deal with a restricted set of clients (only Netscape Navigator and MSIE for instance).

Node:Proxy, Previous:Cookies, Up:Crawler

Proxy handling

A proxy is a gateway to the external world. In the most common case a single proxy handles all the requests that imply a connection to a site outside the local network.

To specify a proxy for a given protocol you should set an environnement variable. This variable will be read by the crawler and the specified proxy will be used.

export http_proxy=http://proxy.provider.com:8080/
export ftp_proxy=http://proxy.provider.com:8080/

If you don't want to use the proxy for a particular domain, for instance a server located on your local network, use the no_proxy variable. It can contain a list of domains separated by comas.

export no_proxy="mydom.com,otherdom.com"

To specify a proxy that will be used by all the commands that call the crawler, add the http_proxy, ftp_proxy and no_proxy variables in the <user>_env file located in /usr/local/bin or in the home directory of <user>. To change the values for a specific domain you just have to localy set the corresponding variables to a different value.

export http_proxy=http://proxy.iway.fr:8080/ ; crawler http://www.ecila.fr/

Using a proxy may have perverse effects on the accuracy of the crawl. Since the crawler implements heuristics to minimize the document loaded, its functionalities are partially redundant with those of the proxy. If the proxy returns a Not modified condition for a given document, it is probably because the proxy cache still consider it as Not modified even though it may have changed on the reference server.

Node:Indexer, Next:Meta database, Previous:Crawler, Up:Top

Indexing the documents

When the crawler successfully retrieves an URL, it submits it immediately to the full text database, if any. If you've downloaded the mifluz library, you should compile and link it to webbase.

The crawler calls full text indexing hooks whenever the status of a document changes. If the document is not found, the delete hook is called and the document is removed from the full text index. If a new document is found the insert hook is called to add it in the full text index.

Node:Meta database, Next:Homes Pages, Previous:Indexer, Up:Top

Meta database

The Meta database is a MySQL database that contains all the information needed by the crawler to crawl the WEB. Each exploration starting point is described in the start table.

The following command retrieves all the URLs known in the test Meta database.

$ mysql -e "select url from url" test
url
http://www.senat.fr/
http://www.senat.fr/robots.txt
http://www.senat.fr/somm.html
http://www.senat.fr/senju98/
http://www.senat.fr/nouveau.html
http://www.senat.fr/sidt.html
http://www.senat.fr/plan.html
http://www.senat.fr/requete2.html
http://www.senat.fr/_vti_bin/shtml.dll/index.html/map
...

url table: url table
url_complete table: url_complete table
mime2ext table: mime2ext table
mime_restict: mime_restrict table
indexes: indexes

Node:url table, Next:url_complete table, Previous:Meta database, Up:Meta database

url table

url: Absolute url
url_md5: MD5 encoding of the url field.
code: HTTP code of the last crawl
mtime: Date of the last successfull crawl
mtime_error: Date of the last crawl with error (either fatal or transient)
tags: List of active tags for this URL
content_type: MIME type of the document contained in the URL
content_length: Total length of the document contained in the URL, not including headers.
complete_rowid: Value of the rowid field of an entry in the url_complete table. The url_complete is filled with information that can be very big such as the hypertext links contained in an HTML document.
crawl: Date of the next crawl, calculated according to the heuristics defined in the corresponding Home page.
hookid: Internal identifier used by the full text hook. When mifluz is used it is the value of the rowid field. If 0 it means that the document was not indexed.
extract: The first characters of the HTML document contained in the URL.
title: The first 80 characters of the title of the HTML document contained in the URL.

Node:url_complete table, Next:mime2ext table, Previous:url table, Up:Meta database

url_complete table

This table complements the url table. It contains information that may be very big so that the url table does not grow too much. An entry is created in the url_complete table for a corresponding url only if there is a need to store some data in its fields.

The URLs stored in the relative and absolute fields have been cannonicalized. That means that they are syntactically valid URLs that can be string compared.

keywords: content of the meta keywords HTML tag, if any.
description: content of the meta description HTML tag, if any.
cookie: original cookie associated with the URL by the server, if any.
base_url: URL contained in the <base> HTML tag, if any.
relative: A white space separated list of relative URLs contained in the HTML document.
absolute: A white space separated list of absolute URLs contained in the document.
location: The URL of the redirection if the URL is redirected.

Node:mime2ext table, Next:mime_restict, Previous:url_complete table, Up:Meta database

mime2ext table

The mime2ext table associates all known mime types to file name extensions. Adding an entry in this table such as

insert into mime2ext values ('application/dsptype', 'tsp,');

will effectively prevent loading the URLs that en with the .tsp extension. Note that if you want to add a new MIME type so that it is recognized by the crawler and loaded, you should also update the list of MIME types listed in the set associated with the content_type field of the url table.

mime: Fully qualified MIME type.
ext: Comma separated list of file name extensions for this MIME type. The list MUST be terminated by a comma.

Node:mime_restict, Next:indexes, Previous:mime2ext table, Up:Meta database

mime_restrict table

The mime_restrict table is a cache for the crawler. If the mime2ext table is modified, the mime_restrict table should be cleaned with the following command:

emysql -e 'delete from mime_restrict' <base>

This deletion may be safely performed even if the crawler is running.

Node:indexes, Previous:mime_restict, Up:Meta database

indexes

The url table is indexed on the rowid and on the url_md5.

The url_complete table is indexed on the rowid only.

The mime2ext table is indexed on the mime and ext fields.

Node:Homes Pages, Next:C API, Previous:Meta database, Up:Top

Homes Pages database - table start

The Home Pages database is a collection of URLs that will be used by the crawler as starting points for exploration. Universal robots like AltaVista do not need such lists because they explore everything. But specialized robots like webbase have to define a set of URLs to work with.

Since the first goal of webbase was to build a search engine gathering French resources, the definition and the choice of the attributes associated with each Home Page is somewhat directed to this goal. In particular there is no special provision to help building a catalog: no categories, no easy way to submit a lot of URLs at once, no support to quickly test a catalog page.

Storage: Home Pages database storage
State information: State information

Node:Storage, Next:State information, Previous:Homes Pages, Up:Homes Pages

Home Pages database storage

The Home Pages database is stored in the start table of the meta information database. The fields of the table are divided in three main classes.

Information used for crawling. These are the fields used by the crawler to control the exploration of a particular Home Page. Depth for instance is the maximum number of documents retrieved by the crawler for this Home Page.
Internal data. These fields must not be changed and are for internal use by the crawler.
User defined data. The user may add fields in this part or remove some. They are only used by the cgi-bin that register a new starting point.

Here is a short list of the fields and their meaning. Some of them are explained in greater details in the following sections.

url: The URL of the Home Page, starting point of exploration.
url_md5: MD5 encoding of the url field.
info: State information about the home page.
depth (default 2): Maximum number of documents for this Home Page.
level (default 1000): Maximum recursion level for this Home Page.
timeout (default 60): Reader tiemout value for this Home Page. Some WEBs are slow to respond and the default value of 60 seconds may not be enough.
loaded_delay (default 7): Number of days to wait before next crawl if the URL is loaded successfully for the first time.
modified_delay (default 14): Number of days to wait before next crawl if the URL was stated Not Modified at last crawl.
not_found_delay (default 30): Number of days to wait before next crawl if the URL was stated Not Found at last crawl.
timeout_delay (default 3): Number of days to wait before next crawl if the URL could not be reached because the serveur was down or a timeout occured during the load or any transient error.
robot_delay (default 60): Number of seconds to wait between two crawls on this server. The default of 60 seconds may be reduced to 0 when crawling an Intranet or the WEB of a friend.
auth: Authentification string.
accept: Accepted MIME types specification for this Home Page. For more information about the values of this field refer to MIME type filtering.
filter: Filter applied on all hypertext links that will be crawled for this home page. Only the hypertext links that match the filter will be crawled. For more information about the values of this field refer to Filtering the exploration of a WEB.
created: Date of creation of the record.

Node:State information, Previous:Storage, Up:Homes Pages

State information

The info field contains information set or read by the crawler when exploring the Home Page. This field may contain a combination of these values, separated by a comma. The meaning of the values is as follows:

unescape: If it is set, the crawler will restore the URL to it's unescaped form before querying the server. Some servers do not handle the %xx notation to specify a character.
sticky: If it is set, the crawler will not give up on a WEB that timeouts. In normal operation, if a WEB has trouble responding in a reasonable delay, the crawler will skip it and continue with others.
sleepy: If it is set, the crawler will sleep as soon as required. In normal operation the crawler will try to find some server that does not require sleeping. It may save some CPU when crawling a single WEB.
nocookie: If set disables cookie handling.
virgin: This flag is automaticaly set when a new entry is insert in the Home Page database. The crawler will then know that this entry has never been explored and take the appropriat action.
exploring: When the crawler is exploring a Home Page, it sets this flag. If the process is interrupted, it will then know that the crawl should be resumed.
explored: The crawler sets this flag when it has finished to explore the Home Page. When it will consider this entry again, it will know that an update may be necessary.
updating: When the crawler tries to update an already successfully explored Home Page, it sets this flag. When it found what needs to be crawled and what does not, it sets the exploring state and continue as if the Home Page was explored for the first time.
in_core: This flag is automatically set on entries that are explored using the UNIX command line for the crawler command.

Node:C API, Next:Administrator guide, Previous:Homes Pages, Up:Top

C langage interface to the crawler

webbase provides a C interface to the crawler. The following are a guide to the usage of this interface.

Initialization: Initialize the crawler
Using the crawler: Using the crawler
webbase_url_t: The webbase_url_t structure
Compiling with webbase: Compiling an application with the webbase library

Node:Initialization, Next:Using the crawler, Previous:C API, Up:C API

Initialize the crawler

The initialization of the crawler is made by arguments, in the same way the main() function of a program is initialized. Here is a short example:

      {
        crawl_params_t* crawl;

        char* argv[] = {
          "myprog",
          "-base", "mybase",
          0
        };
        int argc = 4;

        crawl = crawl_init(argc, argv);
      }

If the crawl_init function fails, it returns a null pointer. The crawl variable now hold an object that uses mybase to store the crawl information. The -base option is mandatory.

Node:Using the crawler, Next:webbase_url_t, Previous:Initialization, Up:C API

Using the crawler

If you want to crawl a Home Page, use the hp_load_in_core function. Here is an example:

hp_load_in_core(crawl, "http://www.thisweb.com/");

This function recusively explores the Home Page given in argument. If the URL of the Home Page is not found in the start table, it will be added. The hp_load_in_core does not return anything. Error conditions, if any, are stored in the entries describing each URL in the url table.

If you want to retrieve a specific URL that has already been crawled use the crawl_touch function (this function must NOT be used to crawl a new URL). It will return a webbase_url_t object describing the URL. In addition, if the content of the document is not in the WLROOT cache, it will be crawled. Here is an example:

webbase_url_t* webbase_url = crawl_touch(crawl, "http://www.thisweb.com/agent.html");

If you want to access the document found at this URL, you can get the full pathname of the temporary file that contains it in the WLROOT cache using the url_furl_string function. Here is an example:

char* path = url_furl_string(webbase_url->w_url, strlen(webbase_url->w_url), URL_FURL_REAL_PATH);

The function returns a null pointer if an error occurs.

When you are finished with the crawler, you should free the crawl object with the crawl_free function. Here is an example:

crawl_free(crawl);

When error occurs all those functions issue error messages on the stderr channel.

Node:webbase_url_t, Next:Compiling with webbase, Previous:Using the crawler, Up:C API

The webbase_url_t structure

You will find the webbase_url_t structure in the webbase_url.h header file, which is automaticaly included by the crawl.h header file.

The real structure of webbase_url_t is made of included structures, however macros hide these details. You should access all the fields using the w_<field> macro such as:

char* location = webbase_url->w_location;

int w_rowid: The unique identifier allocated by mysql.
char w_url[]: The URL of the Home Page, starting point of exploration.
char w_url_md5[]: MD5 encoding of the url field.
unsigned short w_code: HTTP code of the last crawl
time_t w_mtime: Date of the last successfull crawl
time_t w_mtime_error: Date of the last crawl with error (either fatal or transient)
unsigned short w_tags: List of active tags for this URL
char w_content_type[]: MIME type of the document contained in the URL
unsigned int w_content_length: Total length of the document contained in the URL, not including headers.
int w_complete_rowid: Value of the rowid field of an entry in the url_complete table. The url_complete is filled with information that can be very big such as the hypertext links contained in an HTML document.
time_t w_crawl: Date of the next crawl, calculated according to the heuristics defined in the corresponding Home page.
int w_hookid: Internal identifier used by the full text hook. When mifluz is used it is the value of the rowid field. If 0 it means that the document was not indexed.
char w_extract[]: The first characters of the HTML document contained in the URL.
char w_title[]: The first 80 characters of the title of the HTML document contained in the URL.
char w_keywords[]: content of the meta keywords HTML tag, if any.
char w_description[]: content of the meta description HTML tag, if any.
char w_cookie[]: original cookie associated with the URL by the server, if any.
char w_base_url[]: URL contained in the <base> HTML tag, if any.
char w_relative[]: A white space separated list of relative URLs contained in the HTML document.
char w_absolute[]: A white space separated list of absolute URLs contained in the document.
char w_location: The URL of the redirection if the URL is redirected.

The webbase_url_t structure holds all the information describing an URL, including hypertext references. However, it does not contain the content of the document.

The w_info field is a bit field. The allowed values are listed in the WEBBASE_URL_START_* defines. It is specially important to understant that flags must be tested prior to accessing some fields (w_cookie, w_base, w_home, w_relative, w_absolute, w_location). Here is an example:

if(webbase_url->w_info & WEBBASE_URL_INFO_LOCATION) {
  char* location = webbase_url->w_location;
  ...
}

If the corresponding flag is not set, the value of the field is undefined. All the strings are null terminated. You must assume that all the strings can be of arbitrary length.

FRAME: Set if URL contains a frameset.
COMPLETE: Set if w_complete_rowid is not null.
COOKIE: Set if w_cookie contains a value.
BASE: Set if w_base contains a value.
RELATIVE: Set if w_relative contains a value.
ABSOLUTE: Set if w_absolute contains a value.
LOCATION: Set if w_location contains a value.
TIMEOUT: Set if the last crawl ended with a timeout condition. Remember that a timeout condition may be a server refusing connection as well as a slow server.
NOT_MODIFIED: Set if the last crawl returned a Not Modified code. The true value of the code can be found in the w_code field.
NOT_FOUND: Set if the last crawl returned a Not Found code.
OK: Set if the document contains valid data and is not associated with a fatal error. It may be set, for instance for a document associated with a timeout.
ERROR: Set if the a fatal error occured. It may be Not Found or any other fatal error. The real code of the error can be found in the w_code field.
HTTP: Set if the scheme of the URL is HTTP.
FTP: Set if the scheme of the URL is FTP.
NEWS: Set if the scheme of the URL is NEWS.
EXTRACT: Set if the w_extract field contains a value.
TITLE: Set if the w_title field contains a value.
KEYWORDS: Set if the w_keywords field contains a value.
DESCRIPTION: Set if the w_description field contains a value.
READING: Temporarily set when in the read loop.
TRUNCATED: Set if the data contained in the document was truncated because of a read timeout, for instance.
FTP_DIR: Set if the document contains an ftp directory listing.

The values of the w_code field are defined by macros that start with WEBBASE_URL_CODE_*. Some artificial error conditions have been built and are not part of the standard. Their values are between 600 and 610.

Node:Compiling with webbase, Previous:webbase_url_t, Up:C API

Compiling an application with the webbase library

A simple program that uses the crawler functions should include the crawl.h header file. When compiling it should search for includes in /usr/local/include and /usr/local/include/mysql. The libraries will be found in /usr/local/lib and /usr/local/lib/mysql. Here is an example:

$
$ cc -c -I/usr/local/include \
  -I/usr/local/include/mysql myapp.c
$ cc -o myapp -L/usr/local/lib -lwebbase -lhooks -lctools \
  -L/usr/local/lib/mysql -lmysql myapp.o
$

The libraries webbase, hooks, ctools and mysql are all mandatory. If using mifluz you'll have to add other libraries, as specified in the mifluz documentation.

Node:Administrator guide, Next:Concept Index, Previous:C API, Up:Top

A more in-depth view of the crawler

Conventions: Conventions used in figures
Main crawl algorithms: How the crawler works
Presentation of the modules: Where the work is done
More on structures: Relationships between structures
Guidelines to use the crawler: Examples about how to use the crawler

Node:Conventions, Next:Main crawl algorithms, Previous:Administrator guide, Up:Administrator guide

Conventions used in figures

The algorithms presented in this chapter are centered on functions and functionalities. Arrows between rectangle boxes figure the functions calls. When a function successively calls different functions or integrate different functionalities, arrows are numbered to show the order of execution. When a function is called, its name and a short comment are displayed in the rectangle box :

save_cookie : save cookie in DB

A rectangle box with rounded corners figure calls inside the module. A "normal" rectangle box figure calls outside the module.

Node:Main crawl algorithms, Next:Presentation of the modules, Previous:Conventions, Up:Administrator guide

How the crawler works

The following figures show the algorithms implemented by the crawler. In all the cases, the crawler module is the central part of the application, it is why the following figures are centered on it.

Crawl a virgin URL: Crawl a virgin URL
Crawl a list of URLs: Crawl a list of URLs
Crawl rebuild: Rebuild URLs

Node:Crawl a virgin URL, Next:Crawl a list of URLs, Previous:Main crawl algorithms, Up:Main crawl algorithms

Crawl a virgin URL

The following figure presents what is done when a new URL is crawled.

Node:Crawl a list of URLs, Next:Crawl rebuild, Previous:Crawl a virgin URL, Up:Main crawl algorithms

Crawl a list of URLs

The following figure presents a crawl from a list of URLs.

Node:Crawl rebuild, Previous:Crawl a list of URLs, Up:Main crawl algorithms

Rebuild URLs

The following figure presents the rebuilding of an existing URL.

Node:Presentation of the modules, Next:More on structures, Previous:Main crawl algorithms, Up:Administrator guide

Where the work is done

crawler: crawler module
cookies: cookies module
dirsel: dirsel module
http: http module
robots: robots module
webbase: webbase module
webbase_url: webbase_url module
webtools: webtools module

Node:crawler, Next:cookies, Previous:Presentation of the modules, Up:Presentation of the modules

crawler

description

it is main module of the application. it centralizes all the treatments. it is first used for crawl parameters initialization then manages the crawling. Some of the algorithms used in this module are presented in the previous section.

structure

Node:cookies, Next:dirsel, Previous:crawler, Up:Presentation of the modules

cookies

description

This module handles cookie related stuff :

retrieve cookie from DB into memory struct
load/store cookie in database
parse HTTP cookie string
find cookies, if any, to be used for URLs

algorithms

The cookie_match function is called by the crawler when building outgoing requests to send to HTTP servers. If it finds it, the cookie_match returns the cookie that must be used.

Node:dirsel, Next:http, Previous:cookies, Up:Presentation of the modules

dirsel

description

This module handles robots.txt and user defined Allow/Disallow clauses.

algorithms

The dirsel module has two main functions :

build a list of strings to add to user Allow/Disallow clauses (dirsel_allow function)
during the exploration of HTTP servers, verify that URLs are allowed, according to robots.txt file and user defined clauses (dirsel_allowed function)

Node:http, Next:robots, Previous:dirsel, Up:Presentation of the modules

http

description

The http module is used by webtools. When it reads an HTTP header or body, the webtools module calls http_header or http_body functions (as callback functions) to manage information contained in them. Information extracted from header or body of pages are stored in a webbase_url_t struct. The html_content_begin initializes an html_content_t structure then calls html_content_parse function to parse the body.

algorithms

Node:robots, Next:webbase, Previous:http, Up:Presentation of the modules

robots

description

The robots module is in charge of the Robot Exclusion Protocol.

structures

algorithms

The robots_load function is used to create Allow/Disallow strings from information contained in robots.txt files. This function first looks if the information is contained in the current uri object. If not, it tries to find information in the database. Finally, if it has not found it, it crawls the robots.txt file located at the current URL (and if this file does not exist, no Allow/Disallow clauses are created).

Node:webbase, Next:webbase_url, Previous:robots, Up:Presentation of the modules

webbase

description

The webbase module is an interface to manipulate three of the most important tables of webbase database :

start
url
start2url

These tables are described in Meta database and Home Pages.

structures

The webbase_t structure mainly contains parameters used to connect to the database. The other modules who need an access to the database use a webbase_t to store database parameters (robots, cookies, ...).

Node:webbase_url, Next:webtools, Previous:webbase, Up:Presentation of the modules

webbase_url

description

The webbase_url module mainly manages the webbase_url_*_t structures. These structures are described in next paragraph and in webbase_url_t.

structures

Node:webtools, Previous:webbase_url, Up:Presentation of the modules

webtools

description

The webtools module manages the connection between the local machine and the remote HTTP server. That is to say :

Open and close connection between the two machines
Read and write data in the socket

When webtools reads data from the socket, it first looks at their HTTP type (header or body) and calls the appropriate parse function in the http module.

structures

algorithms

Node:More on structures, Next:Guidelines to use the crawler, Previous:Presentation of the modules, Up:Administrator guide

Relationships between structures

Node:Guidelines to use the crawler, Previous:More on structures, Up:Administrator guide

Examples about how to use the crawler

Some examples on how to use the crawler on command line. That is to say how to use options, which options need other options, which options are not compatible, ... Crawl a single URL :

crawler -base base_name -- http://url.to.crawl

rebuild option : rebuild URLs (remove all the records from the full text database and resubmit all the URLs for indexing, mifluz module mandatory).

crawler -base base_name -rebuild

rebuild a set of URLs (the where_url is only used when associated with rebuild option)

crawler -base base_name -rebuild -where_url "url regex 'regex_clause'"¹

unload option : remove the starting point and all the URLs linked to it.

crawler -base base_name -unload -- http://url.to.unload

unload_keep_start : same as unload except that starting point is left in DB.

crawler -base base_name -unload_keep_start -- http://url.to.unload

home_pages : load all the URLs listed in the start table

crawler -base base_name -home_pages

schema : obtaining a schema of the database

crawler -base base_name -schema

Node:Concept Index, Previous:Administrator guide, Up:Top

Index of Concepts

Absolute URLs: What is WEB
Accuracy and proxies: Proxy
C structures: More on structures
Cannonical form of an URL: FURL
Cannonicalization and rejection: Recursion
cookies: cookies
crawl : rebuild URLs: Crawl rebuild
crawl a list of URLs: Crawl a list of URLs
crawl a virgin URL: Crawl a virgin URL
crawler: crawler
Crawler: Crawler, Crawling, Concepts
crawler algorithms: Main crawl algorithms
Crawler and full text: Full Text
Crawler boudaries: What is WEB
Crawler definition: Full Text
Crawler heuristics: Heuristics
Crawler implemented in crawler command: Running Crawler
Crawler recursion: Recursion
Crawler starting points: Home Pages
Crawling Home Pages: First Round
Crawling starting points: First Round
Database stability: Full Text
dirsel: dirsel
Editorial capabilities: Homes Pages
Escape server boundaries: Filtering Exploration
examples: Guidelines to use the crawler
Extraction of links in an HTML document: Recursion
Fatal errors: Heuristics
Fatal errors delay: Heuristics
File extensions and MIME types: MIME Filtering
File system copy of URLs: FURL
Filter crawled URLs: Filtering Exploration
Filter example: Filtering Exploration
Firewall: Proxy
First round: First Round
ftp_proxy: Proxy
Full Text database: Concepts
FURL definition: FURL
FURL structure: FURL
Global proxy: Proxy
Home Pages database: Home Pages, Concepts
Home Pages database definition: Homes Pages
Home Pages database structure: Storage
http: http
Http proxy: Proxy
http_proxy: Proxy
Increase the maximum size of an URL: Panic Prevention
Indexing hooks: Indexer
Indexing retrieved URLs: Indexer
Language Recognition: Language Recognition
Long URLs: Panic Prevention
Maximum document size: Panic Prevention
Memory and disk requirements: Space
Meta database: Meta database
MIME type filtering: MIME Filtering
Monitoring catalogs: Planning
mysql: Meta database
Navigable domain: Web Bounds
no_proxy: Proxy
Not Found error is transient: Heuristics
Not modified heuristics: Heuristics
Panic prevention: Panic Prevention
Planning design hints: Planning
Proxy: Proxy
Proxy accuracy: Proxy
Proxy global: Proxy
RAM requirements: Space
Recursive crawling: Crawling
Rejecting non cannonical URLs: Recursion
Relative URLs: What is WEB
Releasing crawler boundaries: Filtering Exploration
Robot exclusion: Robot Exclusion
robots: robots
robots.txt file: Robot Exclusion
Socks: Proxy
start table: Homes Pages, Home Pages
Successfuly loaded heuristics: Heuristics
table start: Homes Pages, Home Pages
Trace verbosity: Traces
Tracing commands: Traces
Transient errors: Heuristics
Transient errors delay: Heuristics
Unreachable documents: Web Bounds
Update frequency: Full Text
Updating known URLs: Planning
URL parts and crawl boundaries: First Round
url table: url table
URLs graph: Web Bounds
webbase: webbase
webbase_url: webbase_url
webtools: webtools

Top
Introduction
- What is webbase ?
- What webbase is not ?
Getting webbase
Overview of the concepts used by webbase
The crawler that mirrors the web
Indexing the documents
Meta database
Homes Pages database - table start
- Home Pages database storage
- State information
C langage interface to the crawler
A more in-depth view of the crawler
Index of Concepts

Footnotes

don't forget to specify the entire where clause

Top

Introduction

What is webbase ?

What webbase is not ?

Getting webbase

Overview of the concepts used by webbase

Home Pages Database - table start

The crawler that fetches WEB documents

Using a full text database

Using language recognition functionality

File equivalent of a URL

What is part of a WEB and what is not.

Web Bounds

Good planning is crucial

Traces and debug help the manager

Memory and disk space considerations

The crawler that mirrors the web

Running the crawler

The first round

Crawler recursion

Robot Exclusion

Heuristics

Filtering the exploration of a WEB

MIME type filtering

Panic prevention

Cookies handling

Proxy handling

Indexing the documents

Meta database

url table

url_complete table

mime2ext table

mime_restrict table

indexes

Homes Pages database - table start

Home Pages database storage

State information

C langage interface to the crawler

Initialize the crawler

Using the crawler

The webbase_url_t structure

Compiling an application with the webbase library

A more in-depth view of the crawler

Conventions used in figures

How the crawler works

Crawl a virgin URL

Crawl a list of URLs

Rebuild URLs

Where the work is done

crawler

description

structure

cookies

description

algorithms

dirsel

description

algorithms

http

description

algorithms

robots

description

structures

algorithms

webbase

description

structures

webbase_url

description

structures

webtools

description

structures

algorithms

Relationships between structures

Examples about how to use the crawler

Index of Concepts

Table of Contents

Footnotes

What is `webbase` ?

What `webbase` is not ?

Overview of the concepts used by `webbase`