webbase
--- The Detailed Node Listing ---
Introduction
webbase
?webbase
is not ?
Overview of the concepts used by webbase
The crawler that mirrors the web
Meta database
Homes Pages database - table start
C langage interface to the crawler
A more in-depth view of the crawler
How the crawler works
Where the work is done
webbase
?webbase
is not ?webbase
?webbase
is a crawler for the Internet. It has two
main functions : crawl the WEB to get documents and build a full
text database with this documents.
The crawler part visit the documents and store intersting information about them localy. It visits the document on a regular basis to make sure that it is still there and updates it if it changes.
The full text database uses the localy copies of the document to
build a searchable index. The full text indexing functions are
not included in webbase
.
webbase
is not ?webbase
is not a full text database. It uses a full
text database to search the content of the URLs retrieved.
The home site of webbase is Senga http://www.senga.org/webbase/html/. It contains the software, online documentation, formated documentation and related software for various platforms.
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
webbase
The crawler (or robot or spider) works according to specifications found in the Home Pages database.
The Home Pages database is a selection of starting points for the crawler, including specifications that drives its actions for each starting point.
The crawler is in charge of maintaining an up-to-date image of the WEB on the local disk. The set of URLs concerned is defined by the Home Pages database.
Using this local copy of the WEB, the full text database will build a searchable index to allow full text retrieval by the user.
The Home Pages database is a list of starting points. The
webbase
crawler is not designed to explore the entire
WEB. It is best suited to build specialized search engines on a
particular topic. The best way to start an webbase
crawler is to put a bookmark file in the Home Pages database.
The crawler
is working on a set of URLs defined by
the Home Pages database. It loads each page listed in the Home
Pages database and extracts hypertexts links from them. Then it
explores these links recursively until there are no more pages or
the maximum number of documents has been loaded.
The full text databases are designed to work with local files, not with URLs. This is why the crawler has to keep a copy of the URLs found on the local disk. In fact, a full text database able to handle URLs is called an Internet search engine :-)
The hooks library in webbase
is designed to provide a bridge between the crawler and a full text
indexing library. It contains a stub that does nothing (the
hooks.c
file) and an interface to the
mifluz full text indexing library (see
http://www.senga.org/mifluz/html/ to download it).
When crawling a document, it is possible to retrieve the language of the document and to store this information in the url table along with other url information. To do this, you must use the langrec module. The language recognition module recognizes five languages : french, english, italian, spanish and german.
The job of the crawler is to maintain a file system copy of the WEB. It is, therefore, necessary to compute a unique file name for each URL. A FURL is simply a transformation of a URL into a file system PATH (hence FURL = File URL).
The crawler uses a MD5 key calculated from the URL as a path name. For example http://www.foo.com/ is transformed into the MD5 key 33024cec6160eafbd2717e394b5bc201 and the corresponding FURI is 33/02/4c/ec6160eafbd2717e394b5bc201. This allows to store a large number of files even on file systems that do not support many entries in the same directory. The drawback is that it's hard to guess which file contains which URL.
An alternative encoding of the FURI is available thru the uri library. It's much more readable and can conviniently be used if the number of URLs crawled is low (less than 50 000). The following figure shows how the URL is mapped to a PATH.
# An URL is converted to a FURL in the following way # # http://www.ina.fr:700/imagina/index.html#queau # | \____________/ \________________/\____/ # | | | lost # | | | # | copied and | # | default port | # | added if not | # | present | # | | | # | | | # | /^^^^^^^^^^^^^\/^^^^^^^^^^^^^^^^\ # http:/www.ina.fr:700/imagina/index.html #
A problem with URLs is that two URLs can lead to the same
document and not be the same string. A well-known example of that
is something%2Felse
that is strictly identical to
something/else
. To cope with this problem a canonical
form has been defined and it obeys complicated rules that leads to
intuitive results.
The mapping is designed to
md5
encoding.When starting the exploration of a WEB, the crawler must answer
this question : is that link part of the server I'm exploring or
not ? It is not enough to state that the URL is absolute or
relative. The method used in webbase
is simple : if
the URL is in the same directory of the same server, then it
belongs to the same WEB.
When surfing the WEB one can reach a large number of documents but not all the documents available on the Internet.
A very common example of this situation arises when someone adds new documents in an existing WEB server. The author will gradually write and polish the new documents, testing them and showing them to friends for suggestions. For a few weeks the documents will exist on the Internet but will not be public: if you don't know the precise URL of the document you can surf forever without reaching it. In other words, the document is served but no links point to it.
This situation is even more common when someone moves a server from one location to another. For one or two months the old location of the server will still answer to requests but links pointing to it will gradually point to the new one.
webbase
is constantly working to keep the base
up-to-date. It follows a planning that you can describe in a
crontab.
The administrator of webbase
must dispatch the
actions so that the Internet link and the machine are not
overloaded for one day and idle the next day. For instance, it is a
good idea to rebuild the database during the weekend and to crawl
every weekday.
Various flags enable verbose levels in the crawler (see the
manual page). They are usually quite verbose and only useful to
know if the process is running or not. Error messages are very
explicit, at least for someone who knows the internals of
webbase
.
The WEB meta information database and the full text database are the main space eaters. The documents stored in the WLROOT cache can be very big if they are not expired.
webbase
tends to be very memory hungy. An average
crawler takes 4Mb of memory. For instance running five simultaneous
mirroring operations you will need 20Mb of memory.
The WEB can be viewed as a huge disk with a lot of bad blocks and really slow access time. The WEB is structured in a hierarchical way, very similar to file systems found on traditional disks.
Crawling part of this huge disk (the WEB) for specific purposes
implies the need for specific tools to deal with these
particularities. Most of the tools that we already have to analyse
and process data, work with traditional file systems. In particular
the mifluz
database is able to efficiently build a
full text database from a given set of files. The
webbase
crawler's main job is to map the WEB on the
disk of a machine so that ordinary tools can work on it.
To run the crawler you should use the crawler(1) command.
crawler -base mybase -- http://www.ecila.fr/
will make a local copy of the http://www.ecila.fr/
URL.
When given a set of URLs, the crawler tries to load them all. It registers them as starting points for later recursion. These URLs will, therefore, be treated specially. The directory part of the URL will be used as a filter to prevent loading documents that do not belong to the same tree during recursion.
For each starting point the crawler will consider all the links contained in the document. Relative links will be converted to absolute links. Only the links that start with the same string as the starting point will be retained. All these documents will be loaded if they satisfy the recursions conditions (see below).
Any URLs contained in the pages and that cannot be put in a cannonical form will be silently rejected.
When all the documents found in the starting points are explored, they go through the same process. The recursion keeps exploring the servers until either it reaches the recursion limit or there are no more documents.
Exploration of a web site can be stopped using the robot
exclusion protocol. When the crawler finds a new host, it tries to
load the robots.txt
file. If it does not find one it
assumes that it is allowed to explore the entire site. If it does
find one, the content of the document is parsed to find directives
that may restrict the set of searchable directories.
A typical robots.txt
file looks like
User-agent: * Disallow: /cgi-bin Disallow: /secrets
In addition to the robots.txt
file, the robot
exclusion protocol forces the robot to not try to load a file from
the same server more than once each minute.
In order to keep a large number of URLs up-to-date locally,
webbase
has to apply some heuristics to prevent
overloading the Internet link. Those heuristics are based on error
messages interpretation and delay definitions.
Error conditions are divided in two classes : transient errors
and fatal errors. Transient errors are supposed to go away after a
while and fatal errors mean that the document is lost for ever. One
error condition (Not Found
) is between the two : it is
defined to be transient but is most often fatal.
When transient errors have been detected, a few days are required before the crawler will try to load the document again typically, 3 days).
When fatal errors have been detected, the crawler will never try to load it again. The document will go away after a while, however. It is important to remember that for a few weeks a document is associated with a fatal error, mainly because there is a good chance that many pages and catalogs still have a link on it. After a month or two, however, we can assume that every catalog and every search engine has been updated and does not contain any reference to the bad document.
If the document was loaded successfully, the crawler will not
try to reload it before a week or two (two weeks typically). Unless
someone is working very hard, bimonthly updates are quite rare.
When the crawler tries to load the document again, two weeks after
the first try, it is often informed that the document was not
changed (Not Modified
condition). In this case the
crawler will wait even longer before the next try (four weeks
typically).
The Not Found
error condition is supposed to be
transient. But since it is so often fatal, it will be reloaded only
after four weeks. The fatal error condition that is supposed to
match the transient Not Found
condition is
Gone
, but it is almost never used.
When starting to explore a starting point, the crawler uses a simple recursive algorithm as described above. It is possible, however, to control this exploration.
A filter can be specified to select eligible URLs. The filter is
a emacs
regular expression. If the expression returns
true, the URL is explored; if it returns false, the URL is
skipped.
Here is an example of a filter:
!/ecila\.com/ && ;\/french\/;
It matches the URLs not contained in the ecila.com
host that have the /french/
directory somewhere in
their path.
By default ecila
only accepts document whose MIME
type is text/*,magnus-internal/*
. You can change this
behaviour by setting the value of the -accept
option
of the crawler. The values listed are comma separated and can be
either a fully qualified MIME type or the beginning of a mime type
followed by a start like in text/*
. For instance to
crawl PostScript documents in addition to HTML, the following
option can be used:
-accept 'application/postscript,text/html'
An attempt is made to detect MIME types at an early stage. A table mapping the common file extensions to their MIME types allows the crawler to select the file names that are likely to contain such MIME types. This is not a 100% reliable method since only the server that provides the document is able to tell the crawler what type of document it is. But the standard on file name extension is so widely spread and this method saves so much connections that it is worth the risk.
Many tests are performed to prevent the crawler from crashing in the middle of a mirror operation. Exceptions are trapped, malformed URLs are rejected with a message, etc. Two tests are configurable because they are sometimes inadequate for some servers.
If a URL is too long, it is rejected. Sometimes,
cgi-bin
behave in such a way that they produce a call
to themselves, adding a new parameter to make it slightly
different. If the crawler dives into this page it will call the
cgi-bin
again and again, and the URL will keep
growing. When the URL grows over a given size (typically, 160
bytes), it is rejected and a message is issued. It is possible to
call the crawler with a parameter that changes the maximum size of
a URL.
When mirroring large amounts of sites, you sometimes find really huge files. Log files from WEB servers of 50 Mb for instance. By default the crawler limits the size of data loaded from a document to 100K. If the document is larger, the data will be truncated and a message will be issued. This threshold can be changed if necessary.
A cookie is a name/value pair assigned to the visitor of a WEB by the server. This pair is sent to the WEB client when it connects for the first time. The client is expected to keep track of this pair and resend it with further requests to the server.
If a robot fails to handle this protocol the WEB server usually build a special URL containing a string that identifies the client. Since all the WEB navigators build relative links from the current URL seen, the forged URL is used throughout the session and the server is still able to recognize the user.
webbase
honors the cookie protocol transparently.
This behaviour can be desactivated if it produces indesirable
results. This may happen in the case of servers configured to deal
with a restricted set of clients (only Netscape Navigator and MSIE
for instance).
A proxy is a gateway to the external world. In the most common case a single proxy handles all the requests that imply a connection to a site outside the local network.
To specify a proxy for a given protocol you should set an environnement variable. This variable will be read by the crawler and the specified proxy will be used.
export http_proxy=http://proxy.provider.com:8080/ export ftp_proxy=http://proxy.provider.com:8080/
If you don't want to use the proxy for a particular domain, for
instance a server located on your local network, use the
no_proxy
variable. It can contain a list of domains
separated by comas.
export no_proxy="mydom.com,otherdom.com"
To specify a proxy that will be used by all the commands that
call the crawler, add the http_proxy
,
ftp_proxy
and no_proxy
variables in the
<user>_env file located in /usr/local/bin or in the home
directory of <user>. To change the values for a specific
domain you just have to localy set the corresponding variables to a
different value.
export http_proxy=http://proxy.iway.fr:8080/ ; crawler http://www.ecila.fr/
Using a proxy may have perverse effects on the accuracy of the
crawl. Since the crawler implements heuristics to minimize the
document loaded, its functionalities are partially redundant with
those of the proxy. If the proxy returns a Not
modified
condition for a given document, it is probably
because the proxy cache still consider it as Not
modified
even though it may have changed on the reference
server.
When the crawler successfully retrieves an URL, it submits it immediately to the full text database, if any. If you've downloaded the mifluz library, you should compile and link it to webbase.
The crawler calls full text indexing hooks whenever the status of a document changes. If the document is not found, the delete hook is called and the document is removed from the full text index. If a new document is found the insert hook is called to add it in the full text index.
The Meta database is a MySQL
database that contains
all the information needed by the crawler to crawl the WEB. Each
exploration starting point is described in the start
table.
The following command retrieves all the URLs known in the
test
Meta database.
$ mysql -e "select url from url" test url http://www.senat.fr/ http://www.senat.fr/robots.txt http://www.senat.fr/somm.html http://www.senat.fr/senju98/ http://www.senat.fr/nouveau.html http://www.senat.fr/sidt.html http://www.senat.fr/plan.html http://www.senat.fr/requete2.html http://www.senat.fr/_vti_bin/shtml.dll/index.html/map ...
url
url_md5
code
mtime
mtime_error
tags
content_type
content_length
complete_rowid
rowid
field of an entry in the
url_complete
table. The url_complete
is
filled with information that can be very big such as the hypertext
links contained in an HTML document.crawl
hookid
mifluz
is used it is the value of the
rowid
field. If 0 it means that the document was not
indexed.extract
title
This table complements the url table. It contains information that may be very big so that the url table does not grow too much. An entry is created in the url_complete table for a corresponding url only if there is a need to store some data in its fields.
The URLs stored in the relative
and
absolute
fields have been cannonicalized. That means
that they are syntactically valid URLs that can be string
compared.
keywords
meta keywords
HTML tag, if
any.description
meta description
HTML tag, if
any.cookie
base_url
<base>
HTML tag, if
any.relative
absolute
location
The mime2ext table associates all known mime types to file name extensions. Adding an entry in this table such as
insert into mime2ext values ('application/dsptype', 'tsp,');
will effectively prevent loading the URLs that en with the
.tsp
extension. Note that if you want to add a new
MIME type so that it is recognized by the crawler and loaded, you
should also update the list of MIME types listed in the set
associated with the content_type
field of the url
table.
mime
ext
MUST
be terminated by a comma.The mime_restrict table is a cache for the crawler. If the mime2ext table is modified, the mime_restrict table should be cleaned with the following command:
emysql -e 'delete from mime_restrict' <base>
This deletion may be safely performed even if the crawler is running.
The url table is indexed on the rowid
and on the
url_md5
.
The url_complete table is indexed on the rowid
only.
The mime2ext table is indexed on the mime
and
ext
fields.
The Home Pages database is a collection of URLs that will be
used by the crawler as starting points for exploration. Universal
robots like AltaVista
do not need such lists because
they explore everything. But specialized robots like
webbase
have to define a set of URLs to work with.
Since the first goal of webbase
was to build a
search engine gathering French resources, the definition and the
choice of the attributes associated with each Home Page is somewhat
directed to this goal. In particular there is no special provision
to help building a catalog: no categories, no easy way to submit a
lot of URLs at once, no support to quickly test a catalog page.
The Home Pages database is stored in the start
table of the meta information database. The fields of the table are
divided in three main classes.
not
be changed
and are for internal use by the crawler.Here is a short list of the fields and their meaning. Some of them are explained in greater details in the following sections.
url
url_md5
info
depth (default 2)
level (default 1000)
timeout (default 60)
loaded_delay (default 7)
modified_delay (default 14)
Not Modified
at last crawl.not_found_delay (default 30)
Not Found
at last crawl.timeout_delay (default 3)
robot_delay (default 60)
auth
accept
filter
created
The info
field contains information set or read by
the crawler when exploring the Home Page. This field may contain a
combination of these values, separated by a comma. The meaning of
the values is as follows:
unescape
unescaped
form before querying the server. Some
servers do not handle the %xx
notation to specify a
character.sticky
sleepy
nocookie
virgin
exploring
explored
updating
in_core
crawler
command.webbase
provides a C interface to the crawler. The
following are a guide to the usage of this interface.
The initialization of the crawler is made by arguments, in the
same way the main()
function of a program is
initialized. Here is a short example:
{ crawl_params_t* crawl; char* argv[] = { "myprog", "-base", "mybase", 0 }; int argc = 4; crawl = crawl_init(argc, argv); }
If the crawl_init
function fails, it returns a null
pointer. The crawl
variable now hold an object that
uses mybase
to store the crawl information. The
-base
option is mandatory.
If you want to crawl a Home Page, use the
hp_load_in_core
function. Here is an example:
hp_load_in_core(crawl, "http://www.thisweb.com/");
This function recusively explores the Home Page given in
argument. If the URL of the Home Page is not found in the
start
table, it will be added. The
hp_load_in_core
does not return anything. Error
conditions, if any, are stored in the entries describing each URL
in the url
table.
If you want to retrieve a specific URL that has already been
crawled use the crawl_touch
function (this function
must NOT be used to crawl a new URL). It will return a
webbase_url_t
object describing the URL. In addition,
if the content of the document is not in the WLROOT
cache, it will be crawled. Here is an example:
webbase_url_t* webbase_url = crawl_touch(crawl, "http://www.thisweb.com/agent.html");
If you want to access the document found at this URL, you can
get the full pathname of the temporary file that contains it in the
WLROOT
cache using the url_furl_string
function. Here is an example:
char* path = url_furl_string(webbase_url->w_url, strlen(webbase_url->w_url), URL_FURL_REAL_PATH);
The function returns a null pointer if an error occurs.
When you are finished with the crawler, you should free the
crawl
object with the crawl_free
function. Here is an example:
crawl_free(crawl);
When error occurs all those functions issue error messages on
the stderr
channel.
You will find the webbase_url_t
structure in the
webbase_url.h
header file, which is automaticaly
included by the crawl.h
header file.
The real structure of webbase_url_t
is made of
included structures, however macros hide these details. You should
access all the fields using the w_<field>
macro
such as:
char* location = webbase_url->w_location;
int w_rowid
mysql
.char w_url[]
char w_url_md5[]
unsigned short w_code
time_t w_mtime
time_t w_mtime_error
unsigned short w_tags
char w_content_type[]
unsigned int w_content_length
int w_complete_rowid
rowid
field of an entry in the
url_complete
table. The url_complete
is
filled with information that can be very big such as the hypertext
links contained in an HTML document.time_t w_crawl
int w_hookid
mifluz
is used it is the value of the
rowid
field. If 0 it means that the document was not
indexed.char w_extract[]
char w_title[]
char w_keywords[]
meta keywords
HTML tag, if
any.char w_description[]
meta description
HTML tag, if
any.char w_cookie[]
char w_base_url[]
<base>
HTML tag, if
any.char w_relative[]
char w_absolute[]
char w_location
The webbase_url_t
structure holds all the
information describing an URL, including hypertext references.
However, it does not contain the content of the document.
The w_info
field is a bit field. The allowed values
are listed in the WEBBASE_URL_START_*
defines. It is
specially important to understant that flags must be tested prior
to accessing some fields (w_cookie, w_base, w_home, w_relative,
w_absolute, w_location). Here is an example:
if(webbase_url->w_info & WEBBASE_URL_INFO_LOCATION) { char* location = webbase_url->w_location; ... }
If the corresponding flag is not set, the value of the field is undefined. All the strings are null terminated. You must assume that all the strings can be of arbitrary length.
FRAME
COMPLETE
w_complete_rowid
is not null.COOKIE
w_cookie
contains a value.BASE
w_base
contains a value.RELATIVE
w_relative
contains a value.ABSOLUTE
w_absolute
contains a value.LOCATION
w_location
contains a value.TIMEOUT
timeout
condition may be a server refusing
connection as well as a slow server.NOT_MODIFIED
Not Modified
code. The true value of the code can be found in the
w_code
field.NOT_FOUND
Not Found
code.OK
ERROR
Not
Found
or any other fatal error. The real code of the error
can be found in the w_code
field.HTTP
HTTP
.FTP
FTP
.NEWS
NEWS
.EXTRACT
w_extract
field contains a value.TITLE
w_title
field contains a value.KEYWORDS
w_keywords
field contains a
value.DESCRIPTION
w_description
field contains a
value.READING
TRUNCATED
FTP_DIR
ftp
directory
listing.The values of the w_code
field are defined by
macros that start with WEBBASE_URL_CODE_*
. Some
artificial error conditions have been built and are not part of the
standard. Their values are between 600 and 610.
A simple program that uses the crawler functions should include
the crawl.h
header file. When compiling it should
search for includes in /usr/local/include and
/usr/local/include/mysql. The libraries will be found in
/usr/local/lib and /usr/local/lib/mysql. Here is an example:
$ $ cc -c -I/usr/local/include \ -I/usr/local/include/mysql myapp.c $ cc -o myapp -L/usr/local/lib -lwebbase -lhooks -lctools \ -L/usr/local/lib/mysql -lmysql myapp.o $
The libraries webbase
, hooks
,
ctools
and mysql
are all mandatory. If
using mifluz you'll have to add other libraries,
as specified in the mifluz documentation.
The algorithms presented in this chapter are centered on functions and functionalities. Arrows between rectangle boxes figure the functions calls. When a function successively calls different functions or integrate different functionalities, arrows are numbered to show the order of execution. When a function is called, its name and a short comment are displayed in the rectangle box :
save_cookie : save cookie in DBA rectangle box with rounded corners figure calls inside the module. A "normal" rectangle box figure calls outside the module.
The following figures show the algorithms implemented by the crawler. In all the cases, the crawler module is the central part of the application, it is why the following figures are centered on it.
The following figure presents what is done when a new URL is crawled.
The following figure presents a crawl from a list of URLs.
The following figure presents the rebuilding of an existing URL.
it is main module of the application. it centralizes all the treatments. it is first used for crawl parameters initialization then manages the crawling. Some of the algorithms used in this module are presented in the previous section.
This module handles cookie related stuff :
The cookie_match
function is called by the crawler
when building outgoing requests to send to HTTP servers. If it
finds it, the cookie_match
returns the cookie that
must be used.
This module handles robots.txt and user defined Allow/Disallow clauses.
The dirsel
module has two main functions :
dirsel_allow
function)dirsel_allowed
function)The http
module is used by webtools
.
When it reads an HTTP header or body, the webtools
module calls http_header
or http_body
functions (as callback functions) to manage information contained
in them. Information extracted from header or body of pages are
stored in a webbase_url_t
struct. The
html_content_begin
initializes an
html_content_t
structure then calls
html_content_parse
function to parse the body.
The robots module is in charge of the Robot Exclusion Protocol.
The robots_load
function is used to create
Allow/Disallow strings from information contained in robots.txt
files. This function first looks if the information is contained in
the current uri object. If not, it tries to find information in the
database. Finally, if it has not found it, it crawls the robots.txt
file located at the current URL (and if this file does not exist,
no Allow/Disallow clauses are created).
The webbase
module is an interface to manipulate
three of the most important tables of webbase database :
The webbase_t
structure mainly contains parameters
used to connect to the database. The other modules who need an
access to the database use a webbase_t
to store
database parameters (robots,
cookies, ...).
The webbase_url module mainly manages the
webbase_url_*_t
structures. These structures are
described in next paragraph and in webbase_url_t.
The webtools
module manages the connection between
the local machine and the remote HTTP server. That is to say :
Some examples on how to use the crawler on command line. That is to say how to use options, which options need other options, which options are not compatible, ... Crawl a single URL :
crawler -base base_name -- http://url.to.crawlrebuild option : rebuild URLs (remove all the records from the full text database and resubmit all the URLs for indexing, mifluz module mandatory).
crawler -base base_name -rebuildrebuild a set of URLs (the
where_url
is only used when
associated with rebuild
option)
crawler -base base_name -rebuild -where_url "url regex 'regex_clause'"1unload option : remove the starting point and all the URLs linked to it.
crawler -base base_name -unload -- http://url.to.unloadunload_keep_start : same as unload except that starting point is left in DB.
crawler -base base_name -unload_keep_start -- http://url.to.unloadhome_pages : load all the URLs listed in the start table
crawler -base base_name -home_pagesschema : obtaining a schema of the database
crawler -base base_name -schema
webbase