Section starts by the name of the section in square brackets on a separate line. Every line in the section specifies value of one parameter (if it doesn't start with '#' - in that case it's a comment). Every parameter has a name and a type: int, double, string or string list. Values of type int can be written not only as numbers, but also as words yes (1) and no (0). Strings are delimited by double quotes and cannot contain character 0. Characters double quote, newline, tab and backslash must be quoted with a backslash (like in C strings). String list is a sequence of strings separated by spaces.
Most parameters have implicit values, which are used when the parameter is not specified in options file. Parameters without implicit values must be specified by user. Some things you can tweak with them are:
name | type | implicit value |
---|---|---|
base_dir | string | none |
Directory where cruncher
creates index files and grazer looks for
them.
|
name | type | implicit value |
---|---|---|
source_dir_list | string list | none |
List of directories where cruncher looks for files to
index; their subdirectories are searched as well (depth-first). No shell expansion
is performed on them - use absolute paths.
| ||
source_mask_list | string list | none |
Masks of document files.
|
To generate a stoplist, index part of your database (let's say 10%, but of course, your mileage may vary). From that, generate wordlist (by specifying appropriate word_list_file and running grazer) and use its first few lines as a stoplist (that is, edit generated word list file, assign path to it to stop_list_file and run cruncher).
cruncher maintains its stoplist in internal format, so that you can reset stop_list_file after cruncher run and stoplist remains in effect (if you specify a different stoplist, it will be added to the old one). When you add a word into the stoplist, its occurences are removed from the database.
name | type | implicit value |
---|---|---|
stop_list_file | string | "" |
Non-empty value is a name of text file from which
cruncher adds words to the stop list (before starting indexation).
|
name | type | implicit value |
---|---|---|
word_list_file | string | "" |
If not empty, interpreted as a name of file into which
grazer writes (before it starts searching) all indexed words, sorted by number of
documents in which they occur, number of occurences and alphabetically.
|
name | type | implicit value |
---|---|---|
doc_start_rx | string | "" |
Regular expression matching documents starting in the
middle of a file. One file can contain multiple documents, considered
logically isolated - grazer searches them separately and identifies them by file
name and position where document begins. When empty, the whole file is treated as
one document.
| ||
basic_separators | string | " \n\r\t" |
Word delimiters. Changing this option is not
recommended.
| ||
additional_separators | string | "" |
Word delimiters to add. For example when indexing HTML,
it's useful to add "<>", so that tags are recognized as separate words (and can be
dropped).
| ||
basic_punctuation | string | "\r\t.,;:\'\"()!?\0x1a" |
Basic string of characters stripped off beginning and
end of every word before processing it further. Changing this option is not
recommended.
| ||
additional_punctuation | string | "[]{}@#$%^&*-_=|\/" |
Additional string of characters stripped off beginning
and end of every word before processing it. This one can be modified.
| ||
early_case_conversion | int | 2 |
Case conversions performed by cruncher and grazer.
Possible values:
|
name | type | implicit value |
---|---|---|
doc_parts | string list | empty |
Names of explicit document parts corresponding to
sections in named_objects_file.
The word "plain" is reserved for default part and cannot be in
this list.
|
name | type | implicit value |
---|---|---|
title_part | string | "" |
Document part containing document's title. If it's
not a valid document part (for example if you leave it empty),
returned title will be always empty, too.
|
name | type | implicit value |
---|---|---|
push_rx | string | "" |
Regular expression matching beginning of document part
(empty string never matches).
| ||
pop_rx | string | "" |
Regular expression matching end of document part (empty
string never matches).
| ||
push_rx_is_case_sensitive | int | yes |
pop_rx_is_case_sensitive | int | yes |
Hopefully self-describing...
|
doc_parts = "from" "subject" [from] push_rx = "^From:" pop_rx = "\n" [subject] push_rx = "^Subject:" pop_rx = "\n"A message like this one:
From: preedy@NSWC-WO.ARPA Subject: digest format To: VIRUS-L@LEHIIBM1.BITNET Keywords: edited Ken, It sounds like a good idea to me to start getting the "digests". I am tired of getting duplicates and ascii trash. Hope this isn't too much of a bother for you and you can continue to do it. Thanks for the effort. It is appreciated. You are saving a lot of people precious time. Pat Reedy preedy@NSWC-WO.ARPA [Ed. Thanks for the input! The effort involved is actually quite minimal thanks to a set of GNU EMACS digestifying routines by David Steiner. It takes me about as much time to create a digest as it would to read my own mail.]
From: preedy@NSWC-WO.ARPA
Subject: digest format
To: VIRUS-L@LEHIIBM1.BITNET Keywords: edited Ken, It sounds like a good idea to me to start getting the "digests". I am tired of getting duplicates and ascii trash. Hope this isn't too much of a bother for you and you can continue to do it. Thanks for the effort. It is appreciated. You are saving a lot of people precious time. Pat Reedy preedy@NSWC-WO.ARPA [Ed. Thanks for the input! The effort involved is actually quite minimal thanks to a set of GNU EMACS digestifying routines by David Steiner. It takes me about as much time to create a digest as it would to read my own mail.]
name | type | implicit value |
---|---|---|
compressed_ext | string | "" |
Compressed file extension (i. e. ".gz"
).
| ||
decompress_dest_dir | string | "" |
Directory in which locus programs expect to find
unpacked file after running decompressor_command_line.
| ||
decompressor_command_line | string | "" |
Name and parameters of external unpacking program.
Before running it, locus programs substitute its substrings source_file_subst
and dest_dir_subst (if they exist).
| ||
source_file_subst | string | "%src" |
Substring of decompressor_command_line to be replaced
by name of compressed file.
| ||
dest_dir_subst | string | "%dest" |
Substring of decompressor_command_line to be replaced
by target directory.
|
name | type | implicit value |
---|---|---|
find_counter | int | 10 |
read_counter | int | 1000 |
index_counter | int | 400 |
These numbers control frequency of cruncher messages
(like "reading 1. file (0%)") - the message is printed every counter
iterations. If you have a much faster machine than me (which is quite
probable), you may want to increase them.
|
name | type | implicit value |
---|---|---|
max_search_time | int | 0 |
When positive, and search (itself, without
initialization) lasts more than max_search_time seconds, it fails. This is
primarily for automatically (i. e. from CGI) started grazer.
| ||
temp_dir | string | "" |
Explicit temp directory; if empty, /tmp is used.
| ||
late_case_conversion | int | 2 |
Used when
early_case_conversion is 0. Specifies case conversions
grazer performs on every word from the query to get variants it actually
searches for.
|