Parameters of locus are specified in file locus.opt, divided into sections. The three basic sections are locus (used by all programs), cruncher and grazer; user-defined objects (document parts and operators) have their own sections.

Section starts by the name of the section in square brackets on a separate line. Every line in the section specifies value of one parameter (if it doesn't start with '#' - in that case it's a comment). Every parameter has a name and a type: int, double, string or string list. Values of type int can be written not only as numbers, but also as words yes (1) and no (0). Strings are delimited by double quotes and cannot contain character 0. Characters double quote, newline, tab and backslash must be quoted with a backslash (like in C strings). String list is a sequence of strings separated by spaces.

Most parameters have implicit values, which are used when the parameter is not specified in options file. Parameters without implicit values must be specified by user. Some things you can tweak with them are:

There's also a lot of undocumented parameters - these are considered internal and/or experimental (and even less guaranteed not to change/disappear than documented ones). Use The Source.

Basic operations

section locus
nametypeimplicit value
base_dirstringnone
Directory where cruncher creates index files and grazer looks for them.

section cruncher
nametypeimplicit value
source_dir_liststring listnone
List of directories where cruncher looks for files to index; their subdirectories are searched as well (depth-first). No shell expansion is performed on them - use absolute paths.

source_mask_liststring listnone
Masks of document files.

Search parameters

section grazer
nametypeimplicit value
focus_weightdouble0.2
correlation_weightdouble0.2
prominence_weightdouble0.2
locality_weightdouble0.2
order_weightdouble0.2
relevance_thresholddouble0.5
Implicit operator parameters.

soft_operatorsstring listempty
Names of soft operators corresponding to sections in named_objects_file.

named_objects_filestring "locus.opt"
User-defined objects file.

query_expression_filestring""
If not empty, names the query file.

interval_start_filestring""
If not empty, interpreted as name of indexed file from which searching should begin.

interval_start_offsetint0
If interval_start_file isn't empty, interval_start_offset must be offset of some document in that file (0 is always a valid offset).

start_is_in_intervalintyes
Flag determining whether low bound belongs to the searched interval.

interval_stop_filestring""
If not empty, interpreted as name of indexed file at which searching should end.

interval_stop_offsetint0
If interval_stop_file isn't empty, interval_stop_offset must be offset of some document in that file (0 is always a valid offset).

stop_is_in_intervalintyes
Flag determining whether high bound belongs to the searched interval.

parse_parameterint0
This parameter is for passing of complicated queries as command-line parameters (normally from some other program, although users can type one string in single quotes as grazer argument if it pleases them). If it's set, and grazer is called with exactly one argument, this argument is parsed as a query file.

Stoplist

is used to make index files smaller and indexation faster. It's based on the observation that words which are in every document do not distinguish these documents, so it's not useful to search for them, and so it's not necessary to index them (this is not quite true, but if you see that, you don't need my explanation). Stoplist is specified by a text file; every line of that file contains one word (from the first column; the rest of line after a first whitespace character is ignored). Words in documents are compared (case insensitively) with words in stoplist and matching ones are dropped. An example english stoplist is included, but it's better to generate your own.

To generate a stoplist, index part of your database (let's say 10%, but of course, your mileage may vary). From that, generate wordlist (by specifying appropriate word_list_file and running grazer) and use its first few lines as a stoplist (that is, edit generated word list file, assign path to it to stop_list_file and run cruncher).

cruncher maintains its stoplist in internal format, so that you can reset stop_list_file after cruncher run and stoplist remains in effect (if you specify a different stoplist, it will be added to the old one). When you add a word into the stoplist, its occurences are removed from the database.

section cruncher
nametypeimplicit value
stop_list_filestring""
Non-empty value is a name of text file from which cruncher adds words to the stop list (before starting indexation).

section grazer
nametypeimplicit value
word_list_filestring""
If not empty, interpreted as a name of file into which grazer writes (before it starts searching) all indexed words, sorted by number of documents in which they occur, number of occurences and alphabetically.

Interpretation of document text

I don't recommend changing these options for an existing database.

section locus
nametypeimplicit value
doc_start_rxstring""
Regular expression matching documents starting in the middle of a file. One file can contain multiple documents, considered logically isolated - grazer searches them separately and identifies them by file name and position where document begins. When empty, the whole file is treated as one document.

basic_separatorsstring" \n\r\t"
Word delimiters. Changing this option is not recommended.

additional_separatorsstring""
Word delimiters to add. For example when indexing HTML, it's useful to add "<>", so that tags are recognized as separate words (and can be dropped).

basic_punctuationstring"\r\t.,;:\'\"()!?\0x1a"
Basic string of characters stripped off beginning and end of every word before processing it further. Changing this option is not recommended.

additional_punctuationstring"[]{}@#$%^&*-_=|\/"
Additional string of characters stripped off beginning and end of every word before processing it. This one can be modified.

early_case_conversionint2
Case conversions performed by cruncher and grazer. Possible values:
  • 0 - no conversions in cruncher, grazer looks for variants of every word depending on late_case_conversion
  • 2 (for forward compatibility - all non-zero values behave the same) - both cruncher and grazer convert words to lowercase

Document structure

If your documents (e-mail messages, HTML etc.) have a common structure, you can instruct cruncher to store it so that grazer can use it when searching. You do it by defining document parts, delimited by regular expressions. At the beginning of a document begins "plain" part. Other parts begin when document text matches some push_rx expression and end at pop_rx expression of active document part. When one part ends, the previous one becomes active.

section locus
nametypeimplicit value
doc_partsstring listempty
Names of explicit document parts corresponding to sections in named_objects_file. The word "plain" is reserved for default part and cannot be in this list.

section grazer
nametypeimplicit value
title_partsstring listempty
Document part(s) containing document's title. If it's not a valid document part (for example if you leave it empty), returned title will be always empty, too.

document part section
nametypeimplicit value
push_rxstring""
Regular expression matching beginning of document part (empty string never matches).

pop_rxstring""
Regular expression matching end of document part (empty string never matches).

push_rx_is_case_sensitiveintyes
pop_rx_is_case_sensitiveintyes
Hopefully self-describing...

Example: mailing list

With these options:
doc_parts = "from" "subject"
[from]
push_rx = "^From:"
pop_rx = "\n"
[subject]
push_rx = "^Subject:"
pop_rx = "\n"
A message like this one:
From:         preedy@NSWC-WO.ARPA
Subject:      digest format
To:           VIRUS-L@LEHIIBM1.BITNET
Keywords: edited

Ken,
     It sounds like a good idea to me to start getting the "digests".
I am tired of getting duplicates and ascii trash.  Hope this isn't too
much of a bother for you and you can continue to do it.  Thanks for
the effort.  It is appreciated.  You are saving a lot of people
precious time.


         Pat Reedy
         preedy@NSWC-WO.ARPA


[Ed. Thanks for the input!  The effort involved is actually quite
minimal thanks to a set of GNU EMACS digestifying routines by David
Steiner.  It takes me about as much time to create a digest as it
would to read my own mail.]

would be partitioned into from part:
From:         preedy@NSWC-WO.ARPA

subject part:
Subject:      digest format

and plain part:
To:           VIRUS-L@LEHIIBM1.BITNET
Keywords: edited

Ken,
     It sounds like a good idea to me to start getting the "digests".
I am tired of getting duplicates and ascii trash.  Hope this isn't too
much of a bother for you and you can continue to do it.  Thanks for
the effort.  It is appreciated.  You are saving a lot of people
precious time.


         Pat Reedy
         preedy@NSWC-WO.ARPA


[Ed. Thanks for the input!  The effort involved is actually quite
minimal thanks to a set of GNU EMACS digestifying routines by David
Steiner.  It takes me about as much time to create a digest as it
would to read my own mail.]

Of course you can be fancier than that - for example, it would be posible to define a part for editor's comments. If you devise some examples you're proud of, let me know.

Compressed documents

locus can access data in compressed files - with some important restrictions:
  1. You can use at most one extension of such files and one decompressing program.
  2. In every archive must be exactly one file having the same name (with any extension).
If you want to index compressed data, you must set options compressed_ext, decompress_dest_dir and decompressor_command_line. N.B: locus does not expect unpacking program to destroy the original archive, like gunzip does. If you use gunzip, set decompressor_command_line to some script wrapping gunzip.

section locus
nametypeimplicit value
compressed_extstring""
Compressed file extension (i. e. ".gz" ).

decompress_dest_dirstring""
Directory in which locus programs expect to find unpacked file after running decompressor_command_line.

decompressor_command_linestring""
Name and parameters of external unpacking program. Before running it, locus programs substitute its substrings source_file_subst and dest_dir_subst (if they exist).

source_file_subststring"%src"
Substring of decompressor_command_line to be replaced by name of compressed file.

dest_dir_subststring"%dest"
Substring of decompressor_command_line to be replaced by target directory.

Miscellaneous

section cruncher
nametypeimplicit value
occ_col_sizeint60000
Maximum number of occurences (of one word) cached in memory before writing them to disk.

occ_block_sizeint40000
Maximum number of words cached in memory before writing them to disk.

find_counterint10
read_counterint1500
index_counterint400
These numbers control frequency of cruncher messages (like "reading 1. file (0%)") - the message is printed every counter iterations. If you have a much faster machine than me (which is quite probable), you may want to increase them.

section grazer
nametypeimplicit value
max_search_timeint0
When positive, and search (itself, without initialization) lasts more than max_search_time seconds, it fails. This is primarily for automatically (i. e. from CGI) started grazer.

temp_dirstring""
Explicit temp directory; if empty, /tmp is used.

late_case_conversionint2
Used when early_case_conversion is 0. Specifies case conversions grazer performs on every word from the query to get variants it actually searches for.
  • 0 - as specified, no other variations
  • 1 - as specified and all uppercase
  • 2 - as specified, all lowercase, capitalized (first uppercase, the rest lowercase), all uppercase