Basics
The point of locus is to choose from a database a list of documents you're
searching for. This result depends on various inputs.
- The words you're looking up. This simple criterion is perfectly
adequate, as long as there is relatively few documents satisfying it - that
is, if you know some rare words to look up (e. g. the name of the product
you want). Such simple queries can be specified as command-line
parameters for grazer.
- If you don't know any rare words, you must search for a
pattern of common ones. That can be as simple as searching for a whole
name (query for "Buffalo Bill" gets a lot more descriptive when you
specify that both words must be present and the first must immediately precede the
second) or as complicated as extracting characteristic words and their patterns
from known documents of the kind you're interested in and searching for these
(truth be told, the performance of these complicated approaches is not what I
would like it to be, but I'm working on it :-) ). Conformance to patterns is
quantified by a hardcoded set of metrics and customized by
assigning relative weights to these metrics (i. e. when searching for
"Buffalo Bill", you would assign positive weights to existence of all
words in a document, their order and locality and zero to everything else).
Weights should be numbers from the interval <0, 1> whose sum is 1. Set of
weights (with one
additional parameter) defines
soft operator, which
maps list of words to list of documents containing them,
sorted by their relevance. Default soft operator is unnamed, defined in
locus.opt (if it's not, grazer uses defaults as
its attributes) and it's the one used on queries from command line. You can define
additional, named soft operators in user-defined
objects file. These operators can be used in
query file.
- Many documents stored in databases have general structure - for
example e-mail messages contain author's name, subject, date etc. If you
describe this structure before
indexing them, you can restrict your
queries to document parts (i. e. search for
messages whose subject contains "engineering").
- Of course, you may want to search for more than one pattern in
one query (for example not only for "Buffalo Bill" but for
"William Cody" as well). You can do that by composing soft operators
in the query file with relational operators - the standard & and |.
Composition of relevance follows rules of fuzzy arithmetic.
Soft operators
Soft operator is applied on sorted list of mutually different words (if they're
not mutually different, grazer displays a warning and throws duplicates out).
Each word is first converted to lowercase (for default value of
early_case_conversion) and stemmed (if stemming is enabled - by default it's
not).
grazer gets all the documents (in a specified
interval) containing at least one word from the query. Found documents are
sorted by their relevance and those with relevance lower than a
threshold (value of relevance_threshold parameter of
the operator) are
dropped. Document's relevance is computed as follows:
- First, primary metrics are computed:
- number of occurences of searched-for words in the document
- number of different sought words in the document
- position of first found occurence
- minimal length (in words) of a part of the document containing
all distinct words found in that document
- maximal length of sequence of words which are in the document
in the same order as in the query
- Then, from these, normalized metrics (numbers in the
interval <0, 1>, the bigger the better correlation):
- focus - number of occurences
- correlation - ratio of found words to sought ones
- prominence of the first found occurence
- locality - how near are found words to each other
- order - how well sorted they are
- And finally, normalized metrics are weighted, i. e. multiplied with
weights defining the operator. Relevance is the sum of weighted
normalized metrics.
contains an expression composed from terms by operators '&' and '|' and
parentheses. Terms of this expression are soft operators, either named or
unnamed. Unnamed operator is just a list of words - for example
Buffalo Bill | William Cody
is legal and equivalent to
(Buffalo Bill) | (William Cody)
Named operators must have parameters in parentheses - for example
name(Buffalo Bill)
Named operators can be restricted to some document parts by listing these parts
in square brackets after operator's name - for example
name[from](Linus)
Word "plain" refers to default part - for example
name[title plain](ISO)
Operator parameters containing characters other than letters must be in
double quotes - for example
name("O'Brien")
Quoted operator parameter containing whitespace characters is broken into
multiple ones - for example
"o la la"
is equivalent to
o la
Note that only relational operators can be applied recursively -
soft operators can't.