Searching for Text (Part I)

Do you deal a lot with reading or writing text? Do you often use search tools? Do you have a pile of data sitting on your web and file server(s)? Many of us do. How do you organise your collection of text data? Do you use a directory, an index, or a database? In case you haven't decided yet, let me suggest a few options.

Documents and Dealing with Text

I will focus on organising, indexing, and searching text data. This is sufficient, since a lot of search queries can be transformed to text. In addition, processing text is harder than it seems, so it's good to have a focus. You may note that I make a distinction between documents and text data; the reason is the sheer volume of different document formats. Some of them are well-defined, some aren't. Some have open specifications readily available to developers. Proprietary document formats are always a barrier for data processing. Unfortunately, these formats cannot be avoided.

The first thing you have to do is to organise your data in some way. It doesn't matter if you populate a file server with a directory structure and start copying data or if you keep a list of bookmarks in your browser. The most important aspect is to have a kind of unique identifier or reference to every single document. Uniform Resource Locators (URLs) work well; a path to a file along with its name will also be perfect. It's best if you manage to group your documents by a list of categories. The next thing you have to consider is the document formats. Most indexing and search tools can only handle text, so if your document format allows for conversions, then it is useful for processing. Here are some examples for conversions done in shell scripts.

PDF: pdftotext -q -eol unix -enc UTF-8 $IN - > $OUT
Postscript: pstotext $IN | iconv -f ISO-8859-1 -t UTF-8 -o $OUT -
MS Word: antiword $IN > $OUT
HTML: html2text -nobs -o $OUT $IN
RTF: unrtf --nopict --text $IN > $OUT
MS Excel: py_xls2txt $IN > $OUT
any OpenOffice document: ooo_as_text $IN > $OUT

The variable $IN denotes the source document and $OUT is the name and location of the converted content in plain text. In order to capture all possible character encodings, it is always useful to convert to a suitable Unicode encoding. I usually use UTF-8 for this purpose. Converting to UTF-8 from any other encoding works well; converting from UTF-8 to an encoding having fewer representations of characters is "lossy" and is usually not precise enough to be useful.

Keep in mind that although some converters can deal with MS Office documents, it is not the best format for storing information. The format is still proprietary and you may not use Microsoft's "free" document specification for any purpose (commercial use is explicitly excluded, therefore the specs are not free to use). Storing information in these formats will cause a lot of trouble - especially if the vendor disables old versions of the format by software updates (this has already happened). That's a clear and obvious warning, and if you have any word in how to organise document collections you can avoid a lot of trouble at the beginning.

Having thought about organising the data, we can now consider how to best index it. This doesn't mean that you are done with thinking about the organisation of the data - it really is the most important step.

MySQL Natural Language Full-Text Searches

MySQL offers the creation of full text indices; this is described in the manual in the "Natural Language Full-Text Searches" section. It is an easy way of indexing text data. Let's say you have the following table:

CREATE DATABASE textsearch;
USE textsearch;
CREATE TABLE documents (
    id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,
    filename VARCHAR(255) CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL,
    path VARCHAR(255) CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL,
    type VARCHAR(255) CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL,
    mtime TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
    content TEXT CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL,
    FULLTEXT (filename),
    FULLTEXT (content)
);

We'll store the filename, the path information, the file type, and its converted content in a database table. The VARCHAR data type might be too small if you have big directory trees, but it's more than enough for a simple example. Every document has a unique ID consisting of the field id. The option FULLTEXT() advises MySQL to create a full text search index over the columns filename and content. You can add more columns if you like, but you need to be careful not to index everything. Adding the type column might also be a reasonable option.

Now we need some content - so let's insert a few records for testing.

INSERT INTO documents ( filename, path, type, content )
   VALUES ( 'gpl.txt', '/home/pfeiffer', 'Text', 'This program is free software; 
   you can redistribute it and/or modify it under the terms of the GNU General 
   Public License as published by the Free Software Foundation;' );
INSERT INTO documents ( filename, path, type, content )
   VALUES ( 'fortune.txt', '/home/pfeiffer', 'Text', 'It was all so different 
   before everything changed.' );
INSERT INTO documents ( filename, path, type, content )
   VALUES ( 'lorem.txt', '/home/pfeiffer', 'Text', 'Lorem ipsum dolor sit amet,
   consectetuer adipiscin...' );

Now you can do a full text search.

mysql> SELECT id,filename FROM documents WHERE MATCH(content) AGAINST('lorem');
+----+----------+
| id | filename |
+----+----------+
|  6 | test.txt |
+----+----------+
1 row in set (0.01 sec)

mysql>

The construct MATCH() AGAINST() does the full text search for you. MySQL uses a number to indicate the relevance of the table record. You can show all these rankings by querying for MATCH() AGAINST().

mysql> SELECT id, filename, MATCH(content) AGAINST('lorem') FROM documents;
+----+-------------+---------------------------------+
| id | filename    | MATCH(content) AGAINST('lorem') |
+----+-------------+---------------------------------+
|  1 | gpl.txt     |                               0 |
|  2 | fortune.txt |                               0 |
|  3 | s3.txt      |                               0 |
|  4 | s4.txt      |                               0 |
|  5 | miranda.txt |                               0 |
|  6 | test.txt    |                0.75862580537796 |
+----+-------------+---------------------------------+
6 rows in set (0.00 sec)

mysql>

Obviously, I added a few more rows than described originally. The right column displays the ranking. Only record 6 has a number greater than 0 because all the other texts lack the word lorem. Now you can add more texts and see what their rating is like. Note that MySQL uses a specific strategy when performing full text indexing:

It does not index words with less than 4 characters.
It also does not index any word that appears in more than 50% of the text lines.
Hyphenated words count as two words.
Any partial words are ignored.
The database engine also maintains a list of common English words (stop words); they are ignored, too.

Be careful - if your search query consists solely of stop words, you'll never get any results. If you need a full text search in languages other than English you can provide your own set of stop words. The documentation will tell you how to do this.

It is also possible to search for more than one word. You can add multiple words separated by commas.

SELECT id, filename, MATCH(content) AGAINST('lorem,ipsum') FROM documents;

Full Text Search with PostgreSQL

Of course PostgreSQL can also deal with full text searches - a plugin called Tsearch2 is available for PostgreSQL database servers prior to version 8.3.0 (it's integrated into 8.3.0). Just like for the MySQL functions, you can fine-tune these according to the language your texts are written in. The content has to be transformed into tokens, and PostgreSQL offers new database objects that deal with these operations. The Tsearch2 engine provides text parsers for tokenisation, dictionaries for normalisation of tokens (and lists of stop words), templates for switching between parsers or dictionaries, and configurations to use whatever language you need to. Creating new database objects requires knowledge of C programming.

Let's recreate the example table in PostgreSQL (I use version 8.3.0; if you have an older version, please install Tsearch2):

CREATE TABLE documents (
 id_documents serial,
 filename character varying(254),
 path character varying(254),
 type character varying(254),
 mtime timestamp with time zone,
 content text );
CREATE INDEX documents_idx ON documents USING gin(to_tsvector('english',content));

First we create the table, then we create the text GIN (Generalized Inverted Index); this type of index consists of distinct lexemes. The function to_tsvector() converts the text stored in the content column into these normalised words. It uses the English parser and dictionary. Search queries look like this:

lynx=> SELECT filename,mtime FROM documents WHERE to_tsvector(content) @@ to_tsquery('lorem');
 filename  |            mtime             
-----------+------------------------------
 lorem.txt | 2008-02-26 12:15:16.34584+01
(1 row)

lynx=>

You'd use a normal SELECT and the @@ text match operator. This operator compares arguments and the search string converted to lexemes by use of to_tsvector() and to_tsquery() functions. The results are returned by the SELECT statement. You can also use ranking in order to sort the results.

lynx=> SELECT filename,mtime,ts_rank(to_tsvector(content),to_tsquery('lorem'))
          FROM documents WHERE to_tsvector(content) @@ to_tsquery('lorem');
 filename  |            mtime             |  ts_rank  
-----------+------------------------------+-----------
 lorem.txt | 2008-02-26 12:15:16.34584+01 | 0.0607927
(1 row)

lynx=>

The tokenisation is one of the crucial parts of a text search, and it's important to understand the algorithms that Postgres uses to decompose a string. Consider the following example:

lynx=> SELECT alias, description, token FROM ts_debug('copy a complete database');
   alias   |   description   |  token   
-----------+-----------------+----------
 asciiword | Word, all ASCII | copy
 blank     | Space symbols   |  
 asciiword | Word, all ASCII | a
 blank     | Space symbols   |  
 asciiword | Word, all ASCII | complete
 blank     | Space symbols   |  
 asciiword | Word, all ASCII | database
(7 rows)

lynx=>

The example uses the ts_debug() function and shows every token with its classification. The text search module understands most of the common text constructs; it can also decode URLs.

lynx=> SELECT alias, description, token FROM ts_debug('http://linuxgazette.net/145/lg_tips.html');
  alias   |  description  |               token               
----------+---------------+-----------------------------------
 protocol | Protocol head | http://
 url      | URL           | linuxgazette.net/145/lg_tips.html
 host     | Host          | linuxgazette.net
 url_path | URL path      | /145/lg_tips.html
(4 rows)

lynx=>

Now the parser displays the tokens as part of the URL and identifies them. The tokens allow for better search query processing, and this is the reason why you have to filter your query string. The text search compares tokens and not the strings themselves.

Conclusion

I presented only two ways to index text data. This is really only the tip of the iceberg - there's a lot more to learn about full text searches. Both MySQL and PostgreSQL have convenient algorithms ready for use that facilitate finding documents. You can use a simple Perl script with either one of these database engines, feed them your browser bookmarks and build an index with the content of the web pages, ready for search queries. There are many other tools available, and I'll present another way of indexing in the next part of this series. If you use something different or interesting to accomplish these tasks, please write and let us know about it!

Useful resources

antiword - a free MS Word document reader
Lorem Ipsum generator
MySQL Natural Language Full-Text Searches
OOoPy: Modify OpenOffice.org documents in Python
PostgreSQL documentation - Full Text Search
pyexcelerator - MS Excel reader written in Python
Tsearch2 - full text extension for PostgreSQL
UnRTF
Using Fulltext Indexes in MySQL
Why are the Microsoft Office file formats so complicated?
wvWare - Microsoft® Word document converter
Xpdf tool for PDF viewing and conversion

Talkback: Discuss this article with The Answer Gang

Bio picture

René was born in the year of Atari's founding and the release of the game Pong. Since his early youth he started taking things apart to see how they work. He couldn't even pass construction sites without looking for electrical wires that might seem interesting. The interest in computing began when his grandfather bought him a 4-bit microcontroller with 256 byte RAM and a 4096 byte operating system, forcing him to learn assembler before any other language.

After finishing school he went to university in order to study physics. He then collected experiences with a C64, a C128, two Amigas, DEC's Ultrix, OpenVMS and finally GNU/Linux on a PC in 1997. He is using Linux since this day and still likes to take things apart und put them together again. Freedom of tinkering brought him close to the Free Software movement, where he puts some effort into the right to understand how things work. He is also involved with civil liberty groups focusing on digital rights.

Since 1999 he is offering his skills as a freelancer. His main activities include system/network administration, scripting and consulting. In 2001 he started to give lectures on computer security at the Technikum Wien. Apart from staring into computer monitors, inspecting hardware and talking to network equipment he is fond of scuba diving, writing, or photographing with his digital camera. He would like to have a go at storytelling and roleplaying again as soon as he finds some more spare time on his backup devices.

Copyright © 2008, René Pfeiffer. Released under the Open Publication License unless otherwise noted in the body of the article. Linux Gazette is not produced, sponsored, or endorsed by its prior host, SSC, Inc.

Published in Issue 149 of Linux Gazette, April 2008

<-- prev | next -->

Home Main Site FAQ Site Map Mirrors Translations Search Archives Authors Mailing Lists Join Us! Contact Us
The Free International Online Linux Monthly	ISSN: 1934-371X	Main site: http://linuxgazette.net