Full-Text Indexing and Retrieval Tools

By Shlomo Perets, MicroType
 

You are editing a huge compendium, where each item is stored as a separate file. At a fairly advanced stage, after receiving feedback from reviewers, you wish to inspect all occurrences of certain words when they appear in the context of procedures, but there are thousands of pages...

Or, a menu entry in your product has been changed and you need to revise all occurrences in your documents, department-wide.

How do you go about time-consuming and labor-intensive tasks like these without loosing your mind? Full-text indexing and retrieval tools help you to locate the information you need, which may otherwise be buried in numerous places in your documents.

When you use such tools, the first step is to initiate the creation of an index; this index will contain location information for each and every word in all of your documents. The creation of this index is external to your files and does not affect them in any way. The different full-text indexing tools support different file formats (such as Word, WordPerfect, Excel). Indexed documents are typically specified according to directory and extension. You can build one index for all of your files, or you might choose to build several separate indexes, each for a different project.

Keep in mind that the indexes have be to updated when new documents are created, or existing documents are changed. Updating is usually done automatically, while you work. Alternatively, you can schedule the updating of the index to occur in times convenient for you (lunch break, for example). Given today’s computing power, both the index building and the update are relatively fast processes.

Once you have an index, you can locate, view, edit and/or print information. Using the indexes created, the search function of the full-text indexing tool can locate the required information in your documents. Results are displayed instantly, and the requested words are highlighted within their context.

A search can consist of a single term (word or phrase) or multiple terms. You can define relationships between search terms or impose limits upon them using a wide variety of search techniques, including:

  • wildcard characters - which enable you to look for all word sequences that begin with a certain string, and/or end with another (for example, searching for "manag* bonus" will yield terms like management bonus, managerial bonus, and so on)
  • logical operators - OR, AND , XOR, NOT, EXCEPT
  • proximity (for example, when the terms "danger" and "explosive" occur within the same paragraph).
  • file dates, or other file parameters - that serve to filter the list of files displayed.

You can also use a "search within a search", whereby you restrict subsequent searches to the results of a set from previous search. For example, an initial search might have yielded a relatively long list of occurrences; rather than start a new search, you can refine the parameters and run the search on the results of the previous search.

In addition to helping you find the required information in your documents, text retrieval tools usually include various utilities which can be helpful in detecting mistakes. For example, you can generate a list of all words in your indexed document, sorted alphabetically or by frequency of appearance. When you scan an alphabetical list, you are able sometimes to locate problem words which appear in slightly different form (such as aluminum and aluminium). In the list of words sorted by frequency of appearance, words listed as appearing only once or twice in your entire document collection are often suspicious, and could be misspellings. You may be surprised at the number of misspellings you find in documents which have been checked with a spelling checker and have been proof-read!

Some of the leading players in the category of full-text indexing and retrieval tools are ISYS (Odyssey Development) and ZYindex (ZyLAB Corporation)

The functionality, features and supported file formats in the different packages vary, as do prices. If possible, use a trial version before purchasing.
 
Originally published in i-Contact, publication of the Society for Technical Communication, Israel Chapter, December 1997.


Techniques & Resources

Online Books and Presentations:

ISYS White Papers:

ZyLab White Papers and Presentations: