Text Conversion Programs

By Shlomo Perets, MicroType
 

This column will address various groups of programs which in different cases and situations can be of much use to the technical writer, saving hours of work, enhancing the final product and sometimes locating and dealing with problems which are difficult to trace.

In this issue, I would like to discuss text conversion programs.

When we incorporate raw files from the outer world in our work – part lists, database reports, or text files created on other computing platforms – we frequently encounter some or all of the following problems:

  • Special characters and symbols, accented letters, fractions which are inappropriately converted
  • Misspellings and/or inconsistent spelling inherent in the input file
  • Format codes/tags from the original program appear in the file
  • Redundant spaces and tabs
  • Control characters which can corrupt the file
  • Each line ends with a paragraph mark
  • All text is uppercase

In all of these cases, replacing existing character strings with new strings will solve the problems, provided you know exactly what is wrong and what is needs to be changed to. While text editors and word processors enable this using the "find and replace" function, this is generally impractical and the degree of success is limited. A conversion program will do the job much more successfully and efficiently.

Conversion programs can run on many types of files. They can also be used to process FrameMaker or Word files, but you must operate on the textual representation of the file (eg MIF or RTF) and not on the native binary format.

A conversion program enables execution of a series of custom-defined search and replace actions (known as rules), but the success of the conversion depends on you.

First, you need to know what exactly in the input file you want to change. You can analyze the input file using programs such as List (by Vernon D. Buerg), which function like an X-ray of the file, showing the contents both as regular text (where possible) and as hexadecimal codes.

You then formulate your search and replace rules, and run the conversion program. Invariably, this is a trial and error process, and each time you run the program you will discover nuances you have not yet addressed. If necessary, update the rules you formulated in order to improve the handling of the different variations encountered in the input files.

Although this may seem a tedious process while you are doing it, once the rules are finalized the conversion program can be operated on any number of files of the same type with minimum intervention – thus automating the processing and manipulation of massive quantities of text.

Simple rules might be in a find and replace format such as XFMR -> transformer , and even at this level, using a conversion program to run batches of numerous such replacements is much more efficient than using the Find and Replace function in your word processor.
However, powerful and flexible rules can be formulated, using the advanced features of the conversion program. For example:

  • defining wildcards and character groups – such as punctuation marks – which can then be used when formulating the rules
  • applying masking to prevent or allow repetitive conversion
  • defining repetition (eg changing multiple spaces to a single space)
  • using logical variables that might change their values depending on what is encountered in the input, thus enabling different replacements for the same string based on the context

As you progress, your rules can get complex. To make them more readable you can use abbreviations and include annotations.
Examples of commercial conversion programs are Filtrix by Blueberry Software (which comes with predefined rules for various formats) and Reform(at) by Palron (originally distributed as an add-on with typesetting programs but also available separately).

There are also several public domain or shareware conversion programs available.

The functionality and abilities of these programs varies widely. In any case, I recommend that you search the internet for the most appropriate program for your specific needs.
 
 
 
Figure 1. A partial listing of a sample Reform(at) script

The following Reform(at) script removes commands (starting/ending with angled brackets) and change multiple spaces into a single space. The DELETE logical variable is initially set to false. When a left angled bracket is encountered the DELETE variable is set to true. When DELETE is true, characters in the ALMSTANY category are skipped. A right angled toggles the values of the DELETE variable. Scripts can get complicated and sometimes you need multiple scripts, with the output of one script piped as an input to another script.

\def DELETE false

\def ALMSTANY { 20 .. 3B, 3D, 3F..FF }

\def SPACEBAND \2{" "} ; \n{x} means "n or more occurences of x"
SPACEBAND --> " "

"</P>" --> 0d 0a 0d 0a              ; end of paragraph

?~DELETE : DELETE  "<" -->
?DELETE           ALMSTANY --> ; delete commands
?DELETE : ~DELETE ">" -->
 

Originally published in i-Contact, publication of the Society for Technical Communication, Israel Chapter, October 1997.

Techniques & Resources