RegEx for the Writer

As an IT professional, I use regular expressions every day. Regular expression (or RegEx) is a syntax employed by modern programming and web tools to provide sophisticated pattern-matching capabilities. They scare me a little, because I maintain that all non-trivial regular expressions are what John Dykstra used to call “miracle programs”, programs that are wrong and only appear to work because they have not yet met the right input data that will cause them to stumble, embarrassingly, disastrously, into ruin.

Still, they are handy, and let us go way beyond the simple wildcard matches of yesteryear. So it is not surprising that OpenOffice/LibreOffice, the open source replacements for Microsoft Office written by a global community of uber-geeks, support RegEx. As and author, I use this capability quite a lot. When writing a novel, it is not uncommon to realize (or worry) that you have been systematically making some grammatical or mechanical mistake—it happens to the best of us—or simply to decide to make some global change. Simple search-and-replace is a boon, but RegEx takes us further. For example, “^And” will find lines beginning with a conjunction, “ to [:alpha:]*[\.\!\?]” will find sentences ending with (one particular) preposition.

I have also used RegEx when preparing text for on-line submission, where in-line text needs to be readable on a wide variety of clients. I use an online tool (http://www.formatit.com/) to insert linefeeds enough to format my pasted text to the proper width for submission, then past it back into Libre and use a global replace to transform the end of each line (“$”) into a pair of linefeeds (“\n\n”) and so produce text that remains double spaced even when divorced from the text styles of th word processor.

Recently, I noticed a particular sentence in which I had used three “em” dashes. I wanted to come back to it later, but had forgotten where it was. Rather than search through all 300 dashes in my manuscript, I save the file as text and used the following command line to find my quarry:

grep -o -e "[^\.\!?]*—[^\.\!?]*—[^\.\!?]*—[^\.\!?]*[\.\!?\"]" "Doomsday's Wake.txt"

This searches for any string of letters containing three dashes and preceding a sentence-ending punctuation mark. (If you know RegEx, you know that a repetition operator can simplify this, but for some reason, the version of grep I am running won’t accept it).

That solved, I used this to count the total number of sentences in my document:

grep -o -e "[^\.\!?]*[\.\!?\"]" "Doomsday's Wake.txt"

and this to display all those using a pair of dashes for review:

grep -o -n -e "[^\.\!?]*—[^\.\!?]*[\.\!?\"]" "Doomsday's Wake.txt" | more

Powertools: they’re not just for motor-heads.