Friday, October 10, 2014

More RegEx Recipes for Text Processing

Yesterday I wrote a piece on using some elementary Find-and-Replace stuff using Notepad++ for handling an issue that occasionally arises in transcript handling.

On this post I thought I'd add some real RegEx recipes for more advanced text handling.

(I will be using two programs, the first is the previously mentioned Notepad++.  It is a free, as in beer, software that I use extensively.  The other is TextPad by Helios Software Solutions.  It is only about $30 for a license, but if you need the advanced features, it is WELL worth the price.  I will indicate which recipes must use TextPad, otherwise the default is using Notepad++.)

(I will refer to these RegEx processes as recipes, meaning there is a Find portion and a Replace portion.  I will use quotes to make reading easier, but the start and end quotes are not meant to be used in the actual recipe.  I will indicate a space as <space>, since those tend to be difficult to read on line.  For example; Find: "\r\n<space>\r\n".  Also I will use "F:" and "R:" instead of using all the extra letters of "Find:" and "Replace:"  And obviously, unless noted, the quotes are NOT part of the recipes and are included solely to make it easier to read.)

Alright, first an explanation of the recipe for replacing the erroneous EOL characters from the other blog.

That recipe was F:  "\r\r\n" R:  "\r\n\r\n".
     Explanation:  Find the lonely Carriage Return followed by the correct EOL sequence (\r\r\n) and replace it with the correct EOL sequence - twice (\r\n\r\n).
     Summary:  Clean up files with this odd combination of EOLs.

Another version of this issue is a file that ONLY has Line Feed characters at the end of EVERY line.
This recipe would be F:  "\n"  R:  "\r\n"
(This explanation and summary is left as an exercise for the reader.)

The recipe to remove double-spacing is F: "\r\n\r\n" R: "\r\n".
     Explanation:  Find two ASCII EOL markers in a sequence (\r\n\r\n) and replace with a single EOL (\r\n).
     Summary:  Remove all empty lines, such as found in a double-spaced file.

Trim trailing whitespace F: "[<space>\t]+$" R:  ""
     Explanation:  Find and select any and all "+" spaces "<space>" or tabs "\t" that occur just before the end of the line "$" and replace them all "[]" with nothing.
     Summary:  Trim the end of a line of extraneous whitespace.

Now the follow on to trim leading whitespace F: "^[<space>\t]+" R:  ""
     Explanation, same as above, except "^" means the start of a line.

To find most problem non-ANSI characters use F: "[^\xd,\xa,\x20-\x7a]"
     ANSI characters are the printing characters and lower non-printing ones. The "\x20-\x7a" catches the upper problem characters.


Thanks and have a great day.

chuck

No comments:

Post a Comment