[]RSS

About Archives Artwork Comic Contact Philosophy Projects Tags

Regex of the day: Optional HTML tags

November 11th, 2005 in Perl. Weblog

It’s one of those -laden days, and I’m really starting to more complicated expressions:

^(?:\s+|)<(\w+)(?:\s+|)((?:.*?|))>(.*?)(?:<\/(.*?)>|)(?:\s+|)$

This expression parses a line that contains tags based on the following logic, expecting that:

  1. There will be a start tag near the beginning of the line, possibly padded on the left with spaces that are ignored
  2. The opening tag may contain some HTML parameters
  3. There may be a closing tag on the line
  4. There may be spaces on the right of the closing tag that will be ignored

The expression will parse the following example into 4 parts:

<h1 id="test">This is a test</h1>
  1. h1
  2. id=”test”
  3. This is a test
  4. h1

Learning regex to the point of being able to write complex expressions has taken a couple of years, but has been well worth the effort. To define the same parsing logic in C or C++ (using standard mechanisms) would take 20-30 minutes, and would occupy a page of code. You just have to remember that a regex is a small script, and that it should be tested (and documented) like one.

Regex is like a lot of little languages too (like SQL, bash, m4). It’s terribly useful, succinct, and worth having in your toolkit. It’s not something to hide in layers of abstraction either, rather it’s something that deserves use alongside your ‘real’ tools. I find that developers are in the habit of hiding (or hiding from) little languages, something that results in the too-many-elbows syndrome: insulating yourself from the real power of your tools, making things more complicated in the process.

Simple, in the end, is in the knowledge of the beholder. If you understand regex, code that contains it can be simpler.

2 Responses to “Regex of the day: Optional HTML tags”

  1. Steven Fisher says:
    December 2nd, 2005 at 3:41 pm

    I love regular expressions. I used to think I was good with them, but the regex you quoted… well, I have some studying to do. It would have taken me many hours to build that.

    One problem I frequently is that regular expression engines are not created equal. Learning the quirks of the one you’re using can be quite a pain.

  2. mx says:
    December 2nd, 2005 at 4:30 pm

    And I suck at regex. I’ve had some code reviewed by a brave soul at , who reminded me how much more there is to learn. This particular regex may well be incorrect, buggy, or otherwise retarded: I’m still learning.

    There are at least a few flavours of regex, but they really only differ on the edges. The Php preg engine is similar enough to the Perl engine (and most of the Gnu-tool ones) that I don’t often get lost. It’s a bit like C, in that most compilers support a reasonable subset.

 

Leave a Reply

Subscribe to comments