Showing posts from August, 2011

Regular expressions in lexing and parsing

Comments extracted from a code review. I've been asked to disseminate them more widely.

I should say something about regular expressions in lexing andparsing. Regular expressions are hard to write, hard to write well,and can be expensive relative to other technologies. (Even when theyare implemented correctly in N*M time, they have significantoverheads, especially if they must capture the output.)Lexers, on the other hand, are fairly easy to write correctly (if notas compactly), and very easy to test. Consider finding alphanumericidentifiers. It's not too hard to write the regexp (something like"[a-ZA-Z_][a-ZA-Z_0-9]*"), but really not much harder to write as asimple loop. The performance of the loop, though, will be much higherand will involve much less code under the covers. A regular expressionlibrary is a big thing. Using one to parse identifiers is like using aMack truck to go to the store for milk. And when we want to adjust our lexer to admit other character t…