Table of contents

  1. Regex-based tokenizer

    January 2022

Content

Regex-based tokenizer

This post is not mine, but I found it so useful, that, when I lost the url, I finally decided to save the content to my blog. Also, there’s a guide in the official Python documentation, but it looks a bit more complicated to me.

import re

SCANNER = re.compile(r'''
  (\s+) |                      # whitespace
  (//)[^\n]* |                 # comments
  0[xX]([0-9A-Fa-f]+) |        # hexadecimal integer literals
  (\d+) |                      # integer literals
  (<<|>>) |                    # multi-char punctuation
  ([][(){}<>=,;:*+-/]) |       # punctuation
  ([A-Za-z_][A-Za-z0-9_]*) |   # identifiers
  """(.*?)""" |                # multi-line string literal
  "((?:[^"\n\\]|\\.)*)" |      # regular string literal
  (.)                          # an error!
''', re.DOTALL | re.VERBOSE)

If you combine this with a re.finditer() call on your source string like this:

for match in re.finditer(SCANNER, data):
   space, comment, hexint, integer, mpunct, \
   punct, word, mstringlit, stringlit, badchar = match.groups()
   if space: ...
   if comment: ...
   # ... 
   if badchar: raise FooException...
https://deplinenoise.wordpress.com/2012/01/04/python-tip-regex-based-tokenizer/