Regex-based tokenizer
January 2022
This post is not mine, but I found it so useful, that, when I lost the url, I finally decided to save the content to my blog. Also, there’s a guide in the official Python documentation, but it looks a bit more complicated to me.
import re
SCANNER = re.compile(r'''
(\s+) | # whitespace
(//)[^\n]* | # comments
0[xX]([0-9A-Fa-f]+) | # hexadecimal integer literals
(\d+) | # integer literals
(<<|>>) | # multi-char punctuation
([][(){}<>=,;:*+-/]) | # punctuation
([A-Za-z_][A-Za-z0-9_]*) | # identifiers
"""(.*?)""" | # multi-line string literal
"((?:[^"\n\\]|\\.)*)" | # regular string literal
(.) # an error!
''', re.DOTALL | re.VERBOSE)
If you combine this with a re.finditer() call on your source string like this:
for match in re.finditer(SCANNER, data):
space, comment, hexint, integer, mpunct, \
punct, word, mstringlit, stringlit, badchar = match.groups()
if space: ...
if comment: ...
# ...
if badchar: raise FooException...
https://deplinenoise.wordpress.com/2012/01/04/python-tip-regex-based-tokenizer/
Leave a Reply