The little tokenizer… (1)

Who is this for?

For people interested in building tokenizers/parsers for non-trivial languages and discovering the unexpected difficulties. You should be comfortable in programming and I hope you strive for elegancy as I do.

Do you know the “little schemer”? (or the little lisper?)

yes

What is a lexer / tokenizer?

A tokenizer is a piece of code that will convert the stream of characters into a stream of tokens. For example:

area = pi * radius^2

could be transformed into a data structure like:

[area, =, pi, *, radius, ^, 2]

Isn’t this trivial?

…concerning the previous example, definitely yes. Yet, this is far from all and this article is all about.

What about all the lexer tools and such? Can’t they easely deal with it?

Not once your language becomes a little more complicated or that you want enough customization in the produced tokens.

Can tokenizer really be complicated?

Just a few numbers:

  • Python official tokenizer: 1600+ lines (in C)
  • Haskell GHC tokenizer: 1300+ lines (in Haskell)

…yes, these numbers are only for the tokenizer, not the parser.

Really?

Yes

So what’s the source of complexity?

The complexity comes from multiple reasons:

  • In indentation sensitive programming languages, it is often easier to keep track of the indentation in the tokenizer than leaving it up for the parser. Indeed since this cannot be expressed by a context-free grammar.
  • Keeping track of the location, error reporting, etc
  • Various subtelities in many areas…
  • Quoted strings can have escaped quotes inside of them, as well as other escaped symbols but not all.
  • For most things: ignore white spaces. For “object members”: white spaces around the “.” token is an error.
  • Token type may depend on the context
  • Tricky combinations: “1.2” may be a single token while “listOfLists.1.2” may be valid 5 tokens
  • In other words, what you tokenize may depend on the context.
  • And many other subtileties and additional sources of complexity

…As you see, there are lots of “small” difficulties making the initial tokenizer an ugly bizare thing.

Shall I take a pause to read that again and think a bit about it first?

Yes, definitely! Make a walk outside and think how you would handle those things!

Are you really ready?

Not sure …give me a little more time to make a walk outside.

Ok …by now I guess you are ready. So where do we start?

By asking the right question.

Can these orthogonal issues be untangled and be nicely isolated?

Tricky question. It is not easy to answer. Maybe …or not.

Ok, let’s start somewhere, what about indentation?

If you are not interested in indentation sensitive languages, you may want to skip this whole section. Otherwise, enjoy going on. 🙂

Suppose we want to tokenize:

a
   b
      c
         d
   e

There are several ways to do this. One way is to keep track of the indentation while tokenizing and insert special indent and dedent tokens in the stream as we go along. Perhaps we could insert the usual {‘s and }’s standing for starting an indented block and the ending of it. That way, the previous example becomes:

[a, {, b, {, c, {, d, }, }, e, }]

Another way to do this (under some circumstances) is to put your lines directly in a nested list structure that already reflects the indentation. I.e.

[a, [b, [c, [d]],e]]

These are the main idea but there are surely other ways to do it.

…Well, that wasn’t very freaky. Is that all about indentation?

Of course not.
Suppose that, like in python, blank lines are ignored, comments start with # and multiline strings start/end with “””. Then, you should take care that:

a
   b

   c
      #my comment
   d

Becomes:

 [a, {, b, c, d, }]    [a, [b, c, d]] 

and not:

 [a, {, b, }, {, c, }, {, {, }, }, {, d, }]    [a, [b], [c], [[]], [d]] 

Or should it? …depends on you.

In the latter case, you still have the possibility to suppress empty { } couples or aggregate the lists correspondingly. If you do it in one move or two is up to you.

And what about the “””?

Yeah, the dear multiline strings. Well, what should be done is obvious: depending if you’re inside code or inside multiline string, you’ll handle tabs differently. …how to do it is another story. Suffice to say that you can do it during the tokenizing or in a second pass.

…so, do you have an advice which one of the two ways I should choose?

Yes, I do! 🙂 Basically, if you are using a highly stateful language with cumbersome list handling facilities, like in C or Java, then I advise you to use the { } insertion method. On the opposite, if you’re more in a stateless language with powerful list handling features, like Haskell or Scheme, then I advise you the nested lists method. If you’re using python, pick the one you prefer. 🙂

Why this segregation?

About the { } insertion method, remember that we don’t know yet what will be tokenized on the lines themeselves. I.e, when tokenizing:

a b c d e
   f

We must *remember* that the previous number of tabs was 0 while tokenizing the line into a, b, c, d and e before lastly using the tabs information to compute that we must insert a {. Advantages: we put everything in the same list. Disadvantages: we must keep track of the previous tab …sometimes for extended periods of time in case of multiline indent insensitive parts. On the opposite, the nested list method requires easy ways to manipulate lists but no state. Just put every line in the lists and each line will then be tokenized individually. A special case will be done to handle the multiline string elegantly, don’t worry. 😉

Are we going to see some code as well?

Sure. Theory without practice is dull.

So, can we see the example?

Not yet. In fact it is a wonderful exercise for you. Try to write a little program that takes an input like presented above and produces a corresponding internal representation. Using either one of the internal structure presented. This may seem obvious but is not THAT trivial and needs a bit thinking …definitely a good exercise. On the next page, you will get a solution in Haskell producing nested lists.

Advertisements

Tags: , , ,

2 Responses to “The little tokenizer… (1)”

  1. AlexM Says:

    Your blog is interesting!

    Keep up the good work!

  2. sidewords Says:

    thanks 🙂

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


%d bloggers like this: