The little tokenizer… (3)

What’s the subject of today? shall we finally produce our tokens?!

Yes!

Ah, this was about time!

And this time, to change a little, let me start with the questions. 😉

…So here I go with my questions.
What do you think about the process of transforming a string into tokens?

Well, that shouldn’t be so difficult. …In fact, it even looks pretty easy!
Simply express your tokens as regular expressions and let the stream be matched against those.
I.e.:
string -> \”.*\”
number -> -?\d+(\.\d+)?
parenth -> [\(\)]
dot -> \.
symbol -> \w[\w\d]+

Using this description, it should be unambiguous and our tokens are matched! 🙂
Wasn’t that easy?

Is that really sufficient?

…Yeah …What is buzzing you?

What about white spaces?

Well, simply ignore them of course.

Ok, let’s ignore them.
What if “myObject. myMember” is not allowed because of that space in the middle for example?

Oh, I didn’t thought of this.
…We cannot purely and simply ignore white spaces. 😦

What should we do then? Tokenize them as well and produce “white space” tokens?

Nah, it would be ugly to make useless tokens.

What should we do then?

…hmmm
Perhaps ensure that after the dot is another symbol?
…and before as well in fact. There must be symbols on both sides.
Oh man, I hope this is not going to be too complicated.

…I don’t know further. Can I ask the questions back again?

…Thanks for letting me asking questions again. I am stuck. Are there other annoying tricky things like the dot or is it an exceptional case we can handle “specially”?

Well, that’s by far not the only case. There are several other examples of tokenizing rules:
– it should be forbidden to place a dot after a number like in “123.45.67”
– a number must be followed by some kind of separators (not a variable like 123.45myVariable)

Ay ay ay. So what should we do? What we parse should depend on what was tokenized just before. Isn’t it?

Perhaps

So it should be like:
if i tokenized a symbol, i must afterwards tokenize either a white space, a parenthesis or a dot
if i tokenized a dot, i must aterwards tokenize a symbol
if …
But isn’t this going to be a parser instead of a tokenizer?!

The frontier between a tokenizer and a parser is sometimes very thin… If it even exist is arguable.

So we shall implement a parser to tokenize our string, right?

You can try it!

Let’s say, like in java or C, you have to tokenize a subset of it: symbols, dots and commas.
However, there is a twist: no whitespaces are allowed between symbols and dots, whereas for commas the surrounding whitespaces can safely be ignored and skipped.
Here are a few examples:
myObject1,myObject2  =>  3 tokens: “myObject”, “,” and “myMember”
myObject1   ,   myObject2  =>  3 tokens: “myObject”, “,” and “myMember”
myObject.myMember  =>  3 tokens: “myObject”, “.” and “myMember”
myObject.   myMember  =>  error
myObject   .myMember  =>  error

So how do we do it?

Let’s formalize this a little bit.
Let’s say our regular expression recognizer understands following “***”:
\l : a letter
\d : a digit
\a : an alphanumeric character, letter or digit

Then a symbol is matched by:
“\l\a+”
And a dot by:
“\.”

Thus, what we must express goes as follows…
parse any:
if char == ‘(‘ or ‘)’
parse parenth
parse any

if char is letter
parse identifier

if char is digit
parse number
parse spacesOrParenth

else
error!

parse parenth:
OK

parse identifier:
if next char == ‘.’
parse dot
parse identifier
else
parse spacesOrParenth

parse spacesOrParenth:
if char == ‘ ‘
skip char
if char == ‘(‘ or ‘)’
parse parenth
else
error!

Pfiouuu …that’s pretty complicated just for a few tokens 😦

Indeed.

Isn’t there a simplier way?!

Sure! …Do you remember what we asked in the begginning? Are you sure we shouldn’t better tokenize the white spaces as well as we first thought?

What? But we would keep the same issues!
We would have all the useless tokens in the token stream produced!
Moreover, the rules you previously set wouldn’t be enforced!

What if we would check the rules afterwards and then filter out the white space tokens?

Duh. …Really not bad. This looks simplier than the previous method!

Indeed! Moreover, it has a lot of advantages besides of being simplier:
– it is more homogenuous
– it is more modularized (making tokens, checking integrity rules and filtering white spaces are completely separated concerns)
– it become easy to add additional constraints on the token sequence afterwards
– keeping track of the position and explaining errors is more straightforward
– everything is easier!

Cool, let’s do it!

Yup, and since we will soon start coding it, let’s formalize this a little bit to have a practical example.

Ok, can you give me the global picture to be sure we are on the same wavelength?

Sure, the tokenizing process will happen in three steps:

  • Make tokens of everything, including whitespaces
  • Ensure basic constraints (no space surrounding dots…)
  • Filter out all dummy tokens

Ok, got it. And what kind of tokens are we going to tokenize?

Well, why not tokenizing LEO’s basic datatypes? Here they are, with examples:

  • Delimiter – . , ; ( ) { } [ ]
  • Char – ‘a’ … ‘z’ ‘\t’ ‘\n’ …
  • String – “hello world!”
  • Integer – 123  -987  1234567890123456789012345678901234567890
  • Decimal – 123.456  -0.0001
  • Date – 2008/08/25  2008-08-25
  • Time – 12:34:56
  • Keyword – declare action function -> if else …
  • Symbol – something  this-is-also-a-valid-id  ^%@*!

Besides of this, there are comments. These start with a # and continue until end of lines.

Fine for me. Let’s write a tokenizer for those!

Sure. That will be your job though! One lesson, one exercise. 😉
If you are interested, I can leave you a skeleton of code though. The aims are clearly stated.
…On this one I will leave you alone though since I assume you are good enough to make it on your own meanwhile.
One tip though: try to abstract common patterns a maximum and to scramble it down into nice functional blocks.

Ok, I’ll try. Is there anything more you can tell me before I write that little tokenizer?

Yes, sure! Let me take the opportunity to speak about the constraints integrity verification. Since we are in Haskell with pattern matching, this can be done very nicely. For example, let us ensure that:
– after any kind of token (except spaces and delimiters), there must be either a white space or a delimiter
– except for . delimiters which must be surrounded by symbols

ensureIntegrity = case tokens of
Symbol a : Delimiter ‘.’ : Symbol b : remaining -> ensureIntegrity (Symbol b : remaining)
a : Delimiter ‘.’ : b : remaining -> error “the dot is not surrounded by symbols”
a : b : remaining ->
if not isDelimOrSpace a and isDelimOrSpace b then
error (string a ++ ” must be followed by a space or a delimiter”)
else
ensureIntegrity (b : remaining)

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


%d bloggers like this: