Archive for the ‘The little…’ Category

The little tokenizer… (3)

September 11, 2008

What’s the subject of today? shall we finally produce our tokens?!

Yes!

Ah, this was about time!

And this time, to change a little, let me start with the questions. 😉

…So here I go with my questions.
What do you think about the process of transforming a string into tokens?

Well, that shouldn’t be so difficult. …In fact, it even looks pretty easy!
Simply express your tokens as regular expressions and let the stream be matched against those.
I.e.:
string -> \”.*\”
number -> -?\d+(\.\d+)?
parenth -> [\(\)]
dot -> \.
symbol -> \w[\w\d]+

Using this description, it should be unambiguous and our tokens are matched! 🙂
Wasn’t that easy?

Is that really sufficient?

…Yeah …What is buzzing you?

What about white spaces?

Well, simply ignore them of course.

Ok, let’s ignore them.
What if “myObject. myMember” is not allowed because of that space in the middle for example?

Oh, I didn’t thought of this.
…We cannot purely and simply ignore white spaces. 😦

What should we do then? Tokenize them as well and produce “white space” tokens?

Nah, it would be ugly to make useless tokens.

What should we do then?

…hmmm
Perhaps ensure that after the dot is another symbol?
…and before as well in fact. There must be symbols on both sides.
Oh man, I hope this is not going to be too complicated.

…I don’t know further. Can I ask the questions back again?

…Thanks for letting me asking questions again. I am stuck. Are there other annoying tricky things like the dot or is it an exceptional case we can handle “specially”?

Well, that’s by far not the only case. There are several other examples of tokenizing rules:
– it should be forbidden to place a dot after a number like in “123.45.67”
– a number must be followed by some kind of separators (not a variable like 123.45myVariable)

Ay ay ay. So what should we do? What we parse should depend on what was tokenized just before. Isn’t it?

Perhaps

So it should be like:
if i tokenized a symbol, i must afterwards tokenize either a white space, a parenthesis or a dot
if i tokenized a dot, i must aterwards tokenize a symbol
if …
But isn’t this going to be a parser instead of a tokenizer?!

The frontier between a tokenizer and a parser is sometimes very thin… If it even exist is arguable.

So we shall implement a parser to tokenize our string, right?

You can try it!

Let’s say, like in java or C, you have to tokenize a subset of it: symbols, dots and commas.
However, there is a twist: no whitespaces are allowed between symbols and dots, whereas for commas the surrounding whitespaces can safely be ignored and skipped.
Here are a few examples:
myObject1,myObject2  =>  3 tokens: “myObject”, “,” and “myMember”
myObject1   ,   myObject2  =>  3 tokens: “myObject”, “,” and “myMember”
myObject.myMember  =>  3 tokens: “myObject”, “.” and “myMember”
myObject.   myMember  =>  error
myObject   .myMember  =>  error

So how do we do it?

Let’s formalize this a little bit.
Let’s say our regular expression recognizer understands following “***”:
\l : a letter
\d : a digit
\a : an alphanumeric character, letter or digit

Then a symbol is matched by:
“\l\a+”
And a dot by:
“\.”

Thus, what we must express goes as follows…
parse any:
if char == ‘(‘ or ‘)’
parse parenth
parse any

if char is letter
parse identifier

if char is digit
parse number
parse spacesOrParenth

else
error!

parse parenth:
OK

parse identifier:
if next char == ‘.’
parse dot
parse identifier
else
parse spacesOrParenth

parse spacesOrParenth:
if char == ‘ ‘
skip char
if char == ‘(‘ or ‘)’
parse parenth
else
error!

Pfiouuu …that’s pretty complicated just for a few tokens 😦

Indeed.

Isn’t there a simplier way?!

Sure! …Do you remember what we asked in the begginning? Are you sure we shouldn’t better tokenize the white spaces as well as we first thought?

What? But we would keep the same issues!
We would have all the useless tokens in the token stream produced!
Moreover, the rules you previously set wouldn’t be enforced!

What if we would check the rules afterwards and then filter out the white space tokens?

Duh. …Really not bad. This looks simplier than the previous method!

Indeed! Moreover, it has a lot of advantages besides of being simplier:
– it is more homogenuous
– it is more modularized (making tokens, checking integrity rules and filtering white spaces are completely separated concerns)
– it become easy to add additional constraints on the token sequence afterwards
– keeping track of the position and explaining errors is more straightforward
– everything is easier!

Cool, let’s do it!

Yup, and since we will soon start coding it, let’s formalize this a little bit to have a practical example.

Ok, can you give me the global picture to be sure we are on the same wavelength?

Sure, the tokenizing process will happen in three steps:

  • Make tokens of everything, including whitespaces
  • Ensure basic constraints (no space surrounding dots…)
  • Filter out all dummy tokens

Ok, got it. And what kind of tokens are we going to tokenize?

Well, why not tokenizing LEO’s basic datatypes? Here they are, with examples:

  • Delimiter – . , ; ( ) { } [ ]
  • Char – ‘a’ … ‘z’ ‘\t’ ‘\n’ …
  • String – “hello world!”
  • Integer – 123  -987  1234567890123456789012345678901234567890
  • Decimal – 123.456  -0.0001
  • Date – 2008/08/25  2008-08-25
  • Time – 12:34:56
  • Keyword – declare action function -> if else …
  • Symbol – something  this-is-also-a-valid-id  ^%@*!

Besides of this, there are comments. These start with a # and continue until end of lines.

Fine for me. Let’s write a tokenizer for those!

Sure. That will be your job though! One lesson, one exercise. 😉
If you are interested, I can leave you a skeleton of code though. The aims are clearly stated.
…On this one I will leave you alone though since I assume you are good enough to make it on your own meanwhile.
One tip though: try to abstract common patterns a maximum and to scramble it down into nice functional blocks.

Ok, I’ll try. Is there anything more you can tell me before I write that little tokenizer?

Yes, sure! Let me take the opportunity to speak about the constraints integrity verification. Since we are in Haskell with pattern matching, this can be done very nicely. For example, let us ensure that:
– after any kind of token (except spaces and delimiters), there must be either a white space or a delimiter
– except for . delimiters which must be surrounded by symbols

ensureIntegrity = case tokens of
Symbol a : Delimiter ‘.’ : Symbol b : remaining -> ensureIntegrity (Symbol b : remaining)
a : Delimiter ‘.’ : b : remaining -> error “the dot is not surrounded by symbols”
a : b : remaining ->
if not isDelimOrSpace a and isDelimOrSpace b then
error (string a ++ ” must be followed by a space or a delimiter”)
else
ensureIntegrity (b : remaining)

The little tokenizer… (2)

August 3, 2008

Here is an answer to the previous exercise:

This is written in Haskell but sadly there is no highlighting for it in wordpress.

text = [
	"a",
	"b",
	"c",
	"\td",
	"\te",
	"\t\t\tf",
	"\t\tg",
	"\th"
	]
	
	
data Block
	= Block [Block]
	| Line String
	deriving Show

indentationToList :: [String] -> Int -> [Block]
indentationToList lines tabs 
	| lines == [] = []
	| countTabs (head lines) == tabs = Line (head lines) : (indentationToList (tail lines) tabs)
	| countTabs (head lines) <  tabs = Block (indentationToList (indented) (tabs + 1)) : (indentationToList remaining tabs)
	| otherwise = error "The value of the tabs parameter is greater than the number of tabs in the text!"
   where
		(indented, remaining) = span (\l -> countTabs l > tabs) lines
		countTabs line = length (takeWhile (== '\t') line)

main = return (indentationToList text 0)

Yay …this is a terse piece of code!

Yeah, let’s see what the ‘indentationToList’ function does exactly. It gets as input a list of lines (a part of the text) and a number of tabs. As output it gives a list of blocks.

A list of blocks?

Exactly. And each block in the list can either be a line or a list of sub-blocks itself. Since we cannot have nested lists in Haskell, we use nested blocks!

So, what does ‘indentationToList’ do?

Well, if no lines are provided then it returns an empty list. Quite straightforward. Else what happens depends on the number of tabs of the first line and the tabs argument.

What does the tabs argument mean?

It means: “I am constructing the list of the lines having tabs tabulations”.
If the current line has exaclty tabs tabulations, then the result is a list whose first element is the line followed by the result of ‘indentationToList’ applied on the remaining lines.
If the current line has more than tabs tabultations, then the result is the list whose first element is a sublist obtained by applying ‘indentationToList’ on all lines having more than tabs tabultation, followed by the result of ‘indentationToList’ applied on the remaining lines.

Is that clear?

Yeah, nearly. I just to read back the example again to be sure it’s completely clear now.

Is this finished concerning indentation?

Of course not.

Ok, what’s next?

We will handle unfinished lines and multiline strings.

What about unfinished lines?

Suppose that lines you have……..

Is that all?

…Far from it!!! …but i’ll finish it another day, i’m tired.

The little tokenizer… (1)

August 3, 2008

Who is this for?

For people interested in building tokenizers/parsers for non-trivial languages and discovering the unexpected difficulties. You should be comfortable in programming and I hope you strive for elegancy as I do.

Do you know the “little schemer”? (or the little lisper?)

yes

What is a lexer / tokenizer?

A tokenizer is a piece of code that will convert the stream of characters into a stream of tokens. For example:

area = pi * radius^2

could be transformed into a data structure like:

[area, =, pi, *, radius, ^, 2]

Isn’t this trivial?

…concerning the previous example, definitely yes. Yet, this is far from all and this article is all about.

What about all the lexer tools and such? Can’t they easely deal with it?

Not once your language becomes a little more complicated or that you want enough customization in the produced tokens.

Can tokenizer really be complicated?

Just a few numbers:

  • Python official tokenizer: 1600+ lines (in C)
  • Haskell GHC tokenizer: 1300+ lines (in Haskell)

…yes, these numbers are only for the tokenizer, not the parser.

Really?

Yes

So what’s the source of complexity?

The complexity comes from multiple reasons:

  • In indentation sensitive programming languages, it is often easier to keep track of the indentation in the tokenizer than leaving it up for the parser. Indeed since this cannot be expressed by a context-free grammar.
  • Keeping track of the location, error reporting, etc
  • Various subtelities in many areas…
  • Quoted strings can have escaped quotes inside of them, as well as other escaped symbols but not all.
  • For most things: ignore white spaces. For “object members”: white spaces around the “.” token is an error.
  • Token type may depend on the context
  • Tricky combinations: “1.2” may be a single token while “listOfLists.1.2” may be valid 5 tokens
  • In other words, what you tokenize may depend on the context.
  • And many other subtileties and additional sources of complexity

…As you see, there are lots of “small” difficulties making the initial tokenizer an ugly bizare thing.

Shall I take a pause to read that again and think a bit about it first?

Yes, definitely! Make a walk outside and think how you would handle those things!

(more…)