Mixing various syntaxes – 2. Tokenizing/parsing generalities

You surely know the usual way a compiler works. You get a stream of characters, then transform it into tokens and then parse it according to some grammar. Of course real-life is usually not that simple and various deviations or hacks have to be applied. However, this won’t be discussed here. The first question should be:

So are we going to do it the same way? Tokenize the whole stream at once and then parse all tokens?

First, the languages are tokenized in different ways. For example “<<” might be tokenized as a single token “<<“, two tokens “<“, “<” or even an error because it would not be allowed depending on the language being tokenized.

Different languages, different tokenization processes and tokens. And this is of course the tip of the iceberg. A naive universal tokenizer/parser handling all languages would definitely explode in complexity and be unmaintainable.

…But before making something complicated, isn’t there a simple way?

Indeed, let us first observe the situation. We have the core language, and in it are embedded blocks written in another sub-language(s). These blocks are at some moment evaluated/run. This raises several questions.

-> Is it really needed to tokenize/parse it at all?

One could think that it would be sufficient to store the html-page-block or the shell-script-block or whatever sub-language block embedded as is in one big string in a variable. Of course, doing this is not what we meant to do. We want syntactic & semantic checks performed on it. That the compiler confirms our HTML structure is correct, that every opening tag has a closing tag, etc… Not that it breaks at runtime.

-> Why not simply take the big text block and give to an independent compiler/checker/interpreter?

Despite this sounds nice and simple, it has one big issue. The sub-languages can typicallly include core language expressions. Hence needing to call back the core language compiler/interpreter but also needing information about the context. If we pass the “big string html page” around in the program and at some time try to display it, then ${f(42)} must be evaluated. But this might not be available in the context of the current execution point of the program, or even point to a different function called “f”.

Hence, either we find a way to parse the whole at once. Either we make independent compilers/interpreters with a way of passing context and calling back each other.

So how do we do?

Let us use our imagination to try simplify things. First, it is reasonnable to assume that the delimitations of sub-languages are easy to detect. For example, it could be an indetend block, something in brackets, and so on. There should be no need for advanced parsing & semantics to detect the limits. Let’s call what’s inside 2 such limits a “block”. It is indeed advised that blocks are easely indentified, for human as well.

So perhaps we should make a kind of tokenizer which outputs either tokens or blocks, we will call this a b-tokenizer. The blocks have a given type and contain themeselves also blocks and tokens. And to each block type is associated to another b-tokenizer in charge to split its stream.

In fact, by doing this, we blur the line between tokenizing and parsing. Indeed, we could for example say that “(” is a symbol opening a new block and that inside the core language should be parsed. Parenthesis grouping can be done in both, wether it is done in the enhanced tokenizer or in the parser, the complexity is just shifted from one side to another. Which option is preferable is an open question.

There are 2 kind of b-tokenizers. Let us look again at the initial example.

Concerning the embedded HTML part, we now clearly where the block starts and ends based on the indentation. Even if the content is crap. Hence, the whole block stream can be passed to the html b-tokenizer as input, while independently the stream afterwards can be continued to be b-tokenized (by the core language).

On the opposite, inside the HTML block, we have a core language block. The beginning is indicated by “${” but where it ends is not known yet …indeed, there could be a “”bla bla :-}”” in the content where the bracket is part of a string expression and not the end of the block. Hence, we must b-tokenize the content which will result in two outputs: a block of tokens AND the remaining stream which will be continued to be parsed by the HTML b-tokenizer.

The first kind of b-tokenizer can be reduced to the second sort with an empty remaining. Why not use this simplification?

The first thing to understand is that the kind of output of the b-tokenizer, which can be made equivalent, is unimportant. What matters is the process of the master b-tokenizer. Either it is a parallel process where the sub b-tokenizer handles a sub-stream which is afterwards concatenated with the rest. Or it is a sequential process where the master b-tokenizer has to wait for the sub b-tokenizer to finish to know on what remaining stream it should operate.

So, if we want to take advantage of parallelization, we should keep these two distinct kind of processes.

Moreover, it has another advantage: error recovery …but we will see that later.

Can I have a summary please?

a bock == type + list of tokens and blocks

It exists a b-tokenizer per type of block:

Input: a stream
Output:
- either error or list of (token or block)
- remaining stream

Two kind of processes as master b-tokenizer when meeting a ” sub block opener”:

call the adequate sub b-tokenizer on the stream, then continue the master with the remaining
call the adequate sub b-tokenizer on a subset of the stream and concatenate it with the master applied on what’s afterwards

Cool, shall we lastly go to the implementation?!

Slow down, not yet. There are a little more things to take into account for a complete implementation. …Coming soon.

This entry was posted on February 22, 2009 at 6:28 pm and is filed under LEO, Programming languages. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

8 Responses to “Mixing various syntaxes – 2. Tokenizing/parsing generalities”

Winheim Raulsh Says:
February 24, 2009 at 4:32 am | Reply
Embedding text expressions in text expressions sounds like a great idea, at first. Then you realize that a GUI would be more useful for the average Joe, and less work for you, than to represent the data via a serialized text expression interface, which requires more work to produce, and, to top it off, only the programmers or other technically-sophisiticated people would find it comprehensible.

It’s a rare kind of pipe-dream, in that it’s achievable, but the pipe-dream doesn’t consider the pragmatic, practical problems, trade-offs, or diminishing returns associated with the maintainance and adoption of it, as well as the limited scope of viability it offers.

I mean, sure, it would be great to have it, if we didn’t have all the downsides that render the costs competitive with the rewards, in many situations.

But maybe I’m wrong if you find a way to solve all these drawbacks, but that’s essentially what you will have to do, is solve these drawbacks, if you want it to to be a viable “pipe dream”. I’m still listening and keeping a curious eye open. Don’t let me stop you.
sidewords Says:
February 24, 2009 at 12:38 pm | Reply
Just to be sure …have you started by reading part 1?

In fact, don’t really agree with most things said. What I agree is that average joes usually love GUIs, everybody does. Yet, despite GUIs are cool and easy for the user, they are also made for a particular task and extremely limited to it. You can do what the GUI is made for, end of line. Can you “program” with GUIs? I don’t think so. Yeah, you can perhaps make SQL by means of complex diagrams and fancy things to annotate your relationships and entities …but let’s face it, writing a line of SQL is easier.

…In any case we are not talking about the same. Mixing syntaxes and GUIs are distinct and independent matters. For example, I don’t even see how you could do the equivalent of the first small example in the introduction. What you can achieve by mixing syntaxes reaches far beyond what you could ever do with a GUI. I also think it becomes easier to use than a GUI once you must do non-trivial things.

Moreover, could you please be more specific about the problems you stated above?
In particular:
– consider the pragmatic, practical problems, trade-offs, or diminishing returns associated with the maintainance and adoption of it, as well as the limited scope of viability it offers.
– solve all these drawbacks
…Which drawbacks are we talking about?

After all, the programmer gets more expressive, easier to understand code. So easier to make, less maintainance…

I don’t know why you call it a pipe-dream either since as stated in introduction, there are a priori no real obstacles. The complexity increases, sure, and it’s a huge job, but writing any programming language always is!

…oh, perhaps that’s it! …in fact this is part of a bigger whole. What is meant to do globally is a new programming language! Which among others support the mixing of several syntaxes …but this only a small part of it. (see articles referring to LEO)

The language as a whole is probably a pipe dream indeed! 😛 …but who knows, I am convinced it would be such a big advance in the programming language world that I can’t abandon it …Even if I know it will fail and nobody will ever care.
Winheim Raulsh Says:
February 25, 2009 at 5:24 am | Reply
let’s face it, writing a line of SQL is easier

Yeah, for programmers, not for users, domain experts, and others. Average people find foreign languages incomprehensible, both formal and natural ones.

I suppose I’m simply saying that DSLs and layerable syntaxes are not a fix-all. A few, highly utilized language-integrated systems can be good in augmenting the programming language with a significant amount of useful functionality, but you must draw the line, or else you’ll end up creating DSLs for every single library you create, for every type of object you define, simply to make object declarations (data definition) look nice. DSLs are essentially text GUIs when that happens, and the GUIs are more productive for data definition. You know you’ve gone too far with the DSL concept when your DSL only serves the purpose of text formatting.

How limited the GUI is depends on the GUI. The idea is that it’s easier churn out GUIs than to create parsers. In fact, GUIs can be auto-generated from data schemas, whereas the correlate for parsers is more complicated. Besides, if you do it right, you can get a GUI to look like syntax-highlighted code, while maintaining structure, allowing automated code navigation, restructuring and other things.

Think about this: is it worth creating a DSL just to define a special date syntax? The syntax doesn’t add functionality, and probably doesn’t integrate with functions you would invoke to compute parts of the date, which will require you to go back to not using syntax. In this case, since you can’t use functions with the date syntax anyway, it would have been better to have a date GUI to pick the date. Writing a parser, just for a date-literal expression?

After all, the programmer gets more expressive, easier to understand code. So easier to make, less maintainance…

Less maintenance for programs, not for the array of parsers, interpreters, etc.

What you can achieve by mixing syntaxes reaches far beyond what you could ever do with a GUI.

The opposite is so clear to me: that GUIs can do so much more than text. Witness Jetbrains’ MPS, Intentsoft’s projectional editor, and Subtext. These aren’t your average data-entry GUIs, of course, but clearly we can see from these that GUIs have more to offer when they’re tailored for the task. They simply have a wider range of expression than text, even formatted text, does.

Having DSLs wouldn’t be so bad if whe could make them up as quickly as we can on paper, but the fact is that we have to write parsers and interpreters for them, so the value and leverage of the DSL must exceed the costs (in time and effort). Not all things deserve to have a DSL.

From what I’ve seen, ruby takes a pragmatic approach: instead of having to completely write your own DSLs, the syntax is so flexible that you simply write some text formatting snippets of code to achieve the look you’re going for. This approach tries to minimize cost and maximize flexibility. Yet, text still cannot out-do GUIs.
sidewords Says:
February 25, 2009 at 11:04 am | Reply
I agree, it’s just a question of where you focus at. Of course GUIs are easier for users (they only use GUIs! :P). Yet, the target audience I had in mind was programmers. After all, a programming language is for programmers and those are the ones who write the programs. Users don’t care about this stuff.

The GUIs can be, in some sense, be seen as the result of the programming. It’s the “end user” product. …While also being used as tools for the programmers themeselves.

but you must draw the line, or else you’ll end up creating DSLs for every single library you create, for every type of object you define, simply to make object declarations (data definition) look nice.

In fact, I would find it terribly interesting, just out of curiosity, to see wether this would enhance or degrade productivity. Let it be for the sake of a research experiment.

I have no idea myself wether having sub-syntaxes would turn out to be extremely useful or just a crappy features no one cares about. It’s more that it’s feasible, that the complexity is reasonable, so why not including it in the language. (Yeah, of course that’s work for the PL developper!)

The opposite is so clear to me: that GUIs can do so much more than text …ou can get a GUI to look like syntax-highlighted code, while maintaining structure, allowing automated code navigation…

Well, the very funny thing is that in order to have syntax-highlighting, code navigation, etc, you have to parse the text! 😛 What IDEs do is exactly this! They parse the text into some internal representaion in order to be able to highlight them, report errors, tool tips, navigation, overview, etc…

If a GUI wouldn’t parse, it could only show plain raw text in a box, which is fairly limited.

GUIs can be powerful indeed, but if you ask me, they are also very complex to do. I think any GUI is more complex to do than the corresponding console tool doing the same. In any case.

Again, of course, GUIs are highly needed because of their ease of use and convinience for the user.
sidewords Says:
February 25, 2009 at 12:54 pm | Reply
As last sentence, I would like to mention again that these are orthogonal features. GUIs and extended syntax have definitively different purposes.
Winheim Raulsh Says:
February 26, 2009 at 3:17 am | Reply
It’s true: we would need research done on the productivity of niche-DSLs to determine conclusively when to avoid DSLs.

Well, the very funny thing is that in order to have syntax-highlighting, code navigation, etc, you have to parse the text! What IDEs do is exactly this! They parse the text into some internal representaion in order to be able to highlight them, report errors, tool tips, navigation, overview, etc…

Read this. I must repeat my comment earlier:

Witness Jetbrains’ MPS, Intentsoft’s projectional editorhere and here, and Subtext here. Read this also. These aren’t your average data-entry GUIs, of course, but clearly we can see from these that GUIs have more to offer when they’re tailored for the task. They simply have a wider range of expression than text, even formatted text, does.

GUIs can be powerful indeed, but if you ask me, they are also very complex to do. I think any GUI is more complex to do than the corresponding console tool doing the same. In any case.

You and I have different backgrounds. I come from Delphi, where GUIs are a walk in the park (it works alot like VB forms). Whichever one is harder depends on what tools are available.

As last sentence, I would like to mention again that these are orthogonal features. GUIs and extended syntax have definitively different purposes.

But what is text, but an interface? User input in the form of typing, with display in the form of drawing text. Text is simply one possibility for a code development interface. Text IS a GUI. Think about it. When you write code, you are entering data using the keyboard, just like a GUI. A gui is a more graphical language, and text is a more restricted subset of the possibilities for graphical languages. Text is for the data entry of code. GUIs are for the data entry of data, but isn’t code data? Sure, a more complicated kind of data, but data nonetheless.
Winheim Raulsh Says:
February 26, 2009 at 3:25 am | Reply
STUPID BROWSER DELETED MY COMMENT! AAAAAAAAAUUAUUUUUGGGHGHHH!

Here’s some links you can digest:

Language Workbenches: The Killer-App for Domain Specific Languages?

Language Oriented Programming

JetBrains’ MPS

Subtext
Subtext v2 Demo

A story of four benches – Intentional software blog
bluestorm Says:
January 7, 2010 at 8:55 am | Reply
What you describe is already available under several different forms. Macros in Lisp-language have been used to provide similar benefits for long. You could also have a look at Camlp4, a preprocessor for the OCaml language that allows precisely this (your example would be easily ported to Camlp4, except one would use <:html> instead of syntax:html …) :
– http://brion.inria.fr/gallium/index.php/Camlp4
– http://brion.inria.fr/gallium/index.php/Quotation

Side words