Archive for the ‘LEO’ Category

Mixing various syntaxes – 2. Tokenizing/parsing generalities

February 22, 2009

You surely know the usual way a compiler works. You get a stream of characters, then transform it into tokens and then parse it according to some grammar. Of course real-life is usually not that simple and various deviations or hacks have to be applied. However, this won’t be discussed here. The first question should be:

So are we going to do it the same way? Tokenize the whole stream at once and then parse all tokens?

First, the languages are tokenized in different ways. For example “<<” might be tokenized as a single token “<<“, two tokens “<“, “<” or even an error because it would not be allowed depending on the language being tokenized.

Different languages, different tokenization processes and tokens. And this is of course the tip of the iceberg. A naive universal tokenizer/parser handling all languages would definitely explode in complexity and be unmaintainable.

…But before making something complicated, isn’t there a simple way?

Indeed, let us first observe the situation. We have the core language, and in it are embedded blocks written in another sub-language(s). These blocks are at some moment evaluated/run. This raises several questions.

-> Is it really needed to tokenize/parse it at all?

One could think that it would be sufficient to store the html-page-block or the shell-script-block or whatever sub-language block embedded as is in one big string in a variable. Of course, doing this is not what we meant to do. We want syntactic & semantic checks performed on it. That the compiler confirms our HTML structure is correct, that every opening tag has a closing tag, etc… Not that it breaks at runtime.

-> Why not simply take the big text block and give to an independent compiler/checker/interpreter?

Despite this sounds nice and simple, it has one big issue. The sub-languages can typicallly include core language expressions. Hence needing to call back the core language compiler/interpreter but also needing information about the context. If we pass the “big string html page” around in the program and at some time try to display it, then ${f(42)} must be evaluated. But this might not be available in the context of the current execution point of the program, or even point to a different function called “f”.

Hence, either we find a way to parse the whole at once. Either we make independent compilers/interpreters with a way of passing context and calling back each other.

So how do we do?

Let us use our imagination to try simplify things. First, it is reasonnable to assume that the delimitations of sub-languages are easy to detect. For example, it could be an indetend block, something in brackets, and so on. There should be no need for advanced parsing & semantics to detect the limits. Let’s call what’s inside 2 such limits a “block”. It is indeed advised that blocks are easely indentified, for human as well.

So perhaps we should make a kind of tokenizer which outputs either tokens or blocks, we will call this a b-tokenizer. The blocks have a given type and contain themeselves also blocks and tokens. And to each block type is associated to another b-tokenizer in charge to split its stream.

In fact, by doing this, we blur the line between tokenizing and parsing. Indeed, we could for example say that “(” is a symbol opening a new block and that inside the core language should be parsed. Parenthesis grouping can be done in both, wether it is done in the enhanced tokenizer or in the parser, the complexity is just shifted from one side to another. Which option is preferable is an open question.

There are 2 kind of b-tokenizers. Let us look again at the initial example.

Concerning the embedded HTML part, we now clearly where the block starts and ends based on the indentation. Even if the content is crap. Hence, the whole block stream can be passed to the html b-tokenizer as input, while independently the stream afterwards can be continued to be b-tokenized (by the core language).

On the opposite, inside the HTML block, we have a core language block. The beginning is indicated by “${” but where it ends is not known yet …indeed, there could be a “”bla bla :-}”” in the content where the bracket is part of a string expression and not the end of the block. Hence, we must b-tokenize the content which will result in two outputs: a block of tokens AND the remaining stream which will be continued to be parsed by the HTML b-tokenizer.

The first kind of b-tokenizer can be reduced to the second sort with an empty remaining. Why not use this simplification?

The first thing to understand is that the kind of output of the b-tokenizer, which can be made equivalent, is unimportant. What matters is the process of the master b-tokenizer. Either it is a parallel process where the sub b-tokenizer handles a sub-stream which is afterwards concatenated with the rest. Or it is a sequential process where the master b-tokenizer has to wait for the sub b-tokenizer to finish to know on what remaining stream it should operate.

So, if we want to take advantage of parallelization, we should keep these two distinct kind of processes.

Moreover, it has another advantage: error recovery …but we will see that later.

Can I have a summary please?

a bock == type + list of tokens and blocks

It exists a b-tokenizer per type of block:

  • Input: a stream
  • Output:
    • either error or list of (token or block)
    • remaining stream

Two kind of processes as master b-tokenizer when meeting a ” sub block opener”:

  • call the adequate sub b-tokenizer on the stream, then continue the master with the remaining
  • call the adequate sub b-tokenizer on a subset of the stream and concatenate it with the master applied on what’s afterwards
  • Cool, shall we lastly go to the implementation?!

    Slow down, not yet. There are a little more things to take into account for a complete implementation. …Coming soon.

Mixing various syntaxes – 1. Introduction

February 19, 2009

What is this post about?

Two years ago, I participated to an introductive company course, with among others a little refresher class on C++. There, an obviously innocent girl asked me this stupid question:

“Why can’t you write an SQL statement directly into your [C++] code?”

So, my obviously shutterred and immediate answer was:

“Because you must write C++ code! You can’t put an SQL statement like this in it! You have to use C++ language constructs!”

But, as the good old saying tells: “there are no dumb questions”. And perhaps a more adapted answer should be:

“Because C++ does not support this.”

Obvious, isn’t it? It’s a matter of point of view. Like in most other programming languages, there are no mechanisms to extend syntax or mix different kind of syntaxes.

And this is what this article is about: developping a programming language which supports mixing different syntaxes and where you can add your owns. For example, we could imagine a core programming language able to mix within itself blocks of XML, shell scripts, free text, etc… And that these would be actually parsed and checked independently like if they were inherently part of the language. They would be sub-languages embedded in the core-languages. But this could also go the other way round, where a piece of code of the core-language would be placed inside the XML, shell script, free text or whatever. Leading to a complete and sound system.

This article will go thoroughly from the basic concepts down to a practical implementation of such a sample programming language, able to mix it’s own core constructs with XML and free text. The sub-languages chosen are relatively simple in order to keep the examples and final implementation understandable. Of course, this approach can be extended to languages of any complexity.

Isn’t it too complicated to mix languages?!

On one hand, parsers & interpreters for nearly all language exist individually. This naturally leads to the thought: “Can’t we simply bring them together and jump each time to the right parser depending on the current language we are parsing?”. Despite this is indeed the basic idea, we would like to point out 2 catches.

First, and most importantly, it must make sense to mix the languages. Since language statements have some meaning, what they express should make sense at the place where they are used. Placing a SQL statement in the middle of a class declaration for example would be non-sense. Instead, it should be solely allowed to write it where expressions fits in, for instance as right hand side of an equality.

Secondly, mixing the parsers is not that simple. The context must be tracked, the scope defined, ambiguities avoided, and so on. But you will discover all this in detail along this article. It leads to new challenges to face and the whole process of “parsing” must be rethought carefully.

Ok, but before going on …Is it really worth it?

YES! Definitely!

Since this sounds abstract, let us illustrate by a few practical examples what can be done with it. To keep things simple, let us imagine a core language able to mix itself with HTML. We will refer to HTML as a “sub language” since the latter one is embedded in the core language.

A practical example would be:

function f(x) = x * 2 + 3
myWebPage = syntax:html
    <html></html>
    <body>
        The answer is ${f(42)}!
    </body>

This very short example illustrates the cohabitation of two different languages. The syntax:html tag used to indicate that the below indented block is a HTML block and should be interpreted as such.  Strictly speakink, it is not pure HTML: we added the feature: “${…}” to it. The latter is used to replace what’s inside the brackets by the result of the expression within the current scope. It’s a piece of core language inside the HTML.

The advantages lie on the hand. Being able to mix different languages raises expressivity to a new order of magnitude. Moreover, it can make interfacing with outsibe systems/libraries easier in some cases. Using this, you can use DSLs, various data formats, shell commands, all within a homogenous world. To illustrate a little the power, let us show some more the interaction you can have between our imaginary core language & HTML. However, keep in mind the same applies with all other sub languages.

Here are some examples of things you could do:

  • The core language corectness is verified at compile time
  • The HTML correctness is verified at compile time
  • You can insert core language expressions inside HTML blocks
  • Variables holding HTML blocks are like any other variable, you can pass them around, concatenate it with others, and so on…
  • You can make core language functions that return pieces of HTML to:
    • Transform a piece of data (like a book or an employee) into a HTML representation
    • Create a HTML formular based on some parameters
  • You can make core language functions that transform pieces of HTML to:
    • Add a fancy frame around a HTML block
    • Replace all instance of “day” by “night”
  • And much much more…

If you are not convinced yet, just take quart an hour to think of all things you can do by mixing it with all other sub languages as well!

Basically, you’ve an universal language! With the ability of letting them interact in extremely powerful ways.

And how are you going to represent all this stuff internally?

Good question. Basically, you can handle the HTML block or whatever sub-language block as any other data. As such, you can pass it around, modify it, create new ones, play with it as you like. When looking closer at:

myWebPage = syntax:html
    ...

What it says is in fact “interpret what is in the below indented block as HTML”. In other words, the block below will be parsed, checked, verified etc according to HTML specifications …The result of all this parsing however is in fact like any other core language data, in order to “fit” inside the variable. In other words, the “syntax:html” parser takes a text block as input and outputs a corresponding core language data type. Hence, there is under the hood a bijective function between HTML and the core language. One must deeply understand that HTML is just some representation of data, not more. As such, it can be handled like any other data. The only thing is that we furnish a way to “write” the data in a customized representation/syntax directly in the language.

Of course, not all sublanguage would represent data. For example, a shell script clearly executes commands, another DSL could maybe indicate some transformation process, etc… Therefore the need in the core language to support basic constructs to store: data, functions, actions or combinations of those. Indeed, anything can be transformed into one of those. Shell commands would be transormed into an action, mathematical transformations would be transformed into a function, and so on. What counts is that the sub-language referenced has a “syntax recognizer” which reads the raw text and transforms it into the adequate core language construct.

Wanna know more about implementing this stuff?

Bump me so that I write next section. (Should be around 4 sections total)

LEO’s influences

August 17, 2008

LEO is a new programming language i’m currently working on, as shows several articles in this blog. Of course, it does not come from nothing but is influenced by several languages. “Visiting” other programming languages are like visiting different counries. They have a different flavour, a different way of living, and most of the time you’ll notice remarkable things that you like and would carry with you. The same holds for programming languages. Thus, I will state here the main influences from other programming languages, in alphabetical in order to avoid jaleousy.

Erlang

In Erlang, despite of being very complete in itself, I especially like:

  • the generators and filters
  • that words are values by themeselves
  • …othter things I must explore, since this is the last language I picked up and am learning.

Haskell

In a few words, Haskell is elegant, sharp and sometimes a bit too terse.

I especially like:

  • full lazyness (thus, the order of your statements do not matter anymore, it becomes more descriptive) …what is funny though is that everyone tends to write with a monadic style which is kind of an emulation of statefull sequential programming. I find this a bit odd to what extend these are used.
  • where clauses. This seems really like a detail but putting the details inside a where clause helps you (in my sense) to better structure your code. In mainstream programming languages, you write your code bottom-up. Here you do it the other way round, the main task is outside and it is scrumbled down inside the where clause.
  • equality transitivity. The fact that you can replace the left hand side of an equality by its right hand side (and vice-versa) in the code is cool. It improves understandability.
  • Type classes!

What I’m not fond of:

  • Sometimes a bit terse and not always easy to understand
  • I am not yet convinced of partial function application
  • The fact that lists can only contain elements of the same type is both a blessing and a curse

Python

Pros:

  • It’s easy
  • It’s intuitive
  • It’s readable

Cons:

  • See what you can do in other languages and you’ll see there is a lot of space to cover
  • Too slow for computation intensive stuff

Scheme

Ah, this funny last one is pretty amusing. So minimalistic yet you can do absolutely everything with it. When I’m programming in it, I feel like a small boy playing around.

What I like a lot:

  • Program is data and data is program. Quoting things and evaluating quoted things.
  • Any identifier is valid 🙂
  • Incredibly simple, yet so extensible.
  • It’s simply neat.

What bugs me:

  • I don’t feel as productive as with others
  • I don’t find it as readable.

Smalltalk

Experience the IDE and you won’t want any other. Here you don’t have source code files. You have a program “image”. And the IDE understands it as a whole and let you navigate and interact with it in an unmatched way.

Also, the structure of the program is like assembling Lego blocks. They are all combined from small chunks and part of bigger chuncks. In the IDE, it is like playing with a looking glass where you zoom and back on different objects in the program.

Conclusion

So how would be my ideal languange? …Well, no one cares anyway.

LEO – Main page

June 20, 2008

This is the tempory name for “my new programming language”.

Presentations

A short apetizer & introduction

The paradigms unification

Reference manual

– Overview

Equality & assignation

Functions

– Actions

– Types and datastructures

– Objects

– Classes

– Agents

– Behaviors

– Meta-programming

Implementation drafts

– Overview

– Choice of implementation language

– Reading Leo source code

– Tokenizer
Leo – Scheme

– Parser
Leo – Scheme

Interpreter for:

– Phase 1 – arithmetic expressions

– Phase 2 – equalities, assignation and printing

– Phase 3 – functions

– Phase 4 – …we will see