Mixing various syntaxes – 1. Introduction

What is this post about?

Two years ago, I participated to an introductive company course, with among others a little refresher class on C++. There, an obviously innocent girl asked me this stupid question:

“Why can’t you write an SQL statement directly into your [C++] code?”

So, my obviously shutterred and immediate answer was:

“Because you must write C++ code! You can’t put an SQL statement like this in it! You have to use C++ language constructs!”

But, as the good old saying tells: “there are no dumb questions”. And perhaps a more adapted answer should be:

“Because C++ does not support this.”

Obvious, isn’t it? It’s a matter of point of view. Like in most other programming languages, there are no mechanisms to extend syntax or mix different kind of syntaxes.

And this is what this article is about: developping a programming language which supports mixing different syntaxes and where you can add your owns. For example, we could imagine a core programming language able to mix within itself blocks of XML, shell scripts, free text, etc… And that these would be actually parsed and checked independently like if they were inherently part of the language. They would be sub-languages embedded in the core-languages. But this could also go the other way round, where a piece of code of the core-language would be placed inside the XML, shell script, free text or whatever. Leading to a complete and sound system.

This article will go thoroughly from the basic concepts down to a practical implementation of such a sample programming language, able to mix it’s own core constructs with XML and free text. The sub-languages chosen are relatively simple in order to keep the examples and final implementation understandable. Of course, this approach can be extended to languages of any complexity.

Isn’t it too complicated to mix languages?!

On one hand, parsers & interpreters for nearly all language exist individually. This naturally leads to the thought: “Can’t we simply bring them together and jump each time to the right parser depending on the current language we are parsing?”. Despite this is indeed the basic idea, we would like to point out 2 catches.

First, and most importantly, it must make sense to mix the languages. Since language statements have some meaning, what they express should make sense at the place where they are used. Placing a SQL statement in the middle of a class declaration for example would be non-sense. Instead, it should be solely allowed to write it where expressions fits in, for instance as right hand side of an equality.

Secondly, mixing the parsers is not that simple. The context must be tracked, the scope defined, ambiguities avoided, and so on. But you will discover all this in detail along this article. It leads to new challenges to face and the whole process of “parsing” must be rethought carefully.

Ok, but before going on …Is it really worth it?

YES! Definitely!

Since this sounds abstract, let us illustrate by a few practical examples what can be done with it. To keep things simple, let us imagine a core language able to mix itself with HTML. We will refer to HTML as a “sub language” since the latter one is embedded in the core language.

A practical example would be:

function f(x) = x * 2 + 3
myWebPage = syntax:html
    <html></html>
    <body>
        The answer is ${f(42)}!
    </body>

This very short example illustrates the cohabitation of two different languages. The syntax:html tag used to indicate that the below indented block is a HTML block and should be interpreted as such.  Strictly speakink, it is not pure HTML: we added the feature: “${…}” to it. The latter is used to replace what’s inside the brackets by the result of the expression within the current scope. It’s a piece of core language inside the HTML.

The advantages lie on the hand. Being able to mix different languages raises expressivity to a new order of magnitude. Moreover, it can make interfacing with outsibe systems/libraries easier in some cases. Using this, you can use DSLs, various data formats, shell commands, all within a homogenous world. To illustrate a little the power, let us show some more the interaction you can have between our imaginary core language & HTML. However, keep in mind the same applies with all other sub languages.

Here are some examples of things you could do:

  • The core language corectness is verified at compile time
  • The HTML correctness is verified at compile time
  • You can insert core language expressions inside HTML blocks
  • Variables holding HTML blocks are like any other variable, you can pass them around, concatenate it with others, and so on…
  • You can make core language functions that return pieces of HTML to:
    • Transform a piece of data (like a book or an employee) into a HTML representation
    • Create a HTML formular based on some parameters
  • You can make core language functions that transform pieces of HTML to:
    • Add a fancy frame around a HTML block
    • Replace all instance of “day” by “night”
  • And much much more…

If you are not convinced yet, just take quart an hour to think of all things you can do by mixing it with all other sub languages as well!

Basically, you’ve an universal language! With the ability of letting them interact in extremely powerful ways.

And how are you going to represent all this stuff internally?

Good question. Basically, you can handle the HTML block or whatever sub-language block as any other data. As such, you can pass it around, modify it, create new ones, play with it as you like. When looking closer at:

myWebPage = syntax:html
    ...

What it says is in fact “interpret what is in the below indented block as HTML”. In other words, the block below will be parsed, checked, verified etc according to HTML specifications …The result of all this parsing however is in fact like any other core language data, in order to “fit” inside the variable. In other words, the “syntax:html” parser takes a text block as input and outputs a corresponding core language data type. Hence, there is under the hood a bijective function between HTML and the core language. One must deeply understand that HTML is just some representation of data, not more. As such, it can be handled like any other data. The only thing is that we furnish a way to “write” the data in a customized representation/syntax directly in the language.

Of course, not all sublanguage would represent data. For example, a shell script clearly executes commands, another DSL could maybe indicate some transformation process, etc… Therefore the need in the core language to support basic constructs to store: data, functions, actions or combinations of those. Indeed, anything can be transformed into one of those. Shell commands would be transormed into an action, mathematical transformations would be transformed into a function, and so on. What counts is that the sub-language referenced has a “syntax recognizer” which reads the raw text and transforms it into the adequate core language construct.

Wanna know more about implementing this stuff?

Bump me so that I write next section. (Should be around 4 sections total)

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


%d bloggers like this: