Archive for the ‘Programming languages’ Category

PL sucks…

September 7, 2009

C++: …ah …the good old gdb debugger ūüôā …oh, wait, what’s that strange unknown error …I still don’t understand why it crashes there, WTF! :/

Java: …erh …no …I can’t pass a function as parameter …silly me, I have to repeat “java’s way” to myself agian “objects …only objects exist in the world …forget everything else …use objects even if it sucks in that case”

Haskell: …eerrm …eh …huh …what does again that brillant code that no one understands actually mean? …let me focus on these 20 lines for 3h to get it

PL sucks

August 30, 2009

C++: .h & .cpp? …yeah, right, it would be too simple to define a class in a single file.

Java: I can’t have an int as value or key in a hash table?! …meh …let’s create a container object for the millions of integers I have to use!

Haskell: instead of¬† world.countries[“Antartica”].weatherStats.averageTemperature¬† you have:
averageTemperature (weatherStats (lookup “Antartica” (countries world)))
…well, I don’t like it

Rant of the day…

August 28, 2009

…it looks like all programming languages suck! …a lot!!!

Today’s examples:

C++: you have a strange crash and no message/stack trace? welcome to C++!

Java: what, I can’t put a static method in an interface? grrr

Haskell: small exercice: given a text, make a dictionary counting the number of occurences of each word …pretty ugly compared to a classic for loop huh?

Python: …from there comes the “agile” programming techniques: “write a bit, hope, test a lot!” …it’s a bit too loose for my taste

…I feel that each day I could publish such annoying things

Mixing various syntaxes – 2. Tokenizing/parsing generalities

February 22, 2009

You surely know the usual way a compiler works. You get a stream of characters, then transform it into tokens and then parse it according to some grammar. Of course real-life is usually not that simple and various deviations or hacks have to be applied. However, this won’t be discussed here. The first question should be:

So are we going to do it the same way? Tokenize the whole stream at once and then parse all tokens?

First, the languages are tokenized in different ways. For example “<<” might be tokenized as a single token “<<“, two tokens “<“, “<” or even an error because it would not be allowed depending on the language being tokenized.

Different languages, different tokenization processes and tokens. And this is of course the tip of the iceberg. A naive universal tokenizer/parser handling all languages would definitely explode in complexity and be unmaintainable.

…But before making something complicated, isn’t there a simple way?

Indeed, let us first observe the situation. We have the core language, and in it are embedded blocks written in another sub-language(s). These blocks are at some moment evaluated/run. This raises several questions.

-> Is it really needed to tokenize/parse it at all?

One could think that it would be sufficient to store the html-page-block or the shell-script-block or whatever sub-language block embedded as is in one big string in a variable. Of course, doing this is not what we meant to do. We want syntactic & semantic checks performed on it. That the compiler confirms our HTML structure is correct, that every opening tag has a closing tag, etc… Not that it breaks at runtime.

-> Why not simply take the big text block and give to an independent compiler/checker/interpreter?

Despite this sounds nice and simple, it has one big issue. The sub-languages can typicallly include core language expressions. Hence needing to call back the core language compiler/interpreter but also needing information about the context. If we pass the “big string html page” around in the program and at some time try to display it, then ${f(42)} must be evaluated. But this might not be available in the context of the current execution point of the program, or even point to a different function called “f”.

Hence, either we find a way to parse the whole at once. Either we make independent compilers/interpreters with a way of passing context and calling back each other.

So how do we do?

Let us use our imagination to try simplify things. First, it is reasonnable to assume that the delimitations of sub-languages are easy to detect. For example, it could be an indetend block, something in brackets, and so on. There should be no need for advanced parsing & semantics to detect the limits. Let’s call what’s inside 2 such limits a “block”. It is indeed advised that blocks are easely indentified, for human as well.

So perhaps we should make a kind of tokenizer which outputs either tokens or blocks, we will call this a b-tokenizer. The blocks have a given type and contain themeselves also blocks and tokens. And to each block type is associated to another b-tokenizer in charge to split its stream.

In fact, by doing this, we blur the line between tokenizing and parsing. Indeed, we could for example say that “(” is a symbol opening a new block and that inside the core language should be parsed. Parenthesis grouping can be done in both, wether it is done in the enhanced tokenizer or in the parser, the complexity is just shifted from one side to another. Which option is preferable is an open question.

There are 2 kind of b-tokenizers. Let us look again at the initial example.

Concerning the embedded HTML part, we now clearly where the block starts and ends based on the indentation. Even if the content is crap. Hence, the whole block stream can be passed to the html b-tokenizer as input, while independently the stream afterwards can be continued to be b-tokenized (by the core language).

On the opposite, inside the HTML block, we have a core language block. The beginning is indicated by “${” but where it ends is not known yet …indeed, there could be a “”bla bla :-}”” in the content where the bracket is part of a string expression and not the end of the block. Hence, we must b-tokenize the content which will result in two outputs: a block of tokens AND the remaining stream which will be continued to be parsed by the HTML b-tokenizer.

The first kind of b-tokenizer can be reduced to the second sort with an empty remaining. Why not use this simplification?

The first thing to understand is that the kind of output of the b-tokenizer, which can be made equivalent, is unimportant. What matters is the process of the master b-tokenizer. Either it is a parallel process where the sub b-tokenizer handles a sub-stream which is afterwards concatenated with the rest. Or it is a sequential process where the master b-tokenizer has to wait for the sub b-tokenizer to finish to know on what remaining stream it should operate.

So, if we want to take advantage of parallelization, we should keep these two distinct kind of processes.

Moreover, it has another advantage: error recovery …but we will see that later.

Can I have a summary please?

a bock == type + list of tokens and blocks

It exists a b-tokenizer per type of block:

  • Input: a stream
  • Output:
    • either error or list of (token or block)
    • remaining stream

Two kind of processes as master b-tokenizer when meeting a ” sub block opener”:

  • call the adequate sub b-tokenizer on the stream, then continue the master with the remaining
  • call the adequate sub b-tokenizer on a subset of the stream and concatenate it with the master applied on what’s afterwards
  • Cool, shall we lastly go to the implementation?!

    Slow down, not yet. There are a little more things to take into account for a complete implementation. …Coming soon.

Mixing various syntaxes – 1. Introduction

February 19, 2009

What is this post about?

Two years ago, I participated to an introductive company course, with among others a little refresher class on C++. There, an obviously innocent girl asked me this stupid question:

“Why can’t you write an SQL statement directly into your [C++] code?”

So, my obviously shutterred and immediate answer was:

“Because you must write C++ code! You can’t put an SQL statement like this in it! You have to use C++ language constructs!”

But, as the good old saying tells: “there are no dumb questions”. And perhaps a more adapted answer should be:

“Because C++ does not support this.”

Obvious, isn’t it? It’s a matter of point of view. Like in most other programming languages, there are no mechanisms to extend syntax or mix different kind of syntaxes.

And this is what this article is about: developping a programming language which supports mixing different syntaxes and where you can add your owns. For example, we could imagine a core programming language able to mix within itself blocks of XML, shell scripts, free text, etc… And that these would be actually parsed and checked independently like if they were inherently part of the language. They would be sub-languages embedded in the core-languages. But this could also go the other way round, where a piece of code of the core-language would be placed inside the XML, shell script, free text or whatever. Leading to a complete and sound system.

This article will go thoroughly from the basic concepts down to a practical implementation of such a sample programming language, able to mix it’s own core constructs with XML and free text. The sub-languages chosen are relatively simple in order to keep the examples and final implementation understandable. Of course, this approach can be extended to languages of any complexity.

Isn’t it too complicated to mix languages?!

On one hand, parsers & interpreters for nearly all language exist individually. This naturally leads to the thought: “Can’t we simply bring them together and jump each time to the right parser depending on the current language we are parsing?”. Despite this is indeed the basic idea, we would like to point out 2 catches.

First, and most importantly, it must make sense to mix the languages. Since language statements have some meaning, what they express should make sense at the place where they are used. Placing a SQL statement in the middle of a class declaration for example would be non-sense. Instead, it should be solely allowed to write it where expressions fits in, for instance as right hand side of an equality.

Secondly, mixing the parsers is not that simple. The context must be tracked, the scope defined, ambiguities avoided, and so on. But you will discover all this in detail along this article. It leads to new challenges to face and the whole process of “parsing” must be rethought carefully.

Ok, but before going on …Is it really worth it?

YES! Definitely!

Since this sounds abstract, let us illustrate by a few practical examples what can be done with it. To keep things simple, let us imagine a core language able to mix itself with HTML. We will refer to HTML as a “sub language” since the latter one is embedded in the core language.

A practical example would be:

function f(x) = x * 2 + 3
myWebPage = syntax:html
    <html></html>
    <body>
        The answer is ${f(42)}!
    </body>

This very short example illustrates the cohabitation of two different languages. The syntax:html tag used to indicate that the below indented block is a HTML block and should be interpreted as such.¬† Strictly speakink, it is not pure HTML: we added the feature: “${…}” to it. The latter is used to replace what’s inside the brackets by the result of the expression within the current scope. It’s a piece of core language inside the HTML.

The advantages lie on the hand. Being able to mix different languages raises expressivity to a new order of magnitude. Moreover, it can make interfacing with outsibe systems/libraries easier in some cases. Using this, you can use DSLs, various data formats, shell commands, all within a homogenous world. To illustrate a little the power, let us show some more the interaction you can have between our imaginary core language & HTML. However, keep in mind the same applies with all other sub languages.

Here are some examples of things you could do:

  • The core language corectness is verified at compile time
  • The HTML correctness is verified at compile time
  • You can insert core language expressions inside HTML blocks
  • Variables holding HTML blocks are like any other variable, you can pass them around, concatenate it with others, and so on…
  • You can make core language functions that return pieces of HTML to:
    • Transform a piece of data (like a book or an employee) into a HTML representation
    • Create a HTML formular based on some parameters
  • You can make core language functions that transform pieces of HTML to:
    • Add a fancy frame around a HTML block
    • Replace all instance of “day” by “night”
  • And much much more…

If you are not convinced yet, just take quart an hour to think of all things you can do by mixing it with all other sub languages as well!

Basically, you’ve an universal language! With the ability of letting them interact in extremely powerful ways.

And how are you going to represent all this stuff internally?

Good question. Basically, you can handle the HTML block or whatever sub-language block as any other data. As such, you can pass it around, modify it, create new ones, play with it as you like. When looking closer at:

myWebPage = syntax:html
    ...

What it says is in fact “interpret what is in the below indented block as HTML”. In other words, the block below will be parsed, checked, verified etc according to HTML specifications …The result of all this parsing however is in fact like any other core language data, in order to “fit” inside the variable. In other words, the “syntax:html” parser takes a text block as input and outputs a corresponding core language data type. Hence, there is under the hood a bijective function between HTML and the core language. One must deeply understand that HTML is just some representation of data, not more. As such, it can be handled like any other data. The only thing is that we furnish a way to “write” the data in a customized representation/syntax directly in the language.

Of course, not all sublanguage would represent data. For example, a shell script clearly executes commands, another DSL could maybe indicate some transformation process, etc… Therefore the need in the core language to support basic constructs to store: data, functions, actions or combinations of those. Indeed, anything can be transformed into one of those. Shell commands would be transormed into an action, mathematical transformations would be transformed into a function, and so on. What counts is that the sub-language referenced has a “syntax recognizer” which reads the raw text and transforms it into the adequate core language construct.

Wanna know more about implementing this stuff?

Bump me so that I write next section. (Should be around 4 sections total)

OO -> Actors (?)

November 15, 2008

Since the web has been established, networking and concurrency became central in most applications, like parrallelization is going to be with the multicore technology. However, OO seems not to be that well suited for either of both.

Why is OO not good for parrallelization?

Let’s say you have:
a := f()
b := g()
c := a + b
Simply put: how can you know if you can run f() and g() parrallely? You cannot tell directly. You have to know if f() or g(), and all resulting function calls, might at some moment modify a variable that the other uses or modify as well. In the latter case, parrallelization would lead to unpredictability and thus cannot be performed. Moreover, analyzing such dependencies is extremely difficult and breaks the principle of encapsulation.

And is OO good for networked applications?

Yes and no …At first sight it looks like yes but after some time it turns out it is not ideal. One issue is concurrency. When two requests come in at the time, it should be taken care that something being used by one is not modified by someone else at the same time. This leads to hard to investigate bugs and¬†increased complexity due to synchronization. Another issue is that objects are not that well suited to model¬†transactions.

Ok, so what now? Do you have anything better?

Object Orientation, combining data and methods acting upon them, is¬†“today’s paradigm”.¬†Now everything are objects, everyone “thinks” in terms of objects and OOP is “the way to go”. …But we should keep in mind it’s by all means not the sole way! And perhaps there are even better ways.

Two¬†of the (compatible) models seeing some resurgence are the ones of functional programming and the actor model. I’ll skip the first one¬†and directly go on¬†speaking a little about the actor model.

What is an actor?

At first sight, both may look similar. However, they have profound differences. Conceptually, objects are just “dead” data with methods. On the opposite, actors are “alive” indepedent processes with transactional memory.

An actor is basically like a server. It¬†receives and¬†sends messages, and¬†can also¬†start/stop other agents. By¬† “transactional state”, we mean that state is associated to¬†a “conversation”; not to the actor itself, as opposed to objects having a global shared state. As a consequence actors are free of concurrency issues.

So?

So transactions with asynchronous messages are naturally expressed and we have no more concurrency issues!

What can we model with actors?

Everything that follows the client/server architecture: I/O, networking, the X window system…¬†
In fact, anything interacting with the environment can nicely be modelled with actors, as well as intelligent agents, jobs running in parrallel, etc.

And this is even more natural! Indeed, when reading a file, a DB or whatrever: why sitting there and waiting for the answer? When these are seen as asynchronous requests, computation can go on until the response arrives.

…Sounds nice. Is there anything further?

Yes, this goes wonderfully hand in hand with functional programming. …But more on this later.

Data structures in functional programming …big issues (?!)

November 1, 2008

Welcome to this short presentation where I show something simple turning out to have dramatic consequences. Data and relationships between them plays a vital role in any paradigm …let’s have a closer look of how these can be manipulated in a functional paradigm where data is immutable.

Let us start with something simple. Imagine we want to have a tree which could be navigable. That is, having a tree node, you could access both its children and its parent.

Ok, no problem, let us just define some data type like this for instance:

data Tree a = Tree
{
parent :: Maybe (Tree a),
content :: a,
children :: Forest a
}

This way, one would build data trees like this:

root = Tree
{
parent = Nothing,
content = “root”,
children = [child]
}

child = Tree
{
parent = Just root,
content = “child”,
children = []
}

So far so good you might think …Well, not really, here we already have two issues:

1. Changing a node == reconstruct the whole tree

Assume some operation takes a tree as input and outputs the same tree among with a single node was modified, added or deleted.
In any of these 3 cases, the consequences are the same:
– if node A has been modified/added/deleted, its children and parent has to be replaced with ones having updated references to A
– the same happens for their own children and parents
– …
– the whole tree becomes a replica of the original one

2. Coherence is not enforced

Nothing prevents you from creating bullshit where children do not reference their proper parents and vice-versa. Moreover, creating helper functions cannot really help since they cannot act locally. Indeed, because of issue 1, the whole tree must be built at once, it cannot be assembled of small pieces.

…And the tree was just an example. This applies to ANY navigable data structure!!!
When anything is modified, everything linked to it has to be replicated!
I guess there is no need to draw a picture when there is lots of data and (bidirectional) relationships between them.

One way around would be to use dictionaries to store the relationships like:
parent_child = [(root, chA), (root, chB), …]
And use the latter one for all operations. But this is both very inconvinient and inefficient.

Another way would be to use monads and mutable state …which boils down to simulating the imperative object-oriented paradigm (in a much uglier and complex manner).

…I’m curious of what you think about this, because the fact that something as simple as a tree involves such dramatic consequences looks pretty catastrophic to me.

LEO’s influences

August 17, 2008

LEO is a new programming language i’m currently working on, as shows several articles in this blog. Of course, it does not come from nothing but is influenced by several languages. “Visiting” other programming languages are like visiting different counries. They have a different flavour, a different way of living, and most of the time you’ll notice remarkable things that you like and would carry with you. The same holds for programming languages. Thus, I will state here the main influences from other programming languages, in alphabetical in order to avoid jaleousy.

Erlang

In Erlang, despite of being very complete in itself, I especially like:

  • the generators and filters
  • that words are values by themeselves
  • …othter things I must explore, since this is the last language I picked up and am learning.

Haskell

In a few words, Haskell is elegant, sharp and sometimes a bit too terse.

I especially like:

  • full lazyness (thus, the order of your statements do not matter anymore, it becomes more descriptive) …what is funny though is that everyone tends to write with a monadic style which is kind of an emulation of statefull sequential programming. I find this a bit odd to what extend these are used.
  • where clauses. This seems really like a detail but putting the details inside a where clause helps you (in my sense) to better structure your code. In mainstream programming languages, you write your code bottom-up. Here you do it the other way round, the main task is outside and it is scrumbled down inside the where clause.
  • equality transitivity. The fact that you can replace the left hand side of an equality by its right hand side (and vice-versa) in the code is cool. It improves understandability.
  • Type classes!

What I’m not fond of:

  • Sometimes a bit terse and not always easy to understand
  • I am not yet convinced of partial function application
  • The fact that lists can only contain elements of the same type is both a blessing and a curse

Python

Pros:

  • It’s easy
  • It’s intuitive
  • It’s readable

Cons:

  • See what you can do in other languages and you’ll see there is a lot of space to cover
  • Too slow for computation intensive stuff

Scheme

Ah, this funny last one is pretty amusing. So minimalistic yet you can do absolutely everything with it. When I’m programming in it, I feel like a small boy playing around.

What I like a lot:

  • Program is data and data is program. Quoting things and evaluating quoted things.
  • Any identifier is valid ūüôā
  • Incredibly simple, yet so extensible.
  • It’s simply neat.

What bugs me:

  • I don’t feel as productive as with others
  • I don’t find it as readable.

Smalltalk

Experience the IDE and you won’t want any other. Here you don’t have source code files. You have a program “image”. And the IDE understands it as a whole and let you navigate and interact with it in an unmatched way.

Also, the structure of the program is like assembling Lego blocks. They are all combined from small chunks and part of bigger chuncks. In the IDE, it is like playing with a looking glass where you zoom and back on different objects in the program.

Conclusion

So how would be my ideal languange? …Well, no one cares anyway.

About mixing imperative and functional…

June 20, 2008

Making a custom programming language is a tricky thing… Here come a few issues of the constructs as I thought them.

It was initially planned to use both funtional and imperative programming by providing two constructs, one for true equality and another for affectation. Moreover, this sounded like to provide nice ways to interact. With variables for data and equalitites used to give a formatted view of these. …Sounds weird? Ok, here is a small example:


declare width, height
surface = width * height
do
width := 2
height := 3
print surface # prints 6
width := 4
height := 7
print surface # prints 28

In the previous example, we declare the surface to be equal to the width times the height. It has the same effect than making a function out of area but is simply more expressive.

However …it goes with its issues as well. For example, the lack of references initially planned encounters some issues. In java, affectation of “primitive” elements are done by copy while affectation of “compound” elements (i.e. objects) are done by reference. However, I wanted to avoid the distinction between what is “primitive” and what not. While on the same time avoiding pointers.

In leo, equality is “referential equality”. Thus, when we write:

a = b

The identifier ‘a’ becomes equal to ‘b’, not to it’s content. We do not even evaluate ‘b’ at this moment. On the opposite, an affectation like this:

a := b

takes the content of b at this moment in time and copies the content into a. …The thousand question point is now: does it copy “in depth” the object or does it merely copy a handle to the content of ‘b’?

Let us take a few examples to illustrate this. Suppose we have a group of people, let’s say the pupils of a class with each one having a best friend. If we do:

alice.bestFriend := zoe
bob.bestFriend := zoe

Then we copy the content of zoe into both variable …here we would whish that both zoes are the same even if they are modified afterwards. Thus a referencial copy. What about:

a := 1
b := 1

In this case, there is nothing to do on the ones. Referencial copy has no sense, just copy that value. …But what if the integer would be 60 pages long? And what about this:

date = 07/05/2008
a := date

The issue doesn’t change …is it arbitrarily copied by reference or by value?
…In any case, the issue is the arbitrary frontier between what is primitive and what not, as well as how to decide to make the affectation by copying references or values.

The adopted solution is to always store a pointer in ‘a’! …even in the case of “a := 1”, ‘a’ will point to an anonymous “node” containing itself the value 1.

Thus the internal representation for:

a = b
b := 2

is:

a –> b –> anon_int : 2

So each number would be represented by a small little data object …nice ūüôā
Notice also that = makes them always point to symbols while := always point to data objects.

By the way, == would:
first compare references, if they are same then reply true
else, compare the types and content.

Let us go over to another issue with a more practical example. Assume we have some graph and want to modify some properties of the node having the highest score based on some computation. There should be a function that returns the node having the highest score so that we can modify it accordingly.


graphA = ...
getBestNode = function (graph) -> (node)

do
graphB := graphA

nodeA := gestBestNode(graphA)
nodeB := gestBestNode(graphB)

nodeA.property := … # error!
nodeB.property := … # ok

In the last line …it means that we do not operate on the same data structure than graphA.
In this case, the nodeWe cannot do that!!! …it has no sense …graph cannot be equal to something that we modify afterwards
The greatest argument of functional programming is that it is stateless and deterministic. In the above case, it makes no sense anymore. Indeed, we retrieve an element of a graph to modify it afterwards. …so does this still fit in functional programming? …barely. I am no theorist anyway. The other issue is:

let us be firends…

alice.friend := zoe
bob.friend := zoe

what if

zoe = buildZoe()

Then zoe is either not modifiable or zoe is a copy.

LEO – Main page

June 20, 2008

This is the tempory name for “my new programming language”.

Presentations

A short apetizer & introduction

The paradigms unification

Reference manual

– Overview

Equality & assignation

Functions

– Actions

– Types and datastructures

– Objects

– Classes

– Agents

– Behaviors

– Meta-programming

Implementation drafts

– Overview

– Choice of implementation language

– Reading Leo source code

– Tokenizer
Leo – Scheme

– Parser
Leo – Scheme

Interpreter for:

– Phase 1 – arithmetic expressions

– Phase 2 – equalities, assignation and printing

– Phase 3 – functions

– Phase 4 – …we will see