Archive for the ‘Uncategorized’ Category

If I would design a new OS from scratch… No user rights!

November 13, 2015

If I would design a new OS from scratch… here is stuff I’d do differently.

Make program based rights, not user based ones

Back in the old days, you had one big expensive mainframe, shared by many experts. So was born the unix system rights “read/write/execute” for “user/group/anyone”. At the time, it made perfectly sense.

However, things are very differently now. It’s not one big mainframe for everyone, it’s several small devices per person. Moreover, users nowadays are more noobs than expert, they just want things to work. Me too.

Why is user rights based bad?

Because, basically, if you install a piece of software, it can automatically access all your files! yay! And so viruses, malware, trojans, etc. were born! Wouldn’t it make more sense that a software could access only it’s own files? Or that you have to authorize it to access other files? What about the network? What about the browser or system settings? …you know, that annoying spam bar which got installed because you didn’t pay attention. Currently, if you install something, you can’t “limit” it. It has access to everything.

So what should we do?

The solution is “program rights”. Once you install something, it should ask:

  • can I read/write on disk (in my own directory)
  • can I read/write some other files?
  • can I access internet?
  • can I launch myself as background service at startup?

This should be built-in in the OS. This way, we’d have a tight leash on all the viruses, malware, trojans, and so on. If some shady piece of software wants to read your private stuff, modify your system and talk to the web, you’d have to authorize it beforehand.

Isn’t that what Android does?

Yes and no.

Android asks authorizations for all kind of features, indeed. This exactly what is meant by “program rights”.

…but … all of this is built “on top” of the operating system, in a rather contrived and unexpected way.

If you don’t know it yet, Android is based on the Linux kernel. It’s a big layer on top of it. So if the kernel is user rights based, how do they achieve their apps to run “isolated”? The trick is simple: install and run every app as a different user! (you can read it here: https://source.android.com/security/overview/kernel-security.html) …fancy right? They moved from “user rights” to “program rights” just by interpreting programs as users, so that they can’t interfere with one another. It’s smart, it’s ugly, it’s basically a giant hack …but it works.

Nevertheless, I’d prefer to see it built-in rather than through abusing an unused feature. I guess it also has it’s weaknesses. For example, you couldn’t even prevent an App from filling your disk space. It can be argued that Android is not secure per-se, but that it’s secure because of the highly curated App store. This monopole enables Google not only to make loads of cash but also to ensure all Apps play nice.

Advertisements

Hi Intellego!

December 18, 2014

What is achievable?

Translation quality like google

Nope, not by a long shot. They have several key things mozzilla doesn’t have:

  • Data: they parse every page on the web, every book ever printed and much more. All this gargantic amount of data boosts translation quality and isn’t available to us.
  • Engine: they have their proprietary google translate which they tweak since ten years. They’re probably ahead of the open-source engines out there.
  • Infrastructure: putting a translation model in 100Gig RAM? Peanuts for google. They can easely parralellize huge tasks and have huge computing power at their finger tips.
  • Manpower: lots of smart PhD guys having worked dozens of years to sharpen their expertise in this field.
  • Money: the glue to fuel it all.

Better translations for minority languages

One of the slogan of inteligo was “translation is not a commodity”, in the sense that major languages had better translation systems than relative “exotic” languages. It was attributed to lack of research for such languages driven by low ROI. This may in part be true, but the major reason is probably something else: the lack of linguistic resources.

The more data you feed in a translation system, the better it’ll perform. With the millions(billions) of parrallel english-french sentences to build the models, it goes without saying they are pretty good. With a few thousands Javanese/Pakistanese translations, building translation models from them will result in pretty poor (crappy) translation quality.

Modest translations integrated into firefox / as web service

Yes, this is actually achievable, provided that you pay some expert to do it. At least if you want something decent.

Do we really need to hire someone? Well, at least I think so. You need a lot of knowledge and know-how to set up a system and you are probably not going to achieve it through a bunch of volunteers. It’s not something you can hack in a week of coding. I think the fact that nothing is available after nearly a year of intellego kind of prooves the point.

It will also require the proper infrastructure to work, not a small server in a basement. Here is also, budget is involved.

On the brighter side, if Mozilla can pull it off, they could reap the benefits too. User retention mainly.

 

Planning

Infrastructure

1. Where is the infrastructure?

Machine translation engines are hungry beasts. Ideally, you’ll need:

  • one or more server per language pair, with enough RAM to hold all the models into memory.
  • machines to compute the models used by the translation engines, also called “training”, and perform experimentations.

In order to have satisfying or efficient systems, the servers should be equally efficient. It’s no place for low-balling with 1Gig RAM servers because they’ll quickly run out of memory or lag like crazy.

While the first servers, dedicated to the translation service should be dedicated servers, the others, for training and experimentation may very well take advantage of cloud infrastructures are the need for computation power may vary greatly upon time and training/experiments performed.

So, who takes cares of that? Where are the servers? …it goes without saying that this is accompagned by a budget.

 

Resources

1. Which corpora?

Gathering all bilingual resources you can find is a task for itself. This is a good place the community could provide help, as it doesn’t require prerequisite knowledge and we tap in a large pool of individuals who perhaps know of this or that bilingual texts.

Who is in charge of this? Where is the website listing corpora resources?

Not only bi-lingual resources but also mono-lingual resources are used to build models. In particular language models. As usual, the more data you have, the better the translation will be. This can constitute relatively big amount of text. Just to give you a rough idea:

http://googleresearch.blogspot.de/2006/08/all-our-n-gram-are-belong-to-you.html

Remember, this is 2006 …a long long time ago. Their english sample covered 1,024,908,267,229 words and they extracted 5-grams from them, resulting in 24Gb compressed data. Now, more than 8 years after, it’s likely to be much more. This is just to give you a rough idea of the dimension the data we’re dealing with.

Gathering data blindly is not particularly useful either. If you scrape all forums, you are probably going to end up with translations like: “u can translate 4 fun butt it suckz”. On the other hand, just pull legal sites and the translation engine will speak like an attorney …not particulary helpful either. Final words: quality matters.

 

2. Processing corpora

How you process your raw data is hugely important. Poor processing will break the best translation engine and make it look like crap.

What’s processing corpora? Filtering bullshit from bi-lingual corpora, badly aligned sentences, non-ut8 symbols, homogenizing synbols like quotes and accents… up to tokenization itself which is a black magic. Since it’s the beginning of the pipeline, it will affect the quality of everything up to the final translation. There is a lot of nitty picky details in there, and it can have a surprising impact on the final quality of the system.

Lastly, it goes without saying that all resources should tokenized the same way. While using different corpora processed in different ways is possible, it is not optimal.

Who does that? What tools are you using for it? Who verifies the output and ensures the quality?

 

3. Sentence alignment

Do you want to do that? Or only use pre-aligned bilingual corpora? If not, who does that? What tools are you using for it? Who verifies the output and ensures the quality?

 

Derivative Resources

1. N-Grams

Who does that? What tools are you using for it? Who verifies the output and ensures the quality?

2. Word alignment

Who does that? What tools are you using for it? Who verifies the output and ensures the quality?

 

Models

1. Language models

Who does that? What tools are you using for it? Who verifies the output and ensures the quality?

2. Phrase tables

Who does that? What tools are you using for it? Who verifies the output and ensures the quality?

3 Other models

Who does that? What tools are you using for it? Who verifies the output and ensures the quality?

 

Translation engines

…it’s not “download, install run here” …you will have to learn a lot of stuff to get it running fine

Discovering HaXe!

January 27, 2012

Javascript is per-se the standard when it comes to client side web programming. Every browser supports it, it requires no plugin, it can manipulate the web page directly. It was already there 20 years ago, when the first browser was born, but, surprisingly, it started only recently to be heavily used for “rich clients” which also took quite some time to appear given the already existing technology.

Javascript is all good and nice, it was well designed by its creators, however it might not always be the most practical one as the code grows bigger and bigger. Errors are caught when testing it, it may be hard to debug, it lacks of features like packages, interfaces, visibility, proper classes, etc.

This is where HaXe comes in! It is a full-featured OO programming language that can be compiled into javascript, among others (and also for server-side targets). I experimented with it today, and I must say I’m quite pleased! It has all the features of a modern OO language and a little more. It is statically typed, enabling you to catch errors at compile time. You have code autocompletion, etc. You are basically more productive and produce code which is both better structured and safer.

Moreover, it also enables you to use it for the server side and compile it into neko, PHP or C++. I personally find the template engine not the fanciest but it’s ok. The libraries/documentation currently available for it are currently a bit limited, but should already cover the most basics you need. I highly recommend it, it is definitely a gem!

Why my last job was a total failure…

November 29, 2011

Two years ago, I started a research assistant position. I was full of motivation and looked for more intellectual challenge, where I could push my skills to the maximum. And I did. However, little did I know that it would end so badly. In the end, all I did was flushed down the toilet. Great software, years of hard labor, some heart put into it as well …and all in vain, thrown away forever. This was a very harsh experience, and still is.

So, how did it come to all this? What did I do? The short story is that, after some time, I started to develop a new system from scratch. I did it out of conviction that I could do better than existing systems, which were bloated by the years, extremely over complexified and exhausting to work with. On the other end, the boss was unconvinced. Using his managerial view, all he saw was a big black box and my proposal of replacing it by a supposely better one. He was not fond of it. What started as a mini-prototype on my evening hours began to grow. This prototype was so much better than the old system that I couldn’t let it go.

This brings to my big mistake. I pushed my energy into producing a technical masterpiece, hoping to see that the excellence of this work would enlighten and convince the others by itself. This failed. Actually, the boss never looked at any code, nor ever used either software. I could simply use sujective adjectives “it’s simplier, more flexible, better, etc…” …but it resounded on deaf ears. He requested proof yet wasn’t willing to look inside the box. For him, it was still just a black box, and there was no point in producing another one. For me, continuing working on the old system was equally pointless. It was ugly, overly time consuming, annoying, inferior …and I knew I had perfectionist software on the side, yet incomplete. Colleagues had no time to spend to test it either, or just rolled an eye over it, avoiding to interfere with the boss’s will.

In the end of this year and a half, I achieved two things:

  • Created from scratch a new translation system which competed successfully among the best state-of-the-art systems worldwide
  • Get fired and my work burried

I would have liked to have the system open sourced …but obviously the boss just didn’t care, probably because our relationship didn’t go well. By the time I proved the system was so performant, I was already fired. This brought me with the highest recognition from my former colleagues, a disdain from my boss, one of the best systems ever gathering dust on some hard drive, andย  myself empty handed.

Cheers!

PL sucks…

September 17, 2009

C++: …two dozens of “include”s, three “using packages *”, the standart “#ifndef myheader”, my class declaration… …wohoo, now i’m lastly ready to write the 5 important lines that actually do something at least ๐Ÿ™‚

Java: You want to use operator overloading for your freshly made “Polynomial” class? Heh, bad luck, we’re in java after all, don’t ask for too much freedom!

Haskell: …well, I can derive from Show …fine …why can’t I derive from something else I want?!

Leo -> Arplan, an own website!

April 4, 2009

Hello,

Since I spent quite some time on describing my “ideal” language, I decided to switch to a dedicated website for it:

http://www.sidewords.jimdo.com

The reason is to have a more structured site, more navigable and where information is more accessible. Posts tend to become a mess. By the way, the name of the language has changed from Leo into Arplan because Leo was a too common.

LEO – Equality & assignation

June 20, 2008

The core is based on the difference between equality (=) and affectation (:=). Equality is used in the true sense, meaning that the variable will always be equal to the expression of the right hand side.

In the code:
area = length*width

The area will always be the length times the width, whatever the latter ones carry as value. It is declarative and immutable.

On the opposite, affectation means that you put the value of the right hand side expression at this current
moment of execution in the variable. It is what we are used to in most mainstream imperative languages like c++ and java.

So, let us start with a few examples:

—=== Equality vs Affectation ===—

x = 1
x = 2 // ERROR: x is already equal to 1

y := 1
y := 2 // OK, the value of y is now 2

Once equality has been set, it is immutable. We declare that x is equal to 1. It is a declarative statement and x cannot be equal to two different things. We also say x is bound to 1. On the opposite, y is just a “classic” variable that can hold any value assigned to it.

—=== Equality ===—

declare b
a = b
print a // prints undefined

b := 1
print a // prints 1

b := 2
print a // prints 2

Here, you can see that a is equal to the expression b. When you want to print a, it tries to evaluate the value of a. Since a is equal to b and the value of the latter is 2, it prints 2.

—=== Affectation ===—

b := 1
a := b
print a // prints 1

b := 2
print a // prints 1

When we write a := b, what we do is evaluating the value of b and put the result into a. Thus, at this point of execution, the value of a will be 1. When we change b afterwards, a is not affected since it is associated with the value 1 and not with b. Before going further, it is important to understand clearly the subtle but obvious difference between affectation and equality.

By equality, we bind the variable to the right hand side expression itself. On the opposite, by affectation, we evaluate the right hand side and put it into the variable.

—=== Having fun with both ===—

declare b,c
a = 100 + b // OK, b has been declared previously
b = 10 * c // OK, c has been declared previously

// now, we have a == 100 + 10*c

print a // prints undefined

c := 3
print a // prints 130

c := 7
print a // prints 170

c := a // c := 170
print a // prints 1800

This shows how we can have constructs which are both declarative and highly expressive. There is no need to propagate the value of affected variables nor to updates things upon a change. We simply “know” what a is equal to and that it depends on c.

—=== Unethical mathes ===—

declare b,c
a = b
b = 2*c
c = a // ERROR: cyclic dependency

Luckily, this can be checked at compile time.

———————————————————————————————————–

The other main part of the language are functions. As in functional programming, functions are treaded as first class citizens. Variables are linked (equal or affected) to either values, expressions or functions. Function can be passed as parameters to other functions, functions can be returned as result and so on.

There are three important fact about these functions:
-functions are deterministic/stateless
-functions cannot produce side effects
-functions can be nested

Declaring a function looks like this:

function <name>: (<input>) -> (<output>)
<body>
end function

The “end function” is optional, i will omit it when the body is just a single for readability.

So, for example, we could have:

function f : (x,y) -> (z)
z= x^2 + y^2

There is another way to declare exactly the same. By equaling f to the equivalent anonymous function. This looks like this:

f = function : (x,y) -> (z)
z = x^2 + y^2

Both are exactly the same. In both examples, the identifier f is bound to the function defined above. In fact, the first example can be seen as sintactic sugar for the second one. We will use the second notation mostly in the examples below.

—=== Functions are stateless ===—

a = 2
b := 3
c = b

f = function: (x) -> (y)
y = a*x // OK, a is stateless

g = function: (x) -> (y)
y = b*x // ERROR, b is stateful

h = function: (x) -> (y)
y = c*x // ERROR, c is (indirectly) stateful

—=== Functions are stateless, bis ===—

f = function: (x) -> (y)
//…
g := function: (x) -> (y)
//…

hf = function: (x,y) -> (z)
z = f(x) + f(y) // OK, the function is still stateless

hg = function: (x,y) -> (z)
z = g(x) + g(y) // ERROR, because g is stateful

hh = function: (x,y) -> (z)
if x == 1 or y == 1 then
z = 1
else
z = hh(x-1,y-1) + x*y // OK, stateless recursive calls

—=== Functions without side-effects ===—

declare a,b
f = function: (x,y) -> (z)
a = 1 // ERROR, functions cannot modify variables outside their scope
b := 2 // ERROR, functions cannot modify variables outside their scope

a = 1
b := 2
g = function: (x,y) -> (z)
declare a,b // redeclaring variables with local scope
a = 3 // OK, local variables
b := 4 // OK, local variables

print a // prints 1
print b // prints 2

—=== Function calls ===—

f = function: (x,y) -> (z)
z = x^2 + y^2

print f(1,2) // prints 5

declare a,b
c = f(a,b)
print c // prints undefined

a = 2
b := 3
print c // prints 13

f = function: (x,y) -> (z)
z = x^2 + y^2

g = function: (x,y) -> (z)
z = x^2 – y^2

declare h as function

ans = h(1,2)
print ans // prints undefined

h := function f
print ans // prints 5

h := function g
print ans // prints -3

h := function: (x,y) -> (z)
z = f(x,y) + g(x,y)

print ans // prints 2
print h(10,20) //prints 200

—=== Nested functions ===—

binomial = function: (i,n) -> (ans)
fact = function: (n) -> (ans)
if (n > 1)
ans = n * fact(n-1)
else
ans = 1
ans = fact(n)/(fact(i)*fact(n-i))

—=== Returning functions ===—

norm = function: (i) -> (function ans)
ans = function: (x,y) -> (dist)
dist = (x^i + y^i)^(1/i)

taxicabLength = function norm(1)
euclideanLength = function norm(2)
norm5 = function norm(5)

dist := taxicabLength(3,4)
print dist // prints 7
dist := euclideanLength(3,4)
print dist // prints 5
dist := norm5(3,4)
print dist // prints 4.174…

That was for the nice part. Now comes the trouble… ๐Ÿ˜‰
…are function always stateless if they satisfy the above examples?
consider the following case:

f = function: (c,d) -> (function g)
g = function: (x) -> (y)
y = c*x + d // Authorized or not? …can lead to state indirectly encapsuled in the function

declare a,b
h = function f(a,b) //What happens? …OK if a is stateless, ERROR if a is stateful?
// h is a function depending indirectly on a and b

a = 2 // still OK
b := 5 // ERROR, introduces state in h

…this last example is causing me trouble because I would like to
ensure that functions are always stateless, but the fact that
functions can be returned and that the input arguments are bound by
equality cause issues:
-either indirect state can ******* into functions
-either it becomes both hard to ensure that functions are always
stateles and can be unconvinient for the user since “forgotten”
variables cannot be assigned if part of a function

…i’m curious to hear any feedback.
cheers

Does functional programming matters?

May 7, 2008

Yes, but is just a part of the story. For me, it seems like functional programming is “advertized” by its fervor protagonists as being the holy grail. …Well, well, well …However there is a big issue based on a very simple down to earth concept: functional programming is based on “computing” something, just like mathematical functions and avoiding state at all means. The issue is that the systems we usually want to model are quite the opposite: they are usually extremely stateful. We want to store data and to modify it over time, it is “information” technology, to store and handle “information”. This leads to the widespread use of databases and SQL. They store the state and are thus the core of the business. On the opposite, functional programming doesn’t lend itself well to play around with data structures. Sure it can, but not in an as straightforwad and convinient way as for imperative stateful languages.

What I want to say with this is that they have complementary uses. Imperative constructs are perfect to handle everything about the state. Updating data and handling how information is structured and modified. On the opposite, functional programming makes the perfect candidate to get data in a certain format, to vizualize it, to extract relevant informations from it or to compute new results from it. Imperative and functional are like the controller and the viewer. Perhaps stressing more clearly which parts of the program are functions to vizualize/compute things based on the data and which parts of the program are procedures in order to modify the data would help software development. By making the frontier more clearer, we would help producing clearer code, having less side effects and being more …may I say “adequate”.

Programming languages: readability vs extendability / means of abstractions

May 1, 2008

Programming languages …a subject wider than the oceans.

This small post is a summerized draft discussing readability vs extendability / means of abstractions.

– mainstream languages are limited. In their features, their expressivity and their extensibility.
– special languages are useful in particular cases but limited as well since they rarely combine well with other paradigms.
– lisp/scheme are special cases since they are very extensible / stretchable thus able to privide higher level of abstractions/combinations. This is due to the fact that you program directly the abstract syntax tree in some sense, a LISP/Scheme program is data and vice-versa. The drawback is that it is not as readable as usual languages, the code can be harder to understand and the learning curve is longer.
– languages like dylan trying to bridge the gap between infix syntax for convinience and behind the scenes conversion to lispish syntax to enable powerfull means of extending the language. …But well, who knows Dylan?
– languages like xlr / xmf try to be perfectly extensible languages so that you could express everything with them. The constructs you whish with the syntax you wish. …but, this has two drawbacks however. First, two of the strengths of languages are the completeness of their libraries and secondly the size of their communities, which we both loose in this case …Despite the language would be ideal.

Python showed us that an intuitive and clear syntax is utterly important. Indeed:
– it shortens the learning curve
– it increases readability
– it increases productivity
…It is indeed a straightforward language with enough freedom so that anyone can just pick-up the basics relatively quickly.

Yet, it is not necessarily the language we dream of. If all your experience is limited to such mainstream languages, it might seem it is the ideal candidate having everything you want. If you have already played with completely other paradigms you might miss those and find python a little dry.

Python is the culmination of the OO procedural paradigms, nice, straightforward and sweet.
SQL is the defacto standart for handling data.
Haskell is the most beatiful functional language ever.
Prolog for logic.
Erlang for distributed computing.
Scheme is so minimalisticcally fantastic. It combines as lego blocks like no other language ever.
Java helps you to structure and organize things.
C/C++ have the largest codebase ever and constitute a huge amount of libraries which we often have to interface with.

One day, a girl sitting next to me asked me naively “Why can’t you use SQL statements in your C++ code?”. Obviously, because C++ does not support it. And why? Because there is no way to extend the syntax/semantics. On the same way, it is not really possible to add stack traces in exceptions, nor aspect oriented programming, nor anything else that C++ is designed for. Just make nice procedures and objects, that’s all you can do and you have to deal with it.

For instance, C# 3.5, now offers “Language Integrated Native Queries”. Basically, this means that “now” you can perform directly the SQL querries in the code, wether it is performed on persistent data stored in a database or a normal array in your program.

…However, we are still tied by the non-extensibility of languages.
Say, I want to add aspect orientation to add constructs like:

before set var from MyObect.*
print “Variable ” + getName(var) + ” : ” + getValue(var) + ” -> ” + value

…it is simply and plainly not possible in most mainstream and specific languages.

So, we have four choices:
– wait until the next mainstream language incorporate the feature
– take a lispish language which you can extend more or less to what we want
– pick an experimental/academic language that has many feature among which the one we were interested in
– make our own language

And the consequences are:
1. …you may wait forever
2. …good option. But you loose readability, a slightly longer learning curve and it may still not be as expressive/practical as you had wished.
3. …Missing libraries? Missing support?
4. …I guess you don’t assess correctly the size of the task

An idea, for a new language, would be to have “pluggable” constructs. But this is certainly not new though. If you want to use objects, just
import core.objects
If you want to use SQL just
import core.SQL
if you want to use custom constructs, just do so
import myModule.myConstructs

It is important the constructs are cross compatible though and combine well together. Moreover, it should be possible for a foreigner to add construct and extend the language as easy as possible. This often means interracting with the core of the language, it’s internal representation. We propose to do this into two steps:
– one part is to map syntax to the internal representation
– the other part is to interact with the internal representation according to the construct

The farther appart the syntax is from the internal representation, the better expressivity we gain, but the more it becomes difficult to extend the language since we abstract away.