Computer Laboratory

OCaml Labs

Recent Posts

Mon, 09 Dec 2013 Mirage 1.0: not just a hallucination! MirageOS
Tue, 26 Nov 2013 Switching from Bootstrap to Zurb Foundation Amir Chaudhry
Wed, 20 Nov 2013 Announcing the new Amir Chaudhry
Wed, 06 Nov 2013 Migration plan for the redesign Amir Chaudhry
Wed, 30 Oct 2013 Third OCaml compiler hacking session Compiler Hacking
Mon, 28 Oct 2013 Review of the OCaml FPDays tutorial Amir Chaudhry
Tue, 22 Oct 2013 FP Days OCaml Session Amir Chaudhry
Mon, 07 Oct 2013 FPDays 2013 Real World OCaml tutorial in Cambridge Anil Madhavapeddy
Mon, 07 Oct 2013 High-power LEDs for super-bright light Philippe Wang
Sat, 05 Oct 2013 Using Travis for secure deployments with SSH Anil Madhavapeddy
Fri, 04 Oct 2013 (Example) A use case of MPP in an OCaml program Philippe Wang
Thu, 03 Oct 2013 OPAMaging MPP Philippe Wang
Thu, 03 Oct 2013 Using MPP in two differents ways Philippe Wang
Wed, 02 Oct 2013 Intellisense for OCaml with Vim and Merlin Anil Madhavapeddy
Wed, 02 Oct 2013 Liveblogging the first Human Data Interaction workshop SRG Syslog
Sun, 29 Sep 2013 Test your OCaml packages in minutes using Travis CI Anil Madhavapeddy
Thu, 26 Sep 2013 (Exercise) Odd or even? Philippe Wang
Wed, 25 Sep 2013 (Exercise) Are there more 1s or 0s? Philippe Wang
Tue, 24 Sep 2013 Liveblogging OCaml Workshop 2013 SRG Syslog
Tue, 24 Sep 2013 On the Implementation of OPAMDOC Philippe Wang
Tue, 24 Sep 2013 Feedback requested on the redesign Amir Chaudhry
Tue, 24 Sep 2013 OPAM-DOC with multiple OCaml packages Philippe Wang
Mon, 23 Sep 2013 Liveblogging CUFP 2013 SRG Syslog
Thu, 19 Sep 2013 OPAM 1.1 beta available, with pretty colours Anil Madhavapeddy
Tue, 17 Sep 2013 OCaml Monthly Meeting – Live Blog Heidi Howard
Tue, 17 Sep 2013 Inaugural compiler hackers meeting Compiler Hacking
Sun, 15 Sep 2013 Camlpdf, the first good command-line PDF tool I've found Anil Madhavapeddy
Mon, 09 Sep 2013 OCaml Development in Vim Heidi Howard
Sun, 08 Sep 2013 OCamlot--exploring the edges of OPAM packages Anil Madhavapeddy
Thu, 05 Sep 2013 OMD: a Markdown parser in OCaml Philippe Wang
Wed, 28 Aug 2013 ICFP, CUFP & OCaml2013 Heidi Howard
Fri, 23 Aug 2013 Introducing vchan MirageOS
Mon, 19 Aug 2013 Real World OCaml beta3 release Heidi Howard
Thu, 08 Aug 2013 Mirage travels to OSCON'13: a trip report MirageOS
Mon, 05 Aug 2013 Final Real World OCaml beta; the good, the bad and the ugly Anil Madhavapeddy
Thu, 01 Aug 2013 OCaml Lecture Notes Heidi Howard
Thu, 18 Jul 2013 Creating Xen block devices with Mirage MirageOS
Wed, 10 Jul 2013 Profiling OCaml – Getting Started Guide Heidi Howard
Wed, 03 Jul 2013 Learning Async Heidi Howard
Sun, 16 Jun 2013 Phew, Real World OCaml beta now available. Anil Madhavapeddy

Mirage 1.0: not just a hallucination! (MirageOS)

First: read the overview and technical background behind the project.

When we started hacking on Mirage back in 2009, it started off looking like a conventional OS, except written in OCaml. The monolithic repository contained all the libraries and boot code, and exposed a big OS module for applications to use. We used this to do several fun tutorials at conferences such as ICFP/CUFP and get early feedback.

As development continued though, we started to understand what it is we were building: a "library operating system". As the number of libraries grew, putting everything into one repository just wasn't scaling, and it made it hard to work with third-party code. We spent some time developing tools to make Mirage fit into the broader OCaml ecosystem.

Three key things have emerged from this effort:

  • OPAM, a source-based package manager for OCaml. It supports multiple simultaneous compiler installations, flexible package constraints, and a Git-friendly development workflow. Since releasing 1.0 in March 2013 and 1.1 in October, the community has leapt in to contribute over 1800 packages in this short time. All of the Mirage libraries are now tracked using it, including the Xen libraries.

  • The build system for embedded programming (such as the Xen target) is a difficult one to get right. After several experiments, Mirage provides a single command-line tool that combines configuration directives (also written in OCaml) with OPAM to make building Xen unikernels as easy as Unix binaries.

  • All of the Mirage-compatible libraries satisfy a set of module type signatures in a single file. This is where Mirage lives up to its name: we've gone from the early monolithic repository to a single, standalone interface file that describes the interfaces. Of course, we also have libraries to go along with this signature, and they all live in the Mirage GitHub organization.

With these components, I'm excited to announce that Mirage 1.0 is finally ready to see the light of day! Since it consists of so many libraries, we've decided not to have a "big bang" release where we dump fifty complex libraries on the open-source community. Instead, we're going to spend the month of December writing a series of blog posts that explain how the core components work, leading up to several use cases:

  • The development team have all decided to shift our personal homepages to be Mirage kernels running on Xen as a little Christmas present to ourselves, so we'll work through that step-by-step how to build a dedicated unikernel and maintain and deploy it (spoiler: see this repo). This will culminate in a webservice that our colleagues at Horizon have been building using Android apps and an HTTP backend.

  • The XenServer crew at Citrix are using Mirage to build custom middlebox VMs such as block device caches.

  • For teaching purposes, the Cambridge Computer Lab team want a JavaScript backend, so we'll explain how to port Mirage to this target (which is rather different from either Unix or Xen, and serves to illustrate the portability of our approach).

How to get involved

Bear with us while we update all the documentation and start the blog posts off today (the final libraries for the 1.0 release are all being merged into OPAM while I write this, and the usually excellent Travis continuous integration system is down due to a bug on their side). I'll edit this post to contain links to the future posts as they happen.

Since we're now also a proud Xen and Linux Foundation incubator project, our mailing list is shifting to, and we very much welcome comments and feedback on our efforts over there. The #mirage channel on FreeNode IRC is also growing increasingly popular, as is simply reporting issues on the main Mirage GitHub repository.

Several people have also commented that they want to learn OCaml properly to start using Mirage. I've just co-published an O'Reilly book called Real World OCaml that's available for free online and also as hardcopy/ebook. Our Cambridge colleague John Whittington has also written an excellent introductory text, and you can generally find more resources online. Feel free to ask beginner OCaml questions on our mailing lists and we'll help as best we can!

by Anil Madhavapeddy ( at Mon, 09 Dec 2013

Switching from Bootstrap to Zurb Foundation (Amir Chaudhry)

I've just updated my site's HTML/CSS and moved from Twitter Bootstrap to Zurb Foundation. This post captures my subjective notes on the migration.

My use of Bootstrap

When I originally set this site up, I didn't know what frameworks existed or anything more than the basics of dealing with HTML (and barely any CSS). I came across Twitter Bootstrap and immediately decided it would Solve All My Problems. It really did. Since then, I've gone through one 'upgrade' with Bootstrap (from 1.x to 2.x), after which I dutifully ignored all the fixes and improvements (note that Bootstrap was up to v2.3.2 while I was still using v2.0.2).

Responsive Design

For the most part, this was fine with me but for a while now, I've been meaning to make this site 'responsive' (read: not look like crap from a mobile). Bootstrap v3 purports to be mobile-first so upgrading would likely give me what I'm after but v3 is not backwards compatible, meaning I'd have to rewrite parts of the HTML. Since this step was unavoidable, it led me to have another look at front-end frameworks, just to see if I was missing anything. This was especially relevant since we'd just released the new website, itself built with Bootstrap v2.3.1 (we'd done the design/templating work long before v3 was released). It would be useful to know what else is out there for any future work.

Around this time I discovered Zurb Foundation and also the numerous comparisons between them (note: Foundation seems to come out ahead in most of those). A few days ago, the folks at Zurb released version 5, so I decided that now is the time to kick the tires. For the last few days, I've been playing with the framework and in the end I decided to migrate my site over completely.

Foundation's Yeti

Swapping out one framework for another

Over time, I've become moderately experienced with HTML/CSS and I can usually wrangle things to look the way I want, but my solutions aren't necessarily elegant. I was initially concerned that I'd already munged things so much that changing anything would be a pain. When I first put the styles for this site together, I had to spend quite a bit of time overwriting Bootstrap's defaults so I was prepared for the same when using Foundation. Turns out that I was fine. I currently use Jekyll (and Jekyll Bootstrap) so I only had three template files and a couple of HTML pages to edit and because I'd kept most of my custom CSS in a separate file, it was literally a case of swapping out one framework for another and bug-fixing from there onwards. There's definitely a lesson here in using automation as much as possible.

Customising the styles was another area of concern but I was pleasantly surprised to find I needed less customisation than with Bootstrap. This is likely because I didn't have to override as many defaults (and probably because I've learned more about CSS since then). The one thing I seemed to be missing was a way to deal with code sections, so I just took what Bootstrap had and copied it in. At some point I should revisit this.

It did take me a while to get my head around Foundation's grid but it was worth it in the end. The idea is that you should design for small screens first and then adjust things for larger screens as necessary. There are several different default sizes which inherit their properties from the size below, unless you explicitly override them. I initially screwed this up by explicitly defining the grid using the small-# classes, which obviously looks ridiculous on small screens. I fixed it by swapping out small-# for medium-# everywhere in the HTML, after which everything looked reasonable. Items flowed sensibly into a default column for the small screens and looked acceptable for larger screens and perfectly fine on desktops. I could do more styling of the mobile view but I'd already achieved most of what I was after.

Fixing image galleries and embedded content

The only additional thing I used from Bootstrap was the Carousel. I'd written some custom helper scripts that would take some images and thumbnails from a specified folder and produce clickable thumbnails with a slider underneath. Foundation provides Orbit, so I had to spend time rewriting my script to produce the necessary HTML. This actually resulted in cleaner HTML and one of the features I wanted (the ability to link to a specific image) was available by default in Orbit. At this point I also tried to make the output look better for the case where JavaScript is disabled (in essence, each image is just displayed as a list). Below is an example of an image gallery, taken from a previous post, when I joined the computer lab.

Foundation also provides a component called Flex Video, which allows the browser to scale videos to the appropriate size. This fix was as simple as going back through old posts and wrapping anything that was <iframe> in a <div class="flex-video">. It really was that simple and all the Vimeo and YouTube items scaled perfectly. Here's an example of a video from an earlier post, where I gave a walkthrough of the site. Try changing the width of your browser window to see it scale.

Video demo

Framework differences

Another of the main difference between the two frameworks is that Bootstrap uses LESS to manage its CSS whereas Foundation uses SASS. Frankly, I've no experience with either of them so it makes little difference to me. It's worth bearing in mind for anyone who's workflow does involve pre-processing. Also, Bootstrap is available under the Apache 2 License, while Foundation is released under the MIT license.


Overall, the transition was pretty painless and most of the time was spent getting familiar with the grid, hunting for docs/examples and trying to make the image gallery work the way I wanted. I do think Bootstrap's docs are better but Foundation's aren't bad.

Although this isn't meant to be a comparison, I much prefer Foundation to Bootstrap. If you're not sure which to use then I think the secret is in the names of the frameworks.

  • Bootstrap (for me) was a great way to 'bootstrap' a site quickly with lots of acceptable defaults -- it was quick to get started but took some work to alter.
  • Foundation seems to provide a great 'foundation' on which to create more customised sites -- it's more flexible but needs more upfront thought.

That's pretty much how I'd recommend them to people now.

by Amir Chaudhry at Tue, 26 Nov 2013

Announcing the new (Amir Chaudhry)

As some of you may have noticed, the new site is now live!

The DNS may still be propagating so if hasn't updated for you then try This post is in two parts: the first is the announcement and the second is a call for content.

New website design!

The new site represents a major milestone in the continuing growth of the OCaml ecosystem. It's the culmination of a lot of volunteer work over the last several months and I'd specifically like to thank Christophe, Ashish and Philippe for their dedication (the commit logs speak volumes). Wireframes

We began this journey just over 8 months ago with paper, pencils and a lot of ideas. This led to a comprehensive set of wireframes and walk-throughs of the site, which then developed into a collection of Photoshop mockups. In turn, these formed the basis for the html templates and style sheets, which we've adapted to fit our needs across the site.

Alongside the design process, we also considered the kind of structure and workflow we aspired to, both as maintainers and contributors. This led us to develop completely new tools for Markdown and templating in OCaml, which are now available in OPAM for the benefit all.

Working on all these things in parallel definitely had it challenges (which I'll write about separately) but the result has been worth the effort.

The journey is ongoing and we still have many more improvements we hope to make. The site you see today primarily improves upon the design, structure and workflows but in time, we also intend to incorporate more information on packages and documentation. With the new tooling, moving the website forward will become much easier and I hope that more members of the community become involved in the generation and curation of content. This brings me to the second part of this post.

Call for content

We have lots of great content on the website but there are parts that could do with a refresh and gaps that could be filled. As a community driven site, we need ongoing contributions to ensure that the site best reflects its members.

For example, if you do commercial work on OCaml then maybe you'd like to add yourself to the support page? Perhaps there are tutorials you can help to complete, like 99 problems? If you're not sure where to begin, there are already a number of content issues you could contribute to.

Although we've gone through a bug-hunt already, feedback on the site is still very welcome. You can either create an issue on the tracker (preferred), or email the infrastructure list.

It's fantastic how far we've come and I look forward to the next phase!

by Amir Chaudhry at Wed, 20 Nov 2013

Migration plan for the redesign (Amir Chaudhry)

We're close to releasing the new design of but need help from the OCaml community to identify and fix bugs before we switch next week.

Ashish, Christophe, Philippe and I have been discussing how we should go about this and below is the plan for migration. If anyone would like to discuss any of this, then the infrastructure list is the best place to do so.

  1. We've made a new branch on the main repository with the redesign. This branch is a fork of the master and we've simply cleaned up and replayed our git commits there.

  2. We've built a live version of the new site, which is visible at - this is rebuilt every few minutes from the branch mentioned above.

  3. Over the course of one week, we ask the community to review the new site and report any bugs or problems on the issue tracker. We triage those bugs to identify any blockers and work on those first. This is the phase we'll be in from today.

  4. After one week (7 days), and after blocking bugs have been fixed, we merge the redesign branch into the master branch. This would effectively present the new site to the world.

During the above, we would not be able to accept any new pull requests on the master branch but would be happy to accept them on the new, redesign branch. Hence, restricting the time frame to one week.

Please note that the above is only intended to merge the design and toolchain for the new site. Specifically, we've created new landing pages, have new style sheets and have restructured the site's contents as well as made some new libraries (OMD and MPP). The new toolchain means people can write files in markdown, which makes contributing content a lot easier.

Since the files are on GitHub, people don't even need to clone the site locally to make simple edits (or even add new pages). Just click the 'Edit this page' link in the footer to be taken to the right file in the repository and GitHub's editing and pull request features will allow you to make changes and submit updates, all from within your browser (see the GitHub Article for details).

There is still work to be done on adding new features but the above changes are already a great improvement to the site and are ready to be reviewed by the OCaml community and merged.

by Amir Chaudhry at Wed, 06 Nov 2013

Third OCaml compiler hacking session (Compiler Hacking)

It's time for the third Cambridge OCaml compiler-hacking session! This time we're going to be back in the Computer Lab, where the first session was held.

If you're planning to come along, it'd be helpful if you could indicate interest via Doodle and sign up to the mailing list to receive updates:

Where: Room FW11, Computer Laboratory, Madingley Road

When: 6pm, Wednesday 6th November

Who: anyone interested in improving OCaml. Knowledge of OCaml programming will obviously be helpful, but prior experience of working on OCaml internals isn't necessary.

What: fixing bugs, implementing new features, learning about OCaml internals


We're defining "compiler" pretty broadly, to include anything that's part of the standard distribution, which means at least the standard library, runtime, tools (ocamldep, ocamllex, ocamlyacc, etc.), camlp4, ocamlbuild, the documentation, and the compiler itself. We'll have suggestions for mini-projects for various levels of experience, but feel free to come along and work on whatever you fancy.

We'll also be ordering pizza, so if you want to be counted for food you should aim to arrive by 6.30pm.

by (OCaml Labs) at Wed, 30 Oct 2013

Review of the OCaml FPDays tutorial (Amir Chaudhry)

Last Thursday a bunch of us from the OCaml Labs team gave an OCaml tutorial at the FPDays conference (an event for people interested in Functional Programming). Jeremy and I led the session with Leo, David and Philippe helping everyone progress and dealing with questions.

It turned out to be by far the most popular session at the conference with over 20 people all wanting to get to grips with OCaml! An excellent turnout and a great indicator of the interest that's out there, especially when you offer a hands-on session to people. This shouldn't be a surprise as we've had good attendance for the general OCaml meetups I've run and also the compiler hacking sessions, which Jeremy and Leo have been building up (do sign up if you're interested in either of those!). We had a nice surprise for attendees, which were uncorrected proof copies of Real World OCaml and luckily, we had just enough to go around.

For the tutorial itself, Jeremy put together a nice sequence of exercises and a skeleton repo (with helpful comments in the code) so that people could dive in quickly. The event was set up to be really informal and the rough plan was as following:

  1. Installation/Intro - We checked that people had been able to follow the installation instructions, which we'd sent them in advance. We also handed out copies of the book and made sure folks were comfortable with OPAM.

  2. Hello world - A light intro to get people familiar with the OCaml syntax and installing packages with OPAM. This would also help people to get familiar with the toolchain, workflow and compilation.

  3. Monty Hall browser game - Using js_of_ocaml, we wanted people to create and run the Monty Hall problem in their browser. This would give people a taste of some real world interaction by having to deal with the DOM and interfaces. If folks did well, they could add code to keep logs of the game results.

  4. Client-server game - The previous game was all in the browser (so could be examined by players) so here the task was to split it into a client and server, ensuring the two stay in sync. This would demonstrate the re-usability of the OCaml code already written and give people a feel for client server interactions. If people wanted to do more, they could use ctypes and get better random numbers.

We did manage to stick to the overall scheme as above and we think this is a great base from which to improve future tutorials. It has the really nice benefit of having visual, interactive elements and the ability to run things both in the browser as well as on the server is a great way to show the versatility of OCaml. js_of_ocaml is quite a mature tool and so it's no surprise that it's also used by companies such as Facebook (see the recent CUFP talk by Julien Verlaguet - skip to 19:00).

We learned a lot from running this session so we've captured the good, the bad and the ugly below. This is useful for anyone who'd like to run an OCaml tutorial in the future and also for us to be aware of the next time we do this. I've incorporated the feedback from the attendees as well as our own thoughts.

Heads down and hands on

Things we learnt

The Good

  • Most people really did follow the install instructions beforehand. This made things so much easier on the day as we didn't have to worry about compile times and people getting bored. A few people had even got in touch with me the night before to sort out installation problems.

  • Many folks from OCaml Labs also came over to help people, which meant no-one was waiting longer than around 10 seconds before getting help.

  • We had a good plan of the things we wanted to cover but we were happy to be flexible and made it clear the aim was to get right into it. Several folks told us that they really appreciated this loose (as opposed to rigid) structure.

  • We didn't spend any time lecturing the room but instead got people right into the code. Having enough of a skeleton to get something interesting working was a big plus in this regard. People did progress from the early examples to the later ones fairly well.

  • We had a VM with the correct set up that we could log people into if they were having trouble locally. Two people made use of this.

  • Of course, It was great to have early proofs of the book and these were well-received.

RWO books galore!

The Bad

  • In our excitement to get right into the exercises, we didn't really give an overview of OCaml and its benefits. A few minutes at the beginning would be enough and it's important so that people can leave with a few sound-bites.

  • Not everyone received my email about installation, and certainly not the late arrivals. This meant some pain getting things downloaded and running especially due to the wifi (see 'Ugly' below).

  • A few of the people who had installed, didn't complete the instructions fully but didn't realise this until the morning of the session. There was a good suggestion about having some kind of test to run that would check everything, so you'd know if there was something missing.

  • We really should have had a cut-off where we told people to use VMs instead of fixing installation issues and 10-15 minutes would have been enough. This would have been especially useful for the late-comers.

  • We didn't really keep a record of the problems folks were having so we can't now go back and fix underlying issues. To be fair, this would have been a little awkward to do ad-hoc but in hindsight, it's a good thing to plan for.

The Ugly

  • The only ugly part was the wifi. It turned out that the room itself was a bit of a dead-spot and that wasn't helped by 30ish devices trying to connect to one access point! Having everyone grab packages at the same time in the morning probably didn't help. It was especially tricky as all our mitigation plans seemed to revolve around at least having local connectivity. In any case, this problem only lasted for the morning session and was a little better by the afternoon. I'd definitely recommend a backup plan in the case of complete wifi failure next time! One such plan that Leo got started on was to put the repository and other information onto a flash drive that could be shared with people. We didn't need this in the end but it'll be useful to have something like this prepared for next time. If anyone fancies donating a bunch of flash drives, I'll happily receive them!

Overall, it was a great session and everyone left happy, having completed most of the tutorial (and with a book!). A few even continued at home afterwards and got in touch to let us know that they got everything working. It was a great session and thanks to Mark, Jacqui and the rest of the FPDays crew for a great conference!

RWO Book giveaway

(Thanks to Jeremy, Leo, David and Philippe for contributions to this post)

by Amir Chaudhry at Mon, 28 Oct 2013

FP Days OCaml Session (Amir Chaudhry)

On Thursday, along with Jeremy and Leo, I'll be running an OCaml Hands-on Session at the FPDays conference. Below are some prep instructions for attendees.

Preparation for the session

If you're starting from scratch, installation can take some time so it's best to get as much done in advance as possible. You'll need OPAM (the package manager), OCaml 4.01 (available through OPAM) and a few libraries before Thursday. If you have any issues, please contact Amir.

  • OPAM: Follow the instructions for your platform at OPAM requires OCaml so hopefully the relevant dependencies will kick in and you'll get OCaml too (most likely version 3.12). You can get a cup of coffee while you wait. After installation, run opam init to initialise OPAM.

  • OCaml 4.01: We actually need the latest version of OCaml but OPAM makes this easy. Just run the following (and get more coffee):

$ opam update
$ opam switch 4.01.0
$ eval `opam config env`
  • Libraries: For the workshop you will need to check that you have the following installed: libffi, pcre and pkg-config. This will depend on your platform so on a Mac with homebrew I would do brew install libffi pcre pkg-config or on Debian or Ubuntu apt-get install libffi-dev. After this, two OCaml packages it's worth installing in advance are core and js_of_ocaml so simply run:
$ opam install core js_of_ocaml

OPAM will take care of the dependencies and the rest we can get on the day!

by Amir Chaudhry at Tue, 22 Oct 2013

FPDays 2013 Real World OCaml tutorial in Cambridge (Anil Madhavapeddy)

Yaron Minsky and I have been running OCaml tutorials for a few years at ICFP and CUFP, but haven’t really spread out into the wider conference circuit. Now that Real World OCaml is almost finished, the scene is set for doing much more. The first such tutorial is being help at FPDays 2013 on October 24th in the lovely Murray Edwards College in Cambridge. Check out the Lanyrd page for ticket information, and the OCaml session page for more information.

The basic layout of the tutorial is to get started with the guided tour of the book, and then work through building a distributed message broker. This gets you familiar with the Core standard library, the Async event-driven I/O library, and all the strongly-typed RPC plumbing that goes in between. We’re hoping to have physical preprints of the book available for free to attendees, so do sign up fast if you wish to attend.

As a bonus, the Cambridge FPDays session will feature Jeremy Yallop working through the book and conducting the tutorial: he has an incredible depth of knowledge about the innards of OCaml’s type system, and so advanced users will also find a good home in this tutorial to throw questions at him too! For those of you interested in other programming languages, there are also excellent-looking sessions on Erlang, F# and Scala, and Phil Wadler is giving a keynote speech. I’m most excited about Sam Aaron’s session on live coding and music though. You have to hear it to believe it…

Mon, 07 Oct 2013

High-power LEDs for super-bright light (Philippe Wang)

I've been "playing" with COB LEDs for about half a year now. Here's a summary of my experience.



A "COB LED" is a basically a bunch of tiny LEDs put on together to provide a very energy-efficient and usually very bright LED (using both parallel and series circuit schemes). Note that it means the voltage of COB LEDs are not the voltage of "normal LEDs", since putting some LEDs in series means adding up voltage values!

An LED does not behave like an incandescent light bulb

Also, note that if there's a roughly linear relation between light intensity and voltage for classic lamps, it's not the case at all with LEDs! With LEDs, basically, the light intensity varies with the amperage, not the voltage. So if you lower the voltage by a very little bit, it will shine as bright if the amperage keeps up; if you lower the voltage too much, it basically stops emitting light (e.g., if you give it 16V instead of 19V, you have so little light that it's really useless), but if you increase the voltage too much you still fry your LEDs so you don't want to do that at all.

Light colors, measured in Kelvin

Basically, 2700K to 3300K correspond to classic (discontinued) incandescent light bulbs, and it's quite close to classic candle light, which is about 1850K. 6500K is day light as in "the light provided by the Sun on a sunny day at noon", which seems very cool if it's dark outside or if you have warm light around. My favourite artificial light color is generally from 4000K to 5000K. I find the 3000K quite acceptable, below that is too warm, above 6000K is too cool.

Don't get me wrong: in this post, "cool" not a synonym for "awesome", stick to its original meaning instead.

Read more on Wikipedia.

Color spectral distribution

Note that LEDs generally have a very narrow light color spectrum. That means that instead of emitting light of a lot of different colors, including some colors you don't see or colors that are not very much noticed because they are less intensively emitted, LEDs will really emitt just the color you asked for. That is not such a good thing (it's not so bad either). But as a consequence, you might feel that the light emitted by an LED is not very comfortable. What I do is that I mainly use warm light LEDs and I add a few cool light LEDs amongst them: it goes from 1 cool for 6 warms to 1 for 1. The result of 1 for 1 is quite nice when it's bright outside, but it's not so nice if it's in the middle of the night.

Why not simply buy normal lamps?

Well, buying new lamps for having more light implicates buying light bulbs as well. I'm not a very big fan of light bulbs that contain mercury because if you ever break one, you'll breathe mercury gas (much likely in small quantities though). And I can't install lights on the wall without asking my landlady for permission, and even if I did have permission, it would mean putting electric wires inside the wall and that's not really an idea that speaks to me (I've never done that) and that means painting the wall afterwards (not a big fan of painting walls either). What about LEDs then? Don't I need to install them on the wall anyway? Well, no, I don't! I have access to some of the iron based framework, and as LEDs are light weight and flat, I stick them on some material on which I put some magnets, et voilà! (Now I'm even a bigger fan of strong magnets than before.)

Well, I won't deny it, the main reason why I wanted to use LEDs is that it's super high tech, it's fun, it's powerful and energy-efficient (and ecological), it's nice, it's the future, it's geek. Also, we plan on using LEDs massively in university of Cambridge buildings, starting with the department of computer science.

What I've used


I've essentially used 6W COB LEDs, which — according to the specs given by their manufacturers — are as efficient as 120lm/W or 108lm/W, depending on if you prefer white light or warm white light. I gave a try to more a classic (though high-power) LED before: 9 pieces of 1W LEDs put together on an aluminium disc. Well, I switched to COB LEDs for three reasons. The first one is that COB LEDs are more energy-efficient. The second one is that the voltage of the COB LEDs that I use is higher (19V instead of 10V), which means that for a given wattage, the current intensity (amperage) is lower, and that means less chance for the wires to heat... (I'm not an electronician nor a physisist, please correct me if I'm wrong.) And the third one is that COB LEDs are very compact. The "9 pieces of 1W LEDs on a disc" is really big: the diameter is 88mm. The drawback is that the COB LEDs that I use are only available in the 6500K and 3000K color temperatures, whereas the other ones are available in 6500K, 4000K, 3000K and 2500K; 400K being my favourite.

Power supply units

It's not that easy to provide electrical power to LEDs... because we need a potentially quite high amperage, the right voltage, and the latter has to be very stable (or at least never reach a higher value than the expected one, otherwise the LEDs burn and die quite easily, and we don't want to kill some hardware that's supposed to last ten and up to thirty years).

I recommend this power supply unit (PSU) from Rapid if you're powering at most 30 watts of LEDs at a voltage of 24V or less for three reasons: it seems to me so far that it's a very good quality PSU, the voltage is well regulated, it does provide up to 30W without complaining or heating up, it switches on very quickly (virtually instantaneously). I have used 2 of them for about 6 months. (I haven't tried to make it work at 36W because it would supposedly diminish its lifespan.)

Well, that 36W PSU is currently (early October 2013) out of stock, and since I visited (randomly) a Maplin store and saw this 120W laptop power supply unit for less than £40, I bought two of them to give them a shot. They are a lot cheaper than the 36W PSU I've been using so far, if we compare the price-over-power ratio. Well, the voltage they deliver seems okay: according to my cheap multimeter, so it's not really the most reliable measurement, they deliver 19.2V when using a 19V head. The main drawback of this adaptor is that it takes time to "boot"! (Yes, it means it will not deliver power during those 5 seconds.) It's not really a problem for a laptop, but for a lighting system, having to wait 5 full seconds just seems like forever. So it means that if I want instantaneous light-on feature, I need a switch between the PSU and the LED instead of between the wall plug and the PSU. Well, in practice, I think I'll simply wait 5 seconds, because I'll only use them when I want a very very bright light, since I already have 2 LED-based lighting devices powered by those 36W PSUs that provide power instantaneously.

The other drawback is that it's made to work with (existing) laptops and they provide one connector per model (i.e., shape of the head and voltage), so you can't use any connector and still assume you'll be able to select the voltage: here, the head determines the voltage you'll get.


One of the biggest (potential) problem with COB LEDs is that, as they are very powerful and very tiny, you're supposed to stick them on something that can dissipate the heat that they produce. What I learnt from that is that iron is a very very slow heat conductor compared to aluminium. I'm using aluminium rulers because it's the cheapest flat thing made of aluminium that I could find. Three rules very easily dissipate the heat produced by 4 LEDs. I tried tripling the density of LEDs... It was not a good idea, as the rulers become hot quite rapidly: within a few minutes, they reach close to 50°C. So with 4 LEDs per ruler, active fan cooling may dissipate the heat.

Sticking COB LEDs

To stick LEDs, I used some akasa thermal adhesive tape that I bought on Amazon, and it works pretty well.

The worst part...

... is obviously, to me, the soldering part. It's quite hard to solder those COB LEDs because the contact spots are very small, so it takes a tremendous amount of time.


Why bright light is important...

I feel that it's important to me to face (very) bright light during the day, and it's best if it comes directly from the sun. Apparently (although I'm not a hundred percent sure) it's actually important to be exposed to (very) bright light at least 30 minutes a day.

So, instead of using a very expensive S.A.D. light to prevent potential seasonal affective disorder, well, I'm using very bright state-of-the-art LEDs to light up my home. :-)


LEDs are awesome because they need very little space compared to other light-emitting devices.

As they're very energy-efficient, using them for indirect lighting doesn't consume too much electricity.

They can last up to 10 or 30 years if they're taken well care of. So once you make them, you don't need to worry about them for a while... Well, I'm not sure the power supply units can last that many years, but it's not the hard part to change.

If you live in Cambridge (UK), you could use Makespace to build your own LED-based super bright and/or super smart lighting system. Well, given my low-budget when I started building those LED-based lighting systems, I couldn't afford Makespace membership, so I kind of have all I need to handcraft at home.


4 COB LEDs on 3 aluminium rulers, in parallel. The 3 aluminum rulers are stuck together using aluminium tape, which has good thermal conductivity, reflects light not too badly and is not too expensive.




If you do want LEDs but don't want to solder anything

Well, you may find a lot of websites on Google selling LEDs, you may also try this one, which belongs to a friend of mine.

by Philippe Wang at Mon, 07 Oct 2013

Using Travis for secure deployments with SSH (Anil Madhavapeddy)

In my previous post on Travis, I explained how it can be used to easily test OCaml packages on GitHub without having to host any infrastructure yourself.

The next step I wanted to investigate was how to use Travis to trigger service deployments after a successful build. One nice feature that Travis has is support for encrypted environment variables. The basic workflow is that you encrypt key/value pairs using a public key that they publish per GitHub repository. Once registered, this is made available as a decrypted environment variable within the Travis worker. You can use this to transmit API keys or other authentication data that you need to commit to the travis.yml file, but obviously can’t leave on a public repository for the world to see.

The small hitch with this whole scheme is that there’s a very small limit of about 90 bytes or so for the size of each individual environment variable that’s exported, and so you can’t just stash an SSH private key in there. Instead, it needs to be Base64 encoded, split it up into multiple environment variables of the right size, and then reassembled within the Travis VM. Rather than deal with importable shell scripts between MacOS X and Linux, I created a small travis-senv command-line binary to make this easier.

To use it, just opam install travis-senv and follow the instructions on the README at the homepage. Here’s the fragment of shell script that pushes the build output to another GitHub repository:

if [ "$DEPLOY" = "1" ]; then
  # get the secure key out for deployment
  opam install travis-senv
  mkdir -p ~/.ssh
  travis-senv decrypt > $SSH_DEPLOY_KEY
  chmod 600 $SSH_DEPLOY_KEY
  echo "Host mirdeploy" >> ~/.ssh/config
  echo "   Hostname" >> ~/.ssh/config
  echo "   StrictHostKeyChecking no" >> ~/.ssh/config
  echo "   CheckHostIP no" >> ~/.ssh/config
  echo "   UserKnownHostsFile=/dev/null" >> ~/.ssh/config
  git config --global ""
  git config --global "Travis the Build Bot"
  git clone git@mirdeploy:mirage/mirage-www-deployment
  cd mirage-www-deployment
  mkdir -p xen/$TRAVIS_COMMIT
  cp ../src/mir-www.xen ../src/ ../src/www.conf xen/$TRAVIS_COMMIT
  bzip2 -9 xen/$TRAVIS_COMMIT/mir-www.xen
  git pull --rebase
  git add xen/$TRAVIS_COMMIT
  git commit -m "adding $TRAVIS_COMMIT"
  git push

I’ve been using this to automate the construction of the Mirage Xen unikernel homepage. Every time there’s a push to the mirage-www, the Travis scripts now retrieve an SSH deployment key using travis-senv, and push the results of the build to the mirage-www-deployment repository that stores the build output. This repository is polled by the hosting machines we have to look for new kernels and rotate the website (but more on this later – I’m just integrating the EC2 and Rackspace scripts to remove this step entirely next!)

Sat, 05 Oct 2013

(Example) A use case of MPP in an OCaml program (Philippe Wang)

Let's say you have a big variant type such as this one:

type t = A | B | C | D | E | F | G

and you want to implement a t -> string function. With MPP, you can write this:

{< let l = ["A"; "B"; "C"; "D"; "E"; "F"; "G";] >}
type t = {< let _ = List.iter (Printf.printf "| %s ") l >}
let to_string = function
{< let _ = List.iter (fun e -> Printf.printf "| %s -> %S" e e) l >}

and then you have that:

type t = | A | B | C | D | E | F | G 
let to_string = function
| A -> "A"
| B -> "B"
| C -> "C"
| D -> "D"
| E -> "E"
| F -> "F"
| G -> "G"

As you may see, if you grow the type t by adding more cases, you don't have to manually update the function to_string.

by Philippe Wang at Fri, 04 Oct 2013

OPAMaging MPP (Philippe Wang)

Please read this if you want to know more about MPP.

Announcement: I've OPAMaged MPP!

So, I've eventually OPAMaged MPP (i.e., made an OPAM package for MPP)! Now MPP officially has a version number, and at the time I'm writing these lines, the current version is 0.1.0.

Happy that it eventually worked...

Yes, I'm quite happy that MPP now has an OPAM package. To do so, I copied the _oasis and the Makefile files from MPP and adapted them for MPP. At first I wanted to avoid using oasis because for reasons that are still unknown to me, some people told me (half-joking?) that it was better not to depend on oasis... Well, actually I tried not to depend on oasis! I tried but just couldn't build in a reasonable amount of time a Makefile that met the requirements for an OPAM package. Building MPP is quite trivial, installing it in the right place and being able to uninstall it, well, less trivial (perhaps I've just missed a simple thing in the process).

A side note

Although the MPP project started a while ago now, it's still at a pretty early stage of development since I've been mostly working on the tools that are needed to build the new website, such as OMD (read about the experience here), which I've developed from scratch, and OPAM-DOC, which I had to contribute to at some point because we had to fix it while having a very little amount time.

Now that OMD has become quite stable (although some bugs are still found from time to time, which is normal) ../..

No product of human intellect comes out right the first time. We rewrite sentences, rip out knitting stitches, replant gardens, remodel houses, and repair bridges. Why should software be any different? – Lauren Ruth Wiener

../.. and since I'm not the only contributor on opam-doc (Vincent Botbol is still participating, he fixed quite a nasty-to-fix bug last weekend, and Leo White is still working on the JavaScript part), it's not as much time-taking as OMD, but because it's not my code and because it's using some syntax extension (yeah, CamlP4's in the place, yo), patching it is not as simple as it could be...

by Philippe Wang at Thu, 03 Oct 2013

Using MPP in two differents ways (Philippe Wang)

MPP is now an OPAM package

First of all, I've to say I've finally made an OPAM package for MPP! At the time I'm writing these words, it's still waiting for merging. Read more here.

Description of MPP

MPP is a preprocessor that has been designed to allow any (non-insane) programming language to be used as a preprocessor for any text-based file. The particularity is that the users keep their initial language as their primary language: the language(s) brought by MPP is (are) secondary. It means that ideally, if you give a random text file to MPP, MPP will return it unchanged! And in practice, you can always parameterize MPP (on the command line, or in an MPP file) such that for a given text file, MPP will return it unchanged.

Of course, the purpose of MPP is not to replace the "cat" command! But having the property that you can guarantee that MPP will behave as the "cat" command if you want it to is important, because it means you can customise MPP such that MPP can work for any text file.

Side note:

Well, if you discover a language that has some kind of infinite semantics, which are somehow automatically defined by itself, such that every possible sequence of bytes has a meaning in this language, MPP might not work because you would have a hard time finding a token that's not already taken, since there're all already taken... But I don't think that it's a realistic scenario, so let's just forget about it.

Principle of MPP

The principle on which MPP is based is pretty simple. The users define the tokens they want to use to determine the beginning and the end of MPP blocks. And there are 3 kinds of MPP blocks:

  1. basic MPP block;
  2. basic MPP block with possible nesting;
  3. foreign MPP block.

The first and the second are used for the same things except that the second allows nesting. The third kind is the most interesting one, as it's the one that allows to embed virtually any programming language as a preprocessor language.

Side note:

Well, you can't easily use pure lambda calculus as a preprocessor language. Your programming language has to have some notion of input/output to process text. Using pure lambda calculus might be very painful and it would start by making it impure enough to interact with the rest of the world.

Example 1: the basics

So, let's say you're writing a program using the OCaml language, and you want x to be a random integer.

You can write this:

let x = max_int

and if you want to be sure that the number is different each time you execute your program, you should use Random.self_init or an equivalent.

Now what if you want a random number to be generated at compile time? You can't do it in pure OCaml. You need to preprocess your file and/or to read from another file...

With MPP, you can write this:

let x = << cmd echo $RANDOM >>

if you use mpp with these parameters: -so '<<' -sc '>>' to set opening and closing tokens to << and >>. Default tokens are respectively (( and )) but if you're writing some OCaml code, you shouldn't use (( and )) because there's quite a chance that they appear as OCaml code in your file.

What happens with MPP is that it simply echoes everything that is not MPP, and when it's MPP, it interprets the commands. The cmd builtin is probably the simplest and the most powerful, since it calls your shell with what follows cmd. The first line is the command that is called, and what follows is given to the command as its standard input.

So, basically,

let x = << cmd echo $RANDOM >> mod 42

given to MPP has the same semantics as this Bash script

echo -n 'let x = '
echo $RANDOM
echo ' mod 42'

Now, instead of using non-nesting tokens, you can use nesting tokens, the only difference being that you can nest commands. Wait, what if you nest with non-nesting commands? Well, MPP has a very simple syntax. If you can't nest, you really can't nest. So, if you write << echo << echo 42 >> >> it just doesn't work! Moreover, you have a syntax error because the first >> is seen as the closing token for the first <<, and MPP stops when there's a syntax error. If you want to echo << echo 42 >>, it is possible, you just have to name your block, as in <<name echo << echo 42 >> name>> and then it just works!

Note that with nesting tokens, you have to name all outer tokens. The default tokens are {{ and }}. So, to nest some commands, you can write for instance {{name set x {{ cmd cat foo.txt }} name}}. Note that the the most inner block doesn't have to be a nesting one, as it doens't nest any. It means you could write {{name set x << cmd cat foo.txt >> name}} instead. And note that only the most inner nesting block can be nameless and that's in some cases only.

Note: the technique presented above (in example 1) has been used to generate the new OCaml website.

Example 2: using OCaml as a preprocessor language

When MPP is used with the option -l, it produces a program that prints the file, using information provided to MPP about the language specified with -l. For instance, mpp -l ocaml will produce an OCaml program.

If you apply mpp -l ocaml to a file F that contains no MPP commands, then you have an OCaml program that, when executed, produces the original file F.

When you use -l ocaml, all "foreign blocks" are considered as OCaml code and it will just "pass through". So, if you write 42 {< let x = 23 >} Hello, it will be translated to this program:

let _ = print_string "42 "
let x = 23
let _ = print_string " Hello"

It's not exactly as flexible as "you can write anything at the top-level". Indeed, you may write any valid top-level definition. It could be a type definition, a module definition, really anything you want that is valid at the top level and that does not require double semi-colons. That is, you cannot write {< print_int 42 >}because it's not a valid top-level phrase, you should write this instead {< let _ = print_int 42 >}. If you really love playing at edges, you could write {< ;; print_int 42 ;; >}and it might work... if there is something preceding. If there isn't (but that's not trivial to guess because the semantics of MPP might change on this precise point, I don't know, I can't promise right now), then you could perhaps write {< print_int 42;; >}.

But you can't convert a block of text into an OCaml function. You can't write something like you would do in PHP, such as:

  {< for i = 1 to 42 do >}
    <li> {< print_int i >} </li>
  {< done >}

Oh but wait! You can use functors...

So in functor-style, you can write this:

  {< module I (X:sig val i : int end) = struct >}
    <li> {< let _ = print_int X.i >} </li>
  {< end >}
  {< let _ = for i = 1 to 42 do
       let module M = I(struct let i = i end) in
     done >}

Sorry if that hurts your eyes, but life is not always easy. Oh, and if you think it's overkill, well, of course it is, that's a tiny toy example! I'm actually using this functor-style to write this blog. It means that I generate a template with the functor-style headers and footers and I write inside the functor using Markdown, and when I need to write some OCaml code, I use MPP features. And when I need to print MPP stuff, I use the 2 worlds of MPP, but it's late and I don't want to go into details just right now. What's important is that it's nice and it works. I've always wanted to be able to easily embed the OCaml language in any text (LaTeX, ML, HTML, Bash, etc.) file, and now I can!! :)

by Philippe Wang at Thu, 03 Oct 2013

Intellisense for OCaml with Vim and Merlin (Anil Madhavapeddy)

I’m finalizing the installation instructions for Real World OCaml and finally got around to configuring the awesome Merlin editor tool. I pretty much never configure my editor, so here are my notes on getting Merlin working with Vim completely from scratch on MacOS X (I’m on 10.9DP7, but this should work with earlier versions of MacOS X too).

First, install some basic plugin tools by following the Synastic installation instructions.

Install vim-pathogen:

mkdir -p ~/.vim/autoload ~/.vim/bundle; \
curl -so ~/.vim/autoload/pathogen.vim \

Install vim-sensible for useful defaults:

cd ~/.vim/bundle
git clone git://

Install Syntastic:

cd ~/.vim/bundle
git clone

Follow the Merlin/vim from scratch directions:

opam install merlin

Finally, add this this in your ~/.vimrc to turn everything on.

execute pathogen#infect()
let s:ocamlmerlin=substitute(system('opam config var share'),'\n$','','''') .  "/ocamlmerlin"
execute "set rtp+=".s:ocamlmerlin."/vim"
execute "set rtp+=".s:ocamlmerlin."/vimbufsync"

Trying it out

Read the excellent instructions at the Merlin for Vim from scratch wiki page. A few immediately useful things I did are below.

Displaying types: You can get the local type of something by typing in \t anywhere. This is the <LocalLeader> for anyone who has customized their setup (it is mapped to backslash by default).

Tab completing functions: You can tab-complete module signatures by pressing Ctrl-X-O (make sure you keep Ctrl pressed the whole time). This activates vim’s OmniComplete mode. To get this to work with your external projects, create a .merlin file. I use this for ocaml-dns for example.

PKG lwt cstruct
S lib
S lib_test
S lwt
B _build/lib
B _build/lib_test
B _build/lwt

Also, don’t forget to pass the -bin-annot function to the compiler to generate the typed-AST cmt files. You can add true: bin_annot to your _tags file if you’re using OCamlbuild (in OCaml 4.01.0 or higher, and it’ll be silently ignored on previous versions so you don’t need to worry about breaking older OCaml compilers).

Wed, 02 Oct 2013

Liveblogging the first Human Data Interaction workshop (SRG Syslog)

I’m at the Open Data institute, with Richard Mortier, Jon Crowcroft, Amir Chaudhry and Hamed Haddadi , live-blogging a daylong workshop about our emerging Human-Data Interaction research initiative.  The room is packed with notable researchers from all over the UK, so this promises to be an exciting day!

Mort opens up with an introduction to the format of the workshop and lays out the problem space:

  • Visualisation and sense making: how do we make sense of such complex, technical systems, and crucially, what is the process of translating the information to
  • Transparency and audit: is essential to give a feedback loop to users
  • Privacy and control: 
  • Analytics and commerce: there is clearly an ecosystem forming around personal data (c.f. several startups specifically around this
  • Data to knowledge: what are the new business models around

Attendee introductions

Danielle (missed her last name, is from Switzerland): did a PhD on anomaly detection and then worked at Swiss banks on security monitoring. The key challenges in the area are around the nature of the questions we need to ask about all the data that we have access to.

Steve Brewer: coordinator of the IT as a Utility network which is funding all this.  Interested in the next steps: both immediate and the overall vision and direction it’s all going.  Concrete actions emphasised.

Ian Brown: Oxford University and cybersecurity. Interested broadly in these issues and also has a degree in psychology and behavioural psychology.  The challenge is how to balance the “engineering” (this isnt fundamental computer science) and understand why there is so little takeup of this.

Amir Chaudhry: working on a toolstack for the Internet of Things.  When we have tweeting toasters, how do we make it all useful to people. It’s not the raw data, but the insights we present back to people to make it useful.

Elizabeth Churchill: was at Yahoo, now at eBay Research.  Right now the challenge I’m facing is personalisation, and is trying to understand why the models they have of people are rubbush (this is why so much advertising isnt very useful).  The reasoning processes and data quality is important.  We are making assertions about people based on data they have gathered for another purpose, and the ethical and business issues are important. Has been pushing on : what the data source, what is the data quality, and are these assertions based off data thats appropriate, and how can we design new algorithms”

Jon Crowcroft: from the Computer Lab in Cambridge and does anything to do with communication systems, and leads the Horizon project with Mort and PAWS which is working on wifi access in disadvantaged areas. Also working on Nymote.  Is interested in new human/data rights vs t&cs.  Rights and duties are encoded in terms and conditions (badly) — see Latier’s latest book about this, and see how poor Amazon’s recommendations are.  We’re interested in building a technology where you own your own data but business practises all gel together.  We had a workshop at how the EU is pushing the right to be forgotten, so how can we ensure that data can be removed including all references.  People go “its too difficult”, but this isn’t true — takedowns work, and why is it that only big corporations can afford to take down stuff.  The right to oblivion (“not be there in the first place”) and data shouldn’t be a public good but people should have the right to be in a paid avoidance zones. (See Shockwave Rider by Brunner, 30 years old, loads of great technical ideas, and Future Shock is a good read too).  Can we shift regulatory positions and have

Thomas Dedis: runs Fab Lab in Barcelona.  Cofounder of Smart Citizen, crowdfunded environment sensing platform based on a piece of hardware based on Arduino.  Allows people to capture data in their own homes, and push it to an online platform.  Allows people to push it in a ‘conscious’ way? Go to to see the data platform.  Intended to grow into other places, but capturing air pollution, humidity, temperature and other environment data.  Main thing is that you own your own data.  Much cheaper than the competition too and crowdfunding.

Martin Dittus (@dekstop) is a PhD student at UCL and works on Thousands of people mapping the planet in incredibly detailed ways and it works with both commercial and non-commercial stuff. What makes these systems work, what are the processes to coordinate things, questions of data quality and how to assert stuff over it?  Used to be part of the data team which is about personal data gathering and detailed profiling that users themselves put up.

Christos Efstratiou: as of September is a lecturer at University of Kent (and is still a visiting research at Cambridge).  Works on sensing, more broadly that has to do with people and sensors in the environment and embedded sensing in the environment.  Privacy is a huge issue and is his key challenge.  This isn’t sensing in the old style like Active Badge — back then, people werent aware of the issues and nowadays, people are more aware of privacy.  So the challenge is the evolving user perceptions and how our system design works.  Anecdote: at a recent wedding he was at, there was a request from the  bride/groom to note post any public photos to Twitter/Facebook. We’ve lost control over our public personas.

Julie Freeman: an actual resident in this building and the art associate for the ODI space!  Is a PhD student at QMW and is interested in echoing other people.  Interested in physical manifestations of digital data, and how we can “step away from the screen”.  Broad interests in transformation of digital data.

Barbara Grimpe: from Oxford and is a sociologist in an interdisciplinary EU project on “Governance for Responsible Innovation”.  They have a network notion of responsibility which takes into account that a lot of data production and data use takes place in networks that posses strong ties between people.  She started two cases studies in two area: telecare technologies for elderly people (relevant due to the EU Horizon 2020 societal challenge in aging and wellbeing due to the demographic change in western societies, and this brings the ethical issue around the use of personal data at scale); there is difficulty of informed consent due to dementia also. The other case study is completely different and is about financial market technology and the data is highly commercially sensitive, so understanding how transparency can be balanced against financial control and the need for genuine market secrets to facilitate trade.

Wifak Gueddana is a postdoc at information systems group at LSE.  Did her PhD on open source communities and how open source can work for NGOs and grass roots.  Working on a research project for online platforms — how to use computational analytics to collect and process data and deal with the qualitative and subjective issues.

Hamed Haddadi: lecturer in QMUL, asking how we can compute over mobile data and has worked on advertising and privacy aware system systems.  Linking the temporal nature of the data and understanding how much of your data is exposed (amusing anecdote about wearing shorts)

Muki Haklay: professor of GIScience and the Extreme Citizen Science group. He’s interested in participatory citizen science and cyberscience.  Interested in GeoWeb and mobiile GeoWeb technologies and understanding how tools for voluntary participation work (open street map, participatory geoweb).  ”How can you give control to people that are not technical” and how do we build protocols for cultural sensitivity?  Working on Open Street Map, while its easy for techies to contribute and feel happy, it might be an issue from a privacy perspective without GPS fuzzing or pseudonyms (you can say “I know where you live” to every OSM user). (discussion about most data being rubbish and is a psychology question about whether this depresses people!)

Jenny Harding: from the Ordnance Survey who control most of the UK’s mapping data and is a principal scientist in the research team team on working on how people interact with geography in business, public service and leisure.  Moving beyond just GPS into the connective function about what’s going on in places, and the connection between different objects and places.  What is the purpose for needing all this connected information, and how can the Ordnance Survey better serve its users with such usable connected data?  She commissions internal and external research and this includes PhDs and postdocs.  Challenge for this workshop: how personal data relates to location and the different levels of granularity at which such relationships can be made — beyond GPS, there is address data in (e.g.) supermarket loyalty cards, and other data at a postcode level, and different types of geographies all have connections.  Understanding provenance of data is really really important .

Pat Healey: Professor of Human Interaction and head of cognitive science research group and workson health issues.  Not here yet so introduced by Hamed.

Tristan Henderson; lecturer in compsci at St Andrews in Scotland, did his PhD on first person shooter games and runs a widely used mobile data set called CRAWDAD. They archive and share it and so work a lot on redistribution of data. His undergrad was economics so his interest is on behavioural aspects and usability issues too (Tristan has recently joined the HCI group and the ethics committee at St Andrews).  How can we get researchers to further engage with the ethics process and to refine the notions of informed consent in electronic terms.  Challenges: what is acceptable HDI and are we conducting it in an acceptable way (q from Mort: how broad? a: everything).  And is HDI unique enough that we might need another term.

Laura James: from OKF (not here yet)

Helene Lambrix: visiting LSE and usually at Paris-Dauphine University in France and is hoping to finish her PhD this year! Interested in corporate reputation and social media.

Neal Lathia: researcher at Cambridge University and did his PhD on online recommender systems. Noticed a disparity between data services online and the offline word so started working on recommender systems for urban systems (banging head against TFL data).  At Cambridge, started working with psychologists and leads EmotionSense (how satisfied are you with your life, as well as ongoing smartphone sensor data) — its really popular.  Challenges: language issue around how we present issue and motivate people around using that data (how does using EmotionSense affect their behavior)

Panos from Brunel: interested in cybersecurity and intelligence from media data mining and cloud based work. Commodification process of digital economy data and what is the legal frameowork surrounding this.  What are the personas for data to apply frameworks and data mining techniques to it?  Challenges: regulatory system using big personal data.

Eva from University of Nottingham and has just submitted PhD and waiting viva. Background is political science and internationl relations, and is interested in informed consent and how we sustain consent rather than just secure it as a one-off. The challenges: the human bit and how we communicate the complexity of systems and how people can make meaningful decisions. If you want people to be engaged with process then we need data to be more social and human.

Ursula Martin: professor in a Russel Group university in Mile End road. Is a mathematician and is researching the production of mathematics, and how it happens in the first place. Producing maths is a slow, painstaking thing, and is wondering how the rate of production can keep up with our needs. When interacting with an outfit getting her data, she’s not just an isolate, but is actually part of a large complex system.

Richard Mortier “mort”: From Nottingham and is the dude running this workshop so has introed before!

Nora Ni Loideain: doing a PhD in the european data protection and this requires the mandatory retention of data by telecomms provider. References recent US events cf Snowden and her PhD is on privacy oversight and the guards (or lack thereof) in current frameworks.  Challenges: how can we build these safeguards and what is the nature of informed consent with these.

Ernima Ochu: based in Manchester. Sometimes an activist, sometimes a writer, sometimes an artist. Background in neuroscience!  The social life of data and what happens around it, and how people get around based on it.

Yvonne Rogers and is from team UCL and is director of UCLIC (Interaction Center) and also an Intel-funded institute at UCL where they work on connected data.  Given lots of data from sources, interested in how people can engage with it and how to visualise . (Hamed: “Human City Interaction is the next thing!”)

Geetanjali Sampemane; background in Computer Science, and works at Google on Infrastructure Security and Privacy group.  Challenge is how to help people calibrate the benefits and risks for appropriate tradeoffs.  How can humans make an informed choice, and this isn’t based on informed choice.  Giving people buttons and options isn’t the most useful way to approach this, and we need a mental model similar to how we judge risks in the physical world.  In the online world, noone understands how dangerous it is to reuse passwords. Security people have made is a little worse by telling people to use complicated passwords, but brute force isn’t the big problem right now, it’s the connectivity of services.

Cerys Willoughby: Southhampton and looking at the usability of notebooks and wondering how to make the interfaces usable without being a technological guru. (missed rest due to reading cool comic she projected about her work. Sorry!)

Eiko Yoneki: from Cambridge, and she works on digital epidemiology and real world mobility data collection in Africa (e.g. EipPhone).  She analyses network structure to understand graphs and connectivity.  Also works on CDNs and builds self-adaptive CDNs, and works on graph-specific data parallel algorithms.

Jonathan Cave: game theorist (Yale, Cambridge, Stanford) and worked in a lot of government/academia/research areas. Works on economics and IoT (festival of things and the boundaries of humanity is coming up soon in Cambridge on 29th October).  Fascinating anecdote about price of sick animals

George Danezis: formerly MSR now UCL, and is interested in technical security aspects of how to glue together distributed mobile devices and not leave our personal data unreadable.  Has done work on privacy friendly statistics and how we can process and analyse it as an aggregate data set. Has worked in the context of smart metering in Europe.

Breakout sessions

We then had breakout sessions to brainstorm the challenges in this area (lots of post it notes and arguments), and Amir Chaudhry has summarised the results of the 5 presentations here:

Disambiguating data. For example from the home and organisations. This isn’t a new problem but becomes more important the more data collection occurs using different sources.  Who are the principles in terms of onwership and provenance?  How do we deal with communities/groups  and data control?

Why not try crowd sourcing mechanisms for people to use so that they can improve the use of sites like Ryanair (who deliberately obfuscate things).  Changing mindset from consumer perspective to a producer perspective.  i.e humans make data and perhaps can provide this to others for economic benefits.

We have data and different notions of data quality.  It’s not always the case that the most scientific data is the best, depending on how it’s used.  We have Collectivist notions of data culture: e.g this conversation right now in the room isn’t just individual, it’s all of us so we can’t ascribe it to individuals. Then there are Network notions, based on transactions that use a reductionist view to decide what they’re worth.  Can think of these on a continuum and the research challenge is how well do different ends of above scale in producing data and making value.  Interesting question is if people opt out (right to forget or right to oblivion), then the data that is left is flawed.

We need to examine current assumptions and values around data.  What is the unit of analysis? Must be clearer on this. Where and when we look at what data also matters as well as global aggregation.  Sometimes also want to look at trajectories and data flows and how this changes over time.  How do we interact with this data.  Do users interact with data directly or with something that sits in front of it?  There’s an interaction between data science and the creative process of presenting information in a certain way.  This depends on ultimate goal being served, for example e.g beavioural changes or just increased engagement.

There are big challenges in integration of groups who want to construct humans. Groups like social sciences (think about risk), Psychological science (reputation), Data sciences (Epistomology and Ontology), Design science (Interfaces and interactions). What is the new meaning of ownership and liability? e.g who owns this collection of posters and the ideas that have come out? [Hamed and Mort clarify that it's all theirs!]  What happens if there are negative consequences as a result of using poor data?

Business models are also important in order to go from studies to practice. Are there new social structures we could make to help this?  For example, we have venture capital that takes risks but what about social capital to spread risk to create new businesses e.g kickstarter and the like.

What does informed consent mean? Current system puts onus on user to understand all the contractual conditions before deciding.  Perhaps there’s a social-network method of crowd-sourcing opinions on ToS or providing some kind of health rating?  Perhaps data protection agencies could certify terms or maybe the EFF or non-profits can provide some kind of rating system (c.f Moody’s etc?). For example from the home and organisations. This isn’t a new problem but becomes more important the more data collection occurs using different sources.  Who are the principles in terms of onwership and provenance?  How do we deal with communities/groups  and data control?

Why not try crowd sourcing mechanisms for people to use so that they can improve the use of sites like Ryanair (who deliberately obfuscate things).  Changing mindset from consumer perspective to a producer perspective.  i.e humans make data and perhaps can provide this to others for economic benefits.

We have data and different notions of data quality.  It’s not always the case that the most scientific data is the best, depending on how it’s used.  We have Collectivist notions of data culture: e.g this conversation right now in the room isn’t just individual, it’s all of us so we can’t ascribe it to individuals. Then there are Network notions, based on transactions that use a reductionist view to decide what they’re worth.  Can think of these on a continuum and the research challenge is how well do different ends of above scale in producing data and making value.  Interesting question is if people opt out (right to forget or right to oblivion), then the data that is left is flawed.

We need to examine current assumptions and values around data.  What is the unit of analysis? Must be clearer on this. Where and when we look at what data also matters as well as global aggregation.  Sometimes also want to look at trajectories and data flows and how this changes over time.  How do we interact with this data.  Do users interact with data directly or with something that sits in front of it?  There’s an interaction between data science and the creative process of presenting information in a certain way.  This depends on ultimate goal being served, for example e.g beavioural changes or just increased engagement.

There are big challenges in integration of groups who want to construct humans. Groups like social sciences (think about risk), Psychological science (reputation), Data sciences (Epistomology and Ontology), Design science (Interfaces and interactions).  What is the new meaning of ownership and liability? e.g who owns this collection of posters and the ideas that have come out? [Hamed and Mort clarify that it's all theirs!]  What happens if there are negative consequences as a result of using poor data? Business models are also important in order to go from studies to practice. Are there new social structures we could make to help this?  For example, we have venture capital that takes risks but what about social capital to spread risk to create new businesses e.g kickstarter and the like.

What does informed consent mean? Current system puts onus on user to understand all the contractual conditions before deciding.  Perhaps there’s a social-network method of crowd-sourcing opinions on ToS or providing some kind of health rating?  Perhaps data protection agencies could certify terms or maybe the EFF or non-profits can provide some kind of rating system (c.f Moody’s etc?)

Next steps

Ian Brown took notes on our breakout session on business models for privacy:

Collectives/cooperatives sharing data through PDSes

  • How to incentivise membership? Dividends, social benefit (e.g. medical research)
  • what currently exists where data controller has strong incentive not to leak/breach data e.g. Boots for brand loyalty, Apple/Facebook? So long as customer can switch effectively (portability, erasure)
  • Power tool sharing at village level. Hang off existing structures e.g. local councils.
  • New forms of micro-markets e.g. physical gatherings? Alternatives to currencies. Kickstarter? Distribution reduces risk of centralised architectures.
  • What do syndicalist-anarchist models of data management look like?
  • Current uses of data are optimising existing business practices. But what totally new practices could be enabled? Human-facing efficiencies?

What are types of biz models? Startups, personal profit, NGO, medical research, banks. Balanced investment portfolio.

by Anil Madhavapeddy at Wed, 02 Oct 2013

Test your OCaml packages in minutes using Travis CI (Anil Madhavapeddy)

A few months ago, Mike Lin posted instructions of how to bootstrap an OCaml testing environment within the Travis continuous integration tool. I finally got around to integrating his prototype scripts properly using the latest OCaml and OPAM versions during my travels last week to ICFP. It’s been an extraordinarily quick and pleasant experience to add the (free!) Travis test runs to my OCaml programs on GitHub, so here’s how you can do it too. Dave Scott and I have used this for about 15 of our own projects already, and I’m switching the whole of the Mirage repo infrastructure over to it this week.

(edit: I’ve done a followup post about integrating Travis with SSH to make secure deployments easier.)

Getting started

Getting my first Travis build working with one of my OCaml projects took about 2 minutes in total:

First, log into Travis and sign in via Twitter. Click on the Accounts button on the top-right and you should see a list of the all the GitHub repositories that you have access to. Just click the On switch for the one you want to start testing. Nothing will actually happen until the next code push or pull request goes to that repository. Behind the scenes, the On button that you clicked use the GitHub APIs to turn on the Travis post-commit hook for your repository.

Create a .travis.yml file in your main repository (see below or this gist). Travis doesn’t have native support for OCaml, but it isn’t really needed since we can just use C and write our own shell scripts. The env variables define a matrix of the different combinations of OCaml and OPAM that we want to test. Just remove variations that you don’t care about to avoid wasting Travis’ CPU time (open source projects are supported on a fair-use basis by them). Here’s the .travis.yml that I used for my ocaml-uri library:

language: c
script: bash -ex

Now you just need the shell to run the actual tests. Travis provides an Ubuntu Precise/i386 VM that is destroyed after every test run, so we need to initialize it with the OCaml and OPAM binary packages. Since you often want to test different versions of all of these, I created a series of stable Ubuntu PPAs that have OCaml 3.12,4.0,4.1 and OPAM 1.0 and the (currently beta) 1.1 package manager. You can find them all on my Launchpad page, but the below script takes care of it all for you.

# Edit this for your own project dependencies
OPAM_DEPENDS="ocamlfind ounit re"
3.12.1,1.0.0) ppa=avsm/ocaml312+opam10 ;;
3.12.1,1.1.0) ppa=avsm/ocaml312+opam11 ;;
4.00.1,1.0.0) ppa=avsm/ocaml40+opam10 ;;
4.00.1,1.1.0) ppa=avsm/ocaml40+opam11 ;;
4.01.0,1.0.0) ppa=avsm/ocaml41+opam10 ;;
4.01.0,1.1.0) ppa=avsm/ocaml41+opam11 ;;
*) echo Unknown $OCAML_VERSION,$OPAM_VERSION; exit 1 ;;
echo "yes" | sudo add-apt-repository ppa:$ppa
sudo apt-get update -qq
sudo apt-get install -qq ocaml ocaml-native-compilers camlp4-extra opam
export OPAMYES=1
opam init 
opam install ${OPAM_DEPENDS}
eval `opam config env`
make test

Now just do a push to your repository (a commit adding the Travis files above will do), and you will soon see the Travis web interface update. For example, here’s the output of ocaml-uri that the above example files are for. Of course, you should tweak the scripts to run the tests that your own project needs. Let me know if you make any useful modifications too, by forking the gist or e-mailing me.

Testing pull requests in OPAM

Travis isn’t just for code pushes though; as of a few months ago it can also test pull requests. This is an incredibly useful feature for complex projects such as the OPAM repository that has lots of contributors. You don’t need to do anything special to activate it: whenever someone issues a pull request, Travis will merge it locally and trigger the test runs just as if the code had been pushed directly.

I did do some special scripting to make this work with the OPAM package repository. Ideally, every new package or metadata change in OPAM will attempt to rebuild just that package set (rebuilding the entire repository would far exceed the 50 minute testing budget that Travis imposes). I put together a very hacked up shell script that greps the incoming diff and rebuilds the subset of packages. This is now live, and you can see both successful and failed pull requests (once the request has been merged, there’s a tiny little green arrow beside the commit that was tested).

This is a very unconservative estimate test matrix since a package update will also affect the reverse transitive cone of packages that depend on it, but it does catch several common typos and incompatibilities (for example, packages that use OPAM 1.1-only features by mistake). The longer term plan is to use the OCamlot command line tool to parse the pull request and compute an exact set of packages that need rebuilding.


Travis has one last very cool feature up its sleeve. When a project has successfully built, it can run a scripting hook in the VM, which can be used to trigger a further code push or service update. If the service requires authentication, you can encrypt a secret in the travis.yml using their public key, and it will be available in the VM as an environment variable (but won’t be useful to anyone else looking at the code repository).

There are quite a few uses for these Travis deployment scripts: automating the rebuilding of the website infrastructure, rebuilding the central opam-doc cmt file archive, or even autoupdating Mirage microkernels to Amazon EC2 and Rackspace.

So how does this tie into the ongoing work on OCamlot? Quite simply, it’s saved us a ton of frontend work, and lets us focus on the more interesting OCaml-specific problems. Travis is also somewhat reactive (since it only runs in response to pushes or pull requests), and we still need to be able to run complete repository sweeps to look for more systematic breakage. It also doesn’t support any non-Ubuntu operating systems yet. However, Travis takes the burden off us for handling the steadily increasing number of package updates, and can be used to launch further OCamlot jobs on the other architectures and distributions. All in all, I’m very grateful to Mike for taking the trouble to blog about it back in March!

And a little bit of cheeky ARM hackery

I couldn’t resist one final little hack to see if I could actually do some ARM/OCaml testing using the Travis infrastructure. I adapted Ian Campbell’s excellent guide to ARM cross-chroots, and ended up with a rough script that builds an unattended chroot and can run OCaml/OPAM installations there too.

This isn’t something I’m quite comfortable running on all of my repositories just yet though, since an OCaml cross-build using qemu took over 5 hours and was quite rightly terminated when running within Travis. To mitigate this, I’ve built a custom apt-repository for ARMel packages to install a binary toolchain, and you can see the results of Lwt/ARM building successfully. I’ll update on this as I get it working on more packages, as I’m quite interested in getting a solid Xen/ARM toolstack up and running for another exciting project over in OCaml Labs…

Sun, 29 Sep 2013

(Exercise) Odd or even? (Philippe Wang)

This is the exercice:

Given a string (of bytes), is the number of bits set to 1 odd or even?

Let's now find some possible solutions.

Well, the most naive way is to count them all, and then see if the number is even or odd. But it's probably better not to use an integer to count them all and then see if that counter is odd or even, since we can instead use a Boolean: each time we see a 1, we can simply flip the Boolean value!

let has_an_odd_number_of_1 s =
  let res = ref false in
  for i = 0 to String.length s - 1 do
    let x = int_of_char s.[i] in
    for j = 0 to 7 do
      if (x lsr j) land 1 = 1 then
        res.contents <- not res.contents;

Side note:

Moreover, we can see that using a Boolean means it will not limit the number of bits we may read, whereas an integer... well, actually, when we reach the maximum value of integers (of type int, or Int32.t or Int64.t, which — in OCaml — are respectively Pervasives.max_int, Int32.max_int and Int64.max_int), adding one will make them even negative numbers (max_int, from Pervasives, Int32 or Int64, are all odd numbers), so we should still get the odd/even outcome right. But then, we would have to be sure that we have correctly designed the code that tells us if an integer is odd or even. For instance, we can write is_odd these ways:

Well, let's start with a very classic mathematically-elegant implementation of two mutually recursive functions:

let rec is_odd = function
 | 0 -> false
 | n -> is_even (n-1)
and is_even = function
 | 0 -> true
 | n -> is_odd (n-1)

Note that it needs a lot of computation when the integer is "big"...

And this is probably the shortest version:

let is_odd n = n mod 2 <> 0

Now note that we cannot write this and expect it to work properly in all cases:

let is_odd n = n mod 2 = 1

and that's because if n is negative and odd, then n mod 2 is -1, not 1.

A tricky version that uses properties on integers:

let is_odd n = n <> 0 && ((n / 2) * 2) <> n

Another tricky version that uses properties on machine integers:

let is_odd n = n land 1 = 1

Well, there is the same number of characters as the shortest version, and actually this version is probably the most efficient because we know that the land operation is easy for the computer, whereas the mod operation implies division, which is not an easy one. Then mod could be optimised in some cases, such as mod 2 because it's basically the same as land 1 except that the sign has to be preserved! So, it's likely to still cost more...

But there's way more efficient than that! Indeed, unless we really don't have a choice, we shouldn't read bits one by one, because it's slow!

So what can we do instead?

Well, there's a very very convenient logical operation on integers. It's called XOR (and the OCaml operator for that is lxor), and yes we can use it!

Note that 1 XOR 1 = 0; 0 XOR 0 = 0; but 1 XOR 0 or 0 XOR 1 = 1. It's very convenient because it's basically giving us the answer! When we have two 1s, we can eliminate them, that's what happens with XOR. When we have a 1 and a 0, we keep the 1. So if we take those bytes byte by byte, and XOR them with each other, at the end we will have one remaining byte that will tell us if the whole string had an odd number of 1 or not.

let has_an_odd_number_of_1__efficient s =
  let l = String.length s in
  let rec loop c i =
    if i = l then
      let res = ref false in
      for j = 0 to 7 do
        if (c lsr j) land 1 = 1 then
          res.contents <- not res.contents;
      loop (c lxor int_of_char s.[i]) (succ i)
  in loop 0 0

And... There we go! :-)

Note that int_of_char is a zero-cost operation (but char_of_int is not, because it has to check that the value of the given integer is between 0 and 255 inclusive).

by Philippe Wang at Thu, 26 Sep 2013

(Exercise) Are there more 1s or 0s? (Philippe Wang)

The challenge is this:

Provide a program that will determine, given a set of 1s and 0s, if there are more 0s or more 1s.

So the first question that comes to my mind is about the representation of that set of 1s and 0s. Is it a linked list, a double linked list, an array, a tree, a string, or something else?

If it's a list (single linked or double linked actually doesn't matter much), it's going to be pretty straightforward: just read the list and use a counter. Yes, a single counter is enough, you can add one to the counter when you meet a 1, and subtract one when you meet a 0. At the end, if your counter has its initial value, then you have the same number of 1s and 0s. If it's lesser, then it has more 0s, else it has more 1s.

Let's declare a type for the result.

type result = Equal | More_zeros | More_ones
let count l =
  let rec loop counter = function
    | [] -> counter
    | 1::tl -> loop (counter+1) tl
    | 0::tl -> loop (counter-1) tl
    | _::tl -> loop counter tl
    (* you might want to debate whether you should stop 
       when you get something else than a 0 or a 1. *)
  let r = loop 0 l in
  if r = 0 then
  else if r > 0 then

Well, what's the risk with this program? The risk is that the integer counter overflows, of course! If you have a very long list of 1s (or 0s) only, you may get things just wrong. However, in OCaml, in reality, there's not really a chance that the integer overflows because of the length of a single linked list, especially if you're using a 64-bit architecture based OCaml, on which the greatest integer is 4_611_686_018_427_387_903 (about 4⨉1018). There's really a long way to have such a long list because basically you would need to allocate more than (about 32⨉106 terabytes) at once, since basically a linked list of integers is made of blocks that have 2 cells each (one for the integer, one for the address of the next cell), each cell taking 64 bits (or 8 bytes).

But then, what if you don't have linked lists but some stream that gives you a very large number of 0s and/or 1s? Well, to begin with, counting from 0 to 4⨉1018 takes a really long time. If your machine can count from 0 to 109 in a single second (that would mean your machine is very fast), it would still take 4⨉109 seconds, which is about 4000000000/(606024*365) years. Oh, that means about 126 years! So let's just assume that a 63 bit signed integer is enough for us. And if you really can't assume that for some reason, you can always implement 128 bit signed integers quite easily, and if you don't know how to do that or if you're too lazy to do it, use the Big_int module.

But let's go back the representation of those 0s and 1s. I'd like to make the computation as fast as possible and I'll put those 0s and 1s in a very compact representation. Each 0 and 1 will now only take one bit in the memory (with a possible constant overhead for the whole). For that, let's use OCaml's strings, which are basically arrays of bytes. The longest string I can have on my machine will bear 1_152_921_504_606_846_904 bits (I know that because I multiplied Sys.max_string_length by 8), and that's a lot (more than 108).

Now say we want to know whether there are more 0s or 1s as fast as possible. How do we do that?

We don't want to count all 0s and 1s bit by bit, because that would have quite a high cost! Indeed, we would have to get each byte, and for each byte we would have to read each of its 8 bits (that can be done) one by one. We don't want to do that.

Instead, since we have bytes, we can conveniently allocate an array of size 256. Each cell of that array will contain the right number to add to the counter. This way, we can read a byte, get its number of 0s and 1s in O(1).

(* this table is computed only once *)
let table =
  let number_for_a_byte b =
    let r = ref 0 in
    for i = 0 to 7 do
      if (b lsr i) land 1 = 0 then
        decr r
        incr r
  let a = Array.make 256 0 in
  for i = 0 to 255 do
    a.(i) <- number_for_a_byte i

Then let's abstract from the means to read new 0s and 1s, by assuming we'll be provided a function f that given () will return 8 bits in a value of type char, and will raise the exception End_of_file when it has no more bits to give.

let more_zeros_or_ones f =
  let c = ref 0 in
  begin try while true do
    c := !c + table.(int_of_char(f()))
  done with End_of_file -> () end;
  if !c = 0 then
  else if !c > 0 then

Note that int_of_char has a zero-cost (i.e., it "disappears" at compile-time because int and char sort of share the same memory representation). If you want better performance, you should inline f, provided that you know what it is (you may want to check if the compiler does the inlining itself first, just in case).

Also, you may want to use a table with a size larger than 256 if you have a lot of memory but I'm not so sure you'd gain performance unless you use nasty tricks to read larger integers from a string. Then again, you may not end up using a string, in which case you have to think with the whole problem in mind.

by Philippe Wang at Wed, 25 Sep 2013

Liveblogging OCaml Workshop 2013 (SRG Syslog)

photoI’m here bright and early at the OCaml 2013 workshop at ICFP with Heidi Howard, Leo White, Jeremy Yallop and David Sheets!  There’s a packed session today with lots of us presenting, so we’ll swap over editing this post as the day progresses.  I also need to pop out to attend a panel on the future of Haskell so I’ll be missing the bindings session, sadly!

Accessing and using weather data in OCaml

Hez Carty from MDA Information Systems on doing large-scale weather data

He started with OCaml in graduate school and was looking for something better than Python, whereas they are building systems that need to run for many years with high stability.  He’s not from a formal CS background but from the physical sciences, and found that the static typing made it easy to explore the language features.  The native code output is also important, and he finds predictability very important in day-to-day engineering.  The simple FFI also made a big difference.

They do weather data retrieval and use HTTP/zeromq as a message bus to do language interop, and a workflow management tool to handle dependency graphs.  They take weather models frmo all these sources, and do various analyses (principal component analysis, probability models for precipitation via Monte Carlo simulations with a first order Markov chains).  More generally, there is a bunch of processing for climatology and data extraction and insertion.

A NAM forecast plots the geo data of the USA, and overlays semitransparent layers (e.g. clouds) that are stacked up to get a fake satellite view of the earth, and then there are simulated radar returns where the model (e.g.) rains.  In addition to the structured formats, there is raw streaming data from sensor stations that send back raw binary data such as precipitation events, which is written as a small OCaml library to parses them into more structured formats.

The big data formats are HDF4 and GRIB, which are binary data formats. There are mature C libraries to parse them and lots of existing analysis tools (e.g. wgrib which can extract lat/longs from a big file).  However, the tools arent great for doing larger scale analysis, as shelling out all the time for thousands of points is not ideal — therefore they bind to OCaml to do the analysis in-house.

The GRIB bindings use int, float and float arrays and bindings are entirely hand-written <i>(I’m looking forward to the ctypes talk! –anil)</i>.  They expose only an OCaml friendly interface that is safe for callers.  The speaker now shows several nice graphs from CloudSat tracks from earth orbit, and how they take intersecting profiles to deduce things about what multiple satellites are seeing.

They’re mixing lots of different tools and satellite sources — derived information from teleconnections (air surface pressure and how it varies in terms of oscillations between the north and south Atlantic).  All the analysis is written in OCaml, and he had to work with several scientists who didn’t have access to their initial source code in many cases, so the experience of working with his fresh system was very positive.


  • standard issues when interfacing with C (type mismatches between C and OCaml in particular, and the question of sharing vs copying).  The worst failure mode is silent failure with bad data, which leads to bad science!
  • “Getting used to the functional way of life” can tough, as is knowing when to balance elegance with the speed benefits of not copying hundreds of megabytes of data around can be tough.
  • Development environments other than Emacs and Vim are poorly supported.

The overall project has been very successful, and the bindings for HDF4/GRIB API bindings have been used to process 1TB+/day for over 5 years now.  It’s based on Bigarray and mmap, and the FFI makes getting up and running quickly.

He finds OPAM with local repositories fantastic, and used GODI and odb/oasis-db before that.

Q: how did he do the cool plots?

A: plplot bindings on github

Q: Anil: did you move from Fortran

Q: Nate Foster: terabytes of data makes me wonder about GC?

A: going through and folding over data generates huge amount of garbage and slows things down considerably. If the project is one-off, he tweaks the GC or manually call the GC more often. Really depends on length of code.

Q: The data processed is huge, so what was the parallel computation approach?

A: The task distribution system doesn’t need a lot of parallelism, as he can work on one file at a time or chunk the files manually and aggregate the results.  He’s looking forward to doing some semi-shared memory parallelism.

Q: Pierre Chambart: Is the numerical code in OCaml or C?

A: LAPACK in OCaml, but the Monte Carlo is pure OCaml and interpolation is in pure OCaml too.  Unless speed is absolutely required, it’s written in OCaml.

Frenetic Network Controller

Arjun Guha about their multiyear project to program software define networks. When he started on this, he thought that networks are just a series of tubes! The reality is very different though, as there are numerous intermediate boxes (firewalls, routers, wireless auth). The real problem is that each of these intermediate boxes runs specialized, vendor-specific software and are configured independently via some proprietary CLI. They’re difficult to configure and difficult to statically reason

SDN helps with this by introducing standardized programmable that can deploy in-network features, and a logically centralized controller (a big beefy server) that enables reasoning about whole-network behaviour. SDN is a hot topic right now (lots of commercial startups, 200+ people at HotSDN, generally very busy).

There are several Openflow controllers out there: Java (Floodlight), Python (Pox), C++ (Nox) and Haskell (Nettle), but Arjun introduces a novel OCaml programmable switch!

Openflow works via a controller “faulting” in new packets that are routed to the controller when unknown packets show up on the wire. The source code consists of a generic Openflow packet parser (its a binary protocol), and a generic packet parser (e.g. TCP/IP) and a routing engine.

First library: ocaml-packet that deals with deserializing TCP/IP/ARP/ICMP/Ethernet/802.1Q, and only depends on cstruct by Anil et al. (shows an example of a C struct for IPv4), and the nice thing that they get network byte order for free (thanks to Pierre Chambarts ocplib-endian of course).

ocaml-openflow uses serialization for OpenFlow 1.0 and 1.3, and is based on a (cleaned up) fork of the mirage-openflow library, using ideas from Nettle about to structure the code. OpenFlow 1.3 is less complete currently but is still quite usable.

The runtime systems for OpenFlow 1.0 and 1.3 listens for TCP connections from switches, and does the OpenFlow handles and keeps connections to switches alive.

The source code uses Lwt and so exposes the monad to external users, e.g.:

type t
val connect : Lwt_unix.file_descr -> t option Lwt.t

They have a tutorial for non-OCaml hackers, and found teaching OCaml, Openflow AND Lwt was too much for new users. Therefore they built a new controller called Ox that hides much of the details of Lwt threading. It still uses them internally (Lwt, ocaml-openflow) but is inspired by Ox.

Therefore they can build simple openflow tutorials using Ox, and advanced controller such as Frenetic can be built directly using the low level libraries. Good for beginners and weathered OCaml hackers alike!

The problem with the controller seeing every bit of data is that it gets flooded by packets if it doesnt insert rules into the switches itself. Every switch has a flow table that has priorities, patterns and actions, and tries to take action without bothering the controller. At start of day, the controller does see everything, but eventually flows settle down. The problem they are solving is avoiding having two copies of the program on both the controller and the switch, and figuring out how to deal with race conditions across the two. The simple, logical structure of the OCaml program gets torn apart if they try to write these two parallel logic rules by hand.

They deal with this in the Frenetic network controller. This is a DSL for programming OpenFlow networks, and it has boolean predicates, several policy composition operations and compiles to OF flow rules. It implements ideas from several papers (ICFP/POPL/SIGCOMM/PLDI) and is fully open source on GitHub. (frenetic-lang/frenetic).

They made a transition from Haskell to OCaml for a few reasons. They specified several parts of the Core compiler and runtime in Coq, and found Haskell/Coq extraction very painful (specifically, they had to abandon Coq functors, which had helped them discover key algebraic properties for their work on Kleene algebras). They found it easier for their collaborators to use OCaml than Haskell in general too.

Lwt programming is mind boggling for beginners although the syntax extension does help, and the lack of multicore is a future problem as the controller needs scale up. Overall, they find that hacks at any layer are possible now and lots of projects have spun up!

Q: multicore is a common criticism, but could you give an example of how you would actually solve this is you had shared memory vs a fork model?
A: There are several networking algorithms in the literature, and its a barrier for consumers. Nate notes that they could write one by hand but there’s a time issue there and would rather have it automagically.

PHP Program analysis at Facebook

Pffff is all about deadcode removal, test coverage, checking for undefined function and use of undeclared variables, and syntactical grep rules. They also do taint analysis via abstract interpretation, type driven flow control and eventually (via Monoidics) a separation logic model (they use OCaml too).photo

This talk is about static analysis of a huge codebase (5m lines+ of PHP code is scary!). Since he had a lot of big monitors, he built the tools to do analysis over huge codebases. He demonstrates this on the OCaml compiler itself (camlp4 is huge! boo!!! -anil). They can also zoom in in realtime to each box and eventually see the content of the file itself!

The goal is NOT to build an editor, but to analyse and explore large code bases. It opens and editor if required any time. The idea is to transform semantic program analysis into a visual formt that’s more easily digestible. Just like in Google Maps where you see the big important roads and only the details when you zoom in. He uses this tool all day on his own codebase too, and gradually realised that when you want to learn a new codebase, you want to start to learn the software architecture first, and then want to figure out the dependency graph before seeing details.

He then shows an interactive grid map that explains the dependencies of whole subdirectories of the OCaml compiler, and drill down to understand what they’re for.

Audience q: did you consider a flare graph? a: others have done edge graphs and graphviz etc, and none of these scale to quite the heights that this particular grid done. The ordering of the grid is also the transitive order of dependencies, so it starts with the list of simple dependencies (stdlib) and ends with the toplevel (depends on everything). You can spot for example the OCaml dev rule that things in stdlib are only used by the

With good software, the top right of the grid is full, which is forbidden by OCaml (no cyclic module deps). With other software, the situation is far worse (audience laughter).

There are many other tools (CodeQuery, stags, sgrep/spatch and scheck).  There are plugins for PHP, OCaml (using the cmt options) and Java (using Joust), C/C++ (using clang) getting all the semantic information from varied source code.  There is a plan to move it to the web (from a GTK interface) by using Ocsigen.


Conclusion: CodeMap is a scalable semantic source code vis/searcher for huge codebases, and codeGraph visualizes dependencies.   Future work is to minimize backward deps.

All on Github at

Q: can you insert other metadata such as Linux perf output?

A: yes! (I think)

Extension points: an alternative to camlp4

Leo White talking about alternatives to camlp4. Why? Camlp4 is very complex and it comes due to its support for an extensible grammar. Did a wg-camlp4 working group to figure out how to evolve common uses towards a more easily parsed standard, to reduce the day-to-day dependency on camlp4. After 400 emails, a consensum emerged.

sexplib as an example: the old version was

type t = {
  float: int default(42);
  bar: float;
} with sexp

The annotated version is:

type t = {
  foo: int; [@default 42]
  bar: float
} [@@sexp]

This has annotation data that is passed straight through to a transformer, with no extensions to the grammar. Thus the sexplib transformer can understand with @default 42 means, and editors can easily hide or fold annotations without having to understand their details precisely.

The compiler ignores any AST extension fragments that remain after all the transformers are done.


  match foo with
  (Range ('a','z') | Range(1000,1500)), 65 -> Foo
  | Star (Range('a','z')) -> Bar

These are genuine AST nodes and so if these aren’t handled by the transformer, an error will result. Another example is

properly foreign syntax, via COW (Caml on the Web)

let world = "world" in
let html =  [%html {|<h1>Hello $str:world$!</h1>

There’s also a shortcut convention for the syntax, such as:

match%lexer foo with

There’s a shorthand form to avoid have lots of brackets due to deeply nested lets.

[%lwt let x = f () in
[%lwt let y = g () in

These have all been implemented by Alain Frisch and merged onto trunk (in OCaml 4.02.0dev). Thoughts and feedback are very welcome on the wg-camlp4 list

Q: how does this affect the quality of error messages?
A: Leo -- do you mean error messages from the extension itself?
Q: what if there is an error in the extension?
A: The 42 that was in the default has an AST fragment and so the transformer will transplant the location. A sensible transformer will therefore be nonlossy.

Q: can you chain transforms?
A: yes, thats why we are going with an AST fragment rather than a string, so chaining is easier. If youve got some macro expansion transformer and you want to use it with the sedlex example, they are executed in the other they are given on the command line. The next transformer will just see the resulting AST fragment, and therefore should compose between than camlp4 extensions have in the past.

Q: do you have ideas about hygiene as Scheme does?
A: Not particularly in this case, although Leo has some ideas for future systems.

Q: Ashish Agarwal -- do existing extensions need to be reimplemented
A: Good thing about all these existing extensions is that they have an existing parser, so you can take your existing implementation in camlp4 and mechanically translate into an extension_points version!

Q: Yaron: is anyone thinking about good libraries (hygiene etc)
A: Alain Frisch has merged in ppxtools into OPAM stable (that depends on 4.02.0+)

Q: Will this improve performance over camlp4
A: Parsing is an expensive process and this is still doing marshalling. The cost is often the case that we are shelling out to this other process, so we still need to look into improving performance in other ways.

GPGPU programming with SPOC

M Bourgoin presents his PhD work on SPOC.

GPGPU is odd hardware: several mutiprocessors, dedicated memory, connected via PCI-E and transferred between a GPU and host using DMA. Currently hardware has 300-2000 cores vs 4-16 CPU cores (wow). However this demands a very specific progamming model to take advantage of all this hardware.

Shows an example of a vector addition in openCL

__kernel vac_add(global const double *c, global const double *a, global double *b, int N)
int nIndex = get_global_id(0);
if (nIndex >= N)
c[nIndex] = a[nIndex] + b[nIndex]

To actually use this though, you need a massive program to set up the kernel and execute it. It does the transforms, initialization, scheduling and so forth.

OCaml and GPGPU complement each other, since we have sequential high-level OCaml and highly parallel and low-level GPGPU. So the idea with SPOC is to use OCaml to develop high level abstractions in the GPGPU and program it from OCaml.

SPOC is a stream processing engine in OCaml that emits Cuda but lets it be more hardware independent.

let dev = Devices.init ()
let n = 100000
let v1 = Vector.create Vector.float64 in
let v2 = Vector.create Vector.float64 in
let v3 = Vector.create Vector.float64 in

let k = vector_add (v1,v2,v3,n)
let block = {blockX = 1024; blockY=1; blockZ=1}
let grid={gridx=(n+1024-1)/1024; gridY=1; gridZ=1}

let main()=
random_fill v1;
random_fill v2; k  (block,grid) dev.(0);
#printf of result

SPOC takes care of transferring the kernel to the GPU and back and handling the scheduling and exchange of data. Vectors are very similar to Bigarrays in that they are external

We want: simple to express, predictable performance and extensible libraries, and for current high performance libraries to be usable more safely.

Two solutions: interop with Cuda/OpenCL kernel (better perf and more compatible but less safe). There is also a DSL in OCaml called Sarek, which talk focusses on.

Sarek vector addition has an ML like syntax, and the language features monomorphic, imperative, toplevel and statically typed checked (with type inference). Its statically compiled to OCaml code and dynamic compilation to Cuda and OpenCL.

SPOC+Sarek achieves 80% of hand-tuned Fortran performance and SPOC+external kernels is on par with Fortran (93%). It’s type safe, 30% code reduction, memory managed by the GC and there are fewer memory transfers. This is ready for the real world and has great performance! Available on OPAM.

Great performance, portable, great for GPU and for multicore CPU and nice playground for future abstractions.
Who can benefit?

OCaml programmers -> better performance
HPC programmers -> much more flexible and extensible than current state of the art.

Future work: Sarek needs custom types, function decls, recursions, exceptions, and to build parallel skeletons using SPOC and Sarek.

Why not use Sarek for multicore? Well, here’s the results!

  • ParMap: data parallel and similar to OCaml mapfold
  • OC4MC: posix threads, compatible with current OCaml code
  • SPOC : GPGPU kernels on a CPU, mainly data parallel and needs OpenCL.

Simple program to calculate power series and matrix multiplication.

Power: OCaml 11s14, Parmap 3s30 SPOC <1s
Matmul: OCaml 85s OC4MC 28s and SPOC 6.2s

Significant improvements using SPOC — 11x for power series and 12x for SPOC. Running on a quad core intel i7.

SPOC is available via OPAM and compatible with x86_64 on Linux and MacOS X, and speaker would love feedback.

Q: Anil – how hard is OpenCL to install
A: easy if you know your hardware model and to the vendor website. Harder with multiple GPUs though.

A: if you want to target CPUs, you need to target OpenCL and not CUDA.

High level performance optimisations for OCaml

Pierre Chambart at OCamlPro (funded by Jane Street). OCaml is fast, but not an optimizing compiler (it has predictable performance, good generated code, but what can we do to make it go faster?).

A small modification to high level code shouldnt influence the low level bits.

let f x = 
  let cmp = x > 3 in
  if cmp then A else B

let g x =
  if x > 3 then A else B

The second bit of code is faster than the first here since ‘g’ is spotted by the compiler peephole optimizer.

let g x = 
  let f v = x + v in
  f 3

Abstract code should be compiled less abstractly. Above, we want to inline f, but a closure is still allocated per call unnecessarily. But we also dont want the compiler to be too smart and lose predictability!photo

Currently compiler pipeline has a parsetree – typed tree – lambda (untyped) – clambda (lambda+closures) – cmm (simple C like) and then mach (instruction graph similar to llvm) and then lin (assembly like).

Details: typed tree to lambda (construct elimination) – lambda (high level simplication such as reference elimination and pattern matching generation) – lambda to clambda (inlining, constrant propagation and book keeping in an intricate recursive loop) – clambda to cmm (unboxing, lots of peep hole checks for boxing/unboxing and tagging/untagging) – cmm to mach (instruction selection and some more peephole) – mach (allocation fusion, register allocation and scheduling)

Where do we do high level optimizations then? For example, where do we do inlining (since it allows other transformations to apply). typedtree is too complicate, lambda (want inlining, simpler with closures), clambda (is too difficult to improve, and Pierre did try) and cmm (good for local optimization but not global) and mach is architecture specific.

They are adding a new intermediate form between lambda and clambda known as flambda. This needs a highlevel view of values, simple manipulations, explicit closures and explicit value dependencies. lambda to flambda introduces closures, and flambda to clambda is mainly book keeping (and preparing cross-module information). The magic for optimization will be in flambda-flambda passes, and not in the flambda to clambda layers.

Shows several examples of inlining, simplication and dead code elimination and why its significantly easier in the flambda form for simple microexamples. Also shows lambda lifting!

Change the performance model. Now its WYSIWYG, but we want some sort of understandable compile-time evaluation. For example, consider a map function as a metaprogramming function that you want to execute at compile time without all the intermediate closure allocation.

Add build_test to your OPAM to make it easier to test the compiler, and Obj-using code is very likely to break if you do certain mutations behind the compiler, or dead code elimination might happen if you use a value behind the compiler’s back.

Q: SimonPJ: why is it easier to do closure conversion first, as it’s a lot easier in GHC to do the opposite. There are now two ways to get a value in scope doing it the other way.
A: In Haskell you have everything in your closure, since you get the information from the closure.
Q: Think of lambda calculus with let
A: We do flow analysis to get around this!
Q: This is quite complex. Is it whole program analysis?
A: Restricted to a single compilation unit, and a few other cases, not cross-module yet.

Q: Yoann: if the prediction is “its always faster” then why do you care about predictability?

Replacing the format string hack with GADTs

Benoit Vaugon talks about how to improve the typing of Format strings in OCaml by using GADT encodings. He stats with showing basic examples of the Printf module, and advanced examples that include the “%a” callback, date scanning and hex

The OCaml type checker has an inferred type that encodes the details of the format string in a “format6″ type that has lots of type parameters that describe the format string contents.
During type checking, the literal string is parsed and there is manual inference of the format6 type parameters. At runtime, the format strings are represented by the string from the source code.

During printing functions, it parses the format and count parameters and accumulates parameters. It then extracts and patches subformats, and finally calls the C printf function. Scanning is more complex but broadly similar.

There are quite a few problems with this. There are 5 (!) format string parsers (two in Printf, two in Scanf, one in the type checker) and there are lots of little incompatibilities as a result. For example

Printf.printf "%1.1s" "hello"

Results in Invalid_argument exception despite the static type checking, with a useless error message.

There is a weakness in the type checker too, for example

Printf.sprintf "%s.+f" 3.14

results in the format string being printed.

You can also segfault from there, such as

Format.printf "@%d%s" 42 "hello"

will segfault.

Speed is also a concern, as parsing the format at runtime is slow, and reparsing is required by another slow C printing function. Lots of memory allocation is required.

The new implementation sorts all this out by using GADTs to represent the formats. The format6 type is now concrete and not magic predefined. Formats now also become statically allowed and not multiply passed around dynamically.

The new implementation (which he shows) is a fairly epic GADT but quite regular when broken down into pattern clauses and examined with respect to the old implementation.

Issues: evaluation order changes slightly, as for printing functions the parameters are accumulated before printing. String_of_format is implemented by “%identity” and in the new implementation, we need to change the representation (either regenerate the string from the GADT or implement formats by a tuple).photo

There’s now only one format parser (down from 4-5) for the standard library and OCaml type checker

type ('b,'c,'e,'f) fmt_ebb = Fmt_EBB:
 ('a,'b,'c,'d,'e,'f) CamlinternalFormatBasics.fmt ->
 ('b,'c,'e,'f) fmt_ebb
val fmt_ebb_of_string : string -> ('b,'c,'e,'f) fmt_ebb
val type_format: ('x,'b,'c,'t,'u,'v) format6 ->
   ('a,'b,'c,'d,'e,'f) fmtty -> ('a,'b,'c,'d,'e,'f) format6

(this looks verbose but isnt really exposed outside of the compiler.

Another issue is the “%(..%r..%)” construction, since we need to include a proof term of the number of “%r” occurrences. The encoding for this is quite complex.

Performance of the new system is significantly better across the board. A hello world goes from (230ns to 55ns) timing and (732 to 24 bytes allocated) for example.

Q: Oleg did it already in Haskell (ha ha — editor). He notes that six parameters arent needed, as the GADT is the interpreter over the paramters.
A: yes can remove them if we split the Printf and Scanf format, and there is also a format4 which removes several parameters. We also have a problem with error messages since all the type parameters make things really complicated.
Q: Oleg: the work of interpreting the stream is pushed into the GADT, but if you write an interpreter over the GADT then we have simpler external types.
…discussion moves into offline discussion.

Dynamic types in OCaml

Gregoire Henry wants to do dynamic types in

  • Structural type introspection: generic I/O primitives, type-safe (unlike Marshal) and typed (not written in Camlp4) — these are polytypic functions.
  • Normal type introspection, for dynamic values (a k/v store or a DSL with dynamic typing), or extensible polytypic functions for abstract types.
  • This would also give us a common type representation for (e.g.) Eliom services that will be converted to Javascript, or for FFI libraries, and to let the debugger explore the heap with exact typing information.

Probably isnt a single representation that fits all these use cases: We need to preserve abstraction when desired, but also to break abstraction when desired (e.g. by the debugger or a generic printer, but not by accident!), and it shouldn’t impose a big performance hit.

Gregoire gives a description of a polytypic printing function

let rec print (type t) (ty: t ty) (v:t) =
  match head ty with
  |Int -> print_int v
  | String -> print_string v
  | List ty -> print_list (print ty) v
  | Sum desc ->
  let (name, args) = sum_get desc v in
  print_string name 

We define a predefined type “t ty” and a syntax for type expressions “(type val t) of type “t ty”. We then introduce implicit type arguments which are optional arguments instantiated at call-site with the dynamic representation of the expected type.

val print : ?t:(typ val 'a) -> 'a -> string
let print ?(type val t) (v:t) = ...

type t = R of int * int | ...
let x = R (1,2)
let () = print x (* implicit arg is (type val t) *)

This is experimental syntax, but is shows how we can keep the representation abstract but the printer can still operate over it since the implicit argument passes the printer around.

The main problem is now how we mix polytypic function and abstraction, without always printing “abstract” instead of the real underlying structure. One solution is an extensible polytypic function.

module M: sig
 type t
 val x: t
end = struct
  type t = R of int * int
  let x = R(22,7) let () = register_printer (external type val t) (fun x -> ... )

let () = print x

We register a printer internally in the module, without exposing its internal type externally, but still let it be printed. The internal and external types are different though. We do need to figure out a canonical name for types defined outside the current compilation unit (its absolute path).

By default though, abstraction should consistently introduce new nominal types, but we’re not sure how to reference all the of the external names of a given type within its initial compilation unit (open problem). A pragmatic approach to solving this is the manual or semi-automatic creation of runtime “type names”.

In conclusion, they have a GADT for structural introspection and a RTT representation with global names.  There is also a type-constructor index association table.  Implicit type arguments (lightweight syntax for calling polytypic function, and explicit type parameters for polymorphic functions).

Q: what’s the gain of putting this into the compiler?

A: you may miss language corners, such as firstclass modules or functors, as your “userspace” OCaml code has to pass a type witness into a functor or firstclass module.  Compiler support makes this passing much easier.  The other big benefit is clear error messages from this compiler integration in the long term.  It’s not currently easy for beginners, and this is very important to get right if this becomes a supported feature.

On variation, injectivity and abstraction

Jacques Garrigue speaks about an interesting bug in the type checker that needs some language changes. There was PR#5985 which was a serious type checker bug that lost injectivity.

module F(S: sig type 'a s end) = struct
  include S
  type _ t = T : 'a -> 'a s t
module M = F (struct type 'a s = int end)
let M.T x = M.T 3 in x
- : 'a = <poly> the type is lost!

After expanding “s”, the definition of “M.t” is actually “type _t = T : ‘a -> int t” but “‘a” is not marked as an

In order to protect against this unsoundness, all variables in type declarations must be bound, either by appearing inside the type parameters or be existentially bound (i.e. only present in GADTs). Inside type parameters, these variables must be injective.

OCaml doesnt do injectivity directly, since it relies of variance inference (since we have subtyping). The variance of a parameter is either explicit for abstract and private types or constrained parameters, or its inferred from its occurrences otherwise.

Constrained type parameters have been present since the very start.

type 'a t = T of 'b constraint 'a = 'b list

Rules for checking variance in this case become much more complicated, since constrained parameters variance must now be explicit The variance of type variables inside constrained parameters must also be weaker or equal the version inside the body of the definition.

type +'a t = T of 'b constraint 'a = 'b list (* 'b covariant *)
type 'a r = 'a -> int                        (* contravariant *)
type +'a t = T of 'b constraint 'a = 'b r    (* Fails *)
(* 'b is contravariant in parameters but covariant inside *)

In OCaml though, the variance of a parameter is allowed to be weakened through abstraction.

module M : sig
  type +'a u 
end = struct 
  type 'a u = int

This is correct for the types, but becomes incorrect when used for type parameters.

module F(X: sig type 'a r end) = struct
  type +'a t = T ob 'b constraint 'a = 'b X.r
module N = F (struct type 'a r = 'a -> int end)

Even when we assume that r is invariant, ‘b is inferred as invariant from the parameter of t, which overrides the covariance of the body. Meanwhile in the module N, the variance is now wrong.

How do we fix this? We need to refine the definition of variance, as traditional variance subumption defines a lower bound on the variance, but we need to add some upper bound information too, to be sure that parameters cannot have a strong variance.

(technical explanation of how to compose variances omitted here)

In OCaml 4.01, full variance inference is done using 7 flags (the last is a special case required for principality). Some uses will no longer type check so programs may be rejected for safety.

In the future, Jeremy Yallop, Leo White and Jacques would like to add injectivity annotations for abstract types.

type #'a s
type _ t = T : 'a -> 'a s t

Add new types for isomorphic abbreviations, as Haskell does

module M : sig type #'a t val f : int -> ['pos] t end = struct
  type 'a t = new int
  let f x = (abs x : int :> 'a t)

This is similar to private, but subtyping will work in both directions — this is useful for efficiency, encoding runtime types and allows coercions to be delayed until the signature rather than in the both (in the example above, the coercion in “f” could be omitted) due to the signature.

Another problem is that one cannot prove the uniqueness of abstract types, and we dont know whether an abstract type is contractive (so even if you use rectypes, the following functor can never be defined)

module Fixpoint (M: sig type 'a t end) = struct
  type fix = fix M.t 
Error: the type abbreviation fix is cyclic

One also never knows whether an abstract type may be a float, so the float array optimization may be mismatched.

OCamlot: lots of OCaml testing!

(will post David’s slides on the OCaml Labs website since it’s one of our talks. Will edit this post when that happens).

Merlin: editor support for OCaml

Merlin is a sophisticated editor assistant for OCaml that provides autocompletion in vim and emacs. They leap straight into an interactive demo that’s awesome: autocompletion, module introspection when writing a functor, and so on.

Unlike other tools, this doesn’t rely on a CMT file, so code doesn’t need to compile (except for source browsing). Your project can have type errors when using Merlin!

It integrates with the ecosystem and has direct support for Findlib. There is no direct support for camlp4 however, with specific support for some extensions. If you want a new shiny extension then talk directly to them

There is build system support for running either omake or Jenga in polling mode (for iterative recompilation). Omake rules can inform Merlin to get a consistent state of the project!

A typical session: Parsing is done in two stages and the OCaml parser has been modified to be much more resilient to errors. The usual typing rules are applied, but passed onto a more “relaxed” typers in case of errors to give some support for incremental parsing.

Coming soon is documentation (ha!) as well as retrieval for ocamldoc comments to further augment hints, as well as even more error recovery heuristics. Some of the patches will also be submitted upstream to improve error handling (by accumulating them instead of printing them directly).

It’s available on OPAM under an MIT license via “opam install merlin” and thanks to Jane Street for letting them work on it during an internship there, and all the bug reporters!

Q: Have you considered running Merlin on a remote system and speak the same JSON protocol?
A: They haven’t, but it really shouldn’t be a problem.

Q: Julien V: do we have to build first?
A: No you dont have to build first
Q: What happens when you change branches in a project
A: For project-wide information you need to compile and rebuild.

Q: in the second pass of type checking, what does “relaxed” mean
A: When typing fails for some branch of the AST, in the simple case we introduce a fresh type variable and attempt to do unification with the result of this. This will never be sound, but lets it make progress and give SOME feedback to the user.

Q: is this a new type checker?
A: No its the original type checker with augmented error handling paths to do more unification.

Q: is the relaxed checking really worthwhile? It wasnt a lot of use in gcc since you get a lot of followon errors that are meaningless.
A: lots of discussion, opinions split, it can be disabled!

Understanding memory behaviour of programs

Cagdas B doing a PhD on memory profiling. He wants to study the behaviour of OCaml programs and work on decreasing memory footprint and generally reduce GC pressure on real OCaml programs.

Shows Why3 graph (see photograph) and they’re running nice finegrained analysis over it on real world analyses. How do they generate these graphs?

$ opam switch 4.00.1+ocp-memprof
$ opam install why3
$ OCAMLRUNPARAM=m why3replayer.opt -C why3.conf p9_16

Patched compiler and then the OCAMLRUNPARAM will generate a lot of sampled snapshots of the OCaml heap. There’s no need to change the code or the compilation options. Then install analysis tools

$ opam install ocp-memprof
$ ocp-memprof -loc -sizes PID

The final step analyses all the snapshots. But what is a snapshot? It’s a compressed version of the OCaml heap, with location ids, a graph with pointers, and has saved globals (toplevel modules). We obtain the snapshots by computing a linear scan of all the heap values.

The OCAMLRUNPARAM=m forces the program to generate a snapshot after every major GC, or request the program to generate a snapshot via a HUP signal, or via a new “dump_heap” function that’s been added to the GC module.

It works by reserving some space inside the standard OCaml value block, and so there is minimal runtime performance impact except when dumping out snapshots after a GC. There’s no space overhead since the header size isnt changed. However, it’s only available on 64-bit platforms and reduces block size to 64GB. Ran experiment with OPAM and location identifiers are limited.

There’s one tool based on the snapshot identifiers and the “memprof” tool gets all identifiers from the heap and the corresponding location, and then looks up the right types from a specific location using cmt files (the typed AST).

Plan for this work is to improve the current framework (aggregate information by type and location) and to recover more types, and build more tools around it!

Q: Roshan: one common usecase is when grabbing locations, its boring to see Array.create and it would be nice to see a backtrace of the execution to see where the allocation happened.
A: this is a problem indeed (discussion about the ongoing Mark Shinwell patches too, and previous talk about backtrace analysis from last years workshop).

Benchmarking in Core

Roshan James about microbenchmarking in JS Core. Precise measurementis essential for performance critical code. We want to measure the execution cost of functions that are relatively cheap. Functions with a short execution time that run millions of times in a tight

The essence of benchmarking is to run a time loop (like a lot of times. We need to figure out how to compute a batch size to account for the time it takes to run the actual timing measurement. Similar to Criterion for Haskell, where they loop to figure out how long it takes to compute the time itself, and then run the benchmark function, and compute a full estimate. This works ok, but it requires density plots to let you visualise the form of the error (bimodal, or something else, for example).

There is a lot of noise in microbenchmarking, such as delayed costs due to GC, and variance in execution times is affected by batch sizes. So if we vary the batch size, the variance changes quite a bit since the GC is running in different intervals during the batches. The GC noise has a periodicity though!

They treat microbenchmarking as a linear regression, and then fit execution time to a batch size. They vary the batch size geometrically to get a better linear fit (shows a very linear graph). Theres now no need to estimate the clock and other constant errors, since they are accounted for by the y-intercept which shows the constant costs.

Other costs can also be predicted in the same way, such as estimate memory allocations and promotions using batch size too. This also allows measuring a goodness of fit by using R2 and bootstrapping techniques.

Example use of Core_bech is really easy:just

open Core.Std
open Core_bench.Std
let t1 = Bench.Test.create ~name:"id" (fun () -> ())

Is the identity run for example, and just put something more complex in there. Output is an Ascii_table with lots of columns showing interesting

Some functions have a strange execution time, and the observation isnt a great fit. Between runs, they also directly query the GC and directly query how many times some things have happened (minor collections and so on), and include these along with the batch size in the predictor. With multiple predictors, they are really close to the observed results indeed.

(lots of interesting maths about how they perform runtime cost decomposition)

Its available via “opam install core_bench” and they want to expose more predictors by measuring the effect of live words on performance, and counters for major collection work per minor gc. Accuracy of results is interesting as least-squares is susceptible to outliers so they can incorporate the knowledge that the measurement is heavy-tailed.

by Anil Madhavapeddy at Tue, 24 Sep 2013

On the Implementation of OPAMDOC (Philippe Wang)

Early this year, Leo White started the implementation of opamdoc. Then Vincent Botbol worked on it from late May to mid August during his stay in Cambridge.

Now, Vincent's back to studying "computer science research" in Paris, and he continues working on opam-doc when he can on his free time.

I didn't really want to get involved too deep in the implementation of opamdoc (mainly because it takes time). Eventually, since I've ended up doing most of the technical-side implementation of the new web site, I had to eventually get into opamdoc's source code to integrate its output in the website.

If you look at the source code of opamdoc, you'll see there's a bunch of files at the root directory. Well, some of them are inherited from the implementation of ocamldoc, and some are new. I've mostly contributed to and The former implements the HTML backend of opamdoc, and the latter contains the JavaScript engine that loads the documentation. This blog post is mostly about my experience with those two files.

This big file (about 1.5 kloc) contains the functions to retrieve the information from cmt, cmd, cmti, cmdi and cmi files in order to build the HTML document.

Side note:

The .cmt files were officially introduced in the standard OCaml compiler in version 4.00.0, so it's pretty recent work. Previously, they were produced by a separate tool developed at OCamlPro for TypeRex.

Two reasons why it's not super easy

Well, this can actually be explain in just a few words. To begin with, there are many cases to address. Well, that is not exact. I should say that there are may cases to address and those cases are fairly poorly documented, which is "normal" given that there was probably no particular motivation to put efforts into documentation. This is true for the source code of ocamldoc and for its official documentation.

For instance, if you look into info.mli, you can see that the first type definition is:

type style_kind =
  | SK_bold
  | SK_italic
  | SK_emphasize
  | SK_center
  | SK_left
  | SK_right
  | SK_superscript
  | SK_subscript
  | SK_custom of string

and the only documentation for this type is (** The differents kinds of style. *). Well, you can see that there's not really any documentation needed for those SK_... until... you see SK_custom of string. There you go! You have to guess...

It's not that hard when you have to guess a few times, it's a lot harder when you keep having to guess. That's my point.

Well, the other issue is more interesting. At the time ocamldoc was designed and implemented, I bet no one imagined what folks at JaneStreet would do with OCaml! I'm talking about the implementation of their open source library, known as Core. "Core & co" use a lot of include directives. The module Core.Std includes a lot of other modules, which also include modules. If you want to generate a single full HTML page for Core.Std, you'd end up with a huge page. And such a page would contain a lot of information coming straight from other pages, so you'd end up with hundreds of megabytes. Instead of doing so, opamdoc generates only one page per package and one per module. If a module includes another one, then the first will fetch the documentation of the second and there you go. So we only have 8.4MB of HTML for {async, async_core, async_extra, async_unix, core, core_bench, core_extended, core_kernel} in total (well, this number should increase in the future, but linearly, as people will hopefully start documenting all those packages). And that's why we have a JavaScript loader., or docs_loader.js

Since the JavaScript may have to load tens of megabytes of HTML, you have to program some nasty functions and loops... and at some point it does become big enough for your browser to stop responding while it's busy loading your documentation. So there are several solutions to that. The best would probably be to stop writing in JavaScript (and use something else that compiles to JavaScript). But that's for next step. Now, we're trying to make the JavaScript work.

The problem with JavaScript is that basically there is one "event" loop, and all events are happening sequentially or concurrently, depending on how you wrote your JS, but when one event is making the browser busy, the browser is unable to do anything else. That's why your browser may tell you at some point that you have a script making it ill and ask you whether you want it to stop executing that script. One workaround for that problem when you know you ask for a lot of computation time is to divide your big computation into smaller ones. You can use window.setTimeout for that, meaning you transform your recursive calls like f() into window.setTimeout(f, 0) so that f will be called at some point. And if you have iterative loops instead, write them recursive and use window.setTimeout. That's bad. Because then the browser is unable to tell that it's crazy busy... and it's still really busy. So you can increase the value 0 to 1 or 10 or more. But if you increase it too much, it'll become too slow...

Ok, well, don't use JavaScript. It's a nightmare.

We will probably rewrite the script using Js_of_ocaml at some point, mainly because when you write in OCaml, you have static type checking, and that saves so much time!

To be followed...

by Philippe Wang at Tue, 24 Sep 2013

Feedback requested on the redesign (Amir Chaudhry)

There is a work-in-progress site at, where we've been developing both the tools and design for the new pages. This allows us to test our tools and fix issues before we consider merging changes upstream.

There is a more detailed post coming about all the design work to date and the workflow we're using, but in the meantime, feedback on the following areas would be most welcome. Please leave feedback in the form of issues on the sandbox repo. You can also raise points on the infrastructure mailing list.

  1. OCaml Logo - There was some feedback on the last iteration of the logo, especially regarding the font, so there are now several options to consider. Please look at the images on the GitHub wiki and then leave your feedback on issue #16 on the sandbox repo.

  2. Site design - Please do give feedback on the design and any glitches you notice. Text on each of the new landing pages is still an initial draft so comments and improvements there are also welcome (specifically: Home Page, Learn, Documentation, Platform, Community). There are already a few known issues, so do add your comments to those threads first.

by Amir Chaudhry at Tue, 24 Sep 2013

OPAM-DOC with multiple OCaml packages (Philippe Wang)


The tool ocamldoc shipped with the standard distribution of the OCaml compiler suffers from some limitations because it doesn't have access to all the information it needs. Notably, if you want to build the documentation of some package a that depends on some package b, you'll find out that the documentation of the package a will not automatically refer to the documentation of the package b. Well, opamdoc solves this problem and it's very nice.

On this site, which is work in progress, you can find a list of packages with their descriptions. This list is built using opam2web. And each package description links to its documentation. And the documentation for all packages is built as a whole, which means that if the documentation of a package links to all other packages on which it depends. Isn't it nice? :)

Install and run opamdoc

opamdoc is still under active development, the current step is finding and fixing remaining issues and packaging.

If you want to use opamdoc now, here's some instructions based on Anil's instructions, which are those:

opam remote add opamdoc git://
opam switch 4.01.0beta1+opamdoc
eval `opam config env`
opam install opamdoc
open ~/.opam/4.01.0beta1+opamdoc/opamhtml/index.html
  • opamdoc-rebuild builds a database of cmt/cmd files in ~/.opam/4.01.0beta1+opamdoc/opamdoc.
  • opamdoc-generate builds a package HTML list in ~/.opam/4.01.0beta+opamdoc/opamhtml.

Well, they work quite well, except opamdoc-rebuild uses .cmd files as a base, whereas it should use .cmt files instead because sometimes .cmt files exist while .cmd don't. This is a problem for a few packages (Core is one of them).

My advice: if something doesn't work as expect, please open an issue, and if you want to customize the way the documentation is built, don't hesitate to edit opamdoc-rebuild and/or opamdoc-generate (they are Bash scripts).

A few notes about those instructions

The first line, which is opam remote add opamdoc git://, will give sense to the second one, which is opam switch 4.01.0beta1+opamdoc. And 4.01.0beta1+opamdoc is a customised compiler that produces on the side the .cm{t,d}{,i} files that are used by opamdoc. Using it is very convenient! If you don't use it, you need to produce those files in an other way of your choice (really? you want to go through such trouble?).

Note that opamdoc is written in such a way that it would ask a lot of work to make it work with versions of the compiler prior to 4.01.0 because it uses features introduced by this very version.

Just in case, here follow the customised version of opamdoc-rebuild and opamdoc-generate that I used to build the documentation available on


Here's my customised version of opamdoc-rebuild:

#!/usr/bin/env bash
# Rebuild the cmd/cmt archive in ~/.opam/<switch>/opamdoc
# Copies all the cmt/cmti/cmd files found in the OPAM build
# dir into a single directory structure, separated by MD5
# to keep files distinct.

set -e
SWITCH=$(opam switch show)
if [ "${SWITCH}" = "system" ]; then
  echo Must be using a custom OPAM switch for this to work.
  exit 1

function calc_md5_for_file()
  if builtin command -v md5 > /dev/null; then
    md5=$(cat $1 | md5)
  elif builtin command -v md5sum > /dev/null ; then
    md5=$(cat $1 | md5sum | awk '{print $1}')
    echo "Neither md5 nor md5sum were found in the PATH"
    exit 1

BASE="$(dirname $(dirname $(ocamlc -where)))"
rm -rf ${DOC}
mkdir -p ${DOC}

PKGS=$(find ${BUILD}/ -mindepth 1 -maxdepth 1 -type d "$@")
echo ${PKGS}

for pkg in ${PKGS}; do
  pkgname=$(basename $pkg)
  echo $pkgname
  mkdir -p ${DOC}/$pkgname

  CMTS=$(find ${pkg} -type f -name '*.cmt')
  for cmt in ${CMTS}; do
    d=$(dirname $cmt)
    calc_md5_for_file "$cmt";

    f=$(basename $cmt .cmt)
    mkdir -p ${DOC}/$pkgname/$md5
    for ext in cmt cmti cmd cmdi cmi ; do
        if [ -e $d/$f.$ext ]; then
            $dry cp $d/$f.$ext $r.$ext


Here's my customised version of opamdoc-generate.

#!/usr/bin/env bash
# Rebuild the cmd/cmt archive in ~/.opam/<switch>/opamdoc
# Copies all the cmt/cmti/cmd files found in the OPAM build
# dir into a single directory structure, separated by MD5
# to keep files distinct.

set -e
SWITCH=$(opam switch show)
if [ "${SWITCH}" = "system" ]; then
  echo Must be using a custom OPAM switch for this to work.
  exit 1


BASE="$(dirname $(dirname $(ocamlc -where)))"

rm -rf ${HTML}
mkdir -p ${HTML}

# Grab the build dir in reverse mtime order to get the
# ordering of generation "correct".
PKGS=$(ls -1tr $BUILD)

rm -rf $HTML
mkdir -p $HTML
cd $HTML
for pkg in ${PKGS}; do
  fs="$(find $DOC/$pkg -type f)"
  if [ "$fs" != "" ]; then
    name=$(echo $pkg | awk -F. '{print $1}')
    echo $pkg $name
    $dry $OPAMDOC -p $name $fs


by Philippe Wang at Tue, 24 Sep 2013

Liveblogging CUFP 2013 (SRG Syslog)

I’m here at the Commercial Users of Functional Programming workshop at ICFP 2013) with Heidi Howard, David Sheets and Leo White.


Marius Eriksen and Michael Sperber are co-chairing this year. Functional programming is a hot topic these days, and this year’s program reflects the maturing nature of the program. Record number of submissions, and we could easily have made this a multi-day program.  Videos will be available online after the event.

Keynote is from Dave Thomas on what we can learn from the “language wars of the past”.  His talk will cover the business, social and technical ends of building Smalltalk and his “objaholic” experiences with pushing language technology into businesses, where they never had any intention of changing in this regard.  This worked over the years because it made a material difference to these businesses.  Dave still uses the K programming a lot at work.


Dave is responsible for the Eclipse IDE and the IBM Java virtual machines, and has delivered software on everything from mainframes to wristwatches.

Rush in the 70s with Symbolics and the amazingly cool hardware, followed by the rush to logic programming that led to his grants disappearing. Midway through the 80s, they discovered we wouldnt get parallelism for free with all this magic. After that came the era of machine learning and knowledge ontology crusades, but they are very specialised bits of tech. There remain sectors where ontologies are useful. Nowadays its the multicore crusades, and FP is emerging as useful.

Every road leads to some form of FP, because the principles underlying it apply. Dave does vector programming, Haskell does this with array transformation.

Dave was talking to Simon PJ outside and claimed that “FP has won!”, at which point Simon replied “really?” (laughter). OO techniques can also lead to a lot of tangled code: witness the evolution of Smalltalk to the current generation of languages.

FP technical challenges: what happens when you have a lot of functions, and a group of consenting adults are doing it together. This is called software engineering, or at worst, IT!

Lots of good work in the community on approaching large scale programming. Scala is a bit too complex for Dave (and audience rumbles in agreement or disagreement?). Java and JNI should just die. Erlang and actors are successful to decouple software. But do we really need to hide all our queues, and do we really need to hide our queues? Static analysis of queues are desired.

Want to go fast? Computers like rectangles, all the way down! Array operations and vector processing is the way to go in the long-term.

To go to industry, we need end-to-end toolchains. If you use MS Project, then you probably shouldn’t introduce any new tools.

The first commercial customer for Smalltalk in the 1980s was Techtronix, who wanted to embed reusable tech in their oscilloscopes. If you’re building a serious product like this, it needed language interop between C++ (the logic sensor) and the Smalltalk. Respect the language cultures among components (anyone remember s-records?). ROM had to be replaced with Flash because noone can write code that doesn’t need upgrading (groans from the audience).

With the right tools, programmers will be able to go as quickly as Smalltalk, or at worse Java. We need the work of Simon [PJ] and colleagues on Haskell. Scala is so complex will become the Java (which is the Cobol of tomorrow).

When shipping products, you need to track all the versions of language components in the toolchain. Too many FP toolchains are monolithic and make it hard to track components in the field. JVMs are so slow because they don’t have a good sharing model (DLLs).

What Dave would like is version-to-version transformation in languages. Y2K was a problem not because IBM hadn’t fixed it in their OS, but because customers had modified their local installations and were pinned 6-10 versions behind! The same will happen for existing FPs that fragment into multiple implementations, so will they descend into multiple version hell or will standards emerge from the community?

Also, lets not leave the data behind. Can we alias data from existing legacy sources into our new languages, and not aim for the perfect representations? Most of our programs are dealing with problems with existing data, not code.

Onto deployment issues. Erlang works very well for the problem domain that it’s designed for, but more complex domains exist like OSGI. Hot codeswap and live updates are often viewed as “boring problems” but are vital to real world deployments. Space usage must be paid attention to: how long can a given JVM stay up without leaking? Serialization is important to persist data across versions. If we’re going to use “esoteric control structures” like CallCC, tailcalls or tags, the JVM isn’t going to get us there.

These days, if you can’t stay in your local cache, you will lose several orders of magnitude performance… you’ll need directly connected CPU (or SSD!) memory, and static initalization, and language interop (eagerly vs laziness). Functional data structures and immutable collections are still immature, but need more work to convince doubting Thomas’ about their scalability in real applications.

Hardware support for these are important. Intel Haswell has hardware transaction support (hence can be used by Azul JVMs etc) but we need more for functional langauge performance. There have been $2bn put into speeding up the JVM to make it as fast as C++. Google has a technology called Pinnacle which uses software write barriers that use VMs that execute directly, with a register model and not a stack model (which is easy to put together but hard to make fast). The new generation of language VMs will support direct execution and support modern functional programming languages.

“We all love types” — Dave on objects, but not variables (“its easy to be cooperative with that definition!”). Pluggable types by Gilad are really cute, because they move us towards a world where particular features (Dave likes behavioural types) can be woven in. Perhaps can be done in a dependently typed world. Another good candidate are approximate types and dimensional types. When all these are put together though, it all needs to fit together with sound compositional semantics.

When using an interesting PL, Dave wants warnings to be emitted when the compiler makes space/time tradeoffs so that the programmer can decide between elegance and performance (also mentioned that Haskell programmers do this for .

How does language evolution work? Arthur Whitney maintains K, and he ensures that every version of K is incompatible with the previous one. This is a blessing (!) since it avoids a kitchen sink language and ensures a high quality set of libraries. Quite often there’s a tussle between languages and libraries because there’s not enough coevolution. For example, are closures in Java really useful for the tradeoff in the complexity of the VM? Perhaps we should be looking towards a collection of small correct languages instead of a few partly right and largely wrong languages such as Ada or Java (the Java hate is strong here — editor).

Those of us that have programmed in vector languages (going back to APL) are largely write once read once languages because the code is so hard to read. One trick Dave does with K is to annotate comments inside expressions and use cold-folding without a point-free style to understand vector codeflow better. Also seeing the environment and current name bindings is extremely useful for teaching new languages.

The real problem is: comprehensions, folds, monads — WTF? Yet to see anyone from the monad community give a talk to understand monads. It’s typically “you dont need monads” and then the word is used multiple times in the talk. If it’s not important, why mention it so often? Quite often, the FP experience makes normal programmers feel stupid because they don’t break through this terminology barrier. We as a community need to help people break through the barrier of not being formally trained in mathematics for whatever reason (“category theory for breakfast?”) and *be productive* in FPs.

Smalltalk was like this in the early days. People see a blank window and aren’t sure what to do next. Objects irritated people when they came out, but FP scares people. This is a danger as we go through and wow people with what we can do, but make it look like magic. People change their habits are different rates and we need to understand that. With Haskell, it was odd that he had to get the program “right” (or right/wrong) before it work, as opposed to not working.

Bill Buxton (UI researcher friend) pointed out that people can’t much get beyond programming a VCR. 4GLs were pretty much their limit. Dave’s done a lot of work on evangelizing Smalltalk, but feels it was still too hard. People just didn’t understand how to separate behavior, as the best practices on how to build objects were foreign to most of them. This all settled on some mix of SQL and some glue code, which doesn’t require learning a lot of new technologies.

Given there are many different types of programmers, do they all really need the full power of FP? Some subset of features is more useful, with one way to do things that doesn’t change from release to release. For example, a simple dictionary keeps changing from Hashmap to concurrent Hashmap to something else, but ultimately is still a dictionary. How do we separate the reasoning over performance and structure and not scare people with both at the same time. Knuth started doing this decades ago with literate code, and we *still* dont have a decent way of writing documentation and code in the same place.

We are still missing a pure FP book that is built in the same style and comprehension as SICP. We need thinks that you look the code and represent things more densely (for example, why don’t we use Unicode in the same style as APL symbols to increase the succinctness?). They are working on J programs that move glyphs around on the iPad as an example of this 4GL density.

We need to take all these ideas and *talk more* about metaphors in FP. Where’s the “Scala, F#, Clojure, the good parts” and a frank exposition about the antipatterns? Don’t lie about where the dragons are, but colour code them appropriately to not surprise programmers.

We need to understand how to design with types. In the vector languages, this is called “shape shifting” as they don’t have types, but similar concept. It’s also hard to get the community people to talk about shape shifting, and as a result the art is being lost in the new generation of languages. For FP, can the Yodas in the audience help their audience to “trust the types” ?

Q: In the top 20 languages, every language has a free-or-open-source runtime. This will make a huge difference as it lets people experiment.

A: It’s not been a barrier in experimentation for Dave, but it has been a barrier for commercial use. Dave would not attribute a particular significance to open source as it’s never been a real problem for him (using trial versions and so forth). Open source is a mixed sword, as it tends to product impoverished language teams that don’t have enough resources to move forward. The really good virtual machine implementors are commercially polished (c.f. the $2bn behind the JVM performance from earlier). Clojure is a great example where they appealed to the community to pay on their own behalf to donate for a production version. To build a serious production PL, you need a lot of government research money or a major commercial sponsor.

Q: Marius asks about mobile and embedded and why FP hasn’t cracked it.

A: they are often late to the game, and it’s often not easy to target the runtimes with a high performance runtime. You have to be willing to understand the interface.

Q: Phil Walder says the scariest thing is that if you become a success, then Smalltalk takes the form of C++

A: (or Java!). Being able to maintain the independence of the language creators vs a major player is crucial to conceptual integrity. (Phil) Tools that will migrate between version X and Y and a sustainable ecosystem. It’s important to ensure that the creators get paid to maintain — even the JVM only has 10-20 people in the core team and a vast ecosystem.

OCaml at Facebook via the Hack language

Julien Verlauget is talking about how Facebook is adopting OCaml.1148380_10151682552766094_1789859397_n

Why PHP is good: fast installation times, lots of libraries, easy to learn, and scales in a parallel way (multiple instances).

However, Facebook doesn’t care about any of these features. What Facebook wants is a fast feedback loop between writing code and having it appear on screen. There’s no good way to write a unit test for an experience, and instead a fast feedback loop between code and a live site is how people still work at Facebook today.

Performance at the scale of Facebook really matters. 1% can be a huge bump, but PHP is really hard to optimize due to all the dynamic features in the language. Aside from runtime performance, the other concern is development time for engineers, who need to avoid security problems and general bugs in an incredibly large codebase.

The first attempt at this was to scale the runtime via HipHop. The first attempt was to compile PHP into an interpreter (via sandboxes) and then compile production code to C++. Unfortunately this lead to a huge divergence in performance characteristics between the interpreter and compiled code which led to confusion in production.

The current one is HHVM which is a combined model via JIT runtime that patches code at runtime and helps merge these models.

HHVM is still a PHP platform, and a new model called Hack provides a *new programming language* that targets HHVM (which is essentially a PHP runtime).

Hack is a statically typed language for HHVM. It’s compatible with PHP and interoperates with no overhead and has the same runtime representation as PHP. It has evolved from PHP, and so if you know PHP, you know Hack. It’s designed for incremental adoption via introducing gradual typing (c.f. keynote talk about not scaring programmers not used to types).

function foo(): void {
  $x = 'hello';
  return $x;

This returns a readable error message in the style of a statically typed language, since the inference algorithm keeps a witness about every value’s type and shows counter examples.

The Hack type system must be annotated with class members, function parameters and return types. Everything else is inferred to minimize programmer burden.

Hack types include:

  • Nullable such as

    mainly because of the huge code base that had no pattern matching, so null propagation is important

  • Tuples
  • Closures
  • Collections such as

    that have generics too

  • Constraints
  • Type Aliasing
  • Extensible Records

Demo of a simple mailbox demo follows. HH has a command line mode that lets you syntax highlight which parts of the source code are dynamically checked, by outputing the code with red if it’s potentially unsafe. Awesome!

class Mailbox {
  private $data;
  public function send($data) {
    $this->data = $data
  public function fetch() {
    return $this->data;

(very PHP like but now he annotates it with specific types)

class Mailbox {
  private string $data;
  public function send($data) : void {
    $this->data = $data
  public function fetch() : string {
    return $this->data;

As a programmer, I now want to know if this has all been statically checked (since the previous checking mode showed several red sections which were potentially unsafe). He runs the compiler with a strict mode that rejects the source if there is any unsafe bits.

He then demonstrates how to propagate null types through the code, and then introduces several subtle errors that are all caught.

An audience members asks what happens if you send both an int and a string to a mailbox, and Julien demonstrates a “mixed” type that covers both (is this some sort of row polymorphism? — editor).

The response time of the type checker is close to instantaneous (for the developer loop). Even with a thousand files, the type checkers gets back in a second or less with an error message.

To get the tooling right, Hack uses js_of_ocaml to compile the entire OCaml type checker to Javascript and then run in a complete web-based IDE (demonstrates a full cloud9-like IDE they use internally at FB). To work at scale, they created a server that runs persistently in the background and type checks all the files and responds from a memory cache.

At scale, the the autocomplete requires very lowlatency in the webbased editor, and users also use version control (e.g. switching between branches). The server also has to use a reasonable amount of RAM and have a reasonable initalization time. Fundamentally it has to be stable as developers have to depend on their tools.

A vast amount of code is written in OCaml for the tools. Good choice because it is ideal for symbolic computation, excellent performance and can be compiled to Javascript. Good C interop is really good. The main challenge is the lack of multicore, although Facebook works around it with their multiprocess architecture.

Architecture: the master process has multiple processes which uses shared memory to avoid having to go through the master. The underlying buffers are lock free (similar to our Xen shared memory ring! — editor).

Advice on OCaml at scale:

  • IPC — use lockfree structures and think about sockets/pipes instead of just TCP
  • GC small heap and shared memory is compacted by master
  • OCaml makes you think hard about shared objects, which is a good thing.

Q: is it open source??
A: soon, this is the first time Facebook has talked about it publically.

OpenX and Erlang ads

What is OpenX: (i) An opensource PHP ad server (ii) OpenX is a global company with an ad service featuring a big ad exchange written in Erlang. This talk is about the transition from PHP to Erlang.

photo 1

1998-2007. it was a conventional PHP management system and MySQL. The company started using it as an EC2-based SaaS ad server and serving up clients successfully.

In 2008, they wanted to mix in an ad exchange, to add liquidity to marketplaces that have empty ads. At the same time, they moved offices to Pasadena and looked at new tech. Switched to Hadoop (Java) and Postgres (for some reason). Poor technical choices were made, such as writing to the database for every request (sometimes multiple times!).

In 2009 as the customers increased, there was a big push to scaling things up for performance (this is when the speaker joined the company). Switched Postgres to Cassandra to move from a relational database to a more NoSQL approach that had the appropriate consistency properties. At this time, the architecture was also decomposed to give them headroom to scale for a while, particularly by adding HTTP load balancers and so on.

They wanted to add even more low latency stuff, such as spot bids for immediate bids on empty slots. Anthony decides that the model was a good fit for Erlang, since one request needed to fan out to multiple backend services with a soft-realtime limit to return a response to the user. It took a few months to prototype an Erlang based solution which worked very well.

After this first experience with Erlang and evangelising it, and pointing it out to his coworkers, they came up with another good problem: analyzing the frequency of how often someone looks at an ad. This required a new database that would hold bespoke data efficiently, and also integrated Riak from Basho to interface with the database to run the query engine. This now holds on the order of billions of keys.

In 2007, they ran into problems with Cassandra due to making an incompatible upgrade which would have required OpenX to shut down their business to upgrade, which was unacceptable. Since they had a good experience with Riak core, they developed a new system to move away from Cassandra and add more Erlang for management stack.

In 2011, they decided to split their UI layer into a more Web 2.0-like UI API. The API continues to be implemented in PHP (with a Javascript UI clientside), but the Erlang evolution continued with a gateway service and marketplace service that are both pure Erlang. The basic problem is ad selection, and the logic for this is written as a DSL implemented in both Erlang and Java to enable code reuse across the tools.

However, Cassandra was still sticking around, and they wanted to move away from it. They implemented a data service layer to abstract away from the database. From the early days, they made a bad (but fast) choice to have clients directly call Cassandra via Thrift, as it was easy. They moved away to a more abstract model to get away from Thrift, and so could change clients. The new system could call both Cassandra and Riak to help evolve with both services running simultaneously and provide a seamless migration path to the new service (in stark contrast to the Cassandra upgrade story from earlier).

They also decided that the PHP use for the API was a bad idea, and so switched to Python. They engaged Erlang Solutions to write them an API router since their core team was busy (an example of external consultants coming in and being useful — editor). The current system involved RabbitMQ coordination, and Riak’s replication logic to move out to the colos for publication.

Currently they have 14-15 services in Erlang 7-8 in Java, and a blend of Python/PHP.

How did a PHP shop end up with so much Erlang? Well, they laid off all the PHP developers when they moved offices (lots of audience laughter).

More seriously, the architectural choices are important:

  • Cloud based (generic hardware, automated bootstrap, deployment, package oriented development and fault tolerance).
  • Service based (loosely coupled, single purpose components, pools of components, polyglot so multiple languages/
  • Cross language tools: Thrift (RPC between components), protobuf (RTB and Riak) and Lwes (logging and monitoring) which was open sourced by Yahoo after being originally written at A system called Framework that provides templates for code layout, builds RPMs/debs and integrates with autotools. It enforces versioning and reproducibility across languages.
  • Finally, evangelism is really important. If possible fix the game via architectural choices to set the scene for change. Find a project to showcase the technology, and be responsible for the project’s success. Make sure to share work, and remember that tools are really important to make it easy for others to use.

Really important is figuring out build toolchain — it can be so arcane. They did spend time on integrating with Erlang (erlrc inegrates with packaging system, erlstart to manage lifecycle, erlnode for ?).

Hiring: train people rather than trying to hire knowledgeable. Operations have unfamiliar systems to manage, and developers have new concepts and patterns.

How did Erlang help OpenX developers? The language fit the service model very well. The let-it-crash philosphy and supervision trees led to fewer real service outages.

OpenX is doing really well as a business, too: 250+ billion monthly transactions, 12+ billion daily bids, thousands of machines in 5 colos, 300 employees and $150m+ revenue in 2012.

Q: Marius: with a service based solution, how do you library them up?
A: Their services aren’t hugely fine grained and so usually a module or two (not a lot of state)

Q: Anil: is framework usable from languages? RPM integration seems killer
A: It’s still esoteric and internals not docced, but would welcome a chat!

BAE systems on Redesigning the Computer for Security

Tom Hawkins speaking on DARPA funded project.

SAFE is a codesign of:
a new application langauge (breezE)
a new systems programming language (Tempest)
A new OS
A new processor
and security at every level for defense in depth.

Emphasising hardware enforced security since dynamic checking is too much overhead in software, for example for fine-grained information flow control. Hardware IFC handles the most general attack model (scripting down to machine code injection).

Hardware overview: every word of data contains a tuple of (group 5 bits, tag 59 bits and payload 64 bit). The Atomic Group Unit checks atom typs (e.g. instructions, data pointers and streams). The Fat Pointer Unit (confusingly, FPU) checks pointer operations, and tag management unit (TMU).

The label model can be used for efficient hardware typing, and more recently they are using labels for Control Flow Integrity (e.g. to prevent return oriented attacks). Register file is 32 elements deep, memory is accessed through load/store, and every bit of data in the system is a well-formed atom (and so has tag/groups associated with it).

FPU deals with buffer range checks (e.g. underflow or overflows for a buffer).

The TMU is where the magic is. It combines all the operations from the various operations being worked on (instructions, memory) and the TMU acts as a “database” checking that the operation is consistent with local policy, and the machine can continue execution or raise a security exception.

At day 1, they had an outline for an ISA but nothing else. TIARA project was a baseline (Howie Shrobe, Andrw DeHon, Tom Knight) and they had no languages, no hardware and no toolchain.

They started by sketching out an assembly language and then building an ISA simulator. They simulated simple assembly programs, and HW researchers started coding in Bluespec (a Haskell-like FP that Lennart keynoted about at CUFP 2011 in Japan).

The PL researchers started designing Breeze. Plan was to “steal” Andrew Meyers work on JIF, and “Breeze should be done in a couple of months”!. It didn’t quite work out that short (laughter).

SAFE assembly is as low level as expected, but it adds some new properties:

  • frames manage fat pointer bounds
  • Atomic group declarations on data mark critical sections
  • Tags on data can be marked (e.g. bootloader private)
  • Secure closures is code/data with the authority to run code over the data (they call it gates)

Assembly is tedious and so they started with macros. The Breeze interpreter is now running and there was pressure to start building the compiler. The OS guys wanted a language to build the higher level components.

Solution: a SAFE assembly DSL embedded in Haskell that uses Haskell as the macro language and becomes the library for the future Breeze compiler.

Breeze Language: version 7 — just the datatypes for Booleans was hard and took 4-5 weeks (“this IFC stuff is kind of tricky”). The convenience and modularity of lexical authority passing and one-principal per module is extremely inconvenient!

SAFE Assembly in Haskell looks like a typical EDSL. A monad captures program descriptions, and macros set up data tags and also for better control flow (such as a while loop or ifelse conditionals in the asssembly).

At Year 1.5, the EDSL is still assembly language, no matter how many macros there are. Manual register allocation, calling conventions and data structures. Meanwhile the Breeze compiler was inching off the ground, but there is an awkward transition from high level CPS IR to assembly. There needs to be an intermediate form somewhere. However, the SAFE EDSL in Haskell works great in the compiler backend plugged in seamlessly with the Breeze code generator, which was nice.

Breeze Language is now on Version 12 and continues to answer questions. “What do we do on an access violation?”. Simply stopping the machine isn’t enough, since “What if I maliciously send you data that you can’t access?”. “Simple I’ll just check the label before I attempt to read it?”. “But what if the label itself is private?”. This was the big “aha” moment in IFC, since private labels end up being a poison pill in IFC.

At Year 2.0, the Breeze compiler is going through a major overhaul since it needs improvements (the middle IR is improved but not enough). They temporarily shelve the compiler and they realize it wont come to the rescue of the OS. They really needed a higher low-level language than just assembly.

Breeze language v23 has a solution to poison pills: make all labels public! To label data you must specify the label in advance (brackets), but public labels are not compatible with lexical authority passing so they had to relax their original vision a little.

Year 2.5: Tempest is started as the systems PL for SAFE. Imperative with register allocation and optimizations and control of assembly with inlining and user specified calling conventions. It uses the SAFE EDSL to generate assembly (that was a successful DSL) and nicely fills the Breeze IR gap in the compiler for when that is resurrected.

Breeze language 34: delayed exceptions with not-a-value values (Nav).

The Tempest EDSL with inline assembly has a SAFE assembly sublanguage. The SAFE flow isnow the Breeze, Templest and Safe Assembly layers, with EDSLs of each gluing the compiler pipeline together. There is a SAFE ISA simulator and debugger, along with a Bluespec simulator to save on the FPGA debugging cycles. The TMU is an elaborate piece of hardware (an exotic CAM and cache structure) that a lot of the effort has gone into.

Lessons learnt:

  • designing a higher order IFC language is very hard even with some of the most brilliant PL researchers in the world. Optimal number of researchers is 2-7 rather than any more :) Ben Pierce and Greg Morrisett arguing over very hard problems was a lot of fun in a hard problem space.
  • Day 1 should have started with Tempest, not assembly. Hard to be productive with assembly code, and tempest is the right level of runtime/processor codesign. Level of indirection provides insulation from a changing ISA.
  • EDSLs are great for bootstrapping a language and make excellent backend libraries. However, engineers must be comfortable with the host language and not all the engineers were comfortable with Haskell when they started. EDSLs are hard to debug (line number information apparently hard to get out of error messages vs a concrete syntax).
  • Concrete syntax is more relevnt for some languages than others (e.g. Tempest vs SAFE assembly). The best transition point to a concrete syntax is after developer pressure for modular programming. Once language has modularity, then the switch can be made without disruption.
  • Would a DSL have helped hardware design? Forever debugging ISS and FPGA instead of raw Bluespec. A DSL describing ISA semantics would have kept is synchronized by generating Bluespec, ISS and SAFE EDSL and Coq documentation.

This has been a huge project spanning a vast number of layers. It has produces a large number of interesting papers including one at ICFP this week on testing non-interference quickly by using QuickCheck to generate programs that don’t violate non-interference properties.

FRP at Netflix by Jafar Hussain

Netflix used to have tight coupling between the middle tier and the UI teams. One-sized fits all messages and led to inefficient call patterns.

The plan was to decouple this by giving UI developers the ability to decouple components (or “force!”).

Two developer personas at Netflix: Cloud and UI people. Netflix deals with a third of all US broadband traffic, so they need to think at scale but also to run on really tiny cheap devices that eke out all performance. How do they turn UI developers into effective cloud developers who think at scale?

Gave them UI comforts: Groovy, OO API and a Reactive API. Most UI developers (or indeed any developers) can’t be trusted with locks. This means that reactive programming isn’t enough, which leads to “how do we make parallel programs safe for developers”

Enter Erik Meijer, the legend himself! He asked “Whats the difference between a database query and a mouse drag event?”. Database would use functional code and a mouse drag might be imperative. Erik points out that they are both collections.

New JS closure syntax in ES6 is fantastic to make them easier to use. If you imagine a query for well-rated movies, then you need to map over video lists. He then shows a mouse drag event, but points out that the structure of the code is very simpler. Instead of a “filter” operator, he uses “takeUntil” which stops the operation after some predicate has been satisfied.

The iterator pattern is similar to the observer, except for the bidirectional signalling. Iterators can throw next/hasNext/Throwable, but Observers dont have a similar completion semantic. What happens when we add these to the Observer? We end up with “onNext” and “onCompleted” and “onError”, so now the Observer has comparable semantics to Iterator, so the only thing that changes is the direction of the errors.

Some platforms also have an Iterable. There is now also an Observable to which you can pass an Observer, in the same vein as passing an iterator to a Java Iterable. We also add a Disposable to both, so that when the IO stream has terminated, it can be stopped immediately without having to junk the rest of the data.

Erik and his team at MSR demonstrated that these patterns are dual, so that Iterable and Observable are dual! These are the Reactive Extensions libraries that is a combinator library for the Observable type. It’s open source and has been ported to a number of other languages than C# (including Java).

Observable Monad is a Vector version of the Continuation monad. Null propagation semantics of the Maybe monad and Error propgation semantics of the Either monad. Produced and consumed with side-efects and acts as a fault line between pure and impure code. It can be composed functionally and cancellation semantics are clean. It can be synchronous or asynchronous which is a common mistake (to assume that Iterable is async — either can also be sync simply by calling the functions immediately).

If you Map over Observable, then

(rapid transcript)
var map = (observable, func) =>
  { forEach: observer => 
    var subscription =
       onNext : item => oserver.onNext(func(item))
       onError : error => observer.onError (error)
       onCompleted () => observer.onCompleted()

We can use this to come up with a single reactive type for cloud and UI developers! Jafar demonstrates example of social notifications in the middle tier, but using Observable to filter notifications from a composition of the socialService and messageService. Both underlying services are returning observables, and are non-blocking and completely parallel. Both can execute synchronously or asynchronously and decide which knobs to turn for their particular scheduling needs. The programmer doesn’t need to care about the details.

Another example is search autocomplete, which is surprisingly hard. Need to decide how often to send requests (throttle them to be not per-key) and also deal with concurrent responses from keypresses coming in out of order.

var res = 
    search => 

res.forEach(resultSet => listBox.setItems(resultSet)

This code fragment does all this composably and handles all that logic. It’s throttled, ordered correctly and terminates until the keypresses stop (with respect to the throttling policy.

The full tier is a UI (huge amount of mutable state), dropping in to a distributed functional style, and heading into the data tier. Recall this is at huge scale.

Wins: Rx is open sourced from MS and ported to Java (Observable) and generally better tooled now.
Evangelism comes up again since you can’t assume the best technical solution will win. Practice public speaking and focussing on soft skills! Speaker got better in a short amount of time at distilling the essence of the benefits, and explaining to the target audience why something matters (e.g. to a UI developer vs a cloud developer as he explained earlier).

Challenges: Training/Hiring. Netflix didn’t fire all their UI developers as the previous speaker did (audience laughter). Instead, they leveraged their skills at eking out mobile performance, and taught them alternatives: functional programming, vector programming and reactive programming. Be available to help out people (developers work weird hours!) and interactive training exercises really help too. Also help them understand when to apply each technqique for a particular programming task.

Some of the Javascript programmers had some intuition for these techniques from things like Promises. It wasn’t quite the right answer, as sometimes you need vector versions for performance instead (since it degenerates to a scalar version by passing a vector of size 1). Netflix dont bother teaching scalar, only vector.

Although the programmers are used to mapreduce, the embedded version of this is a lot harder for them to grasp — it’s just a programming style and not just magic in Hadoop. Hiring process adapts to spot the competent functional programmer.

bind/flatMap/concatMap/mapcat/mapMany are all monadic versions with subtle semantics and so they use interactive training exercises to train developers in their runtime semantics.

Challenges: performance needs some chunking for lowend devices (vector a manually chunked scalar monad) and is best applied to less chatty event streams. It worked for search on lowend devices since it takes so damn long to type anything in, so a less chatty UI stream (audience laughter). “We are screwed if the hardware UIs improve”.

Type-unsafe flatMap is easier to understand but needs to be used carefully, but is useful on lowend deviecs. has all the source, so give it a try!

Q: How do Databinding in Rx?
A: No templates, but many other tiers before the view layer are bound together. We introduce an Observable with both pure and impure code (an Observable type with a value property). However they learnt how to do this directly since Observable’s lazy default is the wrong for data binding. Implementing a different type that is strict and publishes the result directly.

Medical Device Automation using message passing concurrency in Scheme

What does a molecular diagnostic device do and what does Scheme have to do with all this?

Detects the presence of specific strands of DNA/RNA in sample and then does necleic acid extractions and purification. Sample are excited using lasers as they undergo PCR which results in spectrographs.
Device contains 19 boards with temp, motor control with many parts that move simultaneously at very high precisions. (author shows demos of the motors running and playing music from the system!)

Client is written in C# and WPF and is thin/stateless. Sever is written in Scheme and Erlang with OTP embedded. Supervision structure to isolate hardware failures. A DSL provides the glue logic.

Scheme is exceptionally clear and simple syntax, and they love the hygeinic syntactic abstractions for their DSL and the first-class functions and continuations are key. Arbitrary precision arithmetic is important for result calculations.

The Erlang embedding in Scheme is how they interop. They embed records by referring to structured data by field name instead of field index. This allows for copying a record and changing particular fields.

There’s no pattern matching built into Scheme and enables matching on all the usual datatypes (strings, lists, vectors and records). It also allows matching the value against a local variable and/or binding values to local variables in the pattern itself, which is new in Scheme (but quite ML-like? — editor)

Processes are on-shot continuations with software timer interrupts. The Gen-server patterns provide a functional way of serializing messages and mutating state. The init/handle-call/handle-cast/handle-info and terminate are all encoded, and so it’s all done except code-change (the FDA wouldnt appreciate code changing on the fly!)

The event manager is responsible for event logging and subscriptions, and supervisors monitor processes and execute a restart or clean shutdown semantic (with enough logging for a field debug cycle).

How it works in the instrument server: “let is crash” throughout and a notable lack of defensive programming due to the actor model. They manage some big hardware which fails all the time and this architecture directly addresses that reality.

Easy assay is executed via a DSL which defines the pipeline in a confirm mode to estimate running time, suggest missing resources and warn about a lack of timeslots. Once satisfied this is all taken care of, the queuing is done via a Tetris-style binpacking heuristic. Each action has its own process for fault tolerance. The DSL exposes an option to ignore action failures which is important for test runs and debugging (also during hardware development).

Some examples of device failures are very bad: Therac 25 was a RTOS that gave people fatal radiation doses. It was tracked down to bad use of mutable state and integer overflow (taken care of in their system with FP and arbitrary precision math). Ariane 5 was due to interger overflow due to conversion precision, which also doesn’t happen in their Scheme system which has non-lossy conversion. And finally the space shuttle simulator crashed due to memory unsafety (a restart was unclean), which cannot happen in Scheme. The supervision hierarchy in Erlang/Scheme can handle crashes with a stable strategy at any layer.

Why a DSL? The assay developers need to write their own recipes such as “which recipes for this instrument, and in which order?”. The scientists shouldn’t need software engineers to help them work, so the DSL sorts that out. Mechanical engineers need to check reliability and timing/variability on the hardware, which they can do by configuring DSL parameters.

There are a few bits to the DSL: an extraction/purification language (tip loading, pipetting, moving magnets) and the protocol language (aspirating, dispensing, etc) and finally the motor language.

Shows an example of the PCR language, which is a complex recipe involving baking, thermal cycling in loops. The DSL follows the logic flow of the English language version of the recipe very closely (with more brackets).

Wrote a web server that supports HTTP requests and JSON/jQuery and an sexpression-based HTML code generator. Can render pages with Scheme logic and render to webbrowsers. Use by engineers for debugging and prototyping without having to touch the real UI, and non-engineers use it to map data sources from instruments into JSON (which can be fed into validation tools, for example to send to the FDA for regulation).

Lessons learnt:

  • Message passing is great but not a silver bullet. Gen servers can deadlock between each other despite being deadlock free internally. Timeouts in genservers and supervisor logging makes this possible to catch at a higher level, which is crucial!
  • Not much on the web about a design pattern for the supervisor structure, so they had to pioneer it (didnt really want to pioneer it though)
  • Automated unit test frameworks also built by them by hand
  • Hiring remains hard into this unique environment.
  • Existing quality metrics like bug density are problematic, since Scheme is terse and so looks artificially worse than more verbose languages!
  • FDA has more scrutiny into this system since this is different from the usual C/C++/C# codebases and also the large number of open source tools such as Sqlite/Emacs/git. Unique to this regulated problem space of course.

Message passing concurrency makes reasoning about semantics of concurrent process easier, ass well as debugging problems via fault isolation. Let-it-crash leads to shorter, less defensive code with fewer branch points and test cases. Strong typing and runtime checks provide cognizant failure instead of silent corruption (crucial to avoid in this medical space).

Mixing languages has been nice, since they can pick the good bits of Scheme (macros, arbitrary precision math) and Erlang (OTP). DSLs provide the glue here, and rapid prototyping for non-expert software developers.

Gilt and Scala microengines

Clothing group, and they need to customize the website with per-user targetting. The customer facing stuff is the tip of the iceberg with a bunch of backoffice features that need to be plumbed through. The traffic patterns are all about flash traffic that spike massively during sales, and if they go down in these times they lose lots of money.

Gilt is about six years old, and it was built as a Rails app in the beginning that was quite large. It had unclear ownership, complex dependencies, lengthy test cycles, and unexpected performance impacts from small changes. Rails encourages to “throw everything into one directory — a thousands models in one directory” which is really unscalable as the number of engineers grew. Some bad production issues came out of this problem.

Then they started to transition to microservices by transitioning a few core critical pieces into microserviecs such as inventory management. The number of services grew to over 300 services in just 4 years (each HTTP microprotocols), all built using Scala. They developed an architecture to support this sort of growth of microservices.

Cornerstones of infrastructure (goal==cool stuff for customers!)

  • Build system: SBT is a “simple build tool” which then turned into “Scala build tool” due to its complex mental model. Configuring everything correctly can be tricky if you have a number of dependencies. Gilt wanted to abstract developers from this so they could focus on writing appliction code. They solved it via a large SBT plugin that takes care of dependency management that provides a standardized configuration for everything.
    A complete build specification is shown which is very succinct: it just has a combinator for adding dependencies. Behind the scenes, the plugin provides a huge amount of support (testing, rpms, dependency, continuous builds, and many more)
  • Configuration is handled by a Zookeeper cluster and can be overridden with local files or system properties. It’s mapped into strongly-typed classes in Scala and validated via Hibernate so that configuration must match up with service expectations.
  • Testing is the most challenging part of a microservice architecture. Functional tests are the gold standard, but hard for developers to run 300 services on their local laptops. Unit tests can run locally, but there’s no mocking service for these (and overmocking results in useless testing that in non representative). They use the Cake pattern for dependency injection, which enables type-safe and scalable composition. It uses Scala’s self-types and multiple trait inheritance. Things like object repositories or configuration providers can be expressed in this way. The pattern lets them mixin different sources and weave in real/mocked sources.UI Testing is handled via Selenium and is built on ScalaTest. Automated browser testing with reusable components works well, and leverages Scala’s type system well to reject invalid tests/code/configuration.
  • Continuous Delivery lets them deliver 20 to 30 service versions every day using the aforementioned structures

Some cool stuff: on their website, there are badges on their website showing inventory for each item. The inventory badges update in realtime as you look at the page, which really motivates people to buy as they spot inventory dropping down. Shoppers get excited and feel like they’re in a Black Friday sale and are engaged in the shopping activity! Building this stuff was quite easy to do using event-driven Play application and Akka actors.

They build on this for “freefall sales”. Essentially a Dutch auction and is also web sockets and Play based. It initiates a sale that lasts for a fixed time with the price dropping regularly. The highlight is a $16000 purse that leads to audience laughter.

Q: how to deal with production side of microservices such as service discovery or log aggregation?
A: no service discovery as Zookeeper tracks where everything is (which ports something is on). Log aggregation isnt great at the moment, so they’re working on centralized collection atm.

Q: what’s the greatest win of FP here? Is type safety really important or something like that?
A: in the context of the company development and the technical stack, one nice thing about Scala is that it combines the “good parts” of Ruby (concise, expressive) with type safety and performance (as they had in Java). Scala provided a balance between Java and Ruby by giving them type safety, performance and conciseness, which is a big win in terms of expressivity and reuse of components.

Q: how do you handle versioning of microservices?
A: each service maintains back compat in its API until they ensure all their clients are upgraded. The URI path has the version number, so service providers bump the version number for new clients and keep the old version working. They then migrate all the clients of the service as quickly as is practical.

PalletOpts and infrastructure automation

(I missed a lot of this talk due to a lunch chat spilling over. I’ll fill it in later –editor)

A Clojure DSL that runs shell scripts and handles very complex infrastructure automation.

Q: Yaron: How do you handle error propagation through scripts (the equiv of set -e in sh)
A: Each combinator propagates errors and there are different run strategies that handle errors differently depending on needs.

Streaming data at Twitter with Summingbird

Speaker is @sritchie (his slides). photo 3

Summingbird is how to do streaming Map Reduce at the scale they have at Twitter (scale is “staggering”). It has been open sourced since Sep 3rd and a bunch of internal systems at Twitter run on it. Mainly three people: @posco @sritchie and Ashu — small teams.


  • “Write your logic once”: Twitter has a number of 1000+ host Hadoop clusters and a lot of their data is done via k/v processing with Hadoop. Going streaming had to be done evolutionary by letting it be done either on Hadoop or on some new streaming service. Programmers experimenting with data shouldn’t have to make systems decisions early on in the process. Twitters scale: 200m+ active users, 500M tweets/day and scale is pervasive.
  • Solve systems problems once so that systems that are scaling stay fixed.
  • Make non-trivial realtime compute as accessible as Scalding.

This leads onto Summingbird: a declarative map/reduce streaming DSL. It’s a realtime platform that runs on Storm, and a batch platform that runs on Hadoop. The datastructures that come out of the DSL can be run on either platform (so you can choose between batch/streaming very late in the process).

One of the big concepts that Summingbird embraces is the idea of “associative plus” operations — the monoid.
The 4 S’ of Summingbird (“you know you’re onto something when they all have something in common!”):


Source is where data comes from, Store is a k/v store that is being aggregated into. Sink is a dump as processing items, and Service permits k/v lookups during a job.

Store: what values are allowed? What constraints can we place on a MR job such that any valid value can make progress.

trait Monoid[V] {
  def zero: V
  def plus(l:V, r:V):V

“less scary than a monad!”. If you can take any two values and smash them together, you have a monoid — e.g. integers. Also Sets have a monoid (concat) and lists (concat) and maps (recursively pull in the monoid on the value). Others include: HyperLogLog, CMS, ExponentialMA, BloomFilter, Moments,MinHash,Topk. All you need to do is find a datastructure with a monoid and you can use Summingbird. There is an open source library called Algebird (birds birds birds!).

It took some convincing (the words like “monoid” still scared people at Twitter, sadly).

Monoids are amazing due to Associativity. If you know that the values you are reducing are associative, then it’s easy to parallelise since they can be built “in any order”.

Even though realtime is important at Twitter, noones wanted to dip in due to the difficulty of getting started. Summingbird can pull in data from an existing static Hadoop cluster and then union it with a realtime feed. If you give the logic to Summingbird, it will run the logic on Hadoop (consuming discrete batches of data) and the client will keep track of how far it’s gone. The realtime comes from partial aggregations over fixed time buckets, and is merged together in the client using the monoid to aggregate them together to look like a proper realtime feed.

One cool feature: when you visit a tweet, you want the reverse feed of things that have embedded the tweet. The MR graph for this comes from: when you see an impression, find the key of the tweet and emit a tuple of the tweetid and Map[URL,Long]. Since Maps have a monoid, this can be run in parallel, and it will contain a list of who has viewed it and from where. The Map has a Long since popular tweets can be embedded in millions of websites and so they use an “accountment sketch” which is an approximate data structure to deal with scale there. The Summingbird layer which the speaker shows on stage filters events, and generates KV pairs and emits events.

Twitter advertising is also built on Summingbird. Various campaigns can be built by building a backend using a Monoid that expresses the needs, and then the majority of the work is on the UI work in the frontend (where it should be — remember, solve systems problems once is part of the vision).

Future plans:

  • Akka, Spark, Tez platforms
  • Pluggable graph optimizations since various planners do improved filter push
  • Metadata publishing with HCatalog — in OSS, this means better tooling for EC2 and other cloud platforms.
  • And of course, more tutorials

All work happens on Github so contributors are welcome.

Summingbird is working well for the majority of realtime apps at Twitter. It’s all about the Monoid! Data scientists who arent familiar with systems can deploy realtime systems, and 90%+ of the code is reused across the realtime and batch systems.

Q: Did Hadoop need any changes to cope with the interfacing with Summingbird.
A: Scalding (written at Twitter) handles much of the interfacing with Hadoop. Hadoop is really the classical k/v store now. Real problem is figuring out which operations can work on both the batch and realtime and only admitting realtime.

Q: Yaron: associativity is one nice thing about Monoid, but what about commutativity is also important. Are there examples of non-commutative datastructures
A: Good examples. It should be baked into the algebra (non-commutativity). This helps with data skew in particular. An important non-commutative application is Twitter itself! When you want to build the List monoid, the key is userid,time and the value is the list of tweets over that timeline (so ordering matters here). It’s not good to get a non-deterministic order when building up these lists in parallel, so that’s a good example of when associativity and commutativity are both important.

Functional Probabilistic Programming

Avi from Charles River, introducing the Figaro language.

Problem: Suppose you have some information (“suppose Brian ate pizza last night”) and you want to answer some questions based on this (“is Brian a student or a programmer?”) and keep track of the uncertainty in the answers.

Solution: Create a joint probability distribution over the variables, assert the evidence and use probabilistic inference to get the answer back.

One common way to do this generative models. Probabilistic models in which variables are generated in order. Later values can depend on previous variables with a binding. This is a v popular approach with several implementations available over the years.

Developing a new model requires figuring out a representation, an inference algorithm and a learning algorithm. All of these are significant challenges with each of them still being considered paper-worthy.

The dominant paradigm is functional probabilistic programming. An expression describes a computation that produces a value.

let student = true in
let programmer = student in
let pizza = student && programer in 
(student, programmer, pizza)

let student = flip 0.7 in
let programmer = if student flip(0.2) else flip(0.1) in
let pizza = if student && programmer flip(0.9) else flip(0.3)
(student, programmer, pizza)

How are we supposed to understand this kind of program. Sampling semantics: run this program many times, each with a sample outcome. In each run, each outcome has some probability of being generated. The program defines a probability distribution over all the outcomes.

This is a powerful approach since we are taking a Turing powerful language and probabilistic primitives are added, and this naturally expresses a wide range of models.

Structured variable elimination, Markov chain Monte Carlo, importance sampling and so forth can all be built using functional probabilistic programming. PPLs aim to “democratize” model building — one shouldn’t need extensive training in ML or AI to build such models, or use them effectively.

IBAL (2000-2007) is the first practical PPL, implemented in OCaml but not embedded in OCaml. Exact inference using structured variable elimination and later had intelligence important sampling. However, it was hard to integrate it with applications and data (in IBAL the data had to be hardcoded into the data, which wasnt practical). It also lacked continuous variables. Therefore the speaker built Figaro to work around the limitations.

Figaro (2009-present) is a EDSL in Scala and allows distributions over any data type. It has a highly expressive constraint system that allows it to express non-generative models. There is an extensible library of inference algorithms that contains many of the popular probabilistic inference algorithms (eg. variable elimination).

The goals of Figaro are to implement a PPL in a widely used library: Scala is the interop language as Java compat is a prime goal. Figaro provides various good abstractions such as OO that come out of Scala (not sure, he skipped slide rapidly here? — editor)

Figaro has an Element[T] which is a class of probablistic models over type T. Atomic elements (Constant,Flip, Uniform, Geometric) are combined with compound elements (If (Flip(0.9), Constant(0.5)...).

Under the hood, there is a probability monad to track the state. The Constant is the monadic unit and Chain(T,U) implements the monadic bind. Apply(T,U) implements monadic fmap. Most elements in Figaro are implemented using this monad.

photo 4

Example 1: Probabilistic walk over a graph. In Figaro, we start by defining datastructures over the webpage graph (Edge, Node, Graph) and then we define some parameters of the random walk.

Example 2: Friends and smokers. Create a person with a smokes variable. Friends are three times as likely to have the same smoking habit than different, so this is a constraint variable. Finally this is applied to all pairs of friends by adding constraints to the pairs. This is then instantiated in a particular situation by building a set of people (alice, bob, clara) and running the inference algorithm (VariableElimination) to calculate the probability that Clara smokes.

Example 3: Hierarchical reasoning. We observe an object (say a vehicle on a road, or a person, or something) and we want to know what type of object it is. We have some observations about this object (its size, color or shape) and we want to know if its something we should be interested in. It’s very natural to think of these objects via an inheritance hierarchy. In Figaro, every element has a name and belongs to an element collection. A reference is a sequence of names (such as vehicle size). Starting with an element collection, we can get to the vehicle associated with an element (go through through the sequence of nested element collections). However, there may be uncertainty in the identity of a reference (e.g. you dont know what vehicle1 is). Figaro resolve the element to the actual element.

Obtaining Figaro is free and open source ( but v2 is on GitHub (like everything else today, it seems).

Q: What are the limitations of Figaro, such as non-terminating Bernoulli tasks and what constraints cant be encoded?
A: Anything you can express can be put in constraints since any code can go in there. It’s very easy to express something that you cant compute. Figaro is working on helping the programmer that can be computed over, such as via rejection sampling or simply rejecting the bad constraint.

Q: Similar to Norman Ramsey’s work (such as Church), whats the relationship
A: The languages fall into categories: their own languages and embedded libraries. The latter is Figaro and Infer.NET and Factory. INFER.NET is limited in expressivity and so can be compiled into a factor graph so is very efficient. Similarly Factory is lowlevel and compiles to factor graph. Lots to discuss here.

Haskell and FPComplete

Company exists to fix Haskell development tools. All the tools are cloud-based and run in a web browser. Content can be browsed anonymously. This talk will not have the Monad word in it!photo

There exists a feature to create tutorials and embed code within those tutorials.

Latest product (being announced today) is a Haskell IDE that is a web based tool which requires no setup and no download. It has a number of cool features:

  • Errors are displayed immediately as it compiled as you type
  • Git based so full version control

Company goals: increase Haskell adoption and make it more accessible. Offer commercial grade tools and support and simplify Haskell development. Also to support the Haskell community and leverage the community to improve FP Complete’s own products.

The FP Haskell Center (released in Sep 2013) has:

  • a web front end (no setup,access to help and integrated with Haskell)
  • Haskell backend has project management, developer feedback with errors, build system and a cloud based execution and deployment platform. The cloud allows for faster product evolution!
  • Soon to have a personal version and version that will work behind firewalls etc.

It is all implemented in a Haskell web framework like Yesod. Lots of library support via conduits, and the UI uses Javascript (via Fay). All the heavy lifting is done in the backend and not via Javascript. Very few issues surfaced using Yesod and no major modifications were required to build their product.

Challenges were outside the scope of Haskell usually, but Haskell was used to fix them.

  • Javascript coding was significant in the frontend
  • Stable set of libraries are important
  • Compiler integration (feedback and errors) must be available as a library to let the IDE spot identifier location problems
  • Uses LXC containers to created isolation containers — ephemeral storage and containers can run on either shared or dedicated systems.
  • IDE uses isolation git state to separate users from each other. When projects are created, there is a visual representation of projects (in git repositories each, in S3) and each repo is accessed via Gitlib-2. Interesting is that this offers multiple backends for git access: Git C library, GitHub C library, IDE local repo in S3 or others in the future.

For building projects, code is being precompiled continuously in the backend and so bytecode can generated quickly and run within the IDE. Bytecode runs in the GHC frontend container and exceptions dont break the IDE. Keter and Chef provide deployment assistance to get it into a shared application server or container, and FP Complete offers this deployment as part of their product.

Billing Processor provides SOAP APIs. Added SOAP Haskell libraries to Haskell via C++ bindings.

Mobile gaming from Japan

They normally built mobile games using Ruby, PHP, etc, but have switched to FP in recent years for reliability, high performance and productivity. However, they’ve suffered from memory leak (blaming Haskell’s laziness, to audience laughter), and some data loss and performance

Gree is one of the largest mobile games platforms, with 37 million users and 2000 games as of June 2013. They develop social games and a platform for 3rd party games. They also do social media and advertising and licensing, and also venture capital.

Employees are 2582 with a significant number of engineers (cant disclose exact numbers). On client side, it’s normally Java, ObjC, Javascript and Unity/C#. Server side is usually PHP, MySQL and Flare (a KV store written in C++ developed internally at Gree). They started a Haskell project in Jun 2012 and they have 6 Scala and 4 Haskell engineers.

They developed a KVS management system using Haskell, to setup and destroy KVS nodes in response to hardware faults and sudden access spikes. It’s used in a large social game application. Another example is an image storage gateway, which converts WebDAV to memcached/S3 for the SNS photo uploader to store images. It uses Warp and http-conduit to do this conversion.

Some pitfalls they faced with Haskell:

  • Memory leaks via lazy evaluation: frontend server keeps a list of active thread ids in a TVar for monitoring. When deleting from the thread list, the function didnt reduce the function to a normal form and so held onto a reference. This led to a serious memory leak that took the service down. It was fixed easily by evaluating to the normalform and forcing sequentialisation. They felt it is too easy to mix up TVar/MVar writes with other IO operations (which do reduce to normal form and avoid the leak). There is much confusion over the strict/lazy alternatives for non-IO functions
  • Race conditions: Data put in a queue is very rarely lost. The reason is that the timeout functions can be used only once safely with an IO function. (details omitted)
  • Performance degradation: from forking a lot of threads to do monitoring. This was easy to fix but happens often since Haskell libraries are developed actively and the implications of changes are not always obvious.

How to fix all this? Better testing to start with. Haskell has a great QuickCheck framework and they developed HUnit. This lets services be started in the setup (in their example, memcached using a spare port), and then run services against it and make sure its torn down in a timely fashion. They used this to extensively test their KVS system and Flare (written in C++). Over 150 system tests and a control server with 5000+ assertions, so this is a big scale test.

Next up is documentation in Haskell. A problem report describes details of the problem and linked from a bug tracker. The timeline, workaround, impact and details caused must all be captured and are mixed up with a number of other reports, so none of the Haskell people read it. They collected up the aggregation and summarised it for other Haskell products. Despite this, other Haskell programmers still dont read it (audience laughter). To sort this out, they do automated checking using hlint. They customize hlint to “grep” for their particular pitfall, and then check for Emacs. hlint isnt quite good enough for a lot of their pitfalls though, such as the timeout issue from earlier, so they had some workarounds to deal with this. The good thing is that hlint is integrated into their text editor. Not all pitfalls can be caught by hlint: high level design issues will remain problematic.

Education is also important. They have a brown bag (lunch? — editor) FP meeting where they “Make GREE a better place through the power of FP”. They cover both Scala and Haskell topics. They also run an education program for new graduates, and they have a Haskell code puzzle series from Project Euler to help the new graduates.

They conclude that they love FP and they develop some key components of our service using FP. But there are many pitfalls, and they are developing tools to assist them sort them out.

Haskell for distributed systems

Jeff Epstein (formerly a masters student in the Cambridge SRG, so it’s good to see him again!) talks about their new distributed system that they are building in Haskell. His company is also available for hiring!photo

HA middleware manages networked clusters, aiming for exabyte-scale clusters. It’s architected by Mathieu Boespflug and Peter Braam (who has spoken about their FPGA work at past CUFPs). It manages 10000+ nodes and creates consensus in the cluster about the location of active services. It can handle events due to failures, recovery time should be approx. 5 seconds.

Zookeeper didn’t scale (audience interruption: why didnt Zookeeper scale? Jeff says he doesn’t know and should chat offline, audience member notes that they maintain Zookeeper and wants to know more).

Events are sent by nodes and aggregated at a proxy node, and enqueued at an HA station. Events are processed by the hA coordinator and updates a node state graph (which is also replicated — everything is replicated and is a key theme in the talk). The HA coordinator is the decision maker which decides what to do on failure. Its main action is updating the node state graph to reflect the new strategy after failure.

Key components are: Haskell, Cloud Haskell and Paxos.

How we used Haskell:

  • Functional Data structures result in a purely functional graph and replicated (impurely) between cluster managers. Most of the feels imperative though, which was important.
  • Refactoring being easier was important due to strong typing.
  • Laziness was yet again a problem causer in IO due to space leaks in low-level networking code. He notes that distribution is a natural barrier to laziness, since the act of serializing a message forces its evaluation. He also notes that processes much behave the same between processes on the same host as between hosts, and so even local communication must be forced.Phil Walder interjects to ask “why did you use Haskell? It’s a bad fit! Jeff replies that “we like Haskell” and introduces Cloud Haskell, the topic of his MPhil thesis.Cloud Haskell is a library for distributed concurrency in Haskell and provides an actor-style message-passing interface, similarly to Erlang. It provides a “typed erlang”.Cloud Haskell is a good fit for this project in terms of interacting communicating components. Refactoring is easy, and this isn’t a “big data” project in the sense of data size. There is a huge amount of communication messages though.CH has a pluggable transport layer such as TCP, but also for different underlying networks (they built Infiniband). However, CH requires ordered delivery of packets and this was a poor fit (they want out of order delivery, which could be built with UDP, but CH requires an ordering for some reason). Space leaks are a real problem in this layer.Cluster bringup is tough for CH, since nodes need to find each other at start of day, and there some hacks in there to work around a morning storm of messages.(Interjection from audience: why not use Erlang? Jeff: performance wasnt up to par, and static types are important for them). Possibly the native code / bytecode erlang might be useful but speaker wasn’t sure if it would work.

    Paxos: is all about making systems agree (well-known proven algorithm for consensus). They implemented this as a general purpose library on top of CH. The algorithm is a good match for the programming model, with the client, acceptor, proposer learning and leader all built as CH processes. It ended up being a small readable implementation (1.5kLOC) that closely matched the pseudocode in the original Paxos paper (made them feel better about correctness). It’s used for synchronizing the state graph and for the distributed message queue.

    Challenges: Paxos is hard and untyped messages in CH make it worse. Getting reasonable performance from it also requires the modified variants. Liveness also is tricky to avoid overapproximations to prevent nodes being fenced out of the consensus unnecessarily. Multipaxos is something that solves some of this, and they’re working on more specific stuff also.

    Debugging: The CH programming model makes it tractable to debug a distributed application on a single machine. Nevertheless, distributed code is still parallel and involves race conditions and mixing messages. They would like a distributed ThreadScope but instead mainly used oldskool logging.

    They did build a deterministic CH thread scheduler to replace CH’s core message handling primitives, and they also used PROMELA

    Q: how much is open source or will be?
    A: none as far as speaker knows

    Q: Marius: how are non-deterministic sources composed with the deterministic scheduler?
    A: deterministic scheduler is for testing only, so timeouts also so deterministic.

    IQ: Functional Reporting

    Speaker is Edward Kmett on how they got FP in the door at S&P Capital in the door, their extensions to Haskell and generally how they’re working on opensourcing their tech. SAP is huge and have a number of products (S&P Capital IQ Platform, ClariFI, AlphaWorks, Compustat).photo

    Getting FP in the door:

    • Monoids (introduced as with the Summingbird talk, for easy parallelisation)
    • Reducers are another really powerful abstraction which can take a container full of values and give you back a structure, and it has a comonadic structure that lets you (e.g.) pass it many containers full of structures and generally compose things nicely. Reduction can also be done simultaneously across monoidal structures and essentially “we dont see the wiring hanging out of our code any more”.
    • Performance attribution: This is a model that breaks down the reasons why you make money (what did you do that was significant). The old and busted model (implemented in Java, full dataset in memory, hard to extend) was horrible, and the new hotness is implemented in Scala, runs in constant space, is really fast and results are flattened elegantly via combinators to a database backend. In other words: better, but some details are glossed over (multipass algorithms, iteratees, caching layers, and handling missing data) — but all this was sorted out and really sold Scala to the bigger company as a real asset.

They developed Ermine, which is Haskell with Row Types. It’s not quite Haskell, but it’s main feature is that it’s not Scala either (audience laughter). Writing functional programs in Scala is quite painful (e.g. not blowing stack with monads requires trampolining everything and is labour intensive just to get by day to day). They wrote Ermine in both Scala and Haskell with a portable Core that can run on the JVM.

Ermine has a strong type system, with novel row types and polymorphic and constraint kinds (“they were just coming into GHC at the time and I wanted to understand them”) and Rank-N types via a derivative of HMF. It fit their domain requirements really wel, and has builtin database support, can be exposed to end users, a structured code editor that can only generate type-safe code as you’re defining, and full control over error messages.

Row types are a powerful way of describing the presence of a particular field. Its modelled via has, lacks and/or subsumes constraints. This is easily checked, but now inference flows unidirectionally through a join.
In Ermine, they have existentials in their constraint types (technical explanation follows, not shown here –editor).

Their reporting layer has one declarative specification to do all the relational algebra specifications they want, and push it into specific backends (SQL Server, etc) and ensure they run their queries as close to the data as possible. They can also create extensive HTML reports that render pretty consistently across different target platforms. Their example of Ermine was written by a functional programmer that had done no functional programming previously, which required a little hand holding to work through some of the more obscure type errors, but generally it worked ok.

All the code is being stuck onto Hackage/Github etc.

Q: has it been rolled out to a lot of users?
A: being used on the main S&P website. One big client facing page for users. The next gen stuff is using Ermine more widely. The desktop app coming soon has Ermine and will be used by a huge number of users. Trend is to use it in an increasing number of their products.

Enterprise scheduling with Haskell at

They have a cloud-based scheduling platform that takes care of complex enterprise scheduling. It has a branded customer experience (like Google Apps does), a flexible workflow and deep integration with backend systems and in-store workflow (e.g. on iPads).photo

Example is Sephora who do makeup, and their appointment system is integrated via that is embedded in an iFrame that looks like a seamless part of their site so that endusers cant tell the difference.

When speaker started at, there was a lot of fire fighting. It was 43k lines of Groovy on Grails, and several major intractable bugs were affecting the business: timezones, double bookings, recurring events not firing and notifications being delayed. Performance was an issue, as things that were interactive were sometime taking minutes to load. There was also a lot of inflexibility — hilariously, they had to sometimes ask clients to change their business model to fit with the model.

After some careful though, they decided they had to rewrite to cope with growth and better service. The author had done Haskell before, but not specifically was with building a website using it.

Haskell had the perfect model for fixing their scaling problems. At the lowest level, they built a RawDB on top of the IO monad that did everything necessary to maintain ACID guarantees during transactions. An example is that once an effect gets committed (e.g. an email gets sent confirming an appointment and then is cancelled, which happened on their old racy platform). This RawDB layer gives them stronger consistency guarantees with tracked effects and automatic retries on temporary failure.

On top of the RawDB, they build a DB monad which provides a higher level than SQL via algebraic data types, caching and data validation. They then layer a security monad on top of the DB monad to guarantee that all accesses have been checked. Finally there is a business logic layer to enforce local policy and customise per customer (which was a real problem in the past).

Security is complex due to all the customization, and they needed roles by client (owner, staff, customer) but also because clients can customise their verbs that only make sense within the component (Appointments can be joined, rescheduled, or notifications can be sent, edit cancelled). Haskell lets all this fit within a multiparameter type class, instead of doing OO-style inheritance that will rapidly get out of control. They map customisations from the customer onto a more standard schema that they can manipulate generically on a component-by-component basis, and it’s all statically checked and generally easy to read (important for security critical code).

Haskell “lets them build code for the long run”. Sometimes though, you need to write code quickly and not care about longevity. This was the case for the importer, which is a quick-and-dirty hack to get code into their system. It has a limited set of inputs, the “end user” is the developer, and it’s thrown away once the job’s done. (shows epic type). Often when people need to do this sort of thing, they resort to bash or dynamically typed systems. In this case, even for this “quick hack”, they couldnt have achieved what they did “as badly as they did, without the help of the type system”. Generally very positive!

Libraries help them not have to write code. Haskell is generally lower in terms of package count, but their *use* of Hackage libraries has been a lot higher than Javascript. They use 9 Javascript libraries, and 71 unique Hackage libraries (+87 dependencies). Its a lot harder to find a quality library that works in the dynamically typed land. Only one of their Haskell libraries had to be replaced due to bugs. This is because Hackage is well-organized, and it’s easier to vet a library since the purely functional ones are less risky (fewer sideeffects through other code). In contrast, Javascript libraries can sideeffect (e.g. on the DOM) very late in the QA process. So: fewer Haskell libraries, but the number of useful libraries seems to be higher.

Problems: cyclic module dependencies with hs-boot files are unwieldy, so avoid them. Scalability with 4+cores was limited, but they havent tested GHC 7.8 (which apparently fixes a lot of this). Idle GC can cause issues with long pauses, so disable this feature — it depends on your workload but it was ok for their server based workload.

Debugging production problems without stack traces is really hard, so consider using profiling mode for all libraries.

Bottom line: 8200 lines of Haskell from 40k+ lines of Groovy. No recurring bugs and very good performance and flexibility for customer needs. Used daily by thousands of people. Check out the website!.

Q: what about testing (QuickCheck?)
A: they’ve done some QuickCheck on critical components (such as an map from times to events). Overall though, just putting Haskell together in the most obvious way led to improvements over their previous platform. Their QA guys are now writing unit tests in Haskell (because their Selenium bindings were harder).

Integration of Mathematica with MapReduce

From Paul-Jean Letourneau, a data scientist at Wolfram Research. Started with a show of hands: most of the audience has used Mathematica and had heard of

The speaker has been responsible for a lot of the personal analytics posts on the Wolfram blog (the email analysis, genomics and several others). The theme of all his projects has been to work on large data. Mathematica has always kept its data oncore, which is a problem with very large data sets, and so this talk is about how they migrate over to a more big-data model to run offcore.

Fundamental principles of Mathematica: everything is an expression, expressions are transformed until they stop changing, and transformation rules are in the form of patterns. Expressions are data structures (actually m-expressions in Lisp parlance). (full explanation of these in Mathematica demonstrated but omitted here)

Expressions in Mathematica are immutable data structures (“program as data”) and so expressions can be rebound. Homoiconicity is pervasive as expressions are the data structure, even internally during evaluation (where it’s represented as a tree).

“everything is a one-liner in Mathematica… for a sufficiently long line” (Theo Gray). (audience laughter). He demonstrates some cool oneliners that are Tweetable (<140chars) and demonstrates how to construct an image recursively out of itself (wow).

He views it as a gateway drug to declarative programming

y = 0
For[i=1; i<=10; i++,y+=i^2]; y

(and then they discover Fold and all the advanced topics such as scoping, evaluation control and the MathLink protocol for IPC).

HadoopLink is how they link Mathmetica to Hadoop. Example is WordCount, wherein he shows code to do it on a single core first (to test the output), and then an export to HDFS so that Hadoop can see the code. This looks fairly straightforward via a DFSExport command from HadoopLik.

The Mapper looks something like:

    /@ StringSplit [k,RegularExpression[], ...

The reduces is similar, as it users an iterator to sum up each of the inputs. It actually communicates to the JVM to ensure that they dont have to load all the values into memory. Running the job is just a special HadoopMapReduceJob along with all the relevant expressions, and it goes ahead and submits it to Hadoop.

Used this to build a genome search engine using all this. He shows how to prep data using Mitochondrian genomic data. The mapper only knows about a single base in the genome and its position, and the query sequence, how can it possibly align the query sequence with the genome? Algorithm scales with the size of the query, and not the size of the dataset. Hadoop can be used to handle the full dataset so getting local algorithm right in Mathematica can be focussed on rather than the bigger problem.

Challenges: when this scaled up to search the entire human genome, my Hadoop cluster blows up. Nodes start dying, Nagios alerts, the works! After drilling down on these cryptic errors, the “garbage collector overhead was exceeded” which was hard to interpret. The memory consumption on each worker node when running with both Mathematica and the slave is quite high. This is a setting that’s made on a clusterwide basis and so is problematic to tweak just for this. What he’d like to do is add job-specific options to run things like heap size.

Its on GitHub and open source ( so go for it.


Ashish Agarwal is organizing BoFs! On Monday @ 6pm there will be a bioinformatics session with Jeff Hammerbacher with location on the CUFP website.

On Tuesday there will be a BoF about organizing local events for FP. In the same way that politics is local, this BoF will help you figure out strategies for local events.

Yaron Minsky announces that 17 organizations have sponsored CUFP this year, which helps students. Industry attention to FP is really amazing to see how FP is moving into the mainstream. Industrial reception at 1830 on Monday downstairs.

Our thanks to Marius Eriksen and Mike Sperber for an amazing program!

by Anil Madhavapeddy at Mon, 23 Sep 2013

OPAM 1.1 beta available, with pretty colours (Anil Madhavapeddy)

Thomas just announced the availability of the OPAM beta release. This has been a huge amount of work for him and Louis, so I’m excited to see this release! Aside from general stability, the main highlights for me are:

  • A switch to the CC0 public-domain-like license for the repository, and LGPL2+linking exception for OPAM itself. The cutover to the new license was the first non-gratuitous use of GitHub’s fancy issue lists I’ve seen, too! As part of this, we’re also beginning a transition over to hosting it at, to underline our committment to maintaining it as an OCaml community resource.

  • Much-improved support for package pinning and updates. This is the feature that makes OPAM work well with MirageOS, since we often need to do development work on a low-level library (such as a device driver and recompile all the reverse dependencies.

  • Support for post-installation messages (e.g. to display licensing information or configuration hints) and better support for the external library management issues I explained in an earlier post about OCamlot testing.

  • Better library structuring to let tools like Opam2web work with the package metadata. For instance, my group’s OCaml Labs has a comprehensive list of the software packages that we work on generated directly from an OPAM remote.

  • A growing set of administration tools (via the opam-admin binary) that run health checks and compute statistics over package repositories. For example, here’s the result of running opam-admin stats over the latest package repository to show various growth curves.

Number of unique contributors to the main OPAM package repository.

Total number of unique packages (including multiple versions of the same package).

Total packages with multiple versions coalesced so you can see new package growth.

Is it a hockey stick graph? Only time will tell! See Thomas’ full release announcement and let us know how you get along with this new release…

Thu, 19 Sep 2013

OCaml Monthly Meeting – Live Blog (Heidi Howard)

Today’s OCaml Labs Monthly Meeting is all about practise talks for OCaml2013 so in that spirit, I’ll practising a bit of live-blogging too.

13:53 – Today’s SRG Meeting is over and its time for some work before the OCaml Labs meeting at 4:00, see you then …

16:02 Techincal difficulties delayed the start

16:02 Intro from Anil

introducing Gabriel Scherer who is visiting us this week and going we are going to Maypole after this meeting. We had a cash prise from ASPLOS after winning the HiPEAC paper award and the money will go towards SRG wine for XMAS party. Signpost paper was accepted to FOCI and a HotNet paper on Trevi was also just accepted

OCL Website – Too much manual management at the moment, moving to an ocaml planet feed of blog posts. David has been busy hacking on OPAM2web, OPAM has 512 packages, Opam2web takes a subset of the OPAM packages and makes the metadata into a minisite, like on OPAM. Doesn’t require manual updates, like an ATOM feed.

Upcoming events – Tomorrow is the 2nd compiler hacking event, at the makespace. Anil will be talking at QCon on Mirage, Mirage 1.0 release date is October 22nd, so maybe a workshop before. We 3 talks for Ocaml2013 (Platform, OcamlOT and Ctypes) so here we go …

16:09 Anil practice talk on OCaml Platform 1.0

Languages take many difference approaches to platform, but what does platform even mean? As a late mover in this field, we can learn from other languages. A platforms is NOT a group of temporarily motivated hackers to build a replacement standard library. Its hard to adopt a particular approach without a domain specific purpose, there are too many opinions, we need objective way to determine what belongs in the platform, we need a genie community that is sustainable (even if a large party leaves). A platform is a bundle of tools that interoperate, with quantitative metric to judge success, built in agility and supporting developers thought the whole development life cycle. Industrial partners have a range of needs, as each work in different domains.

Tooling – Overview of 5 areas: OPAM from OCamlPro, IDE Tools, OPAM-DOC, OCaml compiler itself and

OPAM – 1.1 released today (maybe), over 100 contributors to OPAM,  500+ packages, 1500+ unique versions, external dependency solver using CUDF

IDE Support – OCaml has many intermediate files. In OCaml 4.0 onwards, we have a binary format of an abstract syntax tree with type annotations called cmt (and cmti for interface files), we can now create external tools to query this like opam-doc. ocp-index and ocp-indent from OCamlPro, and Merlin (I thinks this is EPIC) are also now available

opam-doc – Now we have cmt files, we need unified documentation across packages, this is much harder than it sounds as it touches every part of the tool stack. Not all packages can be installed at once due to conflicts. Module inclusion is tough to code in static html. (Need to make a demo) bindoc takes the Typed AST (in cmt) and generates cmd, which include the ocamldoc comments, Opamdoc takes the cmt database for opam and output a single website with your universe of packages. - Demo of at, feedback is welcome says amir

Now we have the tools, what metrics can we extract to see how well our tools are doing.

Portability – windows compatibility ?

Maintainer – is there a place for docs and will people response to issues/comments/emails, where can issues be submitted ?

Tests – code coverage, multi variant benchmarking in core-bench

Stability – OPAM support pining, how stable are the interfaces of libraries ?

opam tracks compiler constraint, statically analyses the build system from logs (OCamlOT)

Agility – Building a platform is EXHAUSTING. We want to ask “WANT IF” questions: what if let was monomophic? what if we removed camlp4? what is the syntax precedence changes ?

Distrusted workflow – build on git, distributing tasks between 3 actors: Author (library writers), OCamlOL workers and maintainers. As we become more stable we move from staging to stable to inclusion in the platform.

We are building a tussle, we want to launch a game in janurary and let people put standard libraries into the ring, running OCamlOT to discover the winner

No clear winner: Lwt – portability, Batteries – free of syntax extensions, core – comprehensive.

16:36  Discussion over the battle of the standard libraries and talk feedback

C: talk is a bit long, not sure what to cut..

C: OPAM was dicussed last year at OCaml2013, we want to update everyone and follow on without overlapping too much

Q: Haven’t we already decided on JS’s core ?

A: No, we use all of them, i.e. Mirage used lwt extensively

Q: What if we don’t want any of the new standard libraries ? maybe I just want to use domain specific libraries from OPAM as and when I need them

A: We are not forcing the new standard libraries on anyone, but they are useful for beginners, nice to have consistent style, interoperability and few open statements e.g. Open Core.Std

Q: What if I have already decided which standard library I want to use ?

A: Again we are not forcing standard libraries on anyone, we are just trying to force effort more directly. OCaml tools will always be standard library agnoctic

C: the diagram of OCamlOT is confustion

C: how to not overlap with david talks

16:41 Davids talk on OCamlOT

State for the open source OCaml community

Outline: what is quality software? what is the user experience? what is feedback loop for package authors? How do we represent the thing underneath this all? utopian future ?

Quality: Work on every core (ANIL: We want multi-core :P ), consistent results: work or die nicely with obvious solution, not more “What have I forgotten?” questions, it should just tell you. We need addictive actions (not sure what they are), consistency, quality functions…

Universal concerns: compiler hypothesis “what if” questions (anil already said this), build system hypotheses “what strange assumuptions is the buid system making?”, package manager hypothesis and environmner hypothesis

Workflow: Make a pull request, curator observes the proposal, predict the future, proposes amendments, feedback loop and finally agreement is reached. Core is release weekly for example, we are trying to work like linux kernal patches

New workflow: promote health of OCaml community, preaching compatibility, “observe, orient, decide and act”, Computer assisted curator will help a human, to run the loop faster, human can pose questions to the computer assisted curator e.g  ”will this run on ARM ?”

Repository Observation: github binding with web hooks but we are not tied to github. We merge into the world and we need dependences from each possible users prospective of the world

Dependency Orientation: capabilities with environmental dependances, packages with constriant-based dependencies, repositories with revision dependencies and artifact dependencies. example of the android repo

Triage Decisions: taking plain text error and parsing them into categories such as unsatisfiability (can’t have these two packages), dependencies (if my dependency is broken, then I am broken), transient (network down), system, metadata, external dependences (you forgot to write a dependency), build errors and a combo of many of the above.

State Action: commit intention, build, error analysis and buid results

Internet res: The agents negotiates over REST API on HTTPS, independent metadata layers (not sure about this) ,everythings an s-exp, branch consistent store explained, like git or Irminsule

Current state: github web hooks, we are conservative so one byte changes and we rebuild everything, basic triage heuristics completed, no amendment are proposed by the system atm, we don’t commit the outcome but the evidence, simple reactions to results, a website with green and red boxes in the large table

History: we have found lots of metadata issues, many packages bugs, some tool bugs like a non relocatable compiler and ocamlbuild PATH ignorer, we currently have 30+ x84-64 30+x84-32, 8 ARMs , many Linux distros , dead Raspberry Pi, panicking *nix filesystems and lots of people have set warning as error

Future: opamfu for DAG analysis, schema migration overhead, lower overhead for administrating exotic workers contributed to OCamlOT, we need to authenticate machines using ocaml-sodium, we need more advanced automation, proposed amendments, lets have a dialogue, better website integration, benchmarking your upgrades (how much improves cost), run experiments on whole OPAM universe with differential analysis and VM-based test system, to specific the worker finely.

What I think quantity is, vision of the future, how its represented underneath and what’s next,


C: that was 20mins, feedback to David regarding content to be cut,

17:23 Ctypes by Jeremy 

This is a update not a practice talk

An examples of puts from C, how we can write no C and link in OCaml,

NEW things in Ctypes:
prettyprinting – for C types and C values, making it much eaiser to examine values for debuygging

biarray – support for lump of C memory

More type – nullable string, complex numbers

String conversions – much faster

Memory management issues – ctypes now gives the programmer more control over lifetime of OCaml passed to C,

finaliser – which you can attach to memory


stub generation – instead of dynamically binding, it will generate stub code to act to the API

capability-style memory safty – one rogue pointer in a C library, can cause hell, loading each C library in a seperate address space so i library can only kill itself, you can then even run on C library on a foreign host or on a virtual machine

static strcut/union layout – checking layout of structures and unions against the API

17:40 Amir demo of, (its look great :) )


by Heidi-ann at Tue, 17 Sep 2013

Inaugural compiler hackers meeting (Compiler Hacking)

Compiler Hacking

The first OCaml Labs compiler hacking session brought together around twenty people from OCaml Labs, the wider Computer Lab, and various companies around Cambridge for an enjoyable few hours learning about and improving the OCaml compiler toolchain, fuelled by pizza and home-made ice cream (thanks, Philippe!).

We benefited from the presence of a few experienced compiler hackers, but for most of us it was the first attempt to modify the OCaml compiler internals.

The first surprise of the day was the discovery that work on the list of projects was underway before we even arrived! Keen collaborators from The Internet had apparently spotted our triaged bug reports and submitted patches to Mantis.

Standard library and runtime

There was an exciting moment early on when it emerged that two teams had been working independently on the same issue! When Jon Ludlam and Euan Harris submitted a patch to add a get_extension function to the Filename module they found that they had been pipped to the post by Mike McClurg. There's still the judging stage to go, though, as the patches wait on Mantis for official pronouncement from the Inria team.

Vincent Bernardoff also spent some time improving the standard library, fleshing out the interface for translating between OCaml and C error codes, starting from a patch by Goswin von Brederlow.

Stephen Dolan looked at a long-standing issue with names exported by the OCaml runtime that can clash with other libraries, and submitted a patch which hides the sole remaining offender for the runtime library. As he noted in the comments, there are still a couple of hundred global names without the caml_ prefix in the otherlibs section of the standard library.


There was a little flurry of work on new command-line options for the standard toolchain.

A Mantis issue submitted by Gabriel Scherer suggests adding options to stop the compiler at certain stages, to better support tools such as OCamlfind and to make it easier to debug the compiler itself. The Ludlam / Harris team looked at this, and submitted a patch which provoked further thoughts from Gabriel.

Vincent looked at extending ocamldep with support for suffixes other than .ml and .mli. Since the issue was originally submitted, ocamldep has acquired -ml-synonym and -mli-synonym options that serve this purpose, so Vincent looked at supporting other suffixes in the compiler, and submitted a patch as a new issue.

The OCaml top level has a simple feature for setting up the environment — when it starts up it looks for the file .ocamlinit, and executes its contents. It's sometimes useful to skip this stage and run the top level in a vanilla environment, so David Sheets submitted a patch that adds a -no-init option, due for inclusion in the next release.


Error handling issues saw a good deal of activity. Raphaël Proust submitted a patch to improve the reporting of error-enabled warnings; David investigated handling out-of-range integer literals and return-code checking of C functions in the runtime, leading to some discussions on Mantis. Stephen submitted a patch to improve the diagnostics for misuses of virtual. Gabriel Kerneis and Wojciech looked at some typos in ocamlbuild error messages, and Mike opened an issue to clarify the appropriate use of the compiler-libs package.


The open operation on modules can make it difficult for readers of a program to see where particular names are introduced, so its use is sometimes discouraged. The basic feature of making names available without a module prefix is rather useful, though, so various new features (including local opens, warnings for shadowing, and explicit shadowing) have been introduced to tame its power. Stephen looked at adding a further feature, making it possible to open modules under a particular signature, so that open M : S will introduce only those names in M that are specified with S. There's an initial prototype already, and we're looking forward to seeing the final results.

The second language feature of the evening was support for infix operators (such as the List constructor, ::) for user-defined types, a feature that is definitely not in any way motivated by envy of Haskell. Mike's prototype implementation is available, and there's an additional patch that brings it closer to completion.

Next session

The next session is planned for 6pm on Wednesday 18th September 2013 at Makespace, Cambridge. If you're planning to come along it'd be helpful if you could add yourself to the Doodle Poll. Hope to see you there!

by (OCaml Labs) at Tue, 17 Sep 2013

Camlpdf, the first good command-line PDF tool I've found (Anil Madhavapeddy)

The fine folks at O’Reilly have been proof-reading the Real World OCaml book that I’ve been working on. My heart leapt with joy when the copyeditor commented that she thought it was very well written, but my joy was short-lived when it turns out that all her comments were safely ensconced as PDF comments. A few hundreds comments that required a response from us, too.

“No problem! MacOS X Preview can handle these!” was of course my first response, but it turns out it’s totally broken with PDF notes. The first note that you select will appear as the content of all the other subsequent notes. Yaron Minsky then experimented with Adobe Acrobat, which I’ve sworn never to install again after an unfortunate incident involving the uninstaller a couple of years ago. That turned out to be incredibly slow. I tried a few open-source tools such as Skim which, while otherwise an excellent bit of software, couldn’t render these particular annotations.

I literally jumped off my seat upon discovering cpdf.

Meanwhile, John Whitington just announced the release of the Coherent PDF command-line tools. Since these are all written in OCaml (and have been developed over quite a few years now), he also sent in an OPAM pull request to add it to the database. And most conveniently, this ended up solving my little PDF conundrum in less than an hour of hacking, and has almost cured me of my lifelong fear of dealing with anything in a PDF-like format. Here’s what I did:

  • I installed the tools via opam install cpdf. This installed the library but not the binary (swiftly fixed).

  • Reading the license told me that it’s for non-commercial use only, so I bought a license from the Coherent PDF website (a bargain price, given how much it does!).

  • I ran cpdf -list-annotations over the PDF, and it dumped out all the comments as a text file to stdout. This wasn’t quite enough for me, since I needed to match the annotation to a page number. But since John has released it as open-source, I forked the repository and patched the support directly into the command-line tools, and sent a pull request back over to John. Since it’s under a non-standard license, I decided to place my patch in the public domain to make it easier for him to accept it if he chooses.

  • My co-authors can just run opam pin cpdf git:// to pin their local copy of CPDF to my forked branch in their own OPAM installations, and easily use my copy until John gets a chance to integrate my changes properly upstream.

Total time including this blog post: 40 minutes. Now, onto fixing the author responses comments for Real World OCaml now. I’m so happy to have cpdf as a simple, hackable PDF utility, as it does things like page combining and rotations that have always been a little flaky in other tools for me. It’s the Pandoc of PDFs!

Sun, 15 Sep 2013

OCaml Development in Vim (Heidi Howard)

This is a quick run-through of how I set up my development environment in vim:

Install pathogen.vim

mkdir -p ~/.vim/autoload ~/.vim/bundle; \
curl -Sso ~/.vim/autoload/pathogen.vim \

Add the following to ~/.vimrc:

execute pathogen#infect()
syntax on
filetype plugin indent on

Install Syntastic

cd ~/.vim/bundle
git clone

Then quit vim and used :Helptags to check installs so far have worked.

Install Merlin

opam switch 4.01.0dev+trunk
opam update
opam upgrade
opam install merlin

Add the following to ~/.vimrc

:set rtp+=~/.opam/4.01.0dev+trunk/share/ocamlmerlin/vim
:set rtp+=~/.opam/4.01.0dev+trunk/share/ocamlmerlin/vimbufsync
let g:syntastic_ocaml_checkers=['merlin']

:SyntasticInfo will return a list of syntax checkers available to Syntastic, check that this now includes merlin

Install OCP Indent

opam install ocp-indent

Add the following to ~/.vimrc

autocmd FileType ocaml source /home/heidi-ann/.opam/4.01.0dev+trunk/share/typerex/ocp-indent/ocp-indent.vim

by Heidi-ann at Mon, 09 Sep 2013

OCamlot--exploring the edges of OPAM packages (Anil Madhavapeddy)

The new release of OCaml (4.01.0) was just announced! The runup to a major release like this is normally a frantic time to test that your favourite applications don’t break unexpectedly due to some incompatible language feature. This release cycle has a little different though, as we’ve been working hard on using the OPAM package database to build an online regression testing infrastructure to mechanize much of this process.

I wanted to share some of what OCamlot does today, some of the results from about 3 months worth of runs that may help OCaml package maintainers, and finally where we’re going with future developments. This work has been done in collaboration with David Sheets (who built the OCamlot daemon) and the OPAM team ably led by Thomas Gazagnaire at OCamlPro. We’ve also had a lot of help from Jeremie Dimino and Yury Sulsky from Jane Street and Dave Scott and Jon Ludlam from Citrix acting as guinea pigs for their respective regular releases of the Core and Xen/XAPI releases to OPAM.

Towards a truely portable OCaml

The upstream OCaml toolchain is built on very UNIX-like principles, with a number of command-line tools that form a build pipeline. This process usually ends with linking the intermediate object files with a runtime library that provides the garbage collector and other intrinsic OS functions.

Given these raw compiler tools, it’s very easy to compile OCaml into all sorts of weird and wonderful architectures. We’ve seen it run on 8-bit PICs, several efficient Javascript backends (originally ocamljs and more recently js_of_ocaml), and of course my own Mirage Xen unikernel.

While the compiler tools themselves are quite portable and work on almost any UNIX-like system, the build system scaffolding around third-party packages is less portable. Some features such as C bindings often contribute to build breakage on some less-used operating systems such as FreeBSD or OpenBSD, as they usually require probing for header file locations or adding custom CFLAGS before building.

Every Internet of Things starts with a tangled pile of ARM Dreamplugs.

And in our server room, venerable Sparc and PowerPC G4 OpenBSD boxen still live.

Finding older machines is getting tough, but here's Dave's old iMac G5 running Linux.

Most OCaml developers use x86-based machines and so foreign architectures also get less day-to-day testing (OCaml has superb support for fast native code compilation on ARM, PowerPC, Sparc32, and we’re working on MIPS64 here as part of the CHERI project).

We want to make sure that OCaml and its package ecosystem works just as well in the embedded ecosystem as well as it does on vanilla x86 Linux. This includes running on my Linux iMac G5, my FreeBSD Raspberry Pi, my OpenBSD Pandaboard, or even on a bytecode-only architecture like an OpenBSD/Sparc64.

In the longer term, this paves the way to reliable cross-compiled packages for Windows, Android and iPhone (all of which have OCaml ports, but aren’t heavily tested with the full package database). The only practical way to get started on this is by building an automated test infrastructure for OCaml that explores the feature matrix (which eerily for me, happened in the early days of Xen too, via XenRT to stabilize the hypervisor).

Why not Jenkins or Travis?

When we first started hacking on the OPAM testing infrastructure earlier this year, I maintained a local Jenkins installation that monitored the repository. While Jenkins is a superb tool for many continuous integration tasks, it fell over badly when trying to use it on the non-x86 and non-Linux (or Windows) operating systems. Jenkins requires a full Java runtime stack to be available on each of the client machines, which was taking up more time to get up and running than a simple OCaml-based client and server that could compile to both portable bytecode or fast native code.

The other difficulty with OPAM is selecting which packages actually need to be tested, as it has a constraint-based package solver that supports expressive forwards and backwards version restrictions. While basic tests of the latest packages worked with Jenkins, we needed to increasingly customize it to automate interfacing directly with the OPAM libraries and calculating test schedules based on incoming change requests.

Another factor that ruled out depending on hosted services such as Travis is that they tend to support x86-only architectures, whereas we really want to test the full spectrum of CPU variants supported by OCaml. This doesn’t mean that there’s no place for Travis of course, and in fact Mike Lin has already made this work with OPAM.

For our full testing needs though, OCamlot was born: an OCaml client and server system which coordinates different versions of the compiler, architectures and OPAM versions and records the results for triage and fixing issues.

Running OCamlot

The latest alpha release of OCamlot is pretty straightforward to run locally, if you are so inclined. First start a server process:

$ git clone git://
$ cd ocamlot
$ ./
$ oasis setup
# (edit lib/ if you need to change ports)
$ make
$ ./_build/lib/ocamlot_cmd.native --help
$ ./_build/lib/ocamlot_cmd.native serve

The server listens on localhost only by default, and normally an SSL-to-TCP proxy is deployed to listen for external connections (I use stud, which is fast and easy to configure).

The OCamlot clients require a local compilation of OCaml, and they autodetect their local CPU architecture and operating system. I’ve put a gist script together that automates this on most Linux and BSD variants. Just customize the top variables (set MAKE to gmake under BSD) and set the hostname to your server process.

The results repository and auto-triage

Once the client is running, the server dispatches tasks from its test matrix, which is calculated from the OPAM package repository. The server maintains a results repository, which is a Git filesystem database that records the build results and logs via an s-expression per task. It also uses Cohttp to serve up the results for a web browser.

It’s very convenient using Git repositories like this since we can use GitHub (or any other Git host) to coordinate and record results, and restart the server without needing any local state. So convenient, in fact, that Thomas and I have been working on a more formal engine for this called Irminsule (more on that in a later post, though).

It’s almost unheard of to have a full OCamlot run go by without some errors, and so David put together the ocamlot triage command. This takes the state repository and runs a set of regular expressions over it to classify them into common errors. The full file is here, but an excerpt should give you an idea of what we look for:

	(* ordered minor to severe *)
	type analysis =
	  | Solver of solver_error option
	  | Dep of string * analysis
	  | Transient of transient_error
	  | System of system_error
	  | Meta of meta_error
	  | Ext_dep of ext_dep_error
	  | Build of build_error
	  | Multiple of analysis list
	    with sexp

The errors are ordered by severity to aid in color highlighting. They start with OPAM solver failures and dependency failures (e.g. due to trying to build a package that requires a specific OCaml version that isn’t available), and move onto missing package dependencies or system libraris.

Testing the run up to OCaml 4.01

Of course, running all these tests is useless without taking action on the results. I’ve been keeping track of them in issue #1029. The nice thing about GitHub issues is that when this bug is referenced in commits (even in other repositories) a cross-reference shows up on that webpage and lets everything be tracked nicely.

So what were the most common failures in the runup to 4.01, and what should you avoid when writing your own code?

Different standard library module signatures

There have been a few changes to some of the functor signatures in the standard library, such as adding a find function to Set (mantis #5864). A third-party library that tries to match the functor signature will fail to compile with a type error, such as this one below for zipperposition.0.2:

Error: The implementation src/
       does not match the interface src/ptset.cmi:
       In module Big:
       The field `find' is required but not provided

The relevant code in zipperposition makes it clear what the problem is:

	(* Big-endian Patricia trees *)
	module Big : sig
	  include Set.S with type elt = int
	  val intersect : t -> t -> bool

This particular bug was reported upstream, and the fix requires implementing the find function for the Patricia-tree-based Set. Since the OPAM package will always be broken on OCaml 4.01, it was marked with a compiler version constraint to prevent it being selected for installation under that compiler. When a new version with the fix is uploaded to OPAM, it will always be selected in preference to this broken one.

One other 4.01 change that temporarily broke most of the bigger networking libraries such as Core, Lwt and Batteries was the addition of the close-on-exec flag to the Unix module. This change only affects upstream packages that redefine the UNIX module for their own purposes (such as adding an asynchronous I/O monad as Lwt does), hence it affects the standard library replacement packages.

The fix here was to locally add patches into the relevant OPAM packages to immediately unbreak things when the fix when into the 4.01 branch of the compiler, and notify upstream maintainers to release new versions of their projects. There’s a subtle problem here: when a patch such as this goes into an unreleased branch of the compiler (such as 4.01.0dev), it’s hard to reliably detect if the user has got the very latest version of the compiler or not. If you do have problems like this in the future, try recompiling via opam switch reinstall <version> to the latest branch.

It’s very useful to be able to drop in bleeding-edge compiler tools into the OPAM repository using compiler constraints like this. For an example, see Alain Frisch’s ppx_tools, that require the very latest 4.02dev trunk release to compile his new extension-points feature.

Multiple object definitions

OCaml 4.01 also restricts multiple method definitions with the same name in the same object. This leaves only inheritance as the way to override method names, but some packages such as OCamlnet and Mlorg had minor uses of the old mechanism.

You can see this by using opam switch:

$ opam switch 4.00.1
$ eval `opam config env`
$ ocaml
# object method x = 1 method x = 2 end;;
- : < x : int > = <obj>
$ opam switch 4.01.0
$ eval `opam config env`
$ ocaml
# object method x = 1 method x = 2 end;;
Error: The method `x' has multiple definitions in this object

New warnings, and the dreaded warnings-as-errors

After a decade of being deprecated, the (or) and (&) operators finally had a warning turned on by default.

$ ocaml
# true or false;;
Warning 3: deprecated feature: operator (or); you should use (||) instead
- : bool = true

This wouldn’t normally be so bad, except that a surprising number of released packages also turn warnings into fatal errors (by using the -w @ flags explained in the manual). Warnings-as-errors is extremely useful when developing code but is rather harmful in released code, since a future compiler can choose to emit new warnings that aren’t necessarily fatal bugs.

Packages that failed like this include ocamlgraph, spotlib, quickcheck, OPA, Lablgtk-extras and many more. Please do make an effort to not leave this option turned on in your packages, or else it makes life more difficult for testing your code on bleeding edge versions of the compiler in the future.

It’s worth noting here that OCaml 4.01 has introduced a fair number of new and very useful warnings across a number of areas, mainly to do with detecting unexpected ambiguation or shadowing of values. I’ll cover more on these in a future post about the new 4.01 goodies.

External system dependencies

While there any many packages in OPAM that are pure OCaml, there are also a substantial number that require other system tools to be installed. The Lablgtk GUI library obviously requires the C gtk library to be installed.

Determining if these libraries are installed on a particular OS is well beyond the scope of OPAM, as there are almost as many package managers as there are operating systems. However, it’s important for automated testing and user-friendly error messages to have some notion of detecting if the environment is ready for the OCaml package or not.

We’re solving this by using a depexts field in OPAM that consists of a set of tags that identify OS-specific packages that need to be present. A separate script can query these tags from OPAM and do the OS-specific tests or installation.

For example, here’s the sqlite3-ocaml OPAM description:

opam-version: "1"
maintainer: ""
build: [
  ["ocaml" "" "-configure"]
  ["ocaml" "" "-build"]
  ["ocaml" "" "-install"]
remove: [ ["ocamlfind" "remove" "sqlite3"] ]
depends: ["ocamlfind"]
depexts: [
  [ ["debian"  ] [ "libsqlite3-dev"]    ]
  [ ["freebsd" ] [ "database/sqlite3"]  ]
  [ ["openbsd" ] [ "database/sqlite3"]  ]

The depexts field here lists APT package for Debian, and the ports tree locations for FreeBSD and OpenBSD. It could also list more specialised tags for particular versions of an OS. You can query this from OPAM as follows:

$ opam install -e debian sqlite3-ocaml
$ opam install -e openbsd sqlite3-ocaml

OCamlot therefore needs to query the depexts field from the package and run the right apt or pkg_add commands. I’ll write about this in more detail when it’s fully baked, as we’ve modified the semantics of the tag querying between OPAM 1.0 and OPAM 1.1 to make it easier to use in OCamlot.

Portable shell scripts

Once we’ve gotten past the hurdle of the compiler version causing failures, there is the small matter of testing non-Linux operating systems, as well as non-x86 CPU architectures. The #1029 overview lists many of these failures under the Portability section.

Damien Doligez made some excellent points about how to write portable Makefiles that works across both GNU and BSD makes. This is why the carefully crafted OCaml Makefiles do not require GNU make to be installed when compiling on FreeBSD or OpenBSD (MacOS X gave up the fight a long time ago and installs GNU make as its default make command).

OPAM tries to help out BSD by providing a make macro in opam files that is substituted with either "make" (by default) or "gmake" (for BSD). While this works for for the toplevel invocation of the Makefile, it fails if the Makefile recursively invokes further targets without using the $(MAKE) variable instead of directly calling the command. Patching these sorts of things is easy but tedious: see the patchfile for the Facile constraint programming library for an example.

The real problem here, of course, is that package maintainers cannot be reasonably expected to test their code on systems that they don’t normally use–if we demanded perfect portability to be present in the main OPAM repository, we would’t get any submissions!

OCamlot automates this nicely though, by finding lots of portability bugs automatically, and maintainers are by-and-large very responsive when we report the problem upstream.

The emerging distributed workflow

The big drawback to OCamlot in its current form is the amount of triage effort it puts on the OPAM maintainers. The package database has now exceeded 500 packages in just a short few months, and has over 1500 unique versions that all need build testing and more accurate constraints. The wider community has been really keen to participate in helping with triage (just look at all the other people that leapt in on bug #1029), so its our immediate priority to make OCamlot more transparent for people that want to use it to improve their own packages, and in the future also use it to test various hypotheses about all the available open-source OCaml code (see Jacques’ experiment with monomorphic let as an example of something that can benefit from wider automated compilation).

I’ll talk more about how we’re solving this in my upcoming OCaml 2013 Workshop talk about the Platform. I don’t want spoil it too much, but it involves a lovely distributed Git workflow, an enhanced opam2web, and a brand new metadata overlay system for OPAM that lets us enhance the package database with extra information such as statistics, portability and test results, but without polluting the main Git repository with all this extra non-essential data.

If you’re really curious to know right now, then you can see the outline of the new system at Amir’s new wireframes blog post, where Part III contains the continuous integration workflow. A lot of infrastructure work has gone into building all of this over the summer, and now it’s all starting to be deployed in a very satisfying way…

Sun, 08 Sep 2013

OMD: a Markdown parser in OCaml (Philippe Wang)


There are so many existing implementations of Markdown, why yet another one? Well, after asking the OCaml community about existing ways to parse and manipulate Markdown documents, it seemed that there were no stand-alone complete Markdown parser written in OCaml, meaning that they were incomplete (i.e., not fully compatible with Markdown) or interfaces to some Markdown parsers implemented using other languages than OCaml.

Since in OCaml Labs we're working a lot with Github, and Github uses Markdown a lot (Github web pages hosting, Github issues, etc.) and other sites are also using Markdown as well, and Markdown is very popular and easy to learn, and flexible in the sense that you can always fall back to HTML when you want to express something that doesn't have a special syntax in Markdown, blah blah blah, it was - somehow - time to have a Markdown parser implemented using OCaml. And preferably OCaml alone, meaning that one that has OCaml installed should be able to use it easily without having to deal with too many dependencies. Well, there it is: OMD!

Issues... were essentially with Markdown itself

Every computer scientist knows how to write a parser for a language that is as simple as Markdown. Well, no, not every computer scientist, sadly, but at least every programmer should. Ok, sadly it isn't really the case either. Anyways.

Markdown is a rather simple language, if you look at the specs. Well, that depends on what you actuall call "specs", of course. According to Wikipedia, Markdown was invented by John Gruber, and Aaron Swartz helped him. Then the original document that describes the Markdown language is available online there: And you can see that searching for "Markdown+syntax" on Google makes that page the top result (and here again, I'm kind of helping it be and stay on the top...).

Actually, Markdown is not that simple to parse. Why? Because so many things are just left to the imagination of the people who implement parsers for this language, because programs can't decide by themselves what's right to do. It's really the author(s) of the program that decides what the program does. And in Markdown, so many things are ambiguous and Gruber's document doesn't actually describe a grammar at all. It just tells you how to write this and that, but if you write it slightly differently, it doesn't tell you what the outcome would be and by just reading the document, you really can't tell what the output will be.

The one thing to have in mind is that there are no errors in Markdown. Every text file is a valid Markdown file, it might just be converted to some HTML you wouldn't expect. For instance, you may write a link using this syntax:

[blah blah blah](the-url-which-can-be-absolute-or-relative)

and it gets converted to

<p><a href="the-url-which-can-be-absolute-or-relative">blah blah blah</a></p>

But if you forget the closing parenthesis, then it becomes that instead

<p>[blah blah blah](the-url-which-can-be-absolute-or-relative</p>

precisely because nothing is wrong in Markdown.

And what if there are parentheses in your URL? What if they are unbalanced?

Flagrant Ambiguity

The following is some text that has to mean something in Markdown...

* Alice is falling in love with Bob.
    * Bob is secretly in love with Alice, but he's seeing Eve.
  * Eve is in-between, she loves both Alice and Bob, she can't help it.

So, Pandoc, which a tool that converts a document in language A to a document in language B, where A and B can be the same or different languages amongst LaTeX, HTML, Markdown and (many) others, considers that Eve is on the same level as Bob. So its HTML output is

<li>Alice is falling in love with Bob.
<li>Bob is secretly in love with Alice, but he's seeing Eve.</li>
<li>Eve is in-between, she loves both Alice and Bob, she can't help it.</li>

But if instead you add Dylan as in

* Alice is falling in love with Bob.
   * Dylan, a new character, is being conceived here.
    * Bob is secretly in love with Alice, but he's seeing Eve.
  * Eve is in-between, she loves both Alice and Bob, she can't help it.

then of course Eve is not on the same level as Bob anymore and goes with Dylan and Alice, Bob is on his own.

<li>Alice is falling in love with Bob.</li>
<li>Dylan, a new character, is being conceived here.
<li>Bob is secretly in love with Alice, but he's seeing Eve.</li>
<li>Eve is in-between, she loves both Alice and Bob, she can't help it.</li>

This doesn't make much sense...

And Github's embedded Markdown to HTML converter chooses some similar broken semantics. If one writes bullets on different levels, it shouldn't be meaningless.

Also, on Github, if you write

* 1
  * 2
    * 3
      * 4
        * 5
          * 6
            * 7
              * 8
                * 9
                  * 10

Then 2 and 3 are on the same level, as for 4 and 5, 6 and 7, and 8 and 9. 1 and 10 are on their own. And if you extract 2 and 3, meaning

  * 2
    * 3

then 2 and 3 are not on the same level anymore! See for yourself:

With OMD, hopefully, there is a deterministic meaning for each level of indentation. The list where there were Alice, Bob and Eve is converted to this to the least insane semantics I could think of.

The idea is that, since Eve is neither on the same level as Bob or Alice, Eve should be in a new list (because, obviously, she's the only one on that level anyway). So she is in a new list. And since she's not on a deeper level than Bob, she shouldn't be on a sub-list of his. But she is on a deeper level than Alice, so she has to be on a sub-list of hers. So, here is the HTML that OMD produces:

 <li>Alice is falling in love with Bob.
   <li>Bob is secretly in love with Alice, but he&apos;s seeing Eve.
   <li>Eve is in-between, she loves both Alice and Bob, she can&apos;t help it.

Oh, you might have noticed that OMD converts quotes to &apos; because otherwise I would need to differentiate when they have to be converted from when it's optional.


Pandoc's documentation says

In contrast to most existing tools for converting markdown to HTML, which use regex substitutions, Pandoc has a modular design: it consists of a set of readers, which parse text in a given format and produce a native representation of the document, and a set of writers, which convert this native representation into a target format. Thus, adding an input or output format requires only adding a reader or writer.

Come on, most tools are using regular expressions substitutions?! I can only imagine the nightmare that it must be to implement and debug such an implementation -- no wait, I can't because I just don't want to imagine such a nightmare.

I used functions and function calls, a lot of them are tail recursive, not all of them but then it means I don't need them to be tail recursive, and those functions basically take a list of tokens and return a new list with possibly fewer tokens and the expression to which the missing ones were converted into.

So far, in version 0.4 (which is not released yet at the time I write this), there's a little less than 8k lines of pure OCaml code. (Note that I didn't write "pure functional", I wrote "pure OCaml".)

OMD is an open-source free and libre software library that any OCaml developer can use (hopefully quite easily since it doesn't depend on anything else that the standard OCaml compiler and library). And of course, it's also a tool for any one who write Markdown and wants it to be converted (quickly) to HTML. OMD, so far, is about 10 times faster than Pandoc, and I didn't even make any efforts to make it fast.


OMD has been developed using OCaml 4.0.1, Christophe Troestler made me make it compatible with OCaml 3.12.1. Then I guess it might work with older version of OCaml but it's not certain (mainly because OCaml's standard library has slightly changed, as I think I don't use any language features that were introduced in 3.12 or 4.0).

By the way, thank you Christophe for your support, interest and contributions to OMD :-)

Future of OMD

OMD already is in OPAM. A very stable version of OMD should be released soon. As a tool, it takes Markdown as input and produces HTML as output. A Markdown backend has been freshly implemented, so you can output Markdown as well, which is quite useful for debugging or if you want to know how many iterations you need before you reach the fix point. You can also output "text", in the sense that it's basically the HTML without the HTML tags, so it's very non-formatted text. There also are options to output HTML table of contents and parametrise their depths.

OMD and

We are at OCaml Labs making a new website for The design is being provided by onespacemedia. At the time I'm writing these lines, I'm using the HTML/CSS/JS for the upcoming to style my new Github-hosted website, so I can play with it more.

Most pages will be written in Markdown instead of HTML, so that people of the OCaml community may contribute to it in a more convenient way.

And of course, that means that OMD will be used for

by Philippe Wang at Thu, 05 Sep 2013

ICFP, CUFP & OCaml2013 (Heidi Howard)

I’m busy planning my first trip across the Atlantic to attend ICFP, CUFP and OCaml 2013. Today, I’ve been given the duty of “live blogging” the event, over at the syslog, the Cambridge Systems Research Group blog.

My other job for the event is to improve the documentation for Janestreet’s Async library. if anyone else is keen, I would love to organise a doc-a-thon to populate the .mli files

by Heidi-ann at Wed, 28 Aug 2013

Introducing vchan (MirageOS)

In Theory

Unless you are familiar with Xen's source code, there is little chance that you've ever heard of the vchan library or protocol. Documentation about it is very scarce: a description can be found on vchan's public header file, that I quote here for convenience:

Originally borrowed from the Qubes OS Project, this code (i.e. libvchan) has been substantially rewritten [...] This is a library for inter-domain communication. A standard Xen ring buffer is used, with a datagram-based interface built on top. The grant reference and event channels are shared in XenStore under a user-specified path.

This protocol uses shared memory for inter-domain communication, i.e. between two VMs residing in the same Xen host, and uses Xen's mechanisms -- more specifically, ring buffers and event channels -- in order to achieve its aims. Datagram-based interface simply means that the interface resembles UDP, although there is support for stream based communication (like TCP) as well.

Over the last two months or so, I worked on a pure OCaml implementation of this library, meaning that Mirage-based unikernels can now take full advantage of vchan to communicate with neighboring VMs! If your endpoint -- a Linux VM or another unikernel -- is on the same host, it is much faster and more efficient to use vchan rather than the network stack (although unfortunately, it is currently incompatible with existing programs written against the socket library under UNIX or the Flow module of Mirage, although this will improve). It also provides a higher level of security compared to network sockets as messages will never leave the host's shared memory.

Building the vchan echo domain

Provided that you have a Xen-enabled machine, do the following from dom0:

opam install mirari mirage-xen mirage vchan

This will install the library and its dependencies. mirari is necessary to build the echo unikernel:

git clone git://
cd test
mirari configure --xen --no-install
mirari build --xen
sudo mirari run --xen

This will boot a vchan echo domain for dom0, with connection parameters stored in xenstore at /local/domain/<domid>/data/vchan, where <domid> is the domain id of the vchan echo domain. The echo domain is simply an unikernel hosting a vchan server accepting connections from dom0, and echo'ing everything that is sent to it.

The command xl list will give you the domain id of the echo server.

Building the vchan CLI from Xen's sources

You can try it using a vchan client that can be found in Xen's sources at tools/libvchan: Just type make in this directory. It will compile the executable vchan-node2 that you can use to connect to our freshly created echo domain:

./vchan-node2 client <domid>/local/domain/<domid>/data/vchan

If everything goes well, what you type in there will be echoed.

You can obtain the full API documentation for ocaml-vchan by doing a cd ocaml-vchan && make doc. If you are doing network programming under UNIX, vchan's interface will not surprise you. If you are already using vchan for a C project, you will see that the OCaml API is nearly identical to what you are used to.

Please let us know if you use or plan to use this library in any way! If you need tremedous speed or more security, this might fit your needs.

by Vincent Bernardoff ( at Fri, 23 Aug 2013

Real World OCaml beta3 release (Heidi Howard)

Beta3 of RWO is now available: and anil (one of the co-authors) comments on the release

Cover of RWO

by Heidi-ann at Mon, 19 Aug 2013

Mirage travels to OSCON'13: a trip report (MirageOS)

Now that Mirage OS is rapidly converging on a Developer Preview Release 1, we took it for a first public outing at OSCON'13, the O'Reilly Open Source Conference. OSCON is in its 15th year now, and is a meeting place for developers, business people and investors. It was a great opportunity to show MirageOS off to some of the movers and shakers in the OSS world.

Partly because MirageOS is about synthesising extremely specialised guest kernels from high-level code, and partly because both Anil and I are constitutionally incapable of taking the easy way out, we self-hosted the slide deck on Mirage: after some last-minute hacking -- on content not Mirage I should add! -- we built a self-contained microkernel of the talk.

This was what you might call a "full stack" presentation: the custom microkernel (flawlessly!) ran a type-safe network device driver, OCaml TCP/IP stack supporting an OCaml HTTP framework that served slides rendered using reveal.js. The slide deck, including the turbo-boosted screencast of the slide deck compilation, is hosted as another MirageOS virtual machine at We hope to add more slide decks there soon, including resurrecting the tutorial! The source code for all this is in the mirage-decks GitHub repo.

The Talk

The talk went down pretty well -- given we were in a graveyard slot on Friday after many people had left, attendance was fairly high (around 30-40), and the feedback scores have been positive (averaging 4.7/5) with comments including "excellent content and well done" and "one of the most excited projects I heard about" (though we are suspicious that just refers to Anil's usual high-energy presentation style...).

Probably the most interesting chat after the talk was with the Rust authors at Mozilla (@pcwalton and @brson) about combining the Mirage unikernel techniques with the Rust runtime. But perhaps the most surprising feedback was when Anil and I were stopped in the street while walking back from some well-earned sushi, by a cyclist who loudly declared that he'd really enjoyed the talk and thought it was a really exciting project -- never done something that achieved public acclaim from the streets before :)

Book Signing and

Anil also took some time to sit in a book signing for his forthcoming Real World OCaml O'Reilly book. This is really important to making OCaml easier to learn, especially given that all the Mirage libraries are using it. Most of the dev team (and especially thanks to Heidi Howard who bravely worked through really early alpha revisions) have been giving us feedback as the book is written, using the online commenting system.

The booth was also huge, and we spent quite a while plotting the forthcoming Mirage/Xen/ARM backend. We're pretty much just waiting for the Cubieboard2 kernel patches to be upstreamed (keep an eye here) so that we can boot Xen/ARM VMs on tiny ARM devices. There's a full report about this on the blog post about OSCon.

Galois and HalVM

We also stopped by the Galois to chat with Adam Wick, who is the leader of the HalVM project at Galois. This is a similar project to Mirage, but, since it's written in Haskell, has more of a focus on elegant compositional semantics rather than the more brutal performance and predictability that Mirage currently has at its lower levels.

The future of all this ultimately lies in making it easier for these multi-lingual unikernels to be managed and for all of them to communicate more easily, so we chatted about code sharing and common protocols (such as vchan) to help interoperability. Expect to see more of this once our respective implementations get more stable.

All-in-all OSCON'13 was a fun event and definitely one that we look forward returning to with a more mature version of MirageOS, to build on the momentum begun this year! Portland was an amazing host city too, but what happens in Portland, stays in Portland...

by Richard Mortier ( at Thu, 08 Aug 2013

Final Real World OCaml beta; the good, the bad and the ugly (Anil Madhavapeddy)

The second and final public beta of Real World OCaml is now available:

Release notes:

  • Over 2,000 comments from proofreaders have been resolved. We realize that reading early content is hard work, and hugely appreciate the spirited feedback! The book is now a week away from being handed over to the O’Reilly production team for copyediting, so the window for changes are limited after that. Comments reset between milestones and so beta2 is a clean slate; we’re still working through some remaining older issues.

  • The chapters on first-class modules, parsing with Menhir, and objects and classes have been significantly revised from beta1. Our thanks to Leo White for contributing significantly to the latter two chapters.

  • All the code snippets and terminal outputs are now mechanically generated. The source code is as close to public domain as practical, at:

  • The final version will have the installation chapter moved to be online only, and we intend to publish updates there to elaborate on installation and packaging mechanisms.

  • Exercises will be available after we go into production, and also only be available online. We really like the collaborative spirit of the commenting system, and will likely extend this to collecting exercises from our readers on an ongoing basis.

There’s been quite a bit of feedback and conversation about the book, so this also seemed like a good point to checkpoint the process somewhat.

Crowd sourcing community feedback

Good: The decision to crowdsource feedback has been exhausting but very worthwhile, with over 2,200 comments posted (and over 2,000 resolved by us too!). O’Reilly has a similar platform called Atlas that wasn’t quite ready when we started our book, but I’d highly encourage new authors to go down this route and not stick with a traditional editorial scheme.

It’s simply not possible for a small group of technical reviewers to notice as many errors as the wider community has. Having said this, it’s interesting how much more focussed and critical the comments of our editor Andy Oram were when compared to most of the wider community feedback, so the commenting system is definitely a complement and not a replacement to the editorial process.

The GitHub requirement

Bad: After the first beta, we got criticized on a Hacker News thread for passing around Github oAuth tokens without SSL. This was entirely my fault, and I corrected the site to be pure-SSL within 24 hours.

Ugly: In my defence though, I dont want the authority that all the reviewers have granted to me for their Github accounts! We need just two things to enable commenting: an identity service to cut down on spam comments, and the ability to create issues in a public repository. Unfortunately, Github’s scope API requires you to also grant us access to commit to public code repositories. Add on the fact that around 6,000 people have clicked through the oAuth API to review RWO, and you start to see just how much code we potentially have access to. I did try to reduce the damage by not actually storing the oAuth tokens on the server-side. Instead, we store it in the client using a secure cookie, so you can easily reset your browser to log out.

It’s not just about authentication either: another reader points out that if they use GitHub during work hours, they have no real way of separating the news streams that result.

Much of the frustration here is that there’s nothing I can do to fix this except wait for GitHub to hopefully improve their service. I very much hope that GitHub is listening to this and has internal plans to overhaul their privilege management APIs.

Infrastructure-free hosting

Good and Bad: One of my goals with the commenting infrastructure was to try and eliminate all server-side code, so that we could simply publish the book onto Github Pages and use JavaScript for the comment creation and listing.

This almost worked out. We still need a tiny HTTP proxy for comment creation, as we add contextual information such as a milestone to every new comment to make it easier to index. Setting a milestone requires privileged access to the repository and so our server-side proxy creates the issue using the user-supplied oAuth token (so that it originates from the commenter), and then updates it (via the bactrian account) to add the milestone add insert a little contextual comment pointing back to the book paragraph where the comment originated from.

Good: The other criticism from the online feedback was the requirement to have a Github login to read the book at all. This is a restriction that we intend to lift for the final release (which will be freely available online under a CC-BY-NC-ND license), but I think it’s absolutely the right decision to gateway early adopters to get useful feedback. Even if we lost 90% of our potential reviewers through the Github auth wall, I don’t think we could have coped with another 10,000 comments in any case.

On the positive side, we didn’t have a single spam comment or other abuses of the commenting system at all.

I’ve had quite a few queries been open-sourcing the scripts that drive the server-side commenting, and this on my TODO list for after the final book has gone to production.

Auto-generating the examples

Bad: We tried for far too long during the book writing to stumble through with manual installation instructions and hand-copied code snippets and outputs. Some of our alpha reviewers pointed out vociferously that spending time on installation and dealing with code typos was not a good use of their time.

Good: Michael Bolin was entirely correct in his criticism (and incidentally, one of our most superstar reviewers). The latest beta has an entirely mechanically generated toolchain that lets us regenerate the entire book output from a cold start by cloning the examples repository. In retrospect, I should have written this infrastructure a year ago, and I’d recommend any new books of this sort focus hard on automation from the early days.

Luckily, my automation scripts could crib heavily from existing open-source OCaml projects that had portions of what we needed, such as uTop and (and my thanks to Jeremie Dimino and Christophe Troestler for their help here).

Awesome: We’re hacking on a little surprise for the final online version of the book, based on this build infrastructure. Stay tuned!

Mon, 05 Aug 2013

OCaml Lecture Notes (Heidi Howard)

I just wanted to share these sildes from Yaron Minsky guest lecture at Princeton on “Abstractions and Types for Concurrent Programming”: [pdf]  to the COS 326 class and the notes from Cornell’s OCaml course called “Data Structures and Functional Programming“, edit the URL to see notes from different years

by Heidi-ann at Thu, 01 Aug 2013

Creating Xen block devices with Mirage (MirageOS)

Mirage is a unikernel or "library operating system" that allows us to build applications which can be compiled to very diverse environments: the same code can be linked to run as a regular Unix app, relinked to run as a FreeBSD kernel module, and even linked into a self-contained kernel which can run on the Xen hypervisor.

Mirage has access to an extensive suite of pure OCaml libraries, covering everything from Xen block and network virtual device drivers, a TCP/IP stack, OpenFlow learning switches and controllers, to SSH and HTTP server implementations.

I normally use Mirage to deploy applications as kernels on top of a XenServer hypervisor. I start by first using the Mirage libraries within a normal Unix userspace application -- where I have access to excellent debugging tools -- and then finally link my app as a high-performance Xen kernel for production.

However Mirage is great for more than simply building Xen kernels. In this post I'll describe how I've been using Mirage to create experimental virtual disk devices for existing Xen VMs (which may themselves be Linux, *BSD, Windows or even Mirage kernels). The Mirage libraries let me easily experiment with different backend file formats and protocols, all while writing only type-safe OCaml code thats runs in userspace in a normal Linux domain 0.

Disk devices under Xen

The protocols used by Xen disk and network devices are designed to permit fast and efficient software implementations, avoiding the inefficiencies inherent in emulating physical hardware in software. The protocols are based on two primitives:

  • shared memory pages: used for sharing both data and metadata

  • event channels: similar to interrupts, these allow one side to signal the other

In the disk block protocol, the protocol starts with the client ("frontend" in Xen jargon) sharing a page with the server ("backend"). This single page will contain the request/response metadata, arranged as a circular buffer or "ring". The client ("frontend") can then start sharing pages containing disk blocks with the backend and pushing request structures to the ring, updating shared pointers as it goes. The client will give the server end a kick via an event channel signal and then both ends start running simultaneously. There are no locks in the protocol so updates to the shared metadata must be handled carefully, using write memory barriers to ensure consistency.

Xen disk devices in Mirage

Like everything else in Mirage, Xen disk devices are implemented as libraries. The ocamlfind library called "xenctrl" provides support for manipulating blocks of raw memory pages, "granting" access to them to other domains and signalling event channels. There are two implementations of "xenctrl": one that invokes Xen "hypercalls" directly and one which uses the Xen userspace library libxc. Both implementations satisfy a common signature, so it's easy to write code which will work in both userspace and kernelspace.

The ocamlfind library shared-memory-ring provides functions to create and manipulate request/response rings in shared memory as used by the disk and network protocols. This library is a mix of 99.9% OCaml and 0.1% asm, where the asm is only needed to invoke memory barrier operations to ensure that metadata writes issued by one CPU core appear in the same order when viewed from another CPU core.

Finally the ocamlfind library xenblock provides functions to hotplug and hotunplug disk devices, together with an implementation of the disk block protocol itself.

Making custom virtual disk servers with Mirage

Let's experiment with making our own virtual disk server based on the Mirage example program, xen-disk.

First, install Xen, OCaml and OPAM. Second initialise your system:

opam init
eval `opam config env`

At the time of writing, not all the libraries were released as upstream OPAM packages, so it was necessary to add some extra repositories. This should not be necessary after the Mirage developer preview at OSCON 2013.

opam remote add mirage-dev git://
opam remote add xapi-dev git://

Install the unmodified xen-disk package, this will ensure all the build dependencies are installed:

opam install xen-disk

When this completes it will have installed a command-line tool called xen-disk. If you start a VM using your Xen toolstack of choice ("xl create ..." or "xe vm-install ..." or "virsh create ...") then you should be able to run:

xen-disk connect <vmname>

which will hotplug a fresh block device into the VM "<vmname>" using the "discard" backend, which returns "success" to all read and write requests, but actually throws all data away. Obviously this backend should only be used for basic testing!

Assuming that worked ok, clone and build the source for xen-disk yourself:

git clone git://
cd xen-disk

Making a custom virtual disk implementation

The xen-disk program has a set of simple built-in virtual disk implementations. Each one satisifies a simple signature, contained in src/storage.mli:

type configuration = {
  filename: string;      (** path where the data will be stored *)
  format: string option; (** format of physical data *)
(** Information needed to "open" a disk *)

module type S = sig
  (** A concrete mechanism to access and update a virtual disk. *)

  type t
  (** An open virtual disk *)

  val open_disk: configuration -> t option Lwt.t
  (** Given a configuration, attempt to open a virtual disk *)

  val size: t -> int64
  (** [size t] is the size of the virtual disk in bytes. The actual
      number of bytes stored on media may be different. *)

  val read: t -> Cstruct.t -> int64 -> int -> unit Lwt.t
  (** [read t buf offset_sectors len_sectors] copies [len_sectors]
      sectors beginning at sector [offset_sectors] from [t] into [buf] *)

  val write: t -> Cstruct.t -> int64 -> int -> unit Lwt.t
  (** [write t buf offset_sectors len_sectors] copies [len_sectors]
      sectors from [buf] into [t] beginning at sector [offset_sectors]. *)

Let's make a virtual disk implementation which uses an existing disk image file as a "gold image", but uses copy-on-write so that no writes persist. This is a common configuration in Virtual Desktop Infrastructure deployments and is generally handy when you want to test a change quickly, and revert it cleanly afterwards.

A useful Unix technique for file I/O is to "memory map" an existing file: this associates the file contents with a range of virtual memory addresses so that reading and writing within this address range will actually read or write the file contents. The "mmap" C function has a number of flags, which can be used to request "copy on write" behaviour. Reading the OCaml manual Bigarray.map_file it says:

If shared is true, all modifications performed on the array are reflected in the file. This requires that fd be opened with write permissions. If shared is false, modifications performed on the array are done in memory only, using copy-on-write of the modified pages; the underlying file is not affected.

So we should be able to make a virtual disk implementation which memory maps the image file and achieves copy-on-write by setting "shared" to false. For extra safety we can also open the file read-only.

Luckily there is already an "mmap" implementation in xen-disk; all we need to do is tweak it slightly. Note that the xen-disk program uses a co-operative threading library called lwt which replaces functions from the OCaml standard library which might block with non-blocking variants. In particular lwt uses Lwt_bytes.map_file as a wrapper for the Bigarray.Array1.map_file function. In the "open-disk" function we simply need to set "shared" to "false" to achieve the behaviour we want i.e.

let open_disk configuration =
  let fd = Unix.openfile configuration.filename [ Unix.O_RDONLY ] 0o0 in
  let stats = Unix.LargeFile.fstat fd in
  let mmap = Lwt_bytes.map_file ~fd ~shared:false () in
  Unix.close fd;
  return (Some (stats.Unix.LargeFile.st_size, Cstruct.of_bigarray mmap))

The read and write functions can be left as they are:

let read (_, mmap) buf offset_sectors len_sectors =
  let offset_sectors = Int64.to_int offset_sectors in
  let len_bytes = len_sectors * sector_size in
  let offset_bytes = offset_sectors * sector_size in
  Cstruct.blit mmap offset_bytes buf 0 len_bytes;
  return ()

let write (_, mmap) buf offset_sectors len_sectors =
  let offset_sectors = Int64.to_int offset_sectors in
  let offset_bytes = offset_sectors * sector_size in
  let len_bytes = len_sectors * sector_size in
  Cstruct.blit buf 0 mmap offset_bytes len_bytes;
  return ()

Now if we rebuild and run something like:

dd if=/dev/zero of=disk.raw bs=1M seek=1024 count=1
losetup /dev/loop0 disk.raw
mkfs.ext3 /dev/loop0
losetup -d /dev/loop0

dist/build/xen-disk/xen-disk connect <myvm> --path disk.raw

Inside the VM we should be able to do some basic speed testing:

# dd if=/dev/xvdb of=/dev/null bs=1M iflag=direct count=100
100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 0.125296 s, 837 MB/s

Plus we should be able to mount the filesystem inside the VM, make changes and then disconnect (send SIGINT to xen-disk by hitting Control+C on your terminal) without disturbing the underlying disk contents.

So what else can we do?

Thanks to Mirage it's now really easy to experiment with custom storage types for your existing VMs. If you have a cunning scheme where you want to hash block contents, and use the hashes as keys in some distributed datastructure -- go ahead, it's all easy to do. If you have ideas for improving the low-level block access protocol then Mirage makes those experiments very easy too.

If you come up with a cool example with Mirage, then send us a pull request or send us an email to the Mirage mailing list -- we'd love to hear about it!

by Dave Scott ( at Thu, 18 Jul 2013

Profiling OCaml – Getting Started Guide (Heidi Howard)

“Perf” is a common command line linux tool used for code profiling, (perf wiki). A alpha version of a OCaml native code compiler that output code, that can be analysis by perf is now avalaible in OPAM


Installing the perf compatible OCaml compiler is straight forward with OPAM, though quite time-consuming due to the need to re-install many packages

$ opam remote add perf git://
$ opam switch 4.01-perf-annotate
$ eval `opam config env`
$ opam install

Installing perf was also straight forward, in fact I already had it via the linux-tools package in apt-get for Ubuntu.

sudo apt-get install linux-tools

Compiling with the new perf-compatable Ocaml compiler was beautifully simple, running make within an existing project working first time without any further changes necessary.

Basic reporting is collected and viewed using:

sudo perf record ./myprogram.native -o
sudo perf report -i
sudo perf script -i

Similarly basic stats can be collected using:

sudo perf stat ./myprogram.native -o
sudo cat

When finished you can switch back to your normal compiler version, i.e.

$ opam switch 4.00.1
$ eval `opam config env`

by Heidi-ann at Wed, 10 Jul 2013

Learning Async (Heidi Howard)

If your taking your first steps in Janestreet’s Async library, here’s some sources of help & support:

My top 5 sources of Async Documentation

  1. Official Janestreet Docs for Async/Core: These are generated from the .mli’s in the Async source and you will quickly find places where ocamldoc has failed to cope with Core/Async complex module structure. In this case, I recommend the source code:
  2. Dummy’s guide to Async, a useful intro to bind, upon and map (the 3 fundemental functions):
  3. Real World OCaml chapter on Async, covers Deferred basics, Ivars, Reader/Writer, TCP server, running the scheduler, Command, Pipes, Monitors, Delays, Choose and System Threads:
  4. Yaron Minsky (one of co-authors of RWO) provides a nice little example of Async’s RPC libraries:
  5. My favourite project for example Async and Lwt code is cohttp:, otherwise there’s OPAM’s list of Async dependent project, useful for finding more example code:


Google group for Janestreet’s Core & Async, the JS team are really helpful :)!forum/ocaml-core

Async issue tracker on Github is quiet but still a good point to start a conversation:


I recommend the installation chapter of Real World Ocaml: and install instructions for OPAM:

Janestreet provide some installation hints, though this is covered for comprehensively in RWO (linked above):

Very light background reading:



by Heidi-ann at Wed, 03 Jul 2013

Phew, Real World OCaml beta now available. (Anil Madhavapeddy)

When I finished writing my PhD, I swore (as most recent graduates do) to never write a thesis again. Instead, life would be a series of pleasantly short papers, interspersed with the occasional journal article, and lots of not-writing-huge-book-activity in general.

Then CUFP 2011 happened, and I find myself in a bar in Tokyo with Yaron Minsky and Marius Eriksen, and a dangerous bet ensued. A few short weeks after that, and Yaron and I are chatting with Jason Hickey in California about writing a book about the language we love. I’m still telling myself that this will never actually happen, but then everyone’s favourite Californirishman Bryan O’Sullivan puts us in touch with O’Reilly, who published his excellent Haskell tome.

O’Reilly arranged everything incredibly fast, with our editor Andy Oram driving us through the process. We decided early on that we wanted to write a book that had opinions based our personal experience: about how OCaml code should be written, about the standard library involved, and generally making functional programming more accessible. Along the way, we’ve been working incredibly hard on the underlying software platform too, with Jane Street, OCamlPro and my own group OCaml Labs working together on all the pieces. There’s still a lot of work left to do, of course, but we’re right on track to get all this released very soon now.

So, without further ado, I was very pleased to send this e-mail this morning. (and once again reaffirm my committment to never writing another book ever again. Until next time!)

Yaron Minsky, Jason Hickey and I are pleased to announce the beta release of our forthcoming O’Reilly book, called “Real World OCaml”, available online at

The book is split into three parts: language concepts, tools and techniques, and understanding the runtime. As promised last year, we are making a public beta available for community review and to help us hunt down inaccuracies and find areas that need more clarification.

We’ve had the book in closed alpha for six months or so and have developed a feedback system that uses Github to record your comments. This lets us follow up to each review with clarifications and keep track of our progress in fixing issues. During alpha, we’ve received over 1400 comments in this fashion (and addressed the vast majority of them!). However, since we anticipate more comments coming in from a public beta, we would request that you read the FAQ to avoid drowning us in repeat comments:

(TL;DR followup another comment on Github directly if you can instead of creating a new issue via the web interface)

This release is available in HTML format online at:

O’Reilly is currently preparing a Rough Cuts release that will make the beta available as PDF and in popular eBook formats. We anticipate that this will be available later this week, and I’ll send a followup when that happens.

Finally, we would especially like to thank our alpha reviewers. Their feedback has been invaluable to the beta release. The book also includes substantial contributions to individual chapters from Jeremy Yallop (FFI), Stephen Weeks (GC) and Leo White (objects).

If you have any comments that you’d like to send directly by e-mail, please contact us at

Release notes for beta1:

  • The first-class modules chapter is incomplete, pending some portability improvements to the ocaml-plugins Core library.
  • The binary serialization chapter is also incomplete, but has just enough to teach you about the Async RPC library.
  • The installation chapter will be revised in anticipation of the OCaml 4.1 release, and is currently quite source-based.
  • The packaging and build systems chapter hasn’t been started yet. We’re still deciding whether or not to make this an online pointer rather than a print chapter, since it’s likely to change quite fast.
  • We are preparing exercises per chapter that are not included in this particular beta release, but will be available online as soon as possible.
  • The code examples will all be clonable as a separate repository in beta2.

best, Yaron, Jason and Anil

Sun, 16 Jun 2013

Last updated:
Thu, 19 Dec 2013
All times are UTC.

Powered by: Planet