Computer Laboratory

OCaml Labs

Recent Posts

Thu, 08 May 2014 Welcome to the summer Mirage hackers MirageOS
Tue, 29 Apr 2014 Writing Planet in pure OCaml Amir Chaudhry
Thu, 24 Apr 2014 Fifth OCaml compiler hacking session Compiler Hacking
Tue, 25 Feb 2014 MirageOS is in Google Summer of Code 2014 MirageOS
Tue, 11 Feb 2014 Mirage 1.1.0: the eat-your-own-dogfood release MirageOS
Tue, 11 Feb 2014 Fourth OCaml compiler hacking session Compiler Hacking
Tue, 04 Feb 2014 How to handle success Compiler Hacking
Mon, 06 Jan 2014 Functional visitors for a complex tree-shaped data structure Philippe Wang
Fri, 03 Jan 2014 Presenting Decks MirageOS
Sun, 29 Dec 2013 Dec 2013 news update Monthly News
Thu, 19 Dec 2013 Mirage 1.0.3 released; tutorial on building this website available MirageOS
Mon, 09 Dec 2013 Mirage 1.0: not just a hallucination! MirageOS
Tue, 26 Nov 2013 Switching from Bootstrap to Zurb Foundation Amir Chaudhry
Wed, 20 Nov 2013 Announcing the new OCaml.org Amir Chaudhry
Wed, 06 Nov 2013 Migration plan for the OCaml.org redesign Amir Chaudhry
Wed, 30 Oct 2013 Third OCaml compiler hacking session Compiler Hacking
Mon, 28 Oct 2013 Review of the OCaml FPDays tutorial Amir Chaudhry
Tue, 22 Oct 2013 FP Days OCaml Session Amir Chaudhry
Mon, 07 Oct 2013 FPDays 2013 Real World OCaml tutorial in Cambridge Anil Madhavapeddy
Mon, 07 Oct 2013 High-power LEDs for super-bright light Philippe Wang
Sat, 05 Oct 2013 Using Travis for secure deployments with SSH Anil Madhavapeddy
Fri, 04 Oct 2013 (Example) A use case of MPP in an OCaml program Philippe Wang
Thu, 03 Oct 2013 OPAMaging MPP Philippe Wang
Thu, 03 Oct 2013 Using MPP in two differents ways Philippe Wang
Wed, 02 Oct 2013 Intellisense for OCaml with Vim and Merlin Anil Madhavapeddy
Wed, 02 Oct 2013 Liveblogging the first Human Data Interaction workshop SRG Syslog
Sun, 29 Sep 2013 Test your OCaml packages in minutes using Travis CI Anil Madhavapeddy
Thu, 26 Sep 2013 (Exercise) Odd or even? Philippe Wang
Wed, 25 Sep 2013 (Exercise) Are there more 1s or 0s? Philippe Wang
Tue, 24 Sep 2013 Liveblogging OCaml Workshop 2013 SRG Syslog
Tue, 24 Sep 2013 On the Implementation of OPAMDOC Philippe Wang
Tue, 24 Sep 2013 Feedback requested on the OCaml.org redesign Amir Chaudhry
Tue, 24 Sep 2013 OPAM-DOC with multiple OCaml packages Philippe Wang
Mon, 23 Sep 2013 Liveblogging CUFP 2013 SRG Syslog
Thu, 19 Sep 2013 OPAM 1.1 beta available, with pretty colours Anil Madhavapeddy
Tue, 17 Sep 2013 OCaml Monthly Meeting – Live Blog Heidi Howard
Tue, 17 Sep 2013 Inaugural compiler hackers meeting Compiler Hacking
Sun, 15 Sep 2013 Camlpdf, the first good command-line PDF tool I've found Anil Madhavapeddy
Mon, 09 Sep 2013 OCaml Development in Vim Heidi Howard
Sun, 08 Sep 2013 OCamlot--exploring the edges of OPAM packages Anil Madhavapeddy

Welcome to the summer Mirage hackers (MirageOS)


Following our participation in the Google Summer of Code program, we've now finalised selections. We've also got a number of other visitors joining us to hack on Mirage over the summer time, so here are introductions!

  • SSL support: Hannes Mehnert and David Kaloper have been working hard on a safe OCaml TLS implementation. They're going to hack on integrating it all into working under Xen so we can make HTTPS requests (and our Twitter bot will finally be able to tweet!). Both are also interested in formal verification of the result, and several loooong conversations with Peter Sewell will magically transform into machine specifications by summer's end, I'm reliably informed.
  • Cloud APIs: Jyotsna Prakash will spend her summer break as part of Google Summer of Code working on improving cloud provider APIs in OCaml (modelled from her notes on how the GitHub bindings are built). This will let the mirage command-line tool have much more natural integration with remote cloud providers for executing the unikernels straight from a command-line. If you see Jyotsna wandering around aimlessly muttering darkly about HTTP, JSON and REST, then the project is going well.
  • Network Stack fuzzing: Mindy Preston joins us for the summer after her Hacker School stay, courtesy of the OPW program. She's been delving into the network stack running on EC2 and figuring out how to debug issues when the unikernel is running a cloud far, far away (see the post series here: 1, 2, 3, 4).
  • Visualization: Daniel Buenzli returns to Cambridge this summer to continue his work on extremely succinctly named graphics libaries. His Vz, Vg and Gg libaries build a set of primitives for 2D graphics programming. Since the libraries compile to JavaScript, we're planning to use this as the basis for visualization of Mirage applications via a built-in webserver.
  • Modular implicits: Frederic Bour, author of the popular Merlin IDE tool is also in Cambridge this summer working on adding modular implicits to the core OCaml language. Taking inspiration from Modular Type-classes and Scala's implicits, modular implcits allow functions to take implicit module arguments which will be filled-in by the compiler by searching the environment for a module with the appropriate type. This enables ad-hoc polymorphism in a very similar way to Haskell's type classes.
  • Irminsule storage algorithms: Benjamin Farinier (from ENS Lyon) and Matthieu Journault (from ENS Cachan) will work on datastructures for the Irminsule storage system that the next version of Mirage will use. They'll be grabbing copies of the Okasaki classic text and porting some of them into a branch-consistent form.

Of course, work continues apace by the rest of the team as usual, with a steady stream of releases that are building up to some exciting new features. We'll be blogging about ARM support, PVHVM, Irminsule storage and SSL integration just as soon as they're pushed into the stable branches. As always, get in touch via the IRC channel (#mirage on Freenode) or the mailing lists with questions.

by Anil Madhavapeddy (anil@recoil.org) at Thu, 08 May 2014

Writing Planet in pure OCaml (Amir Chaudhry)


I've been learning OCaml for some time now but not really had a problem that I wanted to solve. As such, my progress has been rather slow and sporadic and I only make time for exercises when I'm travelling. In order to focus my learning, I have to identify and tackle something specific. That's usually the best way to advance and I recently found something I can work on.

As I've been trying to write more blog posts, I want to be able to keep as much content on my own site as possible and syndicate my posts out to other sites I run. Put simply, I want to be able to take multiple feeds from different sources and merge them into one feed, which will be served from some other site. In addition, I also want to render that feed as HTML on a webpage. All of this has to remain within the OCaml toolchain so it can be used as part of Mirage (i.e. I can use it when building unikernels).

What I'm describing might sound familiar and there's a well-known tool that does this called Planet. It's a 'river of news' feed reader, which aggregates feeds and can display posts on webpages and you can find the original Planet and it's successor Venus, both written in Python. However, Venus seems to be unmaintained as there are a number of unresolved issues and pull requests, which have been languishing for quite some time with no discussion. There does appear to be a more active Ruby implementation called Pluto, with recent commits and no reported issues.

Benefits of a Planet in pure OCaml

Although I could use the one of the above options, it would be much more useful to keep everything within the OCaml ecosystem. This way I can make the best use of the unikernel approach with Mirage (i.e lean, single-purpose appliances). Obviously, the existing options don't lend themselves to this approach and there are known bugs as a lot has changed on the web since Planet Venus (e.g the adoption of HTML5). Having said that, I can learn a lot from the existing implementations and I'm glad I'm not embarking into completely uncharted territory.

In addition, the OCaml version doesn't need to (and shouldn't) be written as one monolithic library. Instead, pulling together a collection of smaller, reusable libraries that present clear interfaces to each other would make things much more maintainable. This would bring substantially greater benefits to everyone and OPAM can manage the dependencies.

Breaking down the problem

The first cut is somewhat straightforward as we have a piece that deals with the consumption and manipulation of feeds and another that takes the result and emits HTML. This is also how the original Planet is put together, with a library called feedparser and another for templating pages.

For the feed-parsing aspect, I can break it down further by considering Atom and RSS feeds separately and then even further by thinking about how to (1) consume such feeds and (2) output them. Then there is the HTML component, where it may be necessary to consider existing representations of HTML. These are not new ideas and since I'm claiming that individual pieces might be useful then it's worth finding out which ones are already available.

Existing components

The easiest way to find existing libraries is via the OPAM package list. Some quick searches for RSS, XML, HTML and net bring up a lot of packages. The most relevant of these seem to be xmlm, ocamlrss, cow and maybe xmldiff. I noticed that nothing appears, when searching for Atom, but I do know that cow has an Atom module for creating feeds. In terms of turning feeds into pages and HTML, I'm aware of rss2html used on the OCaml website and parts of ocamlnet that may be relevant (e.g nethtml and netstring) as well as cow. There is likely to be other code I'm missing but this is useful as a first pass.

Overall, a number of components are already out there but it's not obvious if they're compatible (e.g html) and there are still gaps (e.g atom). Since I also want to minimise dependencies, I'll try and use whatever works but may ultimately have to roll my own. Either way, I can learn from what already exists. Perhaps I'm being overconfident but if I can break things down sensibly and keep the scope constrained then this should be an achievable project.

The first (baby) steps - an Atom parser

As this is an exercise for me to learn OCaml by solving a problem, I need to break it down into bite-size pieces and take each one at a time. Practically speaking, this means limiting the scope to be as narrow as possible while still producing a useful result for me. That last part is important as I have specific needs and it's likely that the first thing I make won't be particularly interesting for many others.

For my specific use-case, I'm only interested in dealing with Atom feeds as that's what I use on my site and others I'm involved with. Initial feedback is that creating an Atom parser will be the bulk of the work and I should start by defining the types. To keep this manageable, I'm only going to deal with my own feeds instead of attempting a fully compliant parser (in other words, I'll only consider the subset of RFC4287 that's relevant to me). Once I can parse, merge and write such feeds I should be able to iterate from there.

To make my requirements more concrete:

  • Only consider my own Atom feeds for now
  • Initially, be able to parse and emit just one Atom feed
  • Then be able to merge 2+ feeds, specifically:
    • Use tag-based feeds from my personal site as starting points
    • Be able to de-dupe content
  • No database or storage (construct it afresh every time)
  • Minimise library dependencies

Timeframes and workflow

I've honestly no idea how long this might take and I'm treating it as a side-project. I know there are many people out there who could produce a working version of everything in a week or two but I'm not one of them (yet). There are also a lot of ancillary things I need to learn on the way, like packaging, improving my knowledge of Git and dealing with build systems. If I had to put a vague time frame on this, I'd be thinking in months rather than weeks. It might even be the case that others start work on parts of this and ship things sooner but that's great as I'll probably be able to use whatever they create and move further along the chain.

In terms of workflow, everything will be done in the open, warts and all, and I expect to make embarrassing mistakes as I go. You can follow along on my freshly created OCaml Atom repo, and I'll be using the issue tracker as the main way of dealing with bugs and features. Let the fun begin.


Acknowledgements: Thanks to Daniel, Ashish, Christophe, Philippe and Thomas for discussions on an earlier draft of this post and providing feedback on my approach.

by Amir Chaudhry at Tue, 29 Apr 2014

Fifth OCaml compiler hacking session (Compiler Hacking)


It's time for the fifth Cambridge OCaml compiler-hacking session! We'll be meeting in the Computer Lab again next Tuesday evening.

If you're planning to come along, it'd be helpful if you could indicate interest via Doodle and sign up to the mailing list to receive updates:

Where: Room FW11, Computer Laboratory, Madingley Road

When: 6pm, Tuesday 29th April

Who: anyone interested in improving OCaml. Knowledge of OCaml programming will obviously be helpful, but prior experience of working on OCaml internals isn't necessary.

What: fixing bugs, implementing new features, learning about OCaml internals

Wiki: https://github.com/ocamllabs/compiler-hacking/wiki

We're defining "compiler" pretty broadly, to include anything that's part of the standard distribution, which means at least the standard library, runtime, tools (ocamldep, ocamllex, ocamlyacc, etc.), ocamlbuild, the documentation, and the compiler itself. We'll have suggestions for mini-projects for various levels of experience (see also some things we've worked on in previous sessions), but feel free to come along and work on whatever you fancy.

We'll also be ordering pizza, so if you want to be counted for food you should aim to arrive by 6.30pm.

by cl-ocamllabs@lists.cam.ac.uk (OCaml Labs) at Thu, 24 Apr 2014

MirageOS is in Google Summer of Code 2014 (MirageOS)


MirageOS will part of the Google Summer of Code 2014 program, thanks to the Xen Project's participation! It's been a few years since I've mentored for GSoc, but I still have fond memories of some great projects in the past (such as the legendary Quake testing we were forced to do for hours on end). I've already received a number of queries about this year's program from potential students, so here's a few things to note to become a successful applicant.

  • Students still need to apply and be accepted. Your chances of being selected are much higher if you demonstrate some participation and code contributions (even minor) before submitting an application. Thus, even if you don't have a copy of Xen around, roll up your sleeves and head over to the installation instructions.

  • Contributions do not have to be just code. They can be documentation, help with packaging, wiki posts about a particular use, or test cases to improve code coverage.

  • It's unlikely that we'll get students who are very familiar with both OCaml and Xen (if you are, definitely get in touch with us!). You should therefore look over the project ideas as a set of guidelines and not firm suggestions. If you have a particular thing you'd like to do with Mirage (for example, work on the JavaScript backend, an IPython interface or a particular protocol implementation such as XMPP, then that's fine. Just get in touch with us on the mailing lists or directly via e-mail, and we can work through them.

  • Under some circumstances, we can provide resources such as a login to a Xen machine, or delegated credits on a cloud provider. Don't let that stop you from applying for a project idea. In general though, it's best to only depend on your own computer resources if practical to do so.

by Anil Madhavapeddy (anil@recoil.org) at Tue, 25 Feb 2014

Mirage 1.1.0: the eat-your-own-dogfood release (MirageOS)


We've just released Mirage 1.1.0 into OPAM. Once the live site updates, you should be able to run opam update -u and get the latest version. This release is the "eat our own dogfood" release; as I mentioned earlier in January, a number of the Mirage developers have decided to shift our own personal homepages onto Mirage. There's nothing better than using our own tools to find all the little annoyances and shortcomings, and so Mirage 1.1.0 contains some significant usability and structural improvements for building unikernels.

Functional combinators to build device drivers

Mirage separates the application logic from the concrete backend in use by writing the application as an OCaml functor that is parameterized over module types that represent the device driver signature. All of the module types used in Mirage can be browsed in one source file.

In Mirage 1.1.0, Thomas Gazagnaire implemented a a combinator library that makes it easy to separate the definition of application logic from the details of the device drivers that actually execute the code (be it a Unix binary or a dedicated Xen kernel). It lets us write code of this form (taken from mirage-skeleton/block):

let () =
  let main = foreign "Unikernel.Block_test" (console @-> block @-> job) in
  let img = block_of_file "disk.img" in
  register "block_test" [main $ default_console $ img]

In this configuration fragment, our unikernel is defined as a functor over a console and a block device by using console @-> block @-> job. We then define a concrete version of this job by applying the functor (using the $ combinator) to a default console and a file-backed disk image.

The combinator approach lets us express complex assemblies of device driver graphs by writing normal OCaml code, and the mirage command line tool parses this at build-time and generates a main.ml file that has all the functors applied to the right device drivers. Any mismatches in module signatures will result in a build error, thus helping to spot nonsensical combinations (such as using a Unix network socket in a Xen unikernel).

This new feature is walked through in the tutorial, which now walks you through several skeleton examples to explain all the different deployment scenarios. It's also followed by the website tutorial that explains how this website works, and how our Travis autodeployment throws the result onto the public Internet.

Who will win the race to get our website up and running first? Sadly for Anil, Mort is currently in the lead with an all-singing, all-dancing shiny new website. Will he finish in the lead though? Stay tuned!

Less magic in the build

Something that's more behind-the-scenes, but important for easier development, is a simplication in how we build libraries. In Mirage 1.0, we had several packages that couldn't be simultaneously installed, as they had to be compiled in just the right order to ensure dependencies.

With Mirage 1.1.0, this is all a thing of the past. All the libraries can be installed fully in parallel, including the network stack. The 1.1.0 TCP/IP stack is now built in the style of the venerable FoxNet network stack, and is parameterized across its network dependencies. This means that once can quickly assemble a custom network stack from modular components, such as this little fragment below from mirage-skeleton/ethifv4/:

module Main (C: CONSOLE) (N: NETWORK) = struct

  module E = Ethif.Make(N)
  module I = Ipv4.Make(E)
  module U = Udpv4.Make(I)
  module T = Tcpv4.Flow.Make(I)(OS.Time)(Clock)(Random)
  module D = Dhcp_clientv4.Make(C)(OS.Time)(Random)(E)(I)(U)
  

This functor stack starts with a NETWORK (i.e. Ethernet) device, and then applies functors until it ends up with a UDPv4, TCPv4 and DHCPv4 client. See the full file to see how the rest of the logic works, but this serves to illustrate how Mirage makes it possible to build custom network stacks out of modular components. The functors also make it easier to embed the network stack in non-Mirage applications, and the tcpip OPAM package installs pre-applied Unix versions for your toplevel convenience.

To show just how powerful the functor approach is, the same stack can also be mapped onto a version that uses kernel sockets simply by abstracting the lower-level components into an equivalent that uses the Unix kernel to provide the same functionality. We explain how to swap between these variants in the tutorials.

Lots of library releases

While doing the 1.1.0 release in January, we've also released quite a few libraries into OPAM. Here are some of the highlights.

Low-level libraries:

  • mstruct is a streaming layer for handling lists of memory buffers with a simpler read/write interface.
  • nbd is an implementation of the Network Block Device protocol for block drivers.

Networking and web libraries:

  • ipaddr now has IPv6 parsing support thanks to Hugo Heuzard and David Sheets. This is probably the hardest bit of adding IPv6 support to our network stack!
  • cowabloga is slowly emerging as a library to handle the details of rendering Zurb Foundation websites. It's still in active development, but being used for a few of our personal websites as well as this website.
  • cohttp has had several releases thanks to external contributions, particular from Rudy Grinberg who added s-expression support and several other improvements.
  • uri features performance improvements and the elimination of Scanf (considered rather slow by OCaml standards).
  • cow continues its impossible push to make coding HTML and CSS a pleasant experience, with better support for Markdown now.
  • The github bindings are now also in use as part of an experiment to make upstream OCaml development easier for newcomers, thanks to Gabriel Scherer.

Dave Scott led the splitting up of several low-level Xen libraries as part of the build simplication. These now compile on both Xen (using the direct hypercall interface) and Unix (using the dom0 /dev devices) where possible.

  • xen-evtchn for the event notification mechanism. There are a couple of wiki posts that explain how event channels and suspend/resume work in Mirage/Xen guests.
  • xen-gnt for the grant table mechanism that controls inter-process memory.
  • The io-page library no longer needs Unix and Xen variants, as the interface has been standardized to work in both.

All of Dave's hacking on Xen device drivers is showcased in this xen-disk wiki post that explains how you can synthesize your own virtual disk backends using Mirage. Xen uses a split device model, and now Mirage lets us build backend device drivers that service VMs as well as the frontends!

Last, but not least, Thomas Gazagnaire has been building a brand new storage system for Mirage guests that uses git-style branches under the hood to help coordinate clusters of unikernels. We'll talk about how this works in a future update, but there are some cool libraries and prototypes available on OPAM for the curious.

  • lazy-trie is a lazy version of the Trie data structure, useful for exposing Git graphs.
  • git is a now-fairly complete implementation of the Git protocol in pure OCaml, which can interoperate with normal Git servers via the ogit command-line tool that it installs.
  • irminsule is the main library that abstracts Git DAGs into an OCaml programming API. The homepage has instructions on how to play with the command-line frontend to experiment with the database.
  • git2fat converts a Git checkout into a FAT block image, useful when bundling up unikernels.

We'd also like to thank several conference organizers for giving us the opportunity to demonstrate Mirage. The talk video from QCon SF is now live, and we also had a great time at FOSDEM recently (summarized by Amir here). So lots of activities, and no doubt little bugs lurking in places (particularly around installation). As always, please do let us know of any problem by reporting bugs, or feel free to contact us via our e-mail lists or IRC. Next stop: our unikernel homepages!

by Anil Madhavapeddy (anil@recoil.org) at Tue, 11 Feb 2014

Fourth OCaml compiler hacking session (Compiler Hacking)


It's time for the fourth Cambridge OCaml compiler-hacking session! We'll be meeting in the Computer Lab again next Tuesday evening.

If you're planning to come along, it'd be helpful if you could indicate interest via Doodle and sign up to the mailing list to receive updates:

Where: Room FW11, Computer Laboratory, Madingley Road

When: 6pm, Tuesday 18th February

Who: anyone interested in improving OCaml. Knowledge of OCaml programming will obviously be helpful, but prior experience of working on OCaml internals isn't necessary.

What: fixing bugs, implementing new features, learning about OCaml internals

Wiki: https://github.com/ocamllabs/compiler-hacking/wiki

We're defining "compiler" pretty broadly, to include anything that's part of the standard distribution, which means at least the standard library, runtime, tools (ocamldep, ocamllex, ocamlyacc, etc.), ocamlbuild, the documentation, and the compiler itself. We'll have suggestions for mini-projects for various levels of experience (see also some things we've worked on in previous sessions), but feel free to come along and work on whatever you fancy.

We'll also be ordering pizza, so if you want to be counted for food you should aim to arrive by 6.30pm.

by cl-ocamllabs@lists.cam.ac.uk (OCaml Labs) at Tue, 11 Feb 2014

How to handle success (Compiler Hacking)


(Update: there's a Mantis issue open to discuss this proposal.)

OCaml's try construct is good at dealing with exceptions, but not so good at handling the case where no exception is raised. This post describes a simple extension to try that adds support for handling the "success" case.

Here's an example of code that benefits from the extension. On a recent caml-list thread, Simon Cruanes posted the following function for iterating over a stream:

let rec iter_stream f s =
  match (try Some (MyStream.get s) with End_of_stream -> None) with
  | None -> ()
  | Some (x, s') ->
      f x;
      iter_stream f s'

For each element of a stream, iter_stream wraps the element with Some, then unwraps it again and passes it to f. At first glance, wrapping and immediately unwrapping in this way seems like needless obfuscation. However, moving the last two lines out of the body of the try in this way serves two essential purposes: it turns the recursive call to iter_stream into a tail call, and it allows exceptions raised by f to propagate. More generally, this use of options makes it easy to specify the success continuation of a try expression, i.e. the piece of code that receives the value of the body when no exception is raised.

As Simon notes, the match (try Some ...) idiom is widely used in OCaml code. Examples can be found in the source of lwt, batteries, liquidsoap, sexplib, opa, uri, coq, unison, and many other packages.

In response to Simon's message, Oleg pointed out a solution: the 2001 paper Exceptional Syntax (Benton and Kennedy) extends try with a let-like binding construct that supports the success continuation idiom directly without the need for the option value.

This post describes a patch to OCaml that implements a variant of Benton and Kennedy's design called handler case. Like Exceptional Syntax, handler case extends try with explicit success continuation handling. However, unlike Exceptional syntax, handler case uses match binding for both the success continuation and the exception-handling clauses. Here's the extended try syntax:

try expr
with pattern_1 -> expr_1
   | ...
   | pattern_n -> expr_n
   | val pattern_1' -> expr_1'
   | ...
   | val pattern_n' -> expr_n'

As in current OCaml, the clauses pattern_1 -> expr_1 ... pattern_n -> expr_n handle exceptions raised during the evaluation of expr. The clauses val pattern_1' -> expr_1' ... val pattern_n' -> expr_n' handle the case where no exception is raised; in this case the value of expr is matched against pattern_1' ... pattern_n' to select the expression to evaluate to produce the result value. (The actual syntax is implemented slightly more permissively: it allows value-matching and exception-matching clauses to be freely interleaved.)

Using handler case we can rewrite iter_stream to remove the extraneous option value:

let rec iter_stream f s =
  try MyStream.get s
  with End_of_stream -> ()
     | val (x, s') -> f x;
                      iter_stream f s'

We don't need to look far to find other code that benefits from the new construct. Here's a function from the Big_int module in the standard library:

let int_of_big_int bi =
  try let n = int_of_nat bi.abs_value in
    if bi.sign = -1 then - n else n
  with Failure _ ->
    if eq_big_int bi monster_big_int then monster_int
    else failwith "int_of_big_int"

The core of the function --- the call to int_of_nat --- is rather buried in the complex control flow. There are two if-then-else constructs, a let binding, and a try expression with a complex body. Using handler case we can disentangle the code to make the four possible outcomes from the call to int_of_nat explicit:

let int_of_big_int bi =
  try int_of_nat bi.abs_value with
  | val n when bi.sign = -1 ->
     -n
  | val n ->
     n
  | Failure _ when eq_big_int bi monster_big_int ->
     monster_int
  | Failure _ ->
     failwith "int_of_big_int"

Here's a simpler example from the String module, which also involves code that cannot raise an exception in the body of a try block:

try ignore (index_rec s l i c); true with Not_found -> false

Using handler case we can separate the code that may raise an exception (the call to index_rec) from the expression that produces the result:

try index_rec s l i c with val _ -> true | Not_found -> false

Trying it out

Using opam you can install an OCaml compiler extended with handler case as follows:

$ opam remote add ocamllabs git@github.com:ocamllabs/opam-repo-dev.git
ocamllabs Fetching git@github.com:ocamllabs/opam-repo-dev.git
[...]
$ opam switch 4.02.0dev+handler-syntax
# To complete the configuration of OPAM, you need to run:
eval `opam config env`
$ eval `opam config env`

jsofocaml

You can also try out the handler case construct in your browser, using the following modified version of OCamlPro's Try OCaml application:

The discoveries of success continuations

As Philip Wadler notes, constructs for handling success continuations have been independently discovered multiple times. In fact, the history goes back even further than described in Wadler's blog; constructs like handler case date back over thirty years and have been introduced, apparently independently, into at least four languages. Curiously, all the languages use let-binding for success continuations and match binding for failure continuations.

Lisp

In Common Lisp the construct analogous to try is handler-case (from which the construct discussed here borrows its name). A handler-case expression has a body and a sequence of clauses which specify how various conditions (exceptions) should be handled. The special condition specification :no-error specifies the code to run when no condition is signalled. The iter_stream function might be written as follows in Common Lisp:

(defun iter-stream (f s)
   (handler-case (get-stream s)
      (end-of-stream (_) nil)
      (:no-error (x |s'|)
         (funcall f x)
         (iter-stream f |s'|))))

The Common Lisp specification was completed in 1994 but the handler-case construct and its :no-error clause were present in some of Common Lisp's progenitors. The construct was apparently introduced to Symbolics Lisp some time around 1982: it appears in the 5th edition of the Lisp Machine manual (January 1983) but not the 4th edition from 18 months earlier (July 1981).

Python

Python has supported success continuations in exception handlers since May 1994, when the else clause was added to try blocks. The Changelog in old versions of the Grammar/Grammar file has an entry

# 3-May-94:
#Added else clause to try-except

introduced in a commit from August 1994:

changeset:   1744:6c0e11b94009
branch:      legacy-trunk
user:        Guido van Rossum <guido@python.org>
date:        Mon Aug 01 11:00:20 1994 +0000
summary:     Bring alpha100 revision back to mainline

Unlike the :no-error clause in Lisp's handler-case, Python's else clause doesn't bind variables. Since Python variables have function scope, not block scope, bindings in the body of the try block are visible throughout the function. In Python we might write iter_stream as follows:

def iter_stream(f, s):
   try:
      (x, s_) = MyStream.get(s)
   except End_of_stream:
      pass
   else:
      f(x)
      iter_stream(f, s_)

The provenance of the else clause is unclear, but it doesn't seem to derive from Lisp's handler-case. The design of Python's exception handling constructs comes from Modula-3, but the exception handling construct described in the Modula-3 report does not include a way of specifying the success continuation. The syntax of the Modula-3 TRY/EXCEPT statement (found on p21 of the report) does include an ELSE clause:

   TRY    
     Body
   EXCEPT
     id1 (v1) => Handler1
   | ...
   | idn (vn) => Handlern
   ELSE Handler0
   END

However, whereas Python's else handles the case where no exception is raised, Modula-3's ELSE handles the case where an exception not named in one of the EXCEPT clauses is raised: it is equivalent to Python's catch-all except:.

Python also adds success handlers to other constructs. Both the for and the while statements have an optional else clause which is executed unless the loop terminates prematurely with an exception or break.

Exceptional Syntax

The 2001 paper Exceptional Syntax (Benton and Kennedy) proposed the following construct for handling exceptions in Standard ML:

let val pattern_1 <= expr_1
    ...
    val pattern_n <= expr_n
 in
    expr
unless
    pattern_1' => expr_1'
  | ...
  | pattern_n' => expr_n'
end

Evaluation of the let binding proceeds as normal, except that if any of expr_1 to expr_n raises an exception, control is transferred to the right hand side of the first of the clauses after unless whose left hand side matches the exception. The construct is largely similar to our proposed variation, except that the bindings used in the success continuation are based on let, so scrutinising the values requires a separate case (i.e. match) expression.

Using the Exceptional Syntax construct we might write iter_stream as follows:

fun iter_stream f s =
 let val (x, s') <= MyStream.get s in
     f x;
     iter_stream f s'
 unless End_of_stream => ()
 end

Exceptional Syntax has been implemented in the SML-to-Java compiler MLj.

Erlang

The 2004 paper Erlang's Exception Handling Revisited (Richard Carlsson, Björn Gustavsson and Patrik Nyblom) proposed an exception-handling construct for Erlang along the same lines as exceptional syntax, although apparently developed independently. In the proposed extension to Erlang we might write iter_stream as follows:

iter_stream(F, S) ->
   try Mystream:get(S) of
      {X, S_} ->
        _ = F(X),
        iter_stream(F, S_)
   with
    End_of_stream -> {}

Eff

Plotkin and Pretnar's work on handlers for algebraic effects generalises Exceptional Syntax to support effects other than exceptions. The programming language eff implements a design based on this work, and supports Exceptional Syntax, again with let binding for the success continuation. (Although the success continuation is incorporated into the exception matching construct, only a single success continuation pattern is allowed.) In eff we might write iter_stream as follows:

let rec iter_stream f s =
  handle my_stream_get s
  with exn#end_of_stream _ _ -> ()
     | val (x, s') -> f x;
                      iter_stream f s'

The second argument in the end_of_stream clauses binds the continuation of the effect, allowing handling strategies other than the usual stack unwinding. Since we ignore the continuation argument the behaviour is the same as for a regular exception handler.

The eff implementation uses the term "handler case" for the clauses of the handle construct.

OCaml

Several OCaml programmers have proposed or implemented constructs related to handler case.

Martin Jambon has implemented a construct equivalent to Exceptional Syntax for OCaml as part of the micmatch extension. His implementation allows us to write iter_stream in much the same way as Benton and Kennedy's proposal:

let rec iter_stream f s =
  let try (x, s') = my_stream_get s
   in f x;
      iter_stream f s'
  with End_of_stream -> ()

The details of the implementation are discussed in Jake Donham's articles on Camlp4. The micmatch implementation has a novel feature: the let binding associated with the success continuation may be made recursive.

Alain Frisch has proposed and implemented a more powerful extension to OCaml, Static Exceptions, which allow transfer of control to lexically-visible handlers (along the lines of Common Lisp's block and return-from). Static exceptions are based on an equivalent feature in OCaml's intermediate language.

There is a straightforward translation from OCaml extended with handler case into OCaml extended with static exceptions by wrapping the body of each try expression in raise (`Val (...)), and changing the val keyword in the binding section to `Val. For example, iter_stream can be written using static exceptions as follows:

let rec iter_stream f s =
  try raise (`Val (MyStream.get s))
  with End_of_stream -> ()
     | `Val (x, s') -> f x;
                       iter_stream f s'

Of course, static exceptions allow many other programs to be expressed that are not readily expressible using handler case.

Finally, I discovered while writing this article that Christophe Raffalli proposed the handler case design fifteen years ago in a message to caml-list! Christophe's proposal wasn't picked up back then, but perhaps the time has now come to give OCaml programmers a way to handle success.

Postscript: a symmetric extension

The try construct in current OCaml supports matching against raised exceptions but not against the value produced when no exception is raised. Contrariwise, the match construct supports matching against the value produced when no exception is raised, but does not support matching against raised exceptions. As implemented, the patch addresses this asymmetry, extending match with clauses that specify the "failure continuation":

match expr
with pattern_1 -> expr_1
   | ...
   | pattern_n -> expr_n
   | exception pattern_1' -> expr_1'
   | ...
   | exception pattern_n' -> expr_n'

With this additional extension the choice between match and try becomes purely stylistic. We might optimise for succinctness, and use try in the case where exceptions are expected (for example, where they're used for control flow), reserving match for the case where exceptions are truly exceptional.

For the sake of completeness, here's iter_stream written with the extended match construct:

let rec iter_stream f s =
  match MyStream.get s with
     (x, s') -> f x;
                iter_stream f s'
   | exception End_of_stream -> ()

Since both val and exception are existing keywords, the extensions to both try and match are fully backwards compatible.

by cl-ocamllabs@lists.cam.ac.uk (OCaml Labs) at Tue, 04 Feb 2014

Presenting Decks (MirageOS)


A few months ago, partly as a stunt, mostly because we could, Anil and I put together a presentation for OSCON'13 about Mirage in Mirage. That is, as a self-hosting Mirage web application that served up slides using RevealJS. It was a bit of a hack, but it was cool (we thought!) and it worked. Several more presentations were written and given this way, at venues ranging from the XenSummit 2013 to ACM FOCI 2013 to the Cambridge Computer Lab's MSc in Advanced Computer Science.

With the release of Mirage 1.0, CoHTTP, Cowabloga and the new Zurb Foundation based website, it was time to refresh them and as a little seasonal gift, give them a shiny new index with some actual CSS styling. So here they are, a set of presentations that have been given by various members of the Mirage team over the last 6 months or so. They cover a range of topics, from general introductions to the Xen roadmap to more detailed technical background. And, of course, as Mirage is under constant rapid development, some of the older content may already be outdated. But the code for the site itself serves as another example of a simple -- somewhat simpler than the Mirage website in fact -- Mirage web application.

by Richard Mortier (mort@cantab.net) at Fri, 03 Jan 2014

Dec 2013 news update (Monthly News)


This time last year in 2012, I had just announced the formation of a new group called OCaml Labs in the Cambridge Computer Lab that would combine research and community work towards the practical application of functional programming. An incredible year has absolutely flown by, and I've put together this post to summarise what's gone on, and point to our future directions for 2014.

The theme of our group was not to be pure research, but rather a hybrid group that would take on some of the load of day-to-day OCaml maintenance from INRIA, as well as help grow the wider OCaml community. To this end, all of our projects have been highly collaborative, often involving colleagues from OCamlPro, INRIA, Jane Street, Lexifi and Citrix.

This post covers progress in tooling, the compiler and language, community efforts, research projects and concludes with our priorities for 2014.

Tooling

At the start of 2013, OCaml was in the interesting position of being a mature decades-old language with a small, loyal community of industrial users who built mission critical applications using it. We had the opportunity to sit down with many of them at the OCaml Consortium meeting and prioritise where we started work. The answer came back clearly: while the compiler itself is legendary for its stability, the tooling around it (such as package management) was a pressing problem.

OPAM

Our solution to this tooling was centered around the OPAM package manager that OCamlPro released into beta just at the end of 2012, and had its first stable release in March 2013. OPAM differs from most system package managers by emphasising a flexible distributed workflow that uses version constraints to ensure incompatible libraries aren't mixed up (important for the statically-typed OCaml that is very careful about dependencies). Working closely with OCamlPro we developed a git-based workflow to make it possible for users (both individual or industrial) to easily build up their own package repositories and redistribute OCaml code, and started curating the package repository.

The results have been satisfying: we started with an initial set of around 100 packages in OPAM (mostly imported by the 4 developers), and ended 2013 with 587 unique packages and 2000 individual versions, with contributions from 160 individuals. We now have a curated central package repository for anyone to submit their OCaml code, several third-party remotes are maintained (e.g. the Xen Project and Ocsigen). We also regularly receive releases of the Core libraries from Jane Street, and updates from sources as varied as Facebook, Coherent PDF, to the Frenetic SDN research.

A notable contribution from OCamlPro during this time was to clarify the licensing on the package repository to be the liberal CC0, and also to pass ownership to the OCaml organization on GitHub, where it's now jointly maintained by OCaml Labs, OCamlPro and anyone else that wishes to contribute.

A lens into global OCaml code

It's been quite interesting just watching all the varied code fly into the repository, but stability quickly became a concern as the new packages piled up. OCaml compiles to native code on not just x86, but also PowerPC, Sparc and ARM CPUs. We kicked off various efforts into automated testing: firstly David Sheets built the OCamlot daemon that would schedule builds across all the exotic hardware. Later in the year, the Travis service launched support for testing from GitHub pull requests, and this became the front line of automated checking for all incoming new packages to OPAM.

A major headache with automated testing is usually setting up the right build environment with external library dependencies, and so we added Docker support to make it easier to bulk-build packages for local developer use, with the results of builds available publically for anyone to help triage. Unfortunately fixing the bugs themselves is still a very manual process, so more volunteers are always welcome to help out!

We're going to be really seeing the rewards from all this effort as OCaml 4.02 development proceeds, since we can now adopt a data-driven approach to changing language features instead of guessing how much third-party code will break. If your code is in OPAM, then it'll be tested as new features such as module aliases, injectivity and extension points show up.

Better documentation

The venerable OCamlDoc tool has done an admirable job for the last decade, but is increasingly showing its age due to a lack of support for cross-referencing across packages. We started working on this problem in the summer when Vincent Botbol visited us on an internship, expecting it to be a quick job to come up with something as good as Haskell's excellent Haddock online documentation.

Instead, we ran into the "module wall": since OCaml makes it so easy to parameterise code over other modules, it makes it hard to generate static documentation without outputting hundreds of megabytes of HTML every time. After some hard work from Vincent and Leo, we've got a working prototype that lets you simply run opam install opam-doc && opam doc core async to generate package documentation. You can see the results for Mirage online, but expect to see this integrated into the main OCaml site for all OPAM packages as we work through polishing up the user interface.

Turning OPAM into libraries

The other behind-the-scenes effort for OPAM has been to keep the core command-line tool simple and stable, and to have it install OCaml libraries that can be interfaced with by other tools to do domain-specific tasks. Thomas Gazagnaire, Louis Gesbert and David Sheets have been steadily hacking away at this and we now have opamfu to run operations over all packages, and an easy-to-template opam2web that generates the live opam.ocaml.org website.

This makes OPAM easier to deploy within other organizations that want to integrate it into their workflow. For example, the software section of the OCaml Labs website is regularly generated from a search of all OPAM packages tagged ocamllabs. We also used it to rewrite the entire OPAM repository in one epic diff to add external library dependencies via a command-line shim.

OPAM-in-a-Box

All of this effort is geared towards making it easier to maintain reusable local OPAM installations. After several requests from big universities to help out their teaching needs, we're putting together all the support needed to easily redistribute OPAM packages via an "OPAM-in-a-Box" command that uses Docker containers to let you clone and do lightweight modifications of OCaml installations.

This will also be useful for anyone who'd like to run tutorials or teach OCaml, without having to rely on flaky network connectivity at conference venues: a problem we've suffered from too!

Core Compiler

Starting to work on a real compiler can often be a daunting prospect, and so one initiative we started this year is to host regular compiler hacking sessions where people could find a curated list of features to work on, with the regular developers at hand to help out when people get stuck, and free beer and pizza to oil the coding wheels. This has worked out well, with around 20 people showing up on average for the three we held, and several patches submitted upstream to OCaml. Gabriel Scherer and Damien Doligez have been helping this effort by tagging junior jobs in the OCaml Mantis bug tracker as they are filed.

Syntax transformations and extension points

Leo White started the year fresh out of completing his PhD with Alan Mycroft, and before he realized what he'd gotten himelf into was working with Alain Frisch on the future of syntax transformations in OCaml. We started off our first wg-camlp4 working group on the new lists.ocaml.org host, and a spirited discussion started that went on and on for several months. It ended with a very satisfying design for a simpler extension points mechanism which Leo presented at the OCaml 2013 workshop at ICFP, and is now merged into OCaml 4.02-trunk.

Namespaces

Not all of the working groups were quite as successful in coming to a conclusion as the Camlp4 one. On the Platform mailing list, Gabriel Scherer started a discussion on the design for namespaces in OCaml. The resulting discussion was useful in separating multiple concerns that were intermingled in the initial proposal, and Leo wrote a comprehensive blog post on a proposed namespace design.

After further discussion at ICFP 2013 with Jacques Garrigue later in the year, it turns out adding support for module aliases would solve much of the cost associated with compiling large libraries such as Core, with no backwards compatibility issues. This solution has now been integrated into OCaml 4.02.0dev and is being tested with Core.

Delving into the bug tracker

Jeremy Yallop joined us in April, and he and Leo also leapt into the core compiler and started triaging issues on the OCaml bug tracker. This seems unglamorous in the beginning, but there rapidly turned out to be many fascinating threads that shed light on OCaml's design and implementation through seemingly harmless bugs. Here is a pick of some interesting threads through the year that we've been involved with:

This is just a sample of some of the issues solved in Mantis; if you want to learn more about OCaml, it's well worth browsing through it to learn from over a decade of interesting discussions from all the developers.

Thread-local storage runtime

While OCamlPro was working on their reentrant OCaml runtime, we took a different tack by adding thread-local storage to the runtime instead, courtesy of Stephen Dolan. This is an important choice to make at the outset of adding multicore, so both approaches are warranted. The preemptive runtime adds a lot of code churn (due to adding a context parameter to most function calls) and takes up a register, whereas the thread-local storage approach we tried doesn't permit callbacks to different threads.

Much of this work isn't interesting on its own, but forms the basis for a fully multicore runtime (with associated programming model) in 2014. Stay tuned!

Ctypes

One other complaint from the Consortium members was quite surprising: the difficulty of using the OCaml foreign function interface safely to interface with C code. Jeremy Yallop began working on the ctypes library that had the goal of eliminating the need to write any C code at all for the vast majority of foreign bindings.

Instead, Ctypes lets you describe any C function call as an OCaml value, and provides various linkage options to invoke that function into C. The first option he implemented was a dlopen interface, which immediately brought us the same level of functionality as the Python or Haskell Ctypes equivalents. This early code was in itself startlingly useful and more pleasant to use than the raw FFI, and various folk (such as David Sheets' libsodium cryptography bindings) started adopting it.

At this point, I happened to be struggling to write the Foreign Function Interface chapter of Real World OCaml without blowing through our page budget with a comprehensive explanation of the existing system. I decided to take a risk and write about Ctypes instead, since it let new users to the language have a far more productive experience to get started. Xavier Leroy pointed out some shortcomings of the library in his technical book review, most notably with the lack of an interface with C. The design of Ctypes fully supports alternate linking mechanisms than just dlopen though, and Jeremy has added automatic C stub generation support as well. This means that if you use Ctypes to build an OCaml binding in 2014, you can choose several mechanisms for the same source code to link to the external system. Jeremy even demonstrated a forking model at OCaml 2013 that protects the OCaml runtime from the C binding via process separation.

The effort is paying off: Daniel Bünzli ported SDL2 using ctypes, and gave us extensive feedback about any missing corner cases, and the resulting bindings don't require any C code to be written.

Community Efforts

Our community efforts were largely online, but we also hosted visitors over the year and regular face-to-face tutorials.

Online at OCaml.org

While the rest of the crew were hacking on OPAM and OCaml, Amir Chaudhry and Philippe Wang teamed up with Ashish Agarwal and Christophe Troestler to redesign and relaunch the OCaml website. Historically, OCaml's homepage has been the caml.inria.fr domain, and the ocaml.org effort was begun by Christophe and Ashish some years ago to modernize the web presence.

The webpages were already rather large with complex scripting (for example, the 99 Problems page runs the OCaml code to autogenerate the output). Philippe developed a template DSL that made it easier to unify a lot of the templates around the website, and also a Markdown parser that we could link to as a library from the rest of the infrastructure without shelling out to Pandoc.

Meanwhile, Amir designed a series of interactive wireframe sketches and [gathered feedback]((http://amirchaudhry.com/ocamlorg-request-for-feedback/) on it from the community. A local design agency in Cambridge helped with visual look and feel, and finally at the end of the summer we began the migration to the new website, followed by a triumphant switchover in November to the design you see today.

The domain isn't just limited to the website itself. Leo and I set up a SVN-to-Git mirror of the OCaml compiler Subversion repository on the GitHub OCaml organization, which is proving popular with developers. There is an ongoing effort to simplify the core compiler tree by splitting out some of the larger components, and so camlp4 is also now hosted on that organization, along with OASIS. We also administer several subdomains of ocaml.org, such as the mailing lists and the OPAM repository, and other services such as the OCaml Forge are currently migrating over. This was made significantly easier thanks to sponsorship from Rackspace Cloud (users of XenServer which is written in OCaml). They saw our struggles with managing physical machines and gave us developer accounts, and all of the ocaml.org infrastructure is now hosted on Rackspace. We're very grateful to their ongoing help!

If you'd like to contribute to infrastructure help (for example, I'm experimenting with a GitLab mirror), then please join the infrastructure@lists.ocaml.org mailing list and share your thoughts. The website team also need help with adding content and international translations, so head over to the website issue tracker and start proposing improvements you'd like to see.

Next steps for ocaml.org

The floodgates requesting features opened up after the launch of the new look and feel. Pretty much everyone wanted deeper OPAM integration into the main website, for features such as:

  • Starring and reviewing packages
  • Integrating the opam-doc documentation with the metadata
  • Display test results and a compatibility matrix for non-x86 and non-Linux architectures.
  • Link to blog posts and tutorials about the package.

Many of these features were part of the original wireframes but we're being careful to take a long-term view of how they should be create d and maintained.Rather than building all of this as a huge bloated opam2web extension, David Sheets (our resident relucant-to-admit-it web expert) has designed an overlay directory scheme that permits the overlaying of different metadata onto the website. This lets one particular feature (such as blog post aggregation) be handled separately from the others via Atom aggregators.

 Real World OCaml

A big effort that took up most of the year for me was finishing and publishing an O'Reilly book called Real World OCaml with Yaron Minsky and Jason Hickey. Yaron describes how it all started in his blog post, but I learnt a lot from developing a book using the open commenting scheme that we developed just for this.

In particular, the book ended up shining a bright light into dark language corners that we might otherwise not have explored in OCaml Labs. Two chapters of the book that I wasn't satisfied with were the objects and classes chapters, largely since neither Yaron nor Jason nor I had ever really used their full power in our own code. Luckily, Leo White decided to pick up the baton and champion these oft-maligned (but very powerful) features of OCaml, and the result is the clearest explanation of them that I've read yet. Meanwhile, Jeremy Yallop helped out with extensive review of the Foreign Function Interface chapter that used his ctypes library. Finally, Jeremie Diminio at Jane Street worked hard on adding several features to his utop toplevel that made it compelling enough to become our default recommendation for newcomers.

All in all, we ended up closing over 2000 comments in the process of writing the book, and I'm very proud of the result (freely available online, but do buy a copy if you can to support it). Still, there's more I'd like to do in 2014 to improve the ease of using OCaml further. In particular, I removed a chapter on packaging and build systems since I wasn't happy with its quality, and both Thomas Gazagnaire and I intend to spend time in 2014 on improving this part of the ecosystem.

Tutorials and Talks

We had a lively presence at ICFP 2013 this year, with the third iteration of the OCaml 2013 held there, and Stephen Dolan presenting a paper in the main conference. I liveblogged the workshop as it happened, and all the talks we gave are linked from the program. The most exciting part of the conference for a lot of us were the two talks by Facebook on their use of OCaml: first for program analysis using Pfff and then to migrate their massive PHP codebase using an OCaml compiler. I also had the opportunity to participate in a panel at the Haskell Workshop on whether Haskell is too big to fail yet; lots of interesting perspectives on scaling another formerly academic language into the real world.

Yaron Minsky and I have been giving tutorials on OCaml at ICFP for several years, but the release of Real World OCaml has made it significantly easier to give tutorials without the sort of labor intensity that it took in previous years (one memorable ICFP 2011 tutorial that we did took almost 2 hours to get everyone installed with OCaml. In ICFP 2013, it took us 15 minutes or so to get everyone started). Still, giving tutorials at ICFP is very much preaching to the choir, and so we've started speaking at more general-purpose events.

Our first local effort was FPDays in Cambridge, where Jeremy Yallop and Amir Chaudhry ran the tutorial with help from Phillipe Wang, Leo White and David Sheets. The OCaml session there ended up being the biggest one in the entire two days, and Amir wrote up their experiences. One interesting change from our ICFP tutorial is that Jeremy used js_of_ocaml to teach OCaml via JavaScript by building a fun Monty Hall game.

Visitors and Interns

Since OCaml Labs is a normal group within the Cambridge Computer Lab, we often host academic visitors and interns who pass through. This year was certainly diverse, and we welcomed a range of colleagues:

We were also visited several times by Wojciech Meyer from ARM, who was an OCaml developer who maintained (among other things) the ocamlbuild system and worked on DragonKit (an extensible LLVM-like compiler written in OCaml). Wojciech very sadly passed away on November 18th, and we all fondly remember his enthusiastic and intelligent contributions to our small Cambridge community.

We also hosted visitors to live in Cambridge and work with us over the summer. In addition to Vincent Botbol (who worked on OPAM-doc as described earlier) we had the pleasure of having Daniel Bünzli and Xavier Clerc work here. Here's what they did in their own words.

Xavier Clerc: OCamlJava

Xavier Clerc took a break from his regular duties at INRIA to join us over the summer to work on OCaml-Java and adapt it to the latest JVM features. This is an incredibly important project to bridge OCaml with the huge Java community, and here's his report:

After a four-month visit to the OCaml Labs dedicated to the OCaml-Java project, the time has come for an appraisal! The undertaken work can be split into two areas: improvements to code generation, and interaction between the OCaml & Java languages. Regarding code generation, several classical optimizations have been added to the compiler, for example loop unrolling, more aggressive unboxing, better handling of globals, or partial evaluation (at the bytecode level). A new tool, namely ocamljar, has been introduced allowing post-compilation optimizations. The underlying idea is that some optimizations cannot always be applied (e.g. depending whether multiple threads/programs will coexist), but enabling them through command-line flags would lead to recompilation and/or multiple installations of each library according to the set of chosen optimizations. It is thus far more easier to first build an executable jar file, and then modify it according to these optimizations. Furthermore, this workflow allows the ocamljar tool to take advantage of whole-program information for some optimizations. All these improvements, combined, often lead to a gain of roughly 1/3 in terms of execution time.

Regarding language interoperability, there are actually two directions depending on whether you want to call OCaml code from Java, or want to call Java code from OCaml. For the first direction, a tool allows to generate Java source files from OCaml compiled interfaces, mapping the various constructs of the OCaml language to Java classes. It is then possible to call functions, and to manipulate instances of OCaml types in pure Java, still benefiting from the type safety provided by the OCaml language. In the other direction, an extension of the OCaml typer is provided allowing to create and manipulate Java instances directly from OCaml sources. This typer extension is indeed a thin layer upon the original OCaml typer, that is mainly responsible for encoding Java types into OCaml types. This encoding uses a number of advanced elements such as polymorphic variants, subtyping, variance annotations, phantom typing, and printf-hack, but the end-user does not have to be aware of this encoding. On the surface, the type of instances of the Java Object classes is java'lang'Object java_instance, and instances can be created by calling Java.make Object().

While still under heavy development, a working prototype is available, and bugs can be reported. Finally, I would like to thank the OCaml Labs for providing a great working environment.

Daniel Bünzli: Typography and Visualisation

Daniel joined us from Switzerland, and spent some time at Citrix before joining us in OCaml Labs. All of his software is now on OPAM, and is seeing ever-increasing adoption from the community.

Released a first version of Vg [...] I'm especially happy about that as I wanted to use and work on these ideas since at least 2008. The project is a long term project and is certainly not finished yet but this is already a huge step.

Adjusted and released a first version of Gg. While the module was already mostly written before my arrival to Cambridge, the development of Vg and Vz prompted me to make some changes to the module.

[...] released Otfm, a module to decode OpenType fonts. This is a work in progress as not every OpenType table has built-in support for decoding yet. But since it is needed by Vg's PDF renderer I had to cut a release. It can however already be used to implement certain simple things like font kerning with Vg, this can be seen in action in the vecho binary installed by Vg.

Started to work on Vz, a module for helping to map data to Vg images. This is really unfinished and is still considered to be at a design stage. There are a few things that are however well implemented like (human) perceptually meaningful color palettes and the small folding stat module (Vz.Stat). However it quickly became evident that I needed to have more in the box w.r.t. text rendering in Vg/Otfm. Things like d3js entirely rely on the SVG/CSS support for text which makes it easy to e.g. align things (like tick labels on such drawings). If you can't rely on that you need ways of measuring rendered text. So I decided to suspend the work on Vz and put more energy in making a first good release of Vg. Vz still needs quite some design work, especially since it tries to be independent of Vg's backend and from the mechanism for user input.

Spent some time figuring out a new "opam-friendly" release workflow in pkgopkg. One of my problem is that by designing in the small for programming in the large --- what a slogan --- the number of packages I'm publishing is growing (12 and still counting). This means that I need to scale horizontally maintenance-wise unhelped by the sad state of build systems for OCaml. I need tools that make the release process flawless, painless and up to my quality standards. This lead me to enhance and consolidate my old scattered distribution scripts in that repo, killing my dependencies on Oasis and ocamlfind along the way. (edited for brevity, see here)

Daniel also left his bicycle here for future visitors to use, and the "Bünzli-bike" is available for our next visitor! (Louis Gesbert even donated lights, giving it a semblance of safety).

Industrial Fellows

Most of our regular funding bodies such as EPSRC or EU FP7 provide funding, but leave all the intellectual input to the academics. A compelling aspect of OCaml Labs has been how involved our industrial colleagues have been with the day-to-day problems that we solve. Both Jane Street and Citrix have senior staff regularly visiting our group and working alongside us as industrial fellows in the Computer Lab.

  • Mark Shinwell from Jane Street Europe has been working on improving the state of native debugging in OCaml, by adding frame pointers and DWARF symbols to the compiler output. Mark is also a useful source of feedback about the forthcoming design of multicore, since he has daily insight into a huge production codebase at Jane Street (and can tell us about it without us requiring access!).
  • Dave Scott is the principal architect of XenServer at Citrix in Cambridge. This year has been transformative for that project, since Citrix open-sourced XenServer to GitHub and fully adopted OPAM into their workflow. Dave is the author of numerous libraries that have all been released to OPAM, and his colleagues Jon Ludlam and Euan Harris are also regular visitors who have also been contributors to the OPAM and Mirage ecosystems.

    Research Projects

The other 100% of our time at the Labs is spent on research projects. When we started the group, I wanted to set up a feedback loop between local people using OCaml to build systems, with the folk developing OCaml itself. This has worked out particularly well with a couple of big research projects in the Lab.

 Mirage

Mirage is a library operating system written in OCaml that compiles source code into specialised Xen microkernels, developed at the Cambridge Computer Lab, Citrix and the Horizon Digital Economy institute at Nottingham. This year saw several years of effort culminate in the first release of Mirage 1.0 as a self-hosting entity. While Mirage started off as a quick experiment into building specialised virtual appliances, it rapidly became useful to make into a real system for use in bigger research projects. You can learn more about Mirage here, or read the Communications of the ACM article that Dave Scott and I wrote to close out the year.

This project is where the OCaml Labs "feedback loop" has been strongest. A typical Mirage application consists of around 50 libraries that are all installed via OPAM. These range from device drivers to protocol libraries for HTTP or DNS, to filesystems such as FAT32. Coordinating regular releases of all of these would be near impossible without using OPAM, and has also forced us to use our own tools daily, helping to sort out bugs more quickly. You can see the full list of libraries on the OCaml Labs software page.

Mirage is also starting to share code with big projects such as XenServer now, and we have been working with Citrix engineers to help them to move to the Core library that Jane Street has released (and that is covered in Real World OCaml). Moving production codebases this large can take years, but OCaml Labs is turning out to be a good place to start unifying some of the bigger users of OCaml into one place. We're also now an official Xen Project incubator project, which helps us to validate functional programming to other Linux Foundation efforts.

Nymote and User Centric Networking

The release of Mirage 1.0 has put us on the road to simplifying embedded systems programming. The move to the centralized cloud has led to regular well-publicised privacy and security threats to the way we handle our digital infrastructure, and so Jon Crowcroft, Richard Mortier and I are leading an effort to build an alternative privacy-preserving infrastructure using embedded devices as part of the User Centric Networking project, in collaboration with a host of companies led by Technicolor Paris. This work also plays on the strong points of OCaml: it already has a fast ARM backend, and Mirage can easily be ported to the new Xen/ARM target as hardware becomes available.

One of the most difficult aspects of programming on the "wide area" Internet are dealing with the lack of a distributed identity service that's fully secure. We published our thoughts on this at the USENIX Free and Open Communications on the Internet workhsop, and David Sheets is working towards a full implementation using Mirage. If you're interested in following this effort, Amir Chaudhry is blogging at the Nymote project website, where we'll talk about the components as they are released.

 Data Center Networking

At the other extreme from embedded programming is datacenter networking, and we started the Network-as-a-Service research project with Imperial College and Nottingham. With the rapid rise of Software Defined Networking this year, we are investigating how application-specific customisation of network resources can build fast, better, cheaper infrasructure. OCaml is in a good position here: several other groups have built OpenFlow controllers in OCaml (most notably, the Frenetic Project), and Mirage is specifically designed to assemble such bespoke infrastructure.

Another aspect we've been considering is how to solve the problem of optimal connectivity across nodes. TCP is increasingly considered harmful in high-through, high-density clusters, and George Parisis led the design of Trevi, which is a fountain-coding based alternative for storage networking. Meanwhile, Thomas Gazagnaire (who joined OCaml Labs in November), has been working on a branch-consistent data store called Irminsule which supports scalable data sharing and reconciliation using Mirage. Both of these systems will see implementations based on the research done this year.

 Higher Kinded Programming

Jeremy Yallop and Leo White have been developing an approach that makes it possible to write programs with higher-kinded polymorphism (such as monadic functions that are polymorphic in the monad they use) without using functors. It's early days yet, but there's a library available on OPAM that implements the approach, and a draft paper that outlines the design.

Priorities for 2014

This year has been a wild ride to get us up to speed, but we now have a solid sense of what to work on for 2014. We've decided on a high-level set of priorities led by the senior members of the group:

  • Multicore: Leo White will be leading efforts in putting an end-to-end multicore capable OCaml together.
  • Metaprogramming: Jeremy Yallop will direct the metaprogramming efforts, continuing with Ctypes and into macros and extension points.
  • Platform: Thomas Gazagnaire will continue to drive OPAM development towards becoming the first OCaml Platform.
  • Online: Amir Chaudhry will develop the online and community efforts that started in 2013.

These are guidelines to choosing where to spend our time, but not excluding other work or day-to-day bugfixing. Our focus on collaboration with Jane Street, Citrix, Lexifi, OCamlPro and our existing colleagues will continue, as well as warmly welcoming new community members that wish to work with us on any of the projects, either via internships, studentships or good old-fashioned open source hacking.

I appreciate the whole team's feedback in editing this long post into shape, the amazing professorial support from Jon Crowcroft, Ian Leslie and Alan Mycroft throughout the year, and of course the funding and support from Jane Street, Citrix, RCUK, EPSRC, DARPA and the EU FP7 that made all this possible. Roll on 2014, and please do get in touch with me with any queries!

by Anil Madhavapeddy (anil@recoil.org) at Sun, 29 Dec 2013

Mirage 1.0.3 released; tutorial on building this website available (MirageOS)


We've had a lot of people trying out Mirage since the 1.0 release last week, and so we've been steadily cutting point releases and new libraries to OPAM as they're done. The most common build error by far has been people using outdated OPAM packages. Do make sure that you have at least OPAM 1.1 installed, and that you've run opam update -u to get the latest package lists from the package repository.

Mirage 1.0.3 improves Xen configuration generation, cleans up HTTP support, and adds support for FAT filesystems. Here are some of the libraries we've released this week to go along with it:

  • mirage-www (update): the live website now runs on the 1.0 tools. Explanation of how to build it in various configurations is available here.
  • alcotest (new): a lightweight and colourful test framework built over oUnit. The interface is simpler to facilitate writing tests quickly, and it formats test results nicely.
  • mirage-block-xen.1.0.0 (new): is the stable release of the Xen Blkfront driver for block devices. The library supports both frontend and backend operation, but only the frontend is plumbed through to Mirage for now (although the backend can be manually configured).
  • mirage-block-unix.1.2.0 (update): fixed some concurrency bugs and added support for buffered I/O to improve performance.
  • fat-filesystem.0.10.0 (update): copies with more sector sizes, uses buffered I/O on Unix, and adds a KV_RO key/value interface as well as a more complicated filesystem one.
  • mirage-fs-unix.1.0.0 (update): implements the KV_RO signature as a passthrough to a Unix filesystem. This is convenient during development to avoid recompile cycles while changing data.
  • mirage-xen.1.0.0 (update): removed several distracting-but-harmless linker warnings about code bloat.
  • cohttp.0.9.14 (update): supports Server-Side Events via better channel flushing, has a complete set of HTTP codes autogenerated from httpstatus.es and exposes a platform-independent Cohttp_lwt module.
  • cow.0.8.1 (update): switch to the Omd library for Markdown parsing, which is significantly more compatible with other parsers.
  • ezjsonm.0.2.0 (new): a combinator library to parse, select and manipulate JSON structures.
  • ezxmlm.1.0.0 (new): a combinator library to parse, select and transform XML tags and attributes.
  • mirage-http-xen and mirage-http-unix provide the HTTP drivers on top of Cohttp for Mirage. Although they are very similar at the moment, they will diverge as the Unix variant gains options to use kernel sockets instead of only the network stack.

We're making great progress on moving our personal homepages over to Mirage. The first two introductory wiki posts are also now available:

  • Building a hello world example takes you through the basic steps to build a Unix and Xen binary.
  • Building the Mirage website lets you build this website with several variants, demonstrating the Unix passthrough filesystem, the OCaml FAT filesystem library, and how to attach a network stack to your application.

As always, please feel free to report any issues via the bug tracker and ask questions on the mailing list.

by Anil Madhavapeddy (anil@recoil.org) at Thu, 19 Dec 2013

Mirage 1.0: not just a hallucination! (MirageOS)


First: read the overview and technical background behind the project.

When we started hacking on Mirage back in 2009, it started off looking like a conventional OS, except written in OCaml. The monolithic repository contained all the libraries and boot code, and exposed a big OS module for applications to use. We used this to do several fun tutorials at conferences such as ICFP/CUFP and get early feedback.

As development continued though, we started to understand what it is we were building: a "library operating system". As the number of libraries grew, putting everything into one repository just wasn't scaling, and it made it hard to work with third-party code. We spent some time developing tools to make Mirage fit into the broader OCaml ecosystem.

Three key things have emerged from this effort:

  • OPAM, a source-based package manager for OCaml. It supports multiple simultaneous compiler installations, flexible package constraints, and a Git-friendly development workflow. Since releasing 1.0 in March 2013 and 1.1 in October, the community has leapt in to contribute over 1800 packages in this short time. All of the Mirage libraries are now tracked using it, including the Xen libraries.
  • The build system for embedded programming (such as the Xen target) is a difficult one to get right. After several experiments, Mirage provides a single command-line tool that combines configuration directives (also written in OCaml) with OPAM to make building Xen unikernels as easy as Unix binaries.
  • All of the Mirage-compatible libraries satisfy a set of module type signatures in a single file. This is where Mirage lives up to its name: we've gone from the early monolithic repository to a single, standalone interface file that describes the interfaces. Of course, we also have libraries to go along with this signature, and they all live in the Mirage GitHub organization.

With these components, I'm excited to announce that Mirage 1.0 is finally ready to see the light of day! Since it consists of so many libraries, we've decided not to have a "big bang" release where we dump fifty complex libraries on the open-source community. Instead, we're going to spend the month of December writing a series of blog posts that explain how the core components work, leading up to several use cases:

  • The development team have all decided to shift our personal homepages to be Mirage kernels running on Xen as a little Christmas present to ourselves, so we'll work through that step-by-step how to build a dedicated unikernel and maintain and deploy it (spoiler: see this repo). This will culminate in a webservice that our colleagues at Horizon have been building using Android apps and an HTTP backend.
  • The XenServer crew at Citrix are using Mirage to build custom middlebox VMs such as block device caches.
  • For teaching purposes, the Cambridge Computer Lab team want a JavaScript backend, so we'll explain how to port Mirage to this target (which is rather different from either Unix or Xen, and serves to illustrate the portability of our approach).

How to get involved

Bear with us while we update all the documentation and start the blog posts off today (the final libraries for the 1.0 release are all being merged into OPAM while I write this, and the usually excellent Travis continuous integration system is down due to a bug on their side). I'll edit this post to contain links to the future posts as they happen.

Since we're now also a proud Xen and Linux Foundation incubator project, our mailing list is shifting to mirageos-devel@lists.xenproject.org, and we very much welcome comments and feedback on our efforts over there. The #mirage channel on FreeNode IRC is also growing increasingly popular, as is simply reporting issues on the main Mirage GitHub repository.

Several people have also commented that they want to learn OCaml properly to start using Mirage. I've just co-published an O'Reilly book called Real World OCaml that's available for free online and also as hardcopy/ebook. Our Cambridge colleague John Whittington has also written an excellent introductory text, and you can generally find more resources online. Feel free to ask beginner OCaml questions on our mailing lists and we'll help as best we can!

by Anil Madhavapeddy (anil@recoil.org) at Mon, 09 Dec 2013

Switching from Bootstrap to Zurb Foundation (Amir Chaudhry)


I've just updated my site's HTML/CSS and moved from Twitter Bootstrap to Zurb Foundation. This post captures my subjective notes on the migration.

My use of Bootstrap

When I originally set this site up, I didn't know what frameworks existed or anything more than the basics of dealing with HTML (and barely any CSS). I came across Twitter Bootstrap and immediately decided it would Solve All My Problems. It really did. Since then, I've gone through one 'upgrade' with Bootstrap (from 1.x to 2.x), after which I dutifully ignored all the fixes and improvements (note that Bootstrap was up to v2.3.2 while I was still using v2.0.2).

Responsive Design

For the most part, this was fine with me but for a while now, I've been meaning to make this site 'responsive' (read: not look like crap from a mobile). Bootstrap v3 purports to be mobile-first so upgrading would likely give me what I'm after but v3 is not backwards compatible, meaning I'd have to rewrite parts of the HTML. Since this step was unavoidable, it led me to have another look at front-end frameworks, just to see if I was missing anything. This was especially relevant since we'd just released the new OCaml.org website, itself built with Bootstrap v2.3.1 (we'd done the design/templating work long before v3 was released). It would be useful to know what else is out there for any future work.

Around this time I discovered Zurb Foundation and also the numerous comparisons between them (note: Foundation seems to come out ahead in most of those). A few days ago, the folks at Zurb released version 5, so I decided that now is the time to kick the tires. For the last few days, I've been playing with the framework and in the end I decided to migrate my site over completely.

Foundation's Yeti

Swapping out one framework for another

Over time, I've become moderately experienced with HTML/CSS and I can usually wrangle things to look the way I want, but my solutions aren't necessarily elegant. I was initially concerned that I'd already munged things so much that changing anything would be a pain. When I first put the styles for this site together, I had to spend quite a bit of time overwriting Bootstrap's defaults so I was prepared for the same when using Foundation. Turns out that I was fine. I currently use Jekyll (and Jekyll Bootstrap) so I only had three template files and a couple of HTML pages to edit and because I'd kept most of my custom CSS in a separate file, it was literally a case of swapping out one framework for another and bug-fixing from there onwards. There's definitely a lesson here in using automation as much as possible.

Customising the styles was another area of concern but I was pleasantly surprised to find I needed less customisation than with Bootstrap. This is likely because I didn't have to override as many defaults (and probably because I've learned more about CSS since then). The one thing I seemed to be missing was a way to deal with code sections, so I just took what Bootstrap had and copied it in. At some point I should revisit this.

It did take me a while to get my head around Foundation's grid but it was worth it in the end. The idea is that you should design for small screens first and then adjust things for larger screens as necessary. There are several different default sizes which inherit their properties from the size below, unless you explicitly override them. I initially screwed this up by explicitly defining the grid using the small-# classes, which obviously looks ridiculous on small screens. I fixed it by swapping out small-# for medium-# everywhere in the HTML, after which everything looked reasonable. Items flowed sensibly into a default column for the small screens and looked acceptable for larger screens and perfectly fine on desktops. I could do more styling of the mobile view but I'd already achieved most of what I was after.

Fixing image galleries and embedded content

The only additional thing I used from Bootstrap was the Carousel. I'd written some custom helper scripts that would take some images and thumbnails from a specified folder and produce clickable thumbnails with a slider underneath. Foundation provides Orbit, so I had to spend time rewriting my script to produce the necessary HTML. This actually resulted in cleaner HTML and one of the features I wanted (the ability to link to a specific image) was available by default in Orbit. At this point I also tried to make the output look better for the case where JavaScript is disabled (in essence, each image is just displayed as a list). Below is an example of an image gallery, taken from a previous post, when I joined the computer lab.

Foundation also provides a component called Flex Video, which allows the browser to scale videos to the appropriate size. This fix was as simple as going back through old posts and wrapping anything that was <iframe> in a <div class="flex-video">. It really was that simple and all the Vimeo and YouTube items scaled perfectly. Here's an example of a video from an earlier post, where I gave a walkthrough of the ocaml.org site. Try changing the width of your browser window to see it scale.

Video demo

Framework differences

Another of the main difference between the two frameworks is that Bootstrap uses LESS to manage its CSS whereas Foundation uses SASS. Frankly, I've no experience with either of them so it makes little difference to me. It's worth bearing in mind for anyone who's workflow does involve pre-processing. Also, Bootstrap is available under the Apache 2 License, while Foundation is released under the MIT license.

Summary

Overall, the transition was pretty painless and most of the time was spent getting familiar with the grid, hunting for docs/examples and trying to make the image gallery work the way I wanted. I do think Bootstrap's docs are better but Foundation's aren't bad.

Although this isn't meant to be a comparison, I much prefer Foundation to Bootstrap. If you're not sure which to use then I think the secret is in the names of the frameworks.

  • Bootstrap (for me) was a great way to 'bootstrap' a site quickly with lots of acceptable defaults -- it was quick to get started but took some work to alter.
  • Foundation seems to provide a great 'foundation' on which to create more customised sites -- it's more flexible but needs more upfront thought.

That's pretty much how I'd recommend them to people now.

by Amir Chaudhry at Tue, 26 Nov 2013

Announcing the new OCaml.org (Amir Chaudhry)


As some of you may have noticed, the new OCaml.org site is now live!

The DNS may still be propagating so if http://ocaml.org hasn't updated for you then try http://166.78.252.20. This post is in two parts: the first is the announcement and the second is a call for content.

New OCaml.org website design!

The new site represents a major milestone in the continuing growth of the OCaml ecosystem. It's the culmination of a lot of volunteer work over the last several months and I'd specifically like to thank Christophe, Ashish and Philippe for their dedication (the commit logs speak volumes).

OCaml.org Wireframes

We began this journey just over 8 months ago with paper, pencils and a lot of ideas. This led to a comprehensive set of wireframes and walk-throughs of the site, which then developed into a collection of Photoshop mockups. In turn, these formed the basis for the html templates and style sheets, which we've adapted to fit our needs across the site.

Alongside the design process, we also considered the kind of structure and workflow we aspired to, both as maintainers and contributors. This led us to develop completely new tools for Markdown and templating in OCaml, which are now available in OPAM for the benefit all.

Working on all these things in parallel definitely had it challenges (which I'll write about separately) but the result has been worth the effort.

OCaml.org

The journey is ongoing and we still have many more improvements we hope to make. The site you see today primarily improves upon the design, structure and workflows but in time, we also intend to incorporate more information on packages and documentation. With the new tooling, moving the website forward will become much easier and I hope that more members of the community become involved in the generation and curation of content. This brings me to the second part of this post.

Call for content

We have lots of great content on the website but there are parts that could do with a refresh and gaps that could be filled. As a community driven site, we need ongoing contributions to ensure that the site best reflects its members.

For example, if you do commercial work on OCaml then maybe you'd like to add yourself to the support page? Perhaps there are tutorials you can help to complete, like 99 problems? If you're not sure where to begin, there are already a number of content issues you could contribute to.

Although we've gone through a bug-hunt already, feedback on the site is still very welcome. You can either create an issue on the tracker (preferred), or email the infrastructure list.

It's fantastic how far we've come and I look forward to the next phase!

by Amir Chaudhry at Wed, 20 Nov 2013

Migration plan for the OCaml.org redesign (Amir Chaudhry)


We're close to releasing the new design of ocaml.org but need help from the OCaml community to identify and fix bugs before we switch next week.

Ashish, Christophe, Philippe and I have been discussing how we should go about this and below is the plan for migration. If anyone would like to discuss any of this, then the infrastructure list is the best place to do so.

  1. We've made a new branch on the main ocaml.org repository with the redesign. This branch is a fork of the master and we've simply cleaned up and replayed our git commits there.

  2. We've built a live version of the new site, which is visible at http://preview.ocaml.org - this is rebuilt every few minutes from the branch mentioned above.

  3. Over the course of one week, we ask the community to review the new site and report any bugs or problems on the issue tracker. We triage those bugs to identify any blockers and work on those first. This is the phase we'll be in from today.

  4. After one week (7 days), and after blocking bugs have been fixed, we merge the redesign branch into the master branch. This would effectively present the new site to the world.

During the above, we would not be able to accept any new pull requests on the master branch but would be happy to accept them on the new, redesign branch. Hence, restricting the time frame to one week.

Please note that the above is only intended to merge the design and toolchain for the new site. Specifically, we've created new landing pages, have new style sheets and have restructured the site's contents as well as made some new libraries (OMD and MPP). The new toolchain means people can write files in markdown, which makes contributing content a lot easier.

Since the files are on GitHub, people don't even need to clone the site locally to make simple edits (or even add new pages). Just click the 'Edit this page' link in the footer to be taken to the right file in the repository and GitHub's editing and pull request features will allow you to make changes and submit updates, all from within your browser (see the GitHub Article for details).

There is still work to be done on adding new features but the above changes are already a great improvement to the site and are ready to be reviewed by the OCaml community and merged.

by Amir Chaudhry at Wed, 06 Nov 2013

Third OCaml compiler hacking session (Compiler Hacking)


It's time for the third Cambridge OCaml compiler-hacking session! This time we're going to be back in the Computer Lab, where the first session was held.

If you're planning to come along, it'd be helpful if you could indicate interest via Doodle and sign up to the mailing list to receive updates:

Where: Room FW11, Computer Laboratory, Madingley Road

When: 6pm, Wednesday 6th November

Who: anyone interested in improving OCaml. Knowledge of OCaml programming will obviously be helpful, but prior experience of working on OCaml internals isn't necessary.

What: fixing bugs, implementing new features, learning about OCaml internals

Wiki: https://github.com/ocamllabs/compiler-hacking/wiki

We're defining "compiler" pretty broadly, to include anything that's part of the standard distribution, which means at least the standard library, runtime, tools (ocamldep, ocamllex, ocamlyacc, etc.), camlp4, ocamlbuild, the documentation, and the compiler itself. We'll have suggestions for mini-projects for various levels of experience, but feel free to come along and work on whatever you fancy.

We'll also be ordering pizza, so if you want to be counted for food you should aim to arrive by 6.30pm.

by cl-ocamllabs@lists.cam.ac.uk (OCaml Labs) at Wed, 30 Oct 2013

Review of the OCaml FPDays tutorial (Amir Chaudhry)


Last Thursday a bunch of us from the OCaml Labs team gave an OCaml tutorial at the FPDays conference (an event for people interested in Functional Programming). Jeremy and I led the session with Leo, David and Philippe helping everyone progress and dealing with questions.

It turned out to be by far the most popular session at the conference with over 20 people all wanting to get to grips with OCaml! An excellent turnout and a great indicator of the interest that's out there, especially when you offer a hands-on session to people. This shouldn't be a surprise as we've had good attendance for the general OCaml meetups I've run and also the compiler hacking sessions, which Jeremy and Leo have been building up (do sign up if you're interested in either of those!). We had a nice surprise for attendees, which were uncorrected proof copies of Real World OCaml and luckily, we had just enough to go around.

For the tutorial itself, Jeremy put together a nice sequence of exercises and a skeleton repo (with helpful comments in the code) so that people could dive in quickly. The event was set up to be really informal and the rough plan was as following:

  1. Installation/Intro - We checked that people had been able to follow the installation instructions, which we'd sent them in advance. We also handed out copies of the book and made sure folks were comfortable with OPAM.

  2. Hello world - A light intro to get people familiar with the OCaml syntax and installing packages with OPAM. This would also help people to get familiar with the toolchain, workflow and compilation.

  3. Monty Hall browser game - Using js_of_ocaml, we wanted people to create and run the Monty Hall problem in their browser. This would give people a taste of some real world interaction by having to deal with the DOM and interfaces. If folks did well, they could add code to keep logs of the game results.

  4. Client-server game - The previous game was all in the browser (so could be examined by players) so here the task was to split it into a client and server, ensuring the two stay in sync. This would demonstrate the re-usability of the OCaml code already written and give people a feel for client server interactions. If people wanted to do more, they could use ctypes and get better random numbers.

We did manage to stick to the overall scheme as above and we think this is a great base from which to improve future tutorials. It has the really nice benefit of having visual, interactive elements and the ability to run things both in the browser as well as on the server is a great way to show the versatility of OCaml. js_of_ocaml is quite a mature tool and so it's no surprise that it's also used by companies such as Facebook (see the recent CUFP talk by Julien Verlaguet - skip to 19:00).

We learned a lot from running this session so we've captured the good, the bad and the ugly below. This is useful for anyone who'd like to run an OCaml tutorial in the future and also for us to be aware of the next time we do this. I've incorporated the feedback from the attendees as well as our own thoughts.

Heads down and hands on

Things we learnt

The Good

  • Most people really did follow the install instructions beforehand. This made things so much easier on the day as we didn't have to worry about compile times and people getting bored. A few people had even got in touch with me the night before to sort out installation problems.

  • Many folks from OCaml Labs also came over to help people, which meant no-one was waiting longer than around 10 seconds before getting help.

  • We had a good plan of the things we wanted to cover but we were happy to be flexible and made it clear the aim was to get right into it. Several folks told us that they really appreciated this loose (as opposed to rigid) structure.

  • We didn't spend any time lecturing the room but instead got people right into the code. Having enough of a skeleton to get something interesting working was a big plus in this regard. People did progress from the early examples to the later ones fairly well.

  • We had a VM with the correct set up that we could log people into if they were having trouble locally. Two people made use of this.

  • Of course, It was great to have early proofs of the book and these were well-received.

RWO books galore!

The Bad

  • In our excitement to get right into the exercises, we didn't really give an overview of OCaml and its benefits. A few minutes at the beginning would be enough and it's important so that people can leave with a few sound-bites.

  • Not everyone received my email about installation, and certainly not the late arrivals. This meant some pain getting things downloaded and running especially due to the wifi (see 'Ugly' below).

  • A few of the people who had installed, didn't complete the instructions fully but didn't realise this until the morning of the session. There was a good suggestion about having some kind of test to run that would check everything, so you'd know if there was something missing.

  • We really should have had a cut-off where we told people to use VMs instead of fixing installation issues and 10-15 minutes would have been enough. This would have been especially useful for the late-comers.

  • We didn't really keep a record of the problems folks were having so we can't now go back and fix underlying issues. To be fair, this would have been a little awkward to do ad-hoc but in hindsight, it's a good thing to plan for.

The Ugly

  • The only ugly part was the wifi. It turned out that the room itself was a bit of a dead-spot and that wasn't helped by 30ish devices trying to connect to one access point! Having everyone grab packages at the same time in the morning probably didn't help. It was especially tricky as all our mitigation plans seemed to revolve around at least having local connectivity. In any case, this problem only lasted for the morning session and was a little better by the afternoon. I'd definitely recommend a backup plan in the case of complete wifi failure next time! One such plan that Leo got started on was to put the repository and other information onto a flash drive that could be shared with people. We didn't need this in the end but it'll be useful to have something like this prepared for next time. If anyone fancies donating a bunch of flash drives, I'll happily receive them!

Overall, it was a great session and everyone left happy, having completed most of the tutorial (and with a book!). A few even continued at home afterwards and got in touch to let us know that they got everything working. It was a great session and thanks to Mark, Jacqui and the rest of the FPDays crew for a great conference!

RWO Book giveaway

(Thanks to Jeremy, Leo, David and Philippe for contributions to this post)

by Amir Chaudhry at Mon, 28 Oct 2013

FP Days OCaml Session (Amir Chaudhry)


On Thursday, along with Jeremy and Leo, I'll be running an OCaml Hands-on Session at the FPDays conference. Below are some prep instructions for attendees.

Preparation for the session

If you're starting from scratch, installation can take some time so it's best to get as much done in advance as possible. You'll need OPAM (the package manager), OCaml 4.01 (available through OPAM) and a few libraries before Thursday. If you have any issues, please contact Amir.

  • OPAM: Follow the instructions for your platform at http://opam.ocaml.org/doc/Quick_Install.html. OPAM requires OCaml so hopefully the relevant dependencies will kick in and you'll get OCaml too (most likely version 3.12). You can get a cup of coffee while you wait. After installation, run opam init to initialise OPAM.

  • OCaml 4.01: We actually need the latest version of OCaml but OPAM makes this easy. Just run the following (and get more coffee):

$ opam update
$ opam switch 4.01.0
$ eval `opam config env`
  • Libraries: For the workshop you will need to check that you have the following installed: libffi, pcre and pkg-config. This will depend on your platform so on a Mac with homebrew I would do brew install libffi pcre pkg-config or on Debian or Ubuntu apt-get install libffi-dev. After this, two OCaml packages it's worth installing in advance are core and js_of_ocaml so simply run:
$ opam install core js_of_ocaml

OPAM will take care of the dependencies and the rest we can get on the day!

by Amir Chaudhry at Tue, 22 Oct 2013

FPDays 2013 Real World OCaml tutorial in Cambridge (Anil Madhavapeddy)


Yaron Minsky and I have been running OCaml tutorials for a few years at ICFP and CUFP, but haven’t really spread out into the wider conference circuit. Now that Real World OCaml is almost finished, the scene is set for doing much more. The first such tutorial is being help at FPDays 2013 on October 24th in the lovely Murray Edwards College in Cambridge. Check out the Lanyrd page for ticket information, and the OCaml session page for more information.

The basic layout of the tutorial is to get started with the guided tour of the book, and then work through building a distributed message broker. This gets you familiar with the Core standard library, the Async event-driven I/O library, and all the strongly-typed RPC plumbing that goes in between. We’re hoping to have physical preprints of the book available for free to attendees, so do sign up fast if you wish to attend.

As a bonus, the Cambridge FPDays session will feature Jeremy Yallop working through the book and conducting the tutorial: he has an incredible depth of knowledge about the innards of OCaml’s type system, and so advanced users will also find a good home in this tutorial to throw questions at him too! For those of you interested in other programming languages, there are also excellent-looking sessions on Erlang, F# and Scala, and Phil Wadler is giving a keynote speech. I’m most excited about Sam Aaron’s session on live coding and music though. You have to hear it to believe it…

Mon, 07 Oct 2013

High-power LEDs for super-bright light (Philippe Wang)


I've been "playing" with COB LEDs for about half a year now. Here's a summary of my experience.

Preliminaries

COB LEDs

A "COB LED" is a basically a bunch of tiny LEDs put on together to provide a very energy-efficient and usually very bright LED (using both parallel and series circuit schemes). Note that it means the voltage of COB LEDs are not the voltage of "normal LEDs", since putting some LEDs in series means adding up voltage values!

An LED does not behave like an incandescent light bulb

Also, note that if there's a roughly linear relation between light intensity and voltage for classic lamps, it's not the case at all with LEDs! With LEDs, basically, the light intensity varies with the amperage, not the voltage. So if you lower the voltage by a very little bit, it will shine as bright if the amperage keeps up; if you lower the voltage too much, it basically stops emitting light (e.g., if you give it 16V instead of 19V, you have so little light that it's really useless), but if you increase the voltage too much you still fry your LEDs so you don't want to do that at all.

Light colors, measured in Kelvin

Basically, 2700K to 3300K correspond to classic (discontinued) incandescent light bulbs, and it's quite close to classic candle light, which is about 1850K. 6500K is day light as in "the light provided by the Sun on a sunny day at noon", which seems very cool if it's dark outside or if you have warm light around. My favourite artificial light color is generally from 4000K to 5000K. I find the 3000K quite acceptable, below that is too warm, above 6000K is too cool.

Don't get me wrong: in this post, "cool" not a synonym for "awesome", stick to its original meaning instead.

Read more on Wikipedia.

Color spectral distribution

Note that LEDs generally have a very narrow light color spectrum. That means that instead of emitting light of a lot of different colors, including some colors you don't see or colors that are not very much noticed because they are less intensively emitted, LEDs will really emitt just the color you asked for. That is not such a good thing (it's not so bad either). But as a consequence, you might feel that the light emitted by an LED is not very comfortable. What I do is that I mainly use warm light LEDs and I add a few cool light LEDs amongst them: it goes from 1 cool for 6 warms to 1 for 1. The result of 1 for 1 is quite nice when it's bright outside, but it's not so nice if it's in the middle of the night.

Why not simply buy normal lamps?

Well, buying new lamps for having more light implicates buying light bulbs as well. I'm not a very big fan of light bulbs that contain mercury because if you ever break one, you'll breathe mercury gas (much likely in small quantities though). And I can't install lights on the wall without asking my landlady for permission, and even if I did have permission, it would mean putting electric wires inside the wall and that's not really an idea that speaks to me (I've never done that) and that means painting the wall afterwards (not a big fan of painting walls either). What about LEDs then? Don't I need to install them on the wall anyway? Well, no, I don't! I have access to some of the iron based framework, and as LEDs are light weight and flat, I stick them on some material on which I put some magnets, et voilà! (Now I'm even a bigger fan of strong magnets than before.)

Well, I won't deny it, the main reason why I wanted to use LEDs is that it's super high tech, it's fun, it's powerful and energy-efficient (and ecological), it's nice, it's the future, it's geek. Also, we plan on using LEDs massively in university of Cambridge buildings, starting with the department of computer science.

What I've used

LEDs

I've essentially used 6W COB LEDs, which — according to the specs given by their manufacturers — are as efficient as 120lm/W or 108lm/W, depending on if you prefer white light or warm white light. I gave a try to more a classic (though high-power) LED before: 9 pieces of 1W LEDs put together on an aluminium disc. Well, I switched to COB LEDs for three reasons. The first one is that COB LEDs are more energy-efficient. The second one is that the voltage of the COB LEDs that I use is higher (19V instead of 10V), which means that for a given wattage, the current intensity (amperage) is lower, and that means less chance for the wires to heat... (I'm not an electronician nor a physisist, please correct me if I'm wrong.) And the third one is that COB LEDs are very compact. The "9 pieces of 1W LEDs on a disc" is really big: the diameter is 88mm. The drawback is that the COB LEDs that I use are only available in the 6500K and 3000K color temperatures, whereas the other ones are available in 6500K, 4000K, 3000K and 2500K; 400K being my favourite.

Power supply units

It's not that easy to provide electrical power to LEDs... because we need a potentially quite high amperage, the right voltage, and the latter has to be very stable (or at least never reach a higher value than the expected one, otherwise the LEDs burn and die quite easily, and we don't want to kill some hardware that's supposed to last ten and up to thirty years).

I recommend this power supply unit (PSU) from Rapid if you're powering at most 30 watts of LEDs at a voltage of 24V or less for three reasons: it seems to me so far that it's a very good quality PSU, the voltage is well regulated, it does provide up to 30W without complaining or heating up, it switches on very quickly (virtually instantaneously). I have used 2 of them for about 6 months. (I haven't tried to make it work at 36W because it would supposedly diminish its lifespan.)

Well, that 36W PSU is currently (early October 2013) out of stock, and since I visited (randomly) a Maplin store and saw this 120W laptop power supply unit for less than £40, I bought two of them to give them a shot. They are a lot cheaper than the 36W PSU I've been using so far, if we compare the price-over-power ratio. Well, the voltage they deliver seems okay: according to my cheap multimeter, so it's not really the most reliable measurement, they deliver 19.2V when using a 19V head. The main drawback of this adaptor is that it takes time to "boot"! (Yes, it means it will not deliver power during those 5 seconds.) It's not really a problem for a laptop, but for a lighting system, having to wait 5 full seconds just seems like forever. So it means that if I want instantaneous light-on feature, I need a switch between the PSU and the LED instead of between the wall plug and the PSU. Well, in practice, I think I'll simply wait 5 seconds, because I'll only use them when I want a very very bright light, since I already have 2 LED-based lighting devices powered by those 36W PSUs that provide power instantaneously.

The other drawback is that it's made to work with (existing) laptops and they provide one connector per model (i.e., shape of the head and voltage), so you can't use any connector and still assume you'll be able to select the voltage: here, the head determines the voltage you'll get.

Heat!

One of the biggest (potential) problem with COB LEDs is that, as they are very powerful and very tiny, you're supposed to stick them on something that can dissipate the heat that they produce. What I learnt from that is that iron is a very very slow heat conductor compared to aluminium. I'm using aluminium rulers because it's the cheapest flat thing made of aluminium that I could find. Three rules very easily dissipate the heat produced by 4 LEDs. I tried tripling the density of LEDs... It was not a good idea, as the rulers become hot quite rapidly: within a few minutes, they reach close to 50°C. So with 4 LEDs per ruler, active fan cooling may dissipate the heat.

Sticking COB LEDs

To stick LEDs, I used some akasa thermal adhesive tape that I bought on Amazon, and it works pretty well.

The worst part...

... is obviously, to me, the soldering part. It's quite hard to solder those COB LEDs because the contact spots are very small, so it takes a tremendous amount of time.

Overall

Why bright light is important...

I feel that it's important to me to face (very) bright light during the day, and it's best if it comes directly from the sun. Apparently (although I'm not a hundred percent sure) it's actually important to be exposed to (very) bright light at least 30 minutes a day.

So, instead of using a very expensive S.A.D. light to prevent potential seasonal affective disorder, well, I'm using very bright state-of-the-art LEDs to light up my home. :-)

Conclusion

LEDs are awesome because they need very little space compared to other light-emitting devices.

As they're very energy-efficient, using them for indirect lighting doesn't consume too much electricity.

They can last up to 10 or 30 years if they're taken well care of. So once you make them, you don't need to worry about them for a while... Well, I'm not sure the power supply units can last that many years, but it's not the hard part to change.

If you live in Cambridge (UK), you could use Makespace to build your own LED-based super bright and/or super smart lighting system. Well, given my low-budget when I started building those LED-based lighting systems, I couldn't afford Makespace membership, so I kind of have all I need to handcraft at home.

Pictures

4 COB LEDs on 3 aluminium rulers, in parallel. The 3 aluminum rulers are stuck together using aluminium tape, which has good thermal conductivity, reflects light not too badly and is not too expensive.

4-COB-LEDs-off-640.jpg

4-COB-LEDs-off-unplugged-640.jpg

4-COB-LEDs-off2-640.jpg

If you do want LEDs but don't want to solder anything

Well, you may find a lot of websites on Google selling LEDs, you may also try this one, which belongs to a friend of mine.

by Philippe Wang at Mon, 07 Oct 2013

Using Travis for secure deployments with SSH (Anil Madhavapeddy)


In my previous post on Travis, I explained how it can be used to easily test OCaml packages on GitHub without having to host any infrastructure yourself.

The next step I wanted to investigate was how to use Travis to trigger service deployments after a successful build. One nice feature that Travis has is support for encrypted environment variables. The basic workflow is that you encrypt key/value pairs using a public key that they publish per GitHub repository. Once registered, this is made available as a decrypted environment variable within the Travis worker. You can use this to transmit API keys or other authentication data that you need to commit to the travis.yml file, but obviously can’t leave on a public repository for the world to see.

The small hitch with this whole scheme is that there’s a very small limit of about 90 bytes or so for the size of each individual environment variable that’s exported, and so you can’t just stash an SSH private key in there. Instead, it needs to be Base64 encoded, split it up into multiple environment variables of the right size, and then reassembled within the Travis VM. Rather than deal with importable shell scripts between MacOS X and Linux, I created a small travis-senv command-line binary to make this easier.

To use it, just opam install travis-senv and follow the instructions on the README at the homepage. Here’s the fragment of shell script that pushes the build output to another GitHub repository:

if [ "$DEPLOY" = "1" ]; then
  # get the secure key out for deployment
  opam install travis-senv
  mkdir -p ~/.ssh
  SSH_DEPLOY_KEY=~/.ssh/id_dsa
  travis-senv decrypt > $SSH_DEPLOY_KEY
  chmod 600 $SSH_DEPLOY_KEY
  echo "Host mirdeploy github.com" >> ~/.ssh/config
  echo "   Hostname github.com" >> ~/.ssh/config
  echo "   StrictHostKeyChecking no" >> ~/.ssh/config
  echo "   CheckHostIP no" >> ~/.ssh/config
  echo "   UserKnownHostsFile=/dev/null" >> ~/.ssh/config
  git config --global user.email "travis@openmirage.org"
  git config --global user.name "Travis the Build Bot"
  git clone git@mirdeploy:mirage/mirage-www-deployment
  cd mirage-www-deployment
  mkdir -p xen/$TRAVIS_COMMIT
  cp ../src/mir-www.xen ../src/mir-www.map ../src/www.conf xen/$TRAVIS_COMMIT
  bzip2 -9 xen/$TRAVIS_COMMIT/mir-www.xen
  git pull --rebase
  git add xen/$TRAVIS_COMMIT
  git commit -m "adding $TRAVIS_COMMIT"
  git push
fi

I’ve been using this to automate the construction of the Mirage Xen unikernel homepage. Every time there’s a push to the mirage-www, the Travis scripts now retrieve an SSH deployment key using travis-senv, and push the results of the build to the mirage-www-deployment repository that stores the build output. This repository is polled by the hosting machines we have to look for new kernels and rotate the website (but more on this later – I’m just integrating the EC2 and Rackspace scripts to remove this step entirely next!)

Sat, 05 Oct 2013

(Example) A use case of MPP in an OCaml program (Philippe Wang)


Let's say you have a big variant type such as this one:

type t = A | B | C | D | E | F | G

and you want to implement a t -> string function. With MPP, you can write this:

{< let l = ["A"; "B"; "C"; "D"; "E"; "F"; "G";] >}
type t = {< let _ = List.iter (Printf.printf "| %s ") l >}
let to_string = function
{< let _ = List.iter (fun e -> Printf.printf "| %s -> %S" e e) l >}

and then you have that:

type t = | A | B | C | D | E | F | G 
let to_string = function
| A -> "A"
| B -> "B"
| C -> "C"
| D -> "D"
| E -> "E"
| F -> "F"
| G -> "G"

As you may see, if you grow the type t by adding more cases, you don't have to manually update the function to_string.

by Philippe Wang at Fri, 04 Oct 2013

OPAMaging MPP (Philippe Wang)


Please read this if you want to know more about MPP.

Announcement: I've OPAMaged MPP!

So, I've eventually OPAMaged MPP (i.e., made an OPAM package for MPP)! Now MPP officially has a version number, and at the time I'm writing these lines, the current version is 0.1.0.

Happy that it eventually worked...

Yes, I'm quite happy that MPP now has an OPAM package. To do so, I copied the _oasis and the Makefile files from MPP and adapted them for MPP. At first I wanted to avoid using oasis because for reasons that are still unknown to me, some people told me (half-joking?) that it was better not to depend on oasis... Well, actually I tried not to depend on oasis! I tried but just couldn't build in a reasonable amount of time a Makefile that met the requirements for an OPAM package. Building MPP is quite trivial, installing it in the right place and being able to uninstall it, well, less trivial (perhaps I've just missed a simple thing in the process).

A side note

Although the MPP project started a while ago now, it's still at a pretty early stage of development since I've been mostly working on the tools that are needed to build the new ocaml.org website, such as OMD (read about the experience here), which I've developed from scratch, and OPAM-DOC, which I had to contribute to at some point because we had to fix it while having a very little amount time.

Now that OMD has become quite stable (although some bugs are still found from time to time, which is normal) ../..

No product of human intellect comes out right the first time. We rewrite sentences, rip out knitting stitches, replant gardens, remodel houses, and repair bridges. Why should software be any different? – Lauren Ruth Wiener

../.. and since I'm not the only contributor on opam-doc (Vincent Botbol is still participating, he fixed quite a nasty-to-fix bug last weekend, and Leo White is still working on the JavaScript part), it's not as much time-taking as OMD, but because it's not my code and because it's using some syntax extension (yeah, CamlP4's in the place, yo), patching it is not as simple as it could be...

by Philippe Wang at Thu, 03 Oct 2013

Using MPP in two differents ways (Philippe Wang)


MPP is now an OPAM package

First of all, I've to say I've finally made an OPAM package for MPP! At the time I'm writing these words, it's still waiting for merging. Read more here.

Description of MPP

MPP is a preprocessor that has been designed to allow any (non-insane) programming language to be used as a preprocessor for any text-based file. The particularity is that the users keep their initial language as their primary language: the language(s) brought by MPP is (are) secondary. It means that ideally, if you give a random text file to MPP, MPP will return it unchanged! And in practice, you can always parameterize MPP (on the command line, or in an MPP file) such that for a given text file, MPP will return it unchanged.

Of course, the purpose of MPP is not to replace the "cat" command! But having the property that you can guarantee that MPP will behave as the "cat" command if you want it to is important, because it means you can customise MPP such that MPP can work for any text file.

Side note:

Well, if you discover a language that has some kind of infinite semantics, which are somehow automatically defined by itself, such that every possible sequence of bytes has a meaning in this language, MPP might not work because you would have a hard time finding a token that's not already taken, since there're all already taken... But I don't think that it's a realistic scenario, so let's just forget about it.

Principle of MPP

The principle on which MPP is based is pretty simple. The users define the tokens they want to use to determine the beginning and the end of MPP blocks. And there are 3 kinds of MPP blocks:

  1. basic MPP block;
  2. basic MPP block with possible nesting;
  3. foreign MPP block.

The first and the second are used for the same things except that the second allows nesting. The third kind is the most interesting one, as it's the one that allows to embed virtually any programming language as a preprocessor language.

Side note:

Well, you can't easily use pure lambda calculus as a preprocessor language. Your programming language has to have some notion of input/output to process text. Using pure lambda calculus might be very painful and it would start by making it impure enough to interact with the rest of the world.

Example 1: the basics

So, let's say you're writing a program using the OCaml language, and you want x to be a random integer.

You can write this:

let x = Random.int max_int

and if you want to be sure that the number is different each time you execute your program, you should use Random.self_init or an equivalent.

Now what if you want a random number to be generated at compile time? You can't do it in pure OCaml. You need to preprocess your file and/or to read from another file...

With MPP, you can write this:

let x = << cmd echo $RANDOM >>

if you use mpp with these parameters: -so '<<' -sc '>>' to set opening and closing tokens to << and >>. Default tokens are respectively (( and )) but if you're writing some OCaml code, you shouldn't use (( and )) because there's quite a chance that they appear as OCaml code in your file.

What happens with MPP is that it simply echoes everything that is not MPP, and when it's MPP, it interprets the commands. The cmd builtin is probably the simplest and the most powerful, since it calls your shell with what follows cmd. The first line is the command that is called, and what follows is given to the command as its standard input.

So, basically,

let x = << cmd echo $RANDOM >> mod 42

given to MPP has the same semantics as this Bash script

echo -n 'let x = '
echo $RANDOM
echo ' mod 42'

Now, instead of using non-nesting tokens, you can use nesting tokens, the only difference being that you can nest commands. Wait, what if you nest with non-nesting commands? Well, MPP has a very simple syntax. If you can't nest, you really can't nest. So, if you write << echo << echo 42 >> >> it just doesn't work! Moreover, you have a syntax error because the first >> is seen as the closing token for the first <<, and MPP stops when there's a syntax error. If you want to echo << echo 42 >>, it is possible, you just have to name your block, as in <<name echo << echo 42 >> name>> and then it just works!

Note that with nesting tokens, you have to name all outer tokens. The default tokens are {{ and }}. So, to nest some commands, you can write for instance {{name set x {{ cmd cat foo.txt }} name}}. Note that the the most inner block doesn't have to be a nesting one, as it doens't nest any. It means you could write {{name set x << cmd cat foo.txt >> name}} instead. And note that only the most inner nesting block can be nameless and that's in some cases only.

Note: the technique presented above (in example 1) has been used to generate the new OCaml website.

Example 2: using OCaml as a preprocessor language

When MPP is used with the option -l, it produces a program that prints the file, using information provided to MPP about the language specified with -l. For instance, mpp -l ocaml will produce an OCaml program.

If you apply mpp -l ocaml to a file F that contains no MPP commands, then you have an OCaml program that, when executed, produces the original file F.

When you use -l ocaml, all "foreign blocks" are considered as OCaml code and it will just "pass through". So, if you write 42 {< let x = 23 >} Hello, it will be translated to this program:

let _ = print_string "42 "
let x = 23
let _ = print_string " Hello"

It's not exactly as flexible as "you can write anything at the top-level". Indeed, you may write any valid top-level definition. It could be a type definition, a module definition, really anything you want that is valid at the top level and that does not require double semi-colons. That is, you cannot write {< print_int 42 >}because it's not a valid top-level phrase, you should write this instead {< let _ = print_int 42 >}. If you really love playing at edges, you could write {< ;; print_int 42 ;; >}and it might work... if there is something preceding. If there isn't (but that's not trivial to guess because the semantics of MPP might change on this precise point, I don't know, I can't promise right now), then you could perhaps write {< print_int 42;; >}.

But you can't convert a block of text into an OCaml function. You can't write something like you would do in PHP, such as:

<ul>
  {< for i = 1 to 42 do >}
    <li> {< print_int i >} </li>
  {< done >}
</ul>

Oh but wait! You can use functors...

So in functor-style, you can write this:

<ul>
  {< module I (X:sig val i : int end) = struct >}
    <li> {< let _ = print_int X.i >} </li>
  {< end >}
  {< let _ = for i = 1 to 42 do
       let module M = I(struct let i = i end) in
       ()
     done >}
</ul>

Sorry if that hurts your eyes, but life is not always easy. Oh, and if you think it's overkill, well, of course it is, that's a tiny toy example! I'm actually using this functor-style to write this blog. It means that I generate a template with the functor-style headers and footers and I write inside the functor using Markdown, and when I need to write some OCaml code, I use MPP features. And when I need to print MPP stuff, I use the 2 worlds of MPP, but it's late and I don't want to go into details just right now. What's important is that it's nice and it works. I've always wanted to be able to easily embed the OCaml language in any text (LaTeX, ML, HTML, Bash, etc.) file, and now I can!! :)

by Philippe Wang at Thu, 03 Oct 2013

Intellisense for OCaml with Vim and Merlin (Anil Madhavapeddy)


I’m finalizing the installation instructions for Real World OCaml and finally got around to configuring the awesome Merlin editor tool. I pretty much never configure my editor, so here are my notes on getting Merlin working with Vim completely from scratch on MacOS X (I’m on 10.9DP7, but this should work with earlier versions of MacOS X too).

First, install some basic plugin tools by following the Synastic installation instructions.

Install vim-pathogen:

mkdir -p ~/.vim/autoload ~/.vim/bundle; \
curl -so ~/.vim/autoload/pathogen.vim \
https://raw.github.com/tpope/vim-pathogen/master/autoload/pathogen.vim

Install vim-sensible for useful defaults:

cd ~/.vim/bundle
git clone git://github.com/tpope/vim-sensible.git

Install Syntastic:

cd ~/.vim/bundle
git clone https://github.com/scrooloose/syntastic.git

Follow the Merlin/vim from scratch directions:

opam install merlin

Finally, add this this in your ~/.vimrc to turn everything on.

execute pathogen#infect()
let s:ocamlmerlin=substitute(system('opam config var share'),'\n$','','''') .  "/ocamlmerlin"
execute "set rtp+=".s:ocamlmerlin."/vim"
execute "set rtp+=".s:ocamlmerlin."/vimbufsync"

Trying it out

Read the excellent instructions at the Merlin for Vim from scratch wiki page. A few immediately useful things I did are below.

Displaying types: You can get the local type of something by typing in \t anywhere. This is the <LocalLeader> for anyone who has customized their setup (it is mapped to backslash by default).

Tab completing functions: You can tab-complete module signatures by pressing Ctrl-X-O (make sure you keep Ctrl pressed the whole time). This activates vim’s OmniComplete mode. To get this to work with your external projects, create a .merlin file. I use this for ocaml-dns for example.

PKG lwt cstruct
S lib
S lib_test
S lwt
B _build/lib
B _build/lib_test
B _build/lwt

Also, don’t forget to pass the -bin-annot function to the compiler to generate the typed-AST cmt files. You can add true: bin_annot to your _tags file if you’re using OCamlbuild (in OCaml 4.01.0 or higher, and it’ll be silently ignored on previous versions so you don’t need to worry about breaking older OCaml compilers).

Wed, 02 Oct 2013

Liveblogging the first Human Data Interaction workshop (SRG Syslog)


I’m at the Open Data institute, with Richard Mortier, Jon Crowcroft, Amir Chaudhry and Hamed Haddadi , live-blogging a daylong workshop about our emerging Human-Data Interaction research initiative.  The room is packed with notable researchers from all over the UK, so this promises to be an exciting day!

Mort opens up with an introduction to the format of the workshop and lays out the problem space:

  • Visualisation and sense making: how do we make sense of such complex, technical systems, and crucially, what is the process of translating the information to
  • Transparency and audit: is essential to give a feedback loop to users
  • Privacy and control: 
  • Analytics and commerce: there is clearly an ecosystem forming around personal data (c.f. several startups specifically around this
  • Data to knowledge: what are the new business models around

Attendee introductions

Danielle (missed her last name, is from Switzerland): did a PhD on anomaly detection and then worked at Swiss banks on security monitoring. The key challenges in the area are around the nature of the questions we need to ask about all the data that we have access to.

Steve Brewer: coordinator of the IT as a Utility network which is funding all this.  Interested in the next steps: both immediate and the overall vision and direction it’s all going.  Concrete actions emphasised.

Ian Brown: Oxford University and cybersecurity. Interested broadly in these issues and also has a degree in psychology and behavioural psychology.  The challenge is how to balance the “engineering” (this isnt fundamental computer science) and understand why there is so little takeup of this.

Amir Chaudhry: working on a toolstack for the Internet of Things.  When we have tweeting toasters, how do we make it all useful to people. It’s not the raw data, but the insights we present back to people to make it useful.

Elizabeth Churchill: was at Yahoo, now at eBay Research.  Right now the challenge I’m facing is personalisation, and is trying to understand why the models they have of people are rubbush (this is why so much advertising isnt very useful).  The reasoning processes and data quality is important.  We are making assertions about people based on data they have gathered for another purpose, and the ethical and business issues are important. Has been pushing on : what the data source, what is the data quality, and are these assertions based off data thats appropriate, and how can we design new algorithms”

Jon Crowcroft: from the Computer Lab in Cambridge and does anything to do with communication systems, and leads the Horizon project with Mort and PAWS which is working on wifi access in disadvantaged areas. Also working on Nymote.  Is interested in new human/data rights vs t&cs.  Rights and duties are encoded in terms and conditions (badly) — see Latier’s latest book about this, and see how poor Amazon’s recommendations are.  We’re interested in building a technology where you own your own data but business practises all gel together.  We had a workshop at how the EU is pushing the right to be forgotten, so how can we ensure that data can be removed including all references.  People go “its too difficult”, but this isn’t true — takedowns work, and why is it that only big corporations can afford to take down stuff.  The right to oblivion (“not be there in the first place”) and data shouldn’t be a public good but people should have the right to be in a paid avoidance zones. (See Shockwave Rider by Brunner, 30 years old, loads of great technical ideas, and Future Shock is a good read too).  Can we shift regulatory positions and have

Thomas Dedis: runs Fab Lab in Barcelona.  Cofounder of Smart Citizen, crowdfunded environment sensing platform based on a piece of hardware based on Arduino.  Allows people to capture data in their own homes, and push it to an online platform.  Allows people to push it in a ‘conscious’ way? Go to smartcitizen.me to see the data platform.  Intended to grow into other places, but capturing air pollution, humidity, temperature and other environment data.  Main thing is that you own your own data.  Much cheaper than the competition too and crowdfunding.

Martin Dittus (@dekstop) is a PhD student at UCL and works on cities.io. Thousands of people mapping the planet in incredibly detailed ways and it works with both commercial and non-commercial stuff. What makes these systems work, what are the processes to coordinate things, questions of data quality and how to assert stuff over it?  Used to be part of the last.fm data team which is about personal data gathering and detailed profiling that users themselves put up.

Christos Efstratiou: as of September is a lecturer at University of Kent (and is still a visiting research at Cambridge).  Works on sensing, more broadly that has to do with people and sensors in the environment and embedded sensing in the environment.  Privacy is a huge issue and is his key challenge.  This isn’t sensing in the old style like Active Badge — back then, people werent aware of the issues and nowadays, people are more aware of privacy.  So the challenge is the evolving user perceptions and how our system design works.  Anecdote: at a recent wedding he was at, there was a request from the  bride/groom to note post any public photos to Twitter/Facebook. We’ve lost control over our public personas.

Julie Freeman: an actual resident in this building and the art associate for the ODI space!  Is a PhD student at QMW and is interested in echoing other people.  Interested in physical manifestations of digital data, and how we can “step away from the screen”.  Broad interests in transformation of digital data.

Barbara Grimpe: from Oxford and is a sociologist in an interdisciplinary EU project on “Governance for Responsible Innovation”.  They have a network notion of responsibility which takes into account that a lot of data production and data use takes place in networks that posses strong ties between people.  She started two cases studies in two area: telecare technologies for elderly people (relevant due to the EU Horizon 2020 societal challenge in aging and wellbeing due to the demographic change in western societies, and this brings the ethical issue around the use of personal data at scale); there is difficulty of informed consent due to dementia also. The other case study is completely different and is about financial market technology and the data is highly commercially sensitive, so understanding how transparency can be balanced against financial control and the need for genuine market secrets to facilitate trade.

Wifak Gueddana is a postdoc at information systems group at LSE.  Did her PhD on open source communities and how open source can work for NGOs and grass roots.  Working on a research project for online platforms — how to use computational analytics to collect and process data and deal with the qualitative and subjective issues.

Hamed Haddadi: lecturer in QMUL, asking how we can compute over mobile data and has worked on advertising and privacy aware system systems.  Linking the temporal nature of the data and understanding how much of your data is exposed (amusing anecdote about wearing shorts)

Muki Haklay: professor of GIScience and the Extreme Citizen Science group. He’s interested in participatory citizen science and cyberscience.  Interested in GeoWeb and mobiile GeoWeb technologies and understanding how tools for voluntary participation work (open street map, participatory geoweb).  ”How can you give control to people that are not technical” and how do we build protocols for cultural sensitivity?  Working on Open Street Map, while its easy for techies to contribute and feel happy, it might be an issue from a privacy perspective without GPS fuzzing or pseudonyms (you can say “I know where you live” to every OSM user). (discussion about most data being rubbish and is a psychology question about whether this depresses people!)

Jenny Harding: from the Ordnance Survey who control most of the UK’s mapping data and is a principal scientist in the research team team on working on how people interact with geography in business, public service and leisure.  Moving beyond just GPS into the connective function about what’s going on in places, and the connection between different objects and places.  What is the purpose for needing all this connected information, and how can the Ordnance Survey better serve its users with such usable connected data?  She commissions internal and external research and this includes PhDs and postdocs.  Challenge for this workshop: how personal data relates to location and the different levels of granularity at which such relationships can be made — beyond GPS, there is address data in (e.g.) supermarket loyalty cards, and other data at a postcode level, and different types of geographies all have connections.  Understanding provenance of data is really really important .

Pat Healey: Professor of Human Interaction and head of cognitive science research group and workson health issues.  Not here yet so introduced by Hamed.

Tristan Henderson; lecturer in compsci at St Andrews in Scotland, did his PhD on first person shooter games and runs a widely used mobile data set called CRAWDAD. They archive and share it and so work a lot on redistribution of data. His undergrad was economics so his interest is on behavioural aspects and usability issues too (Tristan has recently joined the HCI group and the ethics committee at St Andrews).  How can we get researchers to further engage with the ethics process and to refine the notions of informed consent in electronic terms.  Challenges: what is acceptable HDI and are we conducting it in an acceptable way (q from Mort: how broad? a: everything).  And is HDI unique enough that we might need another term.

Laura James: from OKF (not here yet)

Helene Lambrix: visiting LSE and usually at Paris-Dauphine University in France and is hoping to finish her PhD this year! Interested in corporate reputation and social media.

Neal Lathia: researcher at Cambridge University and did his PhD on online recommender systems. Noticed a disparity between data services online and the offline word so started working on recommender systems for urban systems (banging head against TFL data).  At Cambridge, started working with psychologists and leads EmotionSense (how satisfied are you with your life, as well as ongoing smartphone sensor data) — its really popular.  Challenges: language issue around how we present issue and motivate people around using that data (how does using EmotionSense affect their behavior)

Panos from Brunel: interested in cybersecurity and intelligence from media data mining and cloud based work. Commodification process of digital economy data and what is the legal frameowork surrounding this.  What are the personas for data to apply frameworks and data mining techniques to it?  Challenges: regulatory system using big personal data.

Eva from University of Nottingham and has just submitted PhD and waiting viva. Background is political science and internationl relations, and is interested in informed consent and how we sustain consent rather than just secure it as a one-off. The challenges: the human bit and how we communicate the complexity of systems and how people can make meaningful decisions. If you want people to be engaged with process then we need data to be more social and human.

Ursula Martin: professor in a Russel Group university in Mile End road. Is a mathematician and is researching the production of mathematics, and how it happens in the first place. Producing maths is a slow, painstaking thing, and is wondering how the rate of production can keep up with our needs. When interacting with an outfit getting her data, she’s not just an isolate, but is actually part of a large complex system.

Richard Mortier “mort”: From Nottingham and is the dude running this workshop so has introed before!

Nora Ni Loideain: doing a PhD in the european data protection and this requires the mandatory retention of data by telecomms provider. References recent US events cf Snowden and her PhD is on privacy oversight and the guards (or lack thereof) in current frameworks.  Challenges: how can we build these safeguards and what is the nature of informed consent with these.

Ernima Ochu: based in Manchester. Sometimes an activist, sometimes a writer, sometimes an artist. Background in neuroscience!  The social life of data and what happens around it, and how people get around based on it.

Yvonne Rogers and is from team UCL and is director of UCLIC (Interaction Center) and also an Intel-funded institute at UCL where they work on connected data.  Given lots of data from sources, interested in how people can engage with it and how to visualise . (Hamed: “Human City Interaction is the next thing!”)

Geetanjali Sampemane; background in Computer Science, and works at Google on Infrastructure Security and Privacy group.  Challenge is how to help people calibrate the benefits and risks for appropriate tradeoffs.  How can humans make an informed choice, and this isn’t based on informed choice.  Giving people buttons and options isn’t the most useful way to approach this, and we need a mental model similar to how we judge risks in the physical world.  In the online world, noone understands how dangerous it is to reuse passwords. Security people have made is a little worse by telling people to use complicated passwords, but brute force isn’t the big problem right now, it’s the connectivity of services.

Cerys Willoughby: Southhampton and looking at the usability of notebooks and wondering how to make the interfaces usable without being a technological guru. (missed rest due to reading cool comic she projected about her work. Sorry!)

Eiko Yoneki: from Cambridge, and she works on digital epidemiology and real world mobility data collection in Africa (e.g. EipPhone).  She analyses network structure to understand graphs and connectivity.  Also works on CDNs and builds self-adaptive CDNs, and works on graph-specific data parallel algorithms.

Jonathan Cave: game theorist (Yale, Cambridge, Stanford) and worked in a lot of government/academia/research areas. Works on economics and IoT (festival of things and the boundaries of humanity is coming up soon in Cambridge on 29th October).  Fascinating anecdote about price of sick animals

George Danezis: formerly MSR now UCL, and is interested in technical security aspects of how to glue together distributed mobile devices and not leave our personal data unreadable.  Has done work on privacy friendly statistics and how we can process and analyse it as an aggregate data set. Has worked in the context of smart metering in Europe.

Breakout sessions

We then had breakout sessions to brainstorm the challenges in this area (lots of post it notes and arguments), and Amir Chaudhry has summarised the results of the 5 presentations here:

Disambiguating data. For example from the home and organisations. This isn’t a new problem but becomes more important the more data collection occurs using different sources.  Who are the principles in terms of onwership and provenance?  How do we deal with communities/groups  and data control?

Why not try crowd sourcing mechanisms for people to use so that they can improve the use of sites like Ryanair (who deliberately obfuscate things).  Changing mindset from consumer perspective to a producer perspective.  i.e humans make data and perhaps can provide this to others for economic benefits.

We have data and different notions of data quality.  It’s not always the case that the most scientific data is the best, depending on how it’s used.  We have Collectivist notions of data culture: e.g this conversation right now in the room isn’t just individual, it’s all of us so we can’t ascribe it to individuals. Then there are Network notions, based on transactions that use a reductionist view to decide what they’re worth.  Can think of these on a continuum and the research challenge is how well do different ends of above scale in producing data and making value.  Interesting question is if people opt out (right to forget or right to oblivion), then the data that is left is flawed.

We need to examine current assumptions and values around data.  What is the unit of analysis? Must be clearer on this. Where and when we look at what data also matters as well as global aggregation.  Sometimes also want to look at trajectories and data flows and how this changes over time.  How do we interact with this data.  Do users interact with data directly or with something that sits in front of it?  There’s an interaction between data science and the creative process of presenting information in a certain way.  This depends on ultimate goal being served, for example e.g beavioural changes or just increased engagement.

There are big challenges in integration of groups who want to construct humans. Groups like social sciences (think about risk), Psychological science (reputation), Data sciences (Epistomology and Ontology), Design science (Interfaces and interactions). What is the new meaning of ownership and liability? e.g who owns this collection of posters and the ideas that have come out? [Hamed and Mort clarify that it's all theirs!]  What happens if there are negative consequences as a result of using poor data?

Business models are also important in order to go from studies to practice. Are there new social structures we could make to help this?  For example, we have venture capital that takes risks but what about social capital to spread risk to create new businesses e.g kickstarter and the like.

What does informed consent mean? Current system puts onus on user to understand all the contractual conditions before deciding.  Perhaps there’s a social-network method of crowd-sourcing opinions on ToS or providing some kind of health rating?  Perhaps data protection agencies could certify terms or maybe the EFF or non-profits can provide some kind of rating system (c.f Moody’s etc?). For example from the home and organisations. This isn’t a new problem but becomes more important the more data collection occurs using different sources.  Who are the principles in terms of onwership and provenance?  How do we deal with communities/groups  and data control?

Why not try crowd sourcing mechanisms for people to use so that they can improve the use of sites like Ryanair (who deliberately obfuscate things).  Changing mindset from consumer perspective to a producer perspective.  i.e humans make data and perhaps can provide this to others for economic benefits.

We have data and different notions of data quality.  It’s not always the case that the most scientific data is the best, depending on how it’s used.  We have Collectivist notions of data culture: e.g this conversation right now in the room isn’t just individual, it’s all of us so we can’t ascribe it to individuals. Then there are Network notions, based on transactions that use a reductionist view to decide what they’re worth.  Can think of these on a continuum and the research challenge is how well do different ends of above scale in producing data and making value.  Interesting question is if people opt out (right to forget or right to oblivion), then the data that is left is flawed.

We need to examine current assumptions and values around data.  What is the unit of analysis? Must be clearer on this. Where and when we look at what data also matters as well as global aggregation.  Sometimes also want to look at trajectories and data flows and how this changes over time.  How do we interact with this data.  Do users interact with data directly or with something that sits in front of it?  There’s an interaction between data science and the creative process of presenting information in a certain way.  This depends on ultimate goal being served, for example e.g beavioural changes or just increased engagement.

There are big challenges in integration of groups who want to construct humans. Groups like social sciences (think about risk), Psychological science (reputation), Data sciences (Epistomology and Ontology), Design science (Interfaces and interactions).  What is the new meaning of ownership and liability? e.g who owns this collection of posters and the ideas that have come out? [Hamed and Mort clarify that it's all theirs!]  What happens if there are negative consequences as a result of using poor data? Business models are also important in order to go from studies to practice. Are there new social structures we could make to help this?  For example, we have venture capital that takes risks but what about social capital to spread risk to create new businesses e.g kickstarter and the like.

What does informed consent mean? Current system puts onus on user to understand all the contractual conditions before deciding.  Perhaps there’s a social-network method of crowd-sourcing opinions on ToS or providing some kind of health rating?  Perhaps data protection agencies could certify terms or maybe the EFF or non-profits can provide some kind of rating system (c.f Moody’s etc?)

Next steps

Ian Brown took notes on our breakout session on business models for privacy:

Collectives/cooperatives sharing data through PDSes

  • How to incentivise membership? Dividends, social benefit (e.g. medical research)
  • what currently exists where data controller has strong incentive not to leak/breach data e.g. Boots for brand loyalty, Apple/Facebook? So long as customer can switch effectively (portability, erasure)
  • Power tool sharing at village level. Hang off existing structures e.g. local councils.
  • New forms of micro-markets e.g. physical gatherings? Alternatives to currencies. Kickstarter? Distribution reduces risk of centralised architectures.
  • What do syndicalist-anarchist models of data management look like?
  • Current uses of data are optimising existing business practices. But what totally new practices could be enabled? Human-facing efficiencies?

What are types of biz models? Startups, personal profit, NGO, medical research, banks. Balanced investment portfolio.

by Anil Madhavapeddy at Wed, 02 Oct 2013

Test your OCaml packages in minutes using Travis CI (Anil Madhavapeddy)


A few months ago, Mike Lin posted instructions of how to bootstrap an OCaml testing environment within the Travis continuous integration tool. I finally got around to integrating his prototype scripts properly using the latest OCaml and OPAM versions during my travels last week to ICFP. It’s been an extraordinarily quick and pleasant experience to add the (free!) Travis test runs to my OCaml programs on GitHub, so here’s how you can do it too. Dave Scott and I have used this for about 15 of our own projects already, and I’m switching the whole of the Mirage repo infrastructure over to it this week.

(edit: I’ve done a followup post about integrating Travis with SSH to make secure deployments easier.)

Getting started

Getting my first Travis build working with one of my OCaml projects took about 2 minutes in total:

First, log into Travis and sign in via Twitter. Click on the Accounts button on the top-right and you should see a list of the all the GitHub repositories that you have access to. Just click the On switch for the one you want to start testing. Nothing will actually happen until the next code push or pull request goes to that repository. Behind the scenes, the On button that you clicked use the GitHub APIs to turn on the Travis post-commit hook for your repository.

Create a .travis.yml file in your main repository (see below or this gist). Travis doesn’t have native support for OCaml, but it isn’t really needed since we can just use C and write our own shell scripts. The env variables define a matrix of the different combinations of OCaml and OPAM that we want to test. Just remove variations that you don’t care about to avoid wasting Travis’ CPU time (open source projects are supported on a fair-use basis by them). Here’s the .travis.yml that I used for my ocaml-uri library:

language: c
script: bash -ex .travis-ci.sh
env:
  - OCAML_VERSION=4.01.0 OPAM_VERSION=1.0.0
  - OCAML_VERSION=4.01.0 OPAM_VERSION=1.1.0
  - OCAML_VERSION=4.00.1 OPAM_VERSION=1.0.0
  - OCAML_VERSION=4.00.1 OPAM_VERSION=1.1.0
  - OCAML_VERSION=3.12.1 OPAM_VERSION=1.0.0
  - OCAML_VERSION=3.12.1 OPAM_VERSION=1.1.0

Now you just need the .travis-ci.sh shell to run the actual tests. Travis provides an Ubuntu Precise/i386 VM that is destroyed after every test run, so we need to initialize it with the OCaml and OPAM binary packages. Since you often want to test different versions of all of these, I created a series of stable Ubuntu PPAs that have OCaml 3.12,4.0,4.1 and OPAM 1.0 and the (currently beta) 1.1 package manager. You can find them all on my Launchpad page, but the below script takes care of it all for you.

# Edit this for your own project dependencies
OPAM_DEPENDS="ocamlfind ounit re"
	 
case "$OCAML_VERSION,$OPAM_VERSION" in
3.12.1,1.0.0) ppa=avsm/ocaml312+opam10 ;;
3.12.1,1.1.0) ppa=avsm/ocaml312+opam11 ;;
4.00.1,1.0.0) ppa=avsm/ocaml40+opam10 ;;
4.00.1,1.1.0) ppa=avsm/ocaml40+opam11 ;;
4.01.0,1.0.0) ppa=avsm/ocaml41+opam10 ;;
4.01.0,1.1.0) ppa=avsm/ocaml41+opam11 ;;
*) echo Unknown $OCAML_VERSION,$OPAM_VERSION; exit 1 ;;
esac
	 
echo "yes" | sudo add-apt-repository ppa:$ppa
sudo apt-get update -qq
sudo apt-get install -qq ocaml ocaml-native-compilers camlp4-extra opam
export OPAMYES=1
opam init 
opam install ${OPAM_DEPENDS}
eval `opam config env`
make
make test

Now just do a push to your repository (a commit adding the Travis files above will do), and you will soon see the Travis web interface update. For example, here’s the output of ocaml-uri that the above example files are for. Of course, you should tweak the scripts to run the tests that your own project needs. Let me know if you make any useful modifications too, by forking the gist or e-mailing me.

Testing pull requests in OPAM

Travis isn’t just for code pushes though; as of a few months ago it can also test pull requests. This is an incredibly useful feature for complex projects such as the OPAM repository that has lots of contributors. You don’t need to do anything special to activate it: whenever someone issues a pull request, Travis will merge it locally and trigger the test runs just as if the code had been pushed directly.

I did do some special scripting to make this work with the OPAM package repository. Ideally, every new package or metadata change in OPAM will attempt to rebuild just that package set (rebuilding the entire repository would far exceed the 50 minute testing budget that Travis imposes). I put together a very hacked up shell script that greps the incoming diff and rebuilds the subset of packages. This is now live, and you can see both successful and failed pull requests (once the request has been merged, there’s a tiny little green arrow beside the commit that was tested).

This is a very unconservative estimate test matrix since a package update will also affect the reverse transitive cone of packages that depend on it, but it does catch several common typos and incompatibilities (for example, packages that use OPAM 1.1-only features by mistake). The longer term plan is to use the OCamlot command line tool to parse the pull request and compute an exact set of packages that need rebuilding.

Deployment

Travis has one last very cool feature up its sleeve. When a project has successfully built, it can run a scripting hook in the VM, which can be used to trigger a further code push or service update. If the service requires authentication, you can encrypt a secret in the travis.yml using their public key, and it will be available in the VM as an environment variable (but won’t be useful to anyone else looking at the code repository).

There are quite a few uses for these Travis deployment scripts: automating the rebuilding of the ocaml.org website infrastructure, rebuilding the central opam-doc cmt file archive, or even autoupdating Mirage microkernels to Amazon EC2 and Rackspace.

So how does this tie into the ongoing work on OCamlot? Quite simply, it’s saved us a ton of frontend work, and lets us focus on the more interesting OCaml-specific problems. Travis is also somewhat reactive (since it only runs in response to pushes or pull requests), and we still need to be able to run complete repository sweeps to look for more systematic breakage. It also doesn’t support any non-Ubuntu operating systems yet. However, Travis takes the burden off us for handling the steadily increasing number of package updates, and can be used to launch further OCamlot jobs on the other architectures and distributions. All in all, I’m very grateful to Mike for taking the trouble to blog about it back in March!

And a little bit of cheeky ARM hackery

I couldn’t resist one final little hack to see if I could actually do some ARM/OCaml testing using the Travis infrastructure. I adapted Ian Campbell’s excellent guide to ARM cross-chroots, and ended up with a rough script that builds an unattended chroot and can run OCaml/OPAM installations there too.

This isn’t something I’m quite comfortable running on all of my repositories just yet though, since an OCaml cross-build using qemu took over 5 hours and was quite rightly terminated when running within Travis. To mitigate this, I’ve built a custom apt-repository for ARMel packages to install a binary toolchain, and you can see the results of Lwt/ARM building successfully. I’ll update on this as I get it working on more packages, as I’m quite interested in getting a solid Xen/ARM toolstack up and running for another exciting project over in OCaml Labs…

Sun, 29 Sep 2013

(Exercise) Odd or even? (Philippe Wang)


This is the exercice:

Given a string (of bytes), is the number of bits set to 1 odd or even?

Let's now find some possible solutions.

Well, the most naive way is to count them all, and then see if the number is even or odd. But it's probably better not to use an integer to count them all and then see if that counter is odd or even, since we can instead use a Boolean: each time we see a 1, we can simply flip the Boolean value!

let has_an_odd_number_of_1 s =
  let res = ref false in
  for i = 0 to String.length s - 1 do
    let x = int_of_char s.[i] in
    for j = 0 to 7 do
      if (x lsr j) land 1 = 1 then
        res.contents <- not res.contents;
    done
  done;
  res.contents

Side note:

Moreover, we can see that using a Boolean means it will not limit the number of bits we may read, whereas an integer... well, actually, when we reach the maximum value of integers (of type int, or Int32.t or Int64.t, which — in OCaml — are respectively Pervasives.max_int, Int32.max_int and Int64.max_int), adding one will make them even negative numbers (max_int, from Pervasives, Int32 or Int64, are all odd numbers), so we should still get the odd/even outcome right. But then, we would have to be sure that we have correctly designed the code that tells us if an integer is odd or even. For instance, we can write is_odd these ways:

Well, let's start with a very classic mathematically-elegant implementation of two mutually recursive functions:

let rec is_odd = function
 | 0 -> false
 | n -> is_even (n-1)
and is_even = function
 | 0 -> true
 | n -> is_odd (n-1)

Note that it needs a lot of computation when the integer is "big"...

And this is probably the shortest version:

let is_odd n = n mod 2 <> 0

Now note that we cannot write this and expect it to work properly in all cases:

let is_odd n = n mod 2 = 1

and that's because if n is negative and odd, then n mod 2 is -1, not 1.

A tricky version that uses properties on integers:

let is_odd n = n <> 0 && ((n / 2) * 2) <> n

Another tricky version that uses properties on machine integers:

let is_odd n = n land 1 = 1

Well, there is the same number of characters as the shortest version, and actually this version is probably the most efficient because we know that the land operation is easy for the computer, whereas the mod operation implies division, which is not an easy one. Then mod could be optimised in some cases, such as mod 2 because it's basically the same as land 1 except that the sign has to be preserved! So, it's likely to still cost more...

But there's way more efficient than that! Indeed, unless we really don't have a choice, we shouldn't read bits one by one, because it's slow!

So what can we do instead?

Well, there's a very very convenient logical operation on integers. It's called XOR (and the OCaml operator for that is lxor), and yes we can use it!

Note that 1 XOR 1 = 0; 0 XOR 0 = 0; but 1 XOR 0 or 0 XOR 1 = 1. It's very convenient because it's basically giving us the answer! When we have two 1s, we can eliminate them, that's what happens with XOR. When we have a 1 and a 0, we keep the 1. So if we take those bytes byte by byte, and XOR them with each other, at the end we will have one remaining byte that will tell us if the whole string had an odd number of 1 or not.

let has_an_odd_number_of_1__efficient s =
  let l = String.length s in
  let rec loop c i =
    if i = l then
      let res = ref false in
      for j = 0 to 7 do
        if (c lsr j) land 1 = 1 then
          res.contents <- not res.contents;
      done;
      res.contents
    else
      loop (c lxor int_of_char s.[i]) (succ i)
  in loop 0 0

And... There we go! :-)

Note that int_of_char is a zero-cost operation (but char_of_int is not, because it has to check that the value of the given integer is between 0 and 255 inclusive).

by Philippe Wang at Thu, 26 Sep 2013

(Exercise) Are there more 1s or 0s? (Philippe Wang)


The challenge is this:

Provide a program that will determine, given a set of 1s and 0s, if there are more 0s or more 1s.

So the first question that comes to my mind is about the representation of that set of 1s and 0s. Is it a linked list, a double linked list, an array, a tree, a string, or something else?

If it's a list (single linked or double linked actually doesn't matter much), it's going to be pretty straightforward: just read the list and use a counter. Yes, a single counter is enough, you can add one to the counter when you meet a 1, and subtract one when you meet a 0. At the end, if your counter has its initial value, then you have the same number of 1s and 0s. If it's lesser, then it has more 0s, else it has more 1s.

Let's declare a type for the result.

type result = Equal | More_zeros | More_ones
let count l =
  let rec loop counter = function
    | [] -> counter
    | 1::tl -> loop (counter+1) tl
    | 0::tl -> loop (counter-1) tl
    | _::tl -> loop counter tl
    (* you might want to debate whether you should stop 
       when you get something else than a 0 or a 1. *)
  in
  let r = loop 0 l in
  if r = 0 then
    Equal
  else if r > 0 then
    More_ones
  else
    More_zeros    

Well, what's the risk with this program? The risk is that the integer counter overflows, of course! If you have a very long list of 1s (or 0s) only, you may get things just wrong. However, in OCaml, in reality, there's not really a chance that the integer overflows because of the length of a single linked list, especially if you're using a 64-bit architecture based OCaml, on which the greatest integer is 4_611_686_018_427_387_903 (about 4⨉1018). There's really a long way to have such a long list because basically you would need to allocate more than (about 32⨉106 terabytes) at once, since basically a linked list of integers is made of blocks that have 2 cells each (one for the integer, one for the address of the next cell), each cell taking 64 bits (or 8 bytes).

But then, what if you don't have linked lists but some stream that gives you a very large number of 0s and/or 1s? Well, to begin with, counting from 0 to 4⨉1018 takes a really long time. If your machine can count from 0 to 109 in a single second (that would mean your machine is very fast), it would still take 4⨉109 seconds, which is about 4000000000/(606024*365) years. Oh, that means about 126 years! So let's just assume that a 63 bit signed integer is enough for us. And if you really can't assume that for some reason, you can always implement 128 bit signed integers quite easily, and if you don't know how to do that or if you're too lazy to do it, use the Big_int module.

But let's go back the representation of those 0s and 1s. I'd like to make the computation as fast as possible and I'll put those 0s and 1s in a very compact representation. Each 0 and 1 will now only take one bit in the memory (with a possible constant overhead for the whole). For that, let's use OCaml's strings, which are basically arrays of bytes. The longest string I can have on my machine will bear 1_152_921_504_606_846_904 bits (I know that because I multiplied Sys.max_string_length by 8), and that's a lot (more than 108).

Now say we want to know whether there are more 0s or 1s as fast as possible. How do we do that?

We don't want to count all 0s and 1s bit by bit, because that would have quite a high cost! Indeed, we would have to get each byte, and for each byte we would have to read each of its 8 bits (that can be done) one by one. We don't want to do that.

Instead, since we have bytes, we can conveniently allocate an array of size 256. Each cell of that array will contain the right number to add to the counter. This way, we can read a byte, get its number of 0s and 1s in O(1).

(* this table is computed only once *)
let table =
  let number_for_a_byte b =
    let r = ref 0 in
    for i = 0 to 7 do
      if (b lsr i) land 1 = 0 then
        decr r
      else
        incr r
    done;
    !r
  in
  let a = Array.make 256 0 in
  for i = 0 to 255 do
    a.(i) <- number_for_a_byte i
  done;
  a

Then let's abstract from the means to read new 0s and 1s, by assuming we'll be provided a function f that given () will return 8 bits in a value of type char, and will raise the exception End_of_file when it has no more bits to give.

let more_zeros_or_ones f =
  let c = ref 0 in
  begin try while true do
    c := !c + table.(int_of_char(f()))
  done with End_of_file -> () end;
  if !c = 0 then
    Equal
  else if !c > 0 then
    More_zeros
  else
    More_ones

Note that int_of_char has a zero-cost (i.e., it "disappears" at compile-time because int and char sort of share the same memory representation). If you want better performance, you should inline f, provided that you know what it is (you may want to check if the compiler does the inlining itself first, just in case).

Also, you may want to use a table with a size larger than 256 if you have a lot of memory but I'm not so sure you'd gain performance unless you use nasty tricks to read larger integers from a string. Then again, you may not end up using a string, in which case you have to think with the whole problem in mind.

by Philippe Wang at Wed, 25 Sep 2013

Liveblogging OCaml Workshop 2013 (SRG Syslog)


photoI’m here bright and early at the OCaml 2013 workshop at ICFP with Heidi Howard, Leo White, Jeremy Yallop and David Sheets!  There’s a packed session today with lots of us presenting, so we’ll swap over editing this post as the day progresses.  I also need to pop out to attend a panel on the future of Haskell so I’ll be missing the bindings session, sadly!

Accessing and using weather data in OCaml

Hez Carty from MDA Information Systems on doing large-scale weather data processing.photo

He started with OCaml in graduate school and was looking for something better than Python, whereas they are building systems that need to run for many years with high stability.  He’s not from a formal CS background but from the physical sciences, and found that the static typing made it easy to explore the language features.  The native code output is also important, and he finds predictability very important in day-to-day engineering.  The simple FFI also made a big difference.

They do weather data retrieval and use HTTP/zeromq as a message bus to do language interop, and a workflow management tool to handle dependency graphs.  They take weather models frmo all these sources, and do various analyses (principal component analysis, probability models for precipitation via Monte Carlo simulations with a first order Markov chains).  More generally, there is a bunch of processing for climatology and data extraction and insertion.

A NAM forecast plots the geo data of the USA, and overlays semitransparent layers (e.g. clouds) that are stacked up to get a fake satellite view of the earth, and then there are simulated radar returns where the model (e.g.) rains.  In addition to the structured formats, there is raw streaming data from sensor stations that send back raw binary data such as precipitation events, which is written as a small OCaml library to parses them into more structured formats.

The big data formats are HDF4 and GRIB, which are binary data formats. There are mature C libraries to parse them and lots of existing analysis tools (e.g. wgrib which can extract lat/longs from a big file).  However, the tools arent great for doing larger scale analysis, as shelling out all the time for thousands of points is not ideal — therefore they bind to OCaml to do the analysis in-house.

The GRIB bindings use int, float and float arrays and bindings are entirely hand-written <i>(I’m looking forward to the ctypes talk! –anil)</i>.  They expose only an OCaml friendly interface that is safe for callers.  The speaker now shows several nice graphs from CloudSat tracks from earth orbit, and how they take intersecting profiles to deduce things about what multiple satellites are seeing.

They’re mixing lots of different tools and satellite sources — derived information from teleconnections (air surface pressure and how it varies in terms of oscillations between the north and south Atlantic).  All the analysis is written in OCaml, and he had to work with several scientists who didn’t have access to their initial source code in many cases, so the experience of working with his fresh system was very positive.

Challenges:

  • standard issues when interfacing with C (type mismatches between C and OCaml in particular, and the question of sharing vs copying).  The worst failure mode is silent failure with bad data, which leads to bad science!
  • “Getting used to the functional way of life” can tough, as is knowing when to balance elegance with the speed benefits of not copying hundreds of megabytes of data around can be tough.
  • Development environments other than Emacs and Vim are poorly supported.

The overall project has been very successful, and the bindings for HDF4/GRIB API bindings have been used to process 1TB+/day for over 5 years now.  It’s based on Bigarray and mmap, and the FFI makes getting up and running quickly.

He finds OPAM with local repositories fantastic, and used GODI and odb/oasis-db before that.

Q: how did he do the cool plots?

A: plplot bindings on github

Q: Anil: did you move from Fortran

Q: Nate Foster: terabytes of data makes me wonder about GC?

A: going through and folding over data generates huge amount of garbage and slows things down considerably. If the project is one-off, he tweaks the GC or manually call the GC more often. Really depends on length of code.

Q: The data processed is huge, so what was the parallel computation approach?

A: The task distribution system doesn’t need a lot of parallelism, as he can work on one file at a time or chunk the files manually and aggregate the results.  He’s looking forward to doing some semi-shared memory parallelism.

Q: Pierre Chambart: Is the numerical code in OCaml or C?

A: LAPACK in OCaml, but the Monte Carlo is pure OCaml and interpolation is in pure OCaml too.  Unless speed is absolutely required, it’s written in OCaml.

Frenetic Network Controller

Arjun Guha about their multiyear project to program software define networks. When he started on this, he thought that networks are just a series of tubes! The reality is very different though, as there are numerous intermediate boxes (firewalls, routers, wireless auth). The real problem is that each of these intermediate boxes runs specialized, vendor-specific software and are configured independently via some proprietary CLI. They’re difficult to configure and difficult to statically reason about.photo

SDN helps with this by introducing standardized programmable that can deploy in-network features, and a logically centralized controller (a big beefy server) that enables reasoning about whole-network behaviour. SDN is a hot topic right now (lots of commercial startups, 200+ people at HotSDN, generally very busy).

There are several Openflow controllers out there: Java (Floodlight), Python (Pox), C++ (Nox) and Haskell (Nettle), but Arjun introduces a novel OCaml programmable switch!

Openflow works via a controller “faulting” in new packets that are routed to the controller when unknown packets show up on the wire. The source code consists of a generic Openflow packet parser (its a binary protocol), and a generic packet parser (e.g. TCP/IP) and a routing engine.

First library: ocaml-packet that deals with deserializing TCP/IP/ARP/ICMP/Ethernet/802.1Q, and only depends on cstruct by Anil et al. (shows an example of a C struct for IPv4), and the nice thing that they get network byte order for free (thanks to Pierre Chambarts ocplib-endian of course).

ocaml-openflow uses serialization for OpenFlow 1.0 and 1.3, and is based on a (cleaned up) fork of the mirage-openflow library, using ideas from Nettle about to structure the code. OpenFlow 1.3 is less complete currently but is still quite usable.

The runtime systems for OpenFlow 1.0 and 1.3 listens for TCP connections from switches, and does the OpenFlow handles and keeps connections to switches alive.

The source code uses Lwt and so exposes the monad to external users, e.g.:

type t
val connect : Lwt_unix.file_descr -> t option Lwt.t

They have a tutorial for non-OCaml hackers, and found teaching OCaml, Openflow AND Lwt was too much for new users. Therefore they built a new controller called Ox that hides much of the details of Lwt threading. It still uses them internally (Lwt, ocaml-openflow) but is inspired by Ox.

Therefore they can build simple openflow tutorials using Ox, and advanced controller such as Frenetic can be built directly using the low level libraries. Good for beginners and weathered OCaml hackers alike!

The problem with the controller seeing every bit of data is that it gets flooded by packets if it doesnt insert rules into the switches itself. Every switch has a flow table that has priorities, patterns and actions, and tries to take action without bothering the controller. At start of day, the controller does see everything, but eventually flows settle down. The problem they are solving is avoiding having two copies of the program on both the controller and the switch, and figuring out how to deal with race conditions across the two. The simple, logical structure of the OCaml program gets torn apart if they try to write these two parallel logic rules by hand.

They deal with this in the Frenetic network controller. This is a DSL for programming OpenFlow networks, and it has boolean predicates, several policy composition operations and compiles to OF flow rules. It implements ideas from several papers (ICFP/POPL/SIGCOMM/PLDI) and is fully open source on GitHub. (frenetic-lang/frenetic).

They made a transition from Haskell to OCaml for a few reasons. They specified several parts of the Core compiler and runtime in Coq, and found Haskell/Coq extraction very painful (specifically, they had to abandon Coq functors, which had helped them discover key algebraic properties for their work on Kleene algebras). They found it easier for their collaborators to use OCaml than Haskell in general too.

Lwt programming is mind boggling for beginners although the syntax extension does help, and the lack of multicore is a future problem as the controller needs scale up. Overall, they find that hacks at any layer are possible now and lots of projects have spun up!

Q: multicore is a common criticism, but could you give an example of how you would actually solve this is you had shared memory vs a fork model?
A: There are several networking algorithms in the literature, and its a barrier for consumers. Nate notes that they could write one by hand but there’s a time issue there and would rather have it automagically.

PHP Program analysis at Facebook

Pffff is all about deadcode removal, test coverage, checking for undefined function and use of undeclared variables, and syntactical grep rules. They also do taint analysis via abstract interpretation, type driven flow control and eventually (via Monoidics) a separation logic model (they use OCaml too).photo

This talk is about static analysis of a huge codebase (5m lines+ of PHP code is scary!). Since he had a lot of big monitors, he built the tools to do analysis over huge codebases. He demonstrates this on the OCaml compiler itself (camlp4 is huge! boo!!! -anil). They can also zoom in in realtime to each box and eventually see the content of the file itself!

The goal is NOT to build an editor, but to analyse and explore large code bases. It opens and editor if required any time. The idea is to transform semantic program analysis into a visual formt that’s more easily digestible. Just like in Google Maps where you see the big important roads and only the details when you zoom in. He uses this tool all day on his own codebase too, and gradually realised that when you want to learn a new codebase, you want to start to learn the software architecture first, and then want to figure out the dependency graph before seeing details.

He then shows an interactive grid map that explains the dependencies of whole subdirectories of the OCaml compiler, and drill down to understand what they’re for.

Audience q: did you consider a flare graph? a: others have done edge graphs and graphviz etc, and none of these scale to quite the heights that this particular grid done. The ordering of the grid is also the transitive order of dependencies, so it starts with the list of simple dependencies (stdlib) and ends with the toplevel (depends on everything). You can spot for example the OCaml dev rule that things in stdlib are only used by the compiler.photo

With good software, the top right of the grid is full, which is forbidden by OCaml (no cyclic module deps). With other software, the situation is far worse (audience laughter).

There are many other tools (CodeQuery, stags, sgrep/spatch and scheck).  There are plugins for PHP, OCaml (using the cmt options) and Java (using Joust), C/C++ (using clang) getting all the semantic information from varied source code.  There is a plan to move it to the web (from a GTK interface) by using Ocsigen.

 

Conclusion: CodeMap is a scalable semantic source code vis/searcher for huge codebases, and codeGraph visualizes dependencies.   Future work is to minimize backward deps.

All on Github at github.com/facebook/pfff.git

Q: can you insert other metadata such as Linux perf output?

A: yes! (I think)

Extension points: an alternative to camlp4

Leo White talking about alternatives to camlp4. Why? Camlp4 is very complex and it comes due to its support for an extensible grammar. Did a wg-camlp4 working group to figure out how to evolve common uses towards a more easily parsed standard, to reduce the day-to-day dependency on camlp4. After 400 emails, a consensum emerged.

sexplib as an example: the old version was

type t = {
  float: int default(42);
  bar: float;
} with sexp

The annotated version is:

type t = {
  foo: int; [@default 42]
  bar: float
} [@@sexp]

This has annotation data that is passed straight through to a transformer, with no extensions to the grammar. Thus the sexplib transformer can understand with @default 42 means, and editors can easily hide or fold annotations without having to understand their details precisely.

The compiler ignores any AST extension fragments that remain after all the transformers are done.

sedlex

[%lexer
  match foo with
  (Range ('a','z') | Range(1000,1500)), 65 -> Foo
  | Star (Range('a','z')) -> Bar
]

These are genuine AST nodes and so if these aren’t handled by the transformer, an error will result. Another example is

properly foreign syntax, via COW (Caml on the Web)

let world = "world" in
let html =  [%html {|<h1>Hello $str:world$!</h1>
|}]

There’s also a shortcut convention for the syntax, such as:

match%lexer foo with
 {|

There’s a shorthand form to avoid have lots of brackets due to deeply nested lets.

[%lwt let x = f () in
[%lwt let y = g () in

These have all been implemented by Alain Frisch and merged onto trunk (in OCaml 4.02.0dev). Thoughts and feedback are very welcome on the wg-camlp4 list

Q: how does this affect the quality of error messages?
A: Leo -- do you mean error messages from the extension itself?
Q: what if there is an error in the extension?
A: The 42 that was in the default has an AST fragment and so the transformer will transplant the location. A sensible transformer will therefore be nonlossy.

Q: can you chain transforms?
A: yes, thats why we are going with an AST fragment rather than a string, so chaining is easier. If youve got some macro expansion transformer and you want to use it with the sedlex example, they are executed in the other they are given on the command line. The next transformer will just see the resulting AST fragment, and therefore should compose between than camlp4 extensions have in the past.

Q: do you have ideas about hygiene as Scheme does?
A: Not particularly in this case, although Leo has some ideas for future systems.

Q: Ashish Agarwal -- do existing extensions need to be reimplemented
A: Good thing about all these existing extensions is that they have an existing parser, so you can take your existing implementation in camlp4 and mechanically translate into an extension_points version!

Q: Yaron: is anyone thinking about good libraries (hygiene etc)
A: Alain Frisch has merged in ppxtools into OPAM stable (that depends on 4.02.0+)

Q: Will this improve performance over camlp4
A: Parsing is an expensive process and this is still doing marshalling. The cost is often the case that we are shelling out to this other process, so we still need to look into improving performance in other ways.

GPGPU programming with SPOC

M Bourgoin presents his PhD work on SPOC.

GPGPU is odd hardware: several mutiprocessors, dedicated memory, connected via PCI-E and transferred between a GPU and host using DMA. Currently hardware has 300-2000 cores vs 4-16 CPU cores (wow). However this demands a very specific progamming model to take advantage of all this hardware.

Shows an example of a vector addition in openCL

__kernel vac_add(global const double *c, global const double *a, global double *b, int N)
int nIndex = get_global_id(0);
if (nIndex >= N)
  return;
c[nIndex] = a[nIndex] + b[nIndex]

To actually use this though, you need a massive program to set up the kernel and execute it. It does the transforms, initialization, scheduling and so forth.

OCaml and GPGPU complement each other, since we have sequential high-level OCaml and highly parallel and low-level GPGPU. So the idea with SPOC is to use OCaml to develop high level abstractions in the GPGPU and program it from OCaml.

SPOC is a stream processing engine in OCaml that emits Cuda but lets it be more hardware independent.

let dev = Devices.init ()
let n = 100000
let v1 = Vector.create Vector.float64 in
let v2 = Vector.create Vector.float64 in
let v3 = Vector.create Vector.float64 in

let k = vector_add (v1,v2,v3,n)
let block = {blockX = 1024; blockY=1; blockZ=1}
let grid={gridx=(n+1024-1)/1024; gridY=1; gridZ=1}

let main()=
random_fill v1;
random_fill v2;
Kernel.run k  (block,grid) dev.(0);
#printf of result

SPOC takes care of transferring the kernel to the GPU and back and handling the scheduling and exchange of data. Vectors are very similar to Bigarrays in that they are external memory.photo

We want: simple to express, predictable performance and extensible libraries, and for current high performance libraries to be usable more safely.

Two solutions: interop with Cuda/OpenCL kernel (better perf and more compatible but less safe). There is also a DSL in OCaml called Sarek, which talk focusses on.

Sarek vector addition has an ML like syntax, and the language features monomorphic, imperative, toplevel and statically typed checked (with type inference). Its statically compiled to OCaml code and dynamic compilation to Cuda and OpenCL.

SPOC+Sarek achieves 80% of hand-tuned Fortran performance and SPOC+external kernels is on par with Fortran (93%). It’s type safe, 30% code reduction, memory managed by the GC and there are fewer memory transfers. This is ready for the real world and has great performance! Available on OPAM.

Great performance, portable, great for GPU and for multicore CPU and nice playground for future abstractions.
Who can benefit?

OCaml programmers -> better performance
HPC programmers -> much more flexible and extensible than current state of the art.

Future work: Sarek needs custom types, function decls, recursions, exceptions, and to build parallel skeletons using SPOC and Sarek.

Why not use Sarek for multicore? Well, here’s the results!

  • ParMap: data parallel and similar to OCaml mapfold
  • OC4MC: posix threads, compatible with current OCaml code
  • SPOC : GPGPU kernels on a CPU, mainly data parallel and needs OpenCL.

Simple program to calculate power series and matrix multiplication.

Power: OCaml 11s14, Parmap 3s30 SPOC <1s
Matmul: OCaml 85s OC4MC 28s and SPOC 6.2s

Significant improvements using SPOC — 11x for power series and 12x for SPOC. Running on a quad core intel i7.

SPOC is available via OPAM and compatible with x86_64 on Linux and MacOS X, and speaker would love feedback.

Q: Anil – how hard is OpenCL to install
A: easy if you know your hardware model and to the vendor website. Harder with multiple GPUs though.

A: if you want to target CPUs, you need to target OpenCL and not CUDA.

High level performance optimisations for OCaml

Pierre Chambart at OCamlPro (funded by Jane Street). OCaml is fast, but not an optimizing compiler (it has predictable performance, good generated code, but what can we do to make it go faster?).

A small modification to high level code shouldnt influence the low level bits.

let f x = 
  let cmp = x > 3 in
  if cmp then A else B

let g x =
  if x > 3 then A else B

The second bit of code is faster than the first here since ‘g’ is spotted by the compiler peephole optimizer.

let g x = 
  let f v = x + v in
  f 3

Abstract code should be compiled less abstractly. Above, we want to inline f, but a closure is still allocated per call unnecessarily. But we also dont want the compiler to be too smart and lose predictability!photo

Currently compiler pipeline has a parsetree – typed tree – lambda (untyped) – clambda (lambda+closures) – cmm (simple C like) and then mach (instruction graph similar to llvm) and then lin (assembly like).

Details: typed tree to lambda (construct elimination) – lambda (high level simplication such as reference elimination and pattern matching generation) – lambda to clambda (inlining, constrant propagation and book keeping in an intricate recursive loop) – clambda to cmm (unboxing, lots of peep hole checks for boxing/unboxing and tagging/untagging) – cmm to mach (instruction selection and some more peephole) – mach (allocation fusion, register allocation and scheduling)

Where do we do high level optimizations then? For example, where do we do inlining (since it allows other transformations to apply). typedtree is too complicate, lambda (want inlining, simpler with closures), clambda (is too difficult to improve, and Pierre did try) and cmm (good for local optimization but not global) and mach is architecture specific.

They are adding a new intermediate form between lambda and clambda known as flambda. This needs a highlevel view of values, simple manipulations, explicit closures and explicit value dependencies. lambda to flambda introduces closures, and flambda to clambda is mainly book keeping (and preparing cross-module information). The magic for optimization will be in flambda-flambda passes, and not in the flambda to clambda layers.

Shows several examples of inlining, simplication and dead code elimination and why its significantly easier in the flambda form for simple microexamples. Also shows lambda lifting!

Change the performance model. Now its WYSIWYG, but we want some sort of understandable compile-time evaluation. For example, consider a map function as a metaprogramming function that you want to execute at compile time without all the intermediate closure allocation.

Add build_test to your OPAM to make it easier to test the compiler, and Obj-using code is very likely to break if you do certain mutations behind the compiler, or dead code elimination might happen if you use a value behind the compiler’s back.

Q: SimonPJ: why is it easier to do closure conversion first, as it’s a lot easier in GHC to do the opposite. There are now two ways to get a value in scope doing it the other way.
A: In Haskell you have everything in your closure, since you get the information from the closure.
Q: Think of lambda calculus with let
A: We do flow analysis to get around this!
Q: This is quite complex. Is it whole program analysis?
A: Restricted to a single compilation unit, and a few other cases, not cross-module yet.

Q: Yoann: if the prediction is “its always faster” then why do you care about predictability?

Replacing the format string hack with GADTs

Benoit Vaugon talks about how to improve the typing of Format strings in OCaml by using GADT encodings. He stats with showing basic examples of the Printf module, and advanced examples that include the “%a” callback, date scanning and hex dumping.photo

The OCaml type checker has an inferred type that encodes the details of the format string in a “format6″ type that has lots of type parameters that describe the format string contents.
During type checking, the literal string is parsed and there is manual inference of the format6 type parameters. At runtime, the format strings are represented by the string from the source code.

During printing functions, it parses the format and count parameters and accumulates parameters. It then extracts and patches subformats, and finally calls the C printf function. Scanning is more complex but broadly similar.

There are quite a few problems with this. There are 5 (!) format string parsers (two in Printf, two in Scanf, one in the type checker) and there are lots of little incompatibilities as a result. For example

Printf.printf "%1.1s" "hello"

Results in Invalid_argument exception despite the static type checking, with a useless error message.

There is a weakness in the type checker too, for example

Printf.sprintf "%s.+f" 3.14

results in the format string being printed.

You can also segfault from there, such as

Format.printf "@%d%s" 42 "hello"

will segfault.

Speed is also a concern, as parsing the format at runtime is slow, and reparsing is required by another slow C printing function. Lots of memory allocation is required.

The new implementation sorts all this out by using GADTs to represent the formats. The format6 type is now concrete and not magic predefined. Formats now also become statically allowed and not multiply passed around dynamically.

The new implementation (which he shows) is a fairly epic GADT but quite regular when broken down into pattern clauses and examined with respect to the old implementation.

Issues: evaluation order changes slightly, as for printing functions the parameters are accumulated before printing. String_of_format is implemented by “%identity” and in the new implementation, we need to change the representation (either regenerate the string from the GADT or implement formats by a tuple).photo

There’s now only one format parser (down from 4-5) for the standard library and OCaml type checker

type ('b,'c,'e,'f) fmt_ebb = Fmt_EBB:
 ('a,'b,'c,'d,'e,'f) CamlinternalFormatBasics.fmt ->
 ('b,'c,'e,'f) fmt_ebb
val fmt_ebb_of_string : string -> ('b,'c,'e,'f) fmt_ebb
val type_format: ('x,'b,'c,'t,'u,'v) format6 ->
   ('a,'b,'c,'d,'e,'f) fmtty -> ('a,'b,'c,'d,'e,'f) format6

(this looks verbose but isnt really exposed outside of the compiler.

Another issue is the “%(..%r..%)” construction, since we need to include a proof term of the number of “%r” occurrences. The encoding for this is quite complex.

Performance of the new system is significantly better across the board. A hello world goes from (230ns to 55ns) timing and (732 to 24 bytes allocated) for example.

Q: Oleg did it already in Haskell (ha ha — editor). He notes that six parameters arent needed, as the GADT is the interpreter over the paramters.
A: yes can remove them if we split the Printf and Scanf format, and there is also a format4 which removes several parameters. We also have a problem with error messages since all the type parameters make things really complicated.
Q: Oleg: the work of interpreting the stream is pushed into the GADT, but if you write an interpreter over the GADT then we have simpler external types.
…discussion moves into offline discussion.

Dynamic types in OCaml

Gregoire Henry wants to do dynamic types in OCaml.photo

  • Structural type introspection: generic I/O primitives, type-safe (unlike Marshal) and typed (not written in Camlp4) — these are polytypic functions.
  • Normal type introspection, for dynamic values (a k/v store or a DSL with dynamic typing), or extensible polytypic functions for abstract types.
  • This would also give us a common type representation for (e.g.) Eliom services that will be converted to Javascript, or for FFI libraries, and to let the debugger explore the heap with exact typing information.

Probably isnt a single representation that fits all these use cases: We need to preserve abstraction when desired, but also to break abstraction when desired (e.g. by the debugger or a generic printer, but not by accident!), and it shouldn’t impose a big performance hit.

Gregoire gives a description of a polytypic printing function

let rec print (type t) (ty: t ty) (v:t) =
  match head ty with
  |Int -> print_int v
  | String -> print_string v
  | List ty -> print_list (print ty) v
  | Sum desc ->
  let (name, args) = sum_get desc v in
  print_string name 
  ...

We define a predefined type “t ty” and a syntax for type expressions “(type val t) of type “t ty”. We then introduce implicit type arguments which are optional arguments instantiated at call-site with the dynamic representation of the expected type.

val print : ?t:(typ val 'a) -> 'a -> string
let print ?(type val t) (v:t) = ...

type t = R of int * int | ...
let x = R (1,2)
let () = print x (* implicit arg is (type val t) *)

This is experimental syntax, but is shows how we can keep the representation abstract but the printer can still operate over it since the implicit argument passes the printer around.

The main problem is now how we mix polytypic function and abstraction, without always printing “abstract” instead of the real underlying structure. One solution is an extensible polytypic function.

module M: sig
 type t
 val x: t
end = struct
  type t = R of int * int
  let x = R(22,7) let () = register_printer (external type val t) (fun x -> ... )

end
let () = print x

We register a printer internally in the module, without exposing its internal type externally, but still let it be printed. The internal and external types are different though. We do need to figure out a canonical name for types defined outside the current compilation unit (its absolute path).

By default though, abstraction should consistently introduce new nominal types, but we’re not sure how to reference all the of the external names of a given type within its initial compilation unit (open problem). A pragmatic approach to solving this is the manual or semi-automatic creation of runtime “type names”.

In conclusion, they have a GADT for structural introspection and a RTT representation with global names.  There is also a type-constructor index association table.  Implicit type arguments (lightweight syntax for calling polytypic function, and explicit type parameters for polymorphic functions).

Q: what’s the gain of putting this into the compiler?

A: you may miss language corners, such as firstclass modules or functors, as your “userspace” OCaml code has to pass a type witness into a functor or firstclass module.  Compiler support makes this passing much easier.  The other big benefit is clear error messages from this compiler integration in the long term.  It’s not currently easy for beginners, and this is very important to get right if this becomes a supported feature.

On variation, injectivity and abstraction

Jacques Garrigue speaks about an interesting bug in the type checker that needs some language changes. There was PR#5985 which was a serious type checker bug that lost injectivity.

module F(S: sig type 'a s end) = struct
  include S
  type _ t = T : 'a -> 'a s t
end
module M = F (struct type 'a s = int end)
let M.T x = M.T 3 in x
- : 'a = <poly> the type is lost!

After expanding “s”, the definition of “M.t” is actually “type _t = T : ‘a -> int t” but “‘a” is not marked as an existential.photo

In order to protect against this unsoundness, all variables in type declarations must be bound, either by appearing inside the type parameters or be existentially bound (i.e. only present in GADTs). Inside type parameters, these variables must be injective.

OCaml doesnt do injectivity directly, since it relies of variance inference (since we have subtyping). The variance of a parameter is either explicit for abstract and private types or constrained parameters, or its inferred from its occurrences otherwise.

Constrained type parameters have been present since the very start.

type 'a t = T of 'b constraint 'a = 'b list

Rules for checking variance in this case become much more complicated, since constrained parameters variance must now be explicit The variance of type variables inside constrained parameters must also be weaker or equal the version inside the body of the definition.

type +'a t = T of 'b constraint 'a = 'b list (* 'b covariant *)
type 'a r = 'a -> int                        (* contravariant *)
type +'a t = T of 'b constraint 'a = 'b r    (* Fails *)
(* 'b is contravariant in parameters but covariant inside *)

In OCaml though, the variance of a parameter is allowed to be weakened through abstraction.

module M : sig
  type +'a u 
end = struct 
  type 'a u = int
end

This is correct for the types, but becomes incorrect when used for type parameters.

module F(X: sig type 'a r end) = struct
  type +'a t = T ob 'b constraint 'a = 'b X.r
end
module N = F (struct type 'a r = 'a -> int end)

Even when we assume that r is invariant, ‘b is inferred as invariant from the parameter of t, which overrides the covariance of the body. Meanwhile in the module N, the variance is now wrong.

How do we fix this? We need to refine the definition of variance, as traditional variance subumption defines a lower bound on the variance, but we need to add some upper bound information too, to be sure that parameters cannot have a strong variance.

(technical explanation of how to compose variances omitted here)

In OCaml 4.01, full variance inference is done using 7 flags (the last is a special case required for principality). Some uses will no longer type check so programs may be rejected for safety.

In the future, Jeremy Yallop, Leo White and Jacques would like to add injectivity annotations for abstract types.

type #'a s
type _ t = T : 'a -> 'a s t

Add new types for isomorphic abbreviations, as Haskell does

module M : sig type #'a t val f : int -> ['pos] t end = struct
  type 'a t = new int
  let f x = (abs x : int :> 'a t)
end

This is similar to private, but subtyping will work in both directions — this is useful for efficiency, encoding runtime types and allows coercions to be delayed until the signature rather than in the both (in the example above, the coercion in “f” could be omitted) due to the signature.

Another problem is that one cannot prove the uniqueness of abstract types, and we dont know whether an abstract type is contractive (so even if you use rectypes, the following functor can never be defined)

module Fixpoint (M: sig type 'a t end) = struct
  type fix = fix M.t 
end
Error: the type abbreviation fix is cyclic

One also never knows whether an abstract type may be a float, so the float array optimization may be mismatched.

OCamlot: lots of OCaml testing!

(will post David’s slides on the OCaml Labs website since it’s one of our talks. Will edit this post when that happens).

Merlin: editor support for OCaml

photo
Merlin is a sophisticated editor assistant for OCaml that provides autocompletion in vim and emacs. They leap straight into an interactive demo that’s awesome: autocompletion, module introspection when writing a functor, and so on.

Unlike other tools, this doesn’t rely on a CMT file, so code doesn’t need to compile (except for source browsing). Your project can have type errors when using Merlin!

It integrates with the ecosystem and has direct support for Findlib. There is no direct support for camlp4 however, with specific support for some extensions. If you want a new shiny extension then talk directly to them

There is build system support for running either omake or Jenga in polling mode (for iterative recompilation). Omake rules can inform Merlin to get a consistent state of the project!

A typical session: Parsing is done in two stages and the OCaml parser has been modified to be much more resilient to errors. The usual typing rules are applied, but passed onto a more “relaxed” typers in case of errors to give some support for incremental parsing.

Coming soon is documentation (ha!) as well as retrieval for ocamldoc comments to further augment hints, as well as even more error recovery heuristics. Some of the patches will also be submitted upstream to improve error handling (by accumulating them instead of printing them directly).

It’s available on OPAM under an MIT license via “opam install merlin” and thanks to Jane Street for letting them work on it during an internship there, and all the bug reporters!

Q: Have you considered running Merlin on a remote system and speak the same JSON protocol?
A: They haven’t, but it really shouldn’t be a problem.

Q: Julien V: do we have to build first?
A: No you dont have to build first
Q: What happens when you change branches in a project
A: For project-wide information you need to compile and rebuild.

Q: in the second pass of type checking, what does “relaxed” mean
A: When typing fails for some branch of the AST, in the simple case we introduce a fresh type variable and attempt to do unification with the result of this. This will never be sound, but lets it make progress and give SOME feedback to the user.

Q: is this a new type checker?
A: No its the original type checker with augmented error handling paths to do more unification.

Q: is the relaxed checking really worthwhile? It wasnt a lot of use in gcc since you get a lot of followon errors that are meaningless.
A: lots of discussion, opinions split, it can be disabled!

Understanding memory behaviour of programs

photo
Cagdas B doing a PhD on memory profiling. He wants to study the behaviour of OCaml programs and work on decreasing memory footprint and generally reduce GC pressure on real OCaml programs.

Shows Why3 graph (see photograph) and they’re running nice finegrained analysis over it on real world analyses. How do they generate these graphs?

$ opam switch 4.00.1+ocp-memprof
$ opam install why3
$ OCAMLRUNPARAM=m why3replayer.opt -C why3.conf p9_16

Patched compiler and then the OCAMLRUNPARAM will generate a lot of sampled snapshots of the OCaml heap. There’s no need to change the code or the compilation options. Then install analysis tools

$ opam install ocp-memprof
$ ocp-memprof -loc -sizes PID

The final step analyses all the snapshots. But what is a snapshot? It’s a compressed version of the OCaml heap, with location ids, a graph with pointers, and has saved globals (toplevel modules). We obtain the snapshots by computing a linear scan of all the heap values.

The OCAMLRUNPARAM=m forces the program to generate a snapshot after every major GC, or request the program to generate a snapshot via a HUP signal, or via a new “dump_heap” function that’s been added to the GC module.

It works by reserving some space inside the standard OCaml value block, and so there is minimal runtime performance impact except when dumping out snapshots after a GC. There’s no space overhead since the header size isnt changed. However, it’s only available on 64-bit platforms and reduces block size to 64GB. Ran experiment with OPAM and location identifiers are limited.

There’s one tool based on the snapshot identifiers and the “memprof” tool gets all identifiers from the heap and the corresponding location, and then looks up the right types from a specific location using cmt files (the typed AST).

Plan for this work is to improve the current framework (aggregate information by type and location) and to recover more types, and build more tools around it!

Q: Roshan: one common usecase is when grabbing locations, its boring to see Array.create and it would be nice to see a backtrace of the execution to see where the allocation happened.
A: this is a problem indeed (discussion about the ongoing Mark Shinwell patches too, and previous talk about backtrace analysis from last years workshop).

Benchmarking in Core

Roshan James about microbenchmarking in JS Core. Precise measurementis essential for performance critical code. We want to measure the execution cost of functions that are relatively cheap. Functions with a short execution time that run millions of times in a tight loop.photo

The essence of benchmarking is to run a time loop (like Time.now) a lot of times. We need to figure out how to compute a batch size to account for the time it takes to run the actual timing measurement. Similar to Criterion for Haskell, where they loop Time.now to figure out how long it takes to compute the time itself, and then run the benchmark function, and compute a full estimate. This works ok, but it requires density plots to let you visualise the form of the error (bimodal, or something else, for example).

There is a lot of noise in microbenchmarking, such as delayed costs due to GC, and variance in execution times is affected by batch sizes. So if we vary the batch size, the variance changes quite a bit since the GC is running in different intervals during the batches. The GC noise has a periodicity though!

They treat microbenchmarking as a linear regression, and then fit execution time to a batch size. They vary the batch size geometrically to get a better linear fit (shows a very linear graph). Theres now no need to estimate the clock and other constant errors, since they are accounted for by the y-intercept which shows the constant costs.

Other costs can also be predicted in the same way, such as estimate memory allocations and promotions using batch size too. This also allows measuring a goodness of fit by using R2 and bootstrapping techniques.

Example use of Core_bech is really easy:just

open Core.Std
open Core_bench.Std
let t1 = Bench.Test.create ~name:"id" (fun () -> ())

Is the identity run for example, and just put something more complex in there. Output is an Ascii_table with lots of columns showing interesting results.photo

Some functions have a strange execution time, and the observation isnt a great fit. Between runs, they also directly query the GC and directly query how many times some things have happened (minor collections and so on), and include these along with the batch size in the predictor. With multiple predictors, they are really close to the observed results indeed.

(lots of interesting maths about how they perform runtime cost decomposition)

Its available via “opam install core_bench” and they want to expose more predictors by measuring the effect of live words on performance, and counters for major collection work per minor gc. Accuracy of results is interesting as least-squares is susceptible to outliers so they can incorporate the knowledge that the measurement is heavy-tailed.

by Anil Madhavapeddy at Tue, 24 Sep 2013

On the Implementation of OPAMDOC (Philippe Wang)


Early this year, Leo White started the implementation of opamdoc. Then Vincent Botbol worked on it from late May to mid August during his stay in Cambridge.

Now, Vincent's back to studying "computer science research" in Paris, and he continues working on opam-doc when he can on his free time.

I didn't really want to get involved too deep in the implementation of opamdoc (mainly because it takes time). Eventually, since I've ended up doing most of the technical-side implementation of the new ocaml.org web site, I had to eventually get into opamdoc's source code to integrate its output in the website.

If you look at the source code of opamdoc, you'll see there's a bunch of files at the root directory. Well, some of them are inherited from the implementation of ocamldoc, and some are new. I've mostly contributed to generate.ml and opam_doc_config.ml. The former implements the HTML backend of opamdoc, and the latter contains the JavaScript engine that loads the documentation. This blog post is mostly about my experience with those two files.

generate.ml

This big file (about 1.5 kloc) contains the functions to retrieve the information from cmt, cmd, cmti, cmdi and cmi files in order to build the HTML document.

Side note:

The .cmt files were officially introduced in the standard OCaml compiler in version 4.00.0, so it's pretty recent work. Previously, they were produced by a separate tool developed at OCamlPro for TypeRex.

Two reasons why it's not super easy

Well, this can actually be explain in just a few words. To begin with, there are many cases to address. Well, that is not exact. I should say that there are may cases to address and those cases are fairly poorly documented, which is "normal" given that there was probably no particular motivation to put efforts into documentation. This is true for the source code of ocamldoc and for its official documentation.

For instance, if you look into info.mli, you can see that the first type definition is:

type style_kind =
  | SK_bold
  | SK_italic
  | SK_emphasize
  | SK_center
  | SK_left
  | SK_right
  | SK_superscript
  | SK_subscript
  | SK_custom of string

and the only documentation for this type is (** The differents kinds of style. *). Well, you can see that there's not really any documentation needed for those SK_... until... you see SK_custom of string. There you go! You have to guess...

It's not that hard when you have to guess a few times, it's a lot harder when you keep having to guess. That's my point.

Well, the other issue is more interesting. At the time ocamldoc was designed and implemented, I bet no one imagined what folks at JaneStreet would do with OCaml! I'm talking about the implementation of their open source library, known as Core. "Core & co" use a lot of include directives. The module Core.Std includes a lot of other modules, which also include modules. If you want to generate a single full HTML page for Core.Std, you'd end up with a huge page. And such a page would contain a lot of information coming straight from other pages, so you'd end up with hundreds of megabytes. Instead of doing so, opamdoc generates only one page per package and one per module. If a module includes another one, then the first will fetch the documentation of the second and there you go. So we only have 8.4MB of HTML for {async, async_core, async_extra, async_unix, core, core_bench, core_extended, core_kernel} in total (well, this number should increase in the future, but linearly, as people will hopefully start documenting all those packages). And that's why we have a JavaScript loader.

opam_doc_config.ml, or docs_loader.js

Since the JavaScript may have to load tens of megabytes of HTML, you have to program some nasty functions and loops... and at some point it does become big enough for your browser to stop responding while it's busy loading your documentation. So there are several solutions to that. The best would probably be to stop writing in JavaScript (and use something else that compiles to JavaScript). But that's for next step. Now, we're trying to make the JavaScript work.

The problem with JavaScript is that basically there is one "event" loop, and all events are happening sequentially or concurrently, depending on how you wrote your JS, but when one event is making the browser busy, the browser is unable to do anything else. That's why your browser may tell you at some point that you have a script making it ill and ask you whether you want it to stop executing that script. One workaround for that problem when you know you ask for a lot of computation time is to divide your big computation into smaller ones. You can use window.setTimeout for that, meaning you transform your recursive calls like f() into window.setTimeout(f, 0) so that f will be called at some point. And if you have iterative loops instead, write them recursive and use window.setTimeout. That's bad. Because then the browser is unable to tell that it's crazy busy... and it's still really busy. So you can increase the value 0 to 1 or 10 or more. But if you increase it too much, it'll become too slow...

Ok, well, don't use JavaScript. It's a nightmare.

We will probably rewrite the script using Js_of_ocaml at some point, mainly because when you write in OCaml, you have static type checking, and that saves so much time!

To be followed...

by Philippe Wang at Tue, 24 Sep 2013

Feedback requested on the OCaml.org redesign (Amir Chaudhry)


There is a work-in-progress site at ocaml-redesign.github.io, where we've been developing both the tools and design for the new ocaml.org pages. This allows us to test our tools and fix issues before we consider merging changes upstream.

There is a more detailed post coming about all the design work to date and the workflow we're using, but in the meantime, feedback on the following areas would be most welcome. Please leave feedback in the form of issues on the ocaml.org sandbox repo. You can also raise points on the infrastructure mailing list.

  1. OCaml Logo - There was some feedback on the last iteration of the logo, especially regarding the font, so there are now several options to consider. Please look at the images on the ocaml.org GitHub wiki and then leave your feedback on issue #16 on the sandbox repo.

  2. Site design - Please do give feedback on the design and any glitches you notice. Text on each of the new landing pages is still an initial draft so comments and improvements there are also welcome (specifically: Home Page, Learn, Documentation, Platform, Community). There are already a few known issues, so do add your comments to those threads first.

by Amir Chaudhry at Tue, 24 Sep 2013

OPAM-DOC with multiple OCaml packages (Philippe Wang)


OPAM-DOC

The tool ocamldoc shipped with the standard distribution of the OCaml compiler suffers from some limitations because it doesn't have access to all the information it needs. Notably, if you want to build the documentation of some package a that depends on some package b, you'll find out that the documentation of the package a will not automatically refer to the documentation of the package b. Well, opamdoc solves this problem and it's very nice.

On this site, which is work in progress, you can find a list of packages with their descriptions. This list is built using opam2web. And each package description links to its documentation. And the documentation for all packages is built as a whole, which means that if the documentation of a package links to all other packages on which it depends. Isn't it nice? :)

Install and run opamdoc

opamdoc is still under active development, the current step is finding and fixing remaining issues and packaging.

If you want to use opamdoc now, here's some instructions based on Anil's instructions, which are those:

opam remote add opamdoc git://github.com/avsm/opamdoc-dev
opam switch 4.01.0beta1+opamdoc
eval `opam config env`
opam install opamdoc
opamdoc-rebuild
opamdoc-generate
open ~/.opam/4.01.0beta1+opamdoc/opamhtml/index.html
  • opamdoc-rebuild builds a database of cmt/cmd files in ~/.opam/4.01.0beta1+opamdoc/opamdoc.
  • opamdoc-generate builds a package HTML list in ~/.opam/4.01.0beta+opamdoc/opamhtml.

Well, they work quite well, except opamdoc-rebuild uses .cmd files as a base, whereas it should use .cmt files instead because sometimes .cmt files exist while .cmd don't. This is a problem for a few packages (Core is one of them).

My advice: if something doesn't work as expect, please open an issue, and if you want to customize the way the documentation is built, don't hesitate to edit opamdoc-rebuild and/or opamdoc-generate (they are Bash scripts).

A few notes about those instructions

The first line, which is opam remote add opamdoc git://github.com/avsm/opamdoc-dev, will give sense to the second one, which is opam switch 4.01.0beta1+opamdoc. And 4.01.0beta1+opamdoc is a customised compiler that produces on the side the .cm{t,d}{,i} files that are used by opamdoc. Using it is very convenient! If you don't use it, you need to produce those files in an other way of your choice (really? you want to go through such trouble?).

Note that opamdoc is written in such a way that it would ask a lot of work to make it work with versions of the compiler prior to 4.01.0 because it uses features introduced by this very version.

Just in case, here follow the customised version of opamdoc-rebuild and opamdoc-generate that I used to build the documentation available on http://ocaml-redesign.github.io/pkg/docs/.

opamdoc-rebuild

Here's my customised version of opamdoc-rebuild:

#!/usr/bin/env bash
# Rebuild the cmd/cmt archive in ~/.opam/<switch>/opamdoc
# Copies all the cmt/cmti/cmd files found in the OPAM build
# dir into a single directory structure, separated by MD5
# to keep files distinct.

set -e
#dry=echo
SWITCH=$(opam switch show)
if [ "${SWITCH}" = "system" ]; then
  echo Must be using a custom OPAM switch for this to work.
  exit 1
fi

function calc_md5_for_file()
{
  if builtin command -v md5 > /dev/null; then
    md5=$(cat $1 | md5)
  elif builtin command -v md5sum > /dev/null ; then
    md5=$(cat $1 | md5sum | awk '{print $1}')
  else
    echo "Neither md5 nor md5sum were found in the PATH"
    exit 1
  fi
}

BASE="$(dirname $(dirname $(ocamlc -where)))"
BUILD=${BASE}/build
DOC=${BASE}/opamdoc
rm -rf ${DOC}
mkdir -p ${DOC}

PKGS=$(find ${BUILD}/ -mindepth 1 -maxdepth 1 -type d "$@")
echo ${PKGS}

for pkg in ${PKGS}; do
  pkgname=$(basename $pkg)
  echo $pkgname
  mkdir -p ${DOC}/$pkgname

  CMTS=$(find ${pkg} -type f -name '*.cmt')
  for cmt in ${CMTS}; do
    d=$(dirname $cmt)
    calc_md5_for_file "$cmt";

    f=$(basename $cmt .cmt)
    mkdir -p ${DOC}/$pkgname/$md5
    r=${DOC}/$pkgname/$md5/$f
    for ext in cmt cmti cmd cmdi cmi ; do
        if [ -e $d/$f.$ext ]; then
            $dry cp $d/$f.$ext $r.$ext
        fi
    done
  done
done

opamdoc-generate

Here's my customised version of opamdoc-generate.

#!/usr/bin/env bash
# Rebuild the cmd/cmt archive in ~/.opam/<switch>/opamdoc
# Copies all the cmt/cmti/cmd files found in the OPAM build
# dir into a single directory structure, separated by MD5
# to keep files distinct.

set -e
#dry=echo
SWITCH=$(opam switch show)
if [ "${SWITCH}" = "system" ]; then
  echo Must be using a custom OPAM switch for this to work.
  exit 1
fi

OPAMDOC=${OPAMDOC:-opamdoc}


BASE="$(dirname $(dirname $(ocamlc -where)))"
BUILD=${BASE}/build
DOC=${BASE}/opamdoc
HTML=${BASE}/opamhtml

rm -rf ${HTML}
mkdir -p ${HTML}

# Grab the build dir in reverse mtime order to get the
# ordering of generation "correct".
PKGS=$(ls -1tr $BUILD)

rm -rf $HTML
mkdir -p $HTML
cd $HTML
for pkg in ${PKGS}; do
  fs="$(find $DOC/$pkg -type f)"
  if [ "$fs" != "" ]; then
    name=$(echo $pkg | awk -F. '{print $1}')
    echo $pkg $name
    $dry $OPAMDOC -p $name $fs
  fi
done

Enjoy.

by Philippe Wang at Tue, 24 Sep 2013

Liveblogging CUFP 2013 (SRG Syslog)


I’m here at the Commercial Users of Functional Programming workshop at ICFP 2013) with Heidi Howard, David Sheets and Leo White.

marius1

Marius Eriksen and Michael Sperber are co-chairing this year. Functional programming is a hot topic these days, and this year’s program reflects the maturing nature of the program. Record number of submissions, and we could easily have made this a multi-day program.  Videos will be available online after the event.

Keynote is from Dave Thomas on what we can learn from the “language wars of the past”.  His talk will cover the business, social and technical ends of building Smalltalk and his “objaholic” experiences with pushing language technology into businesses, where they never had any intention of changing in this regard.  This worked over the years because it made a material difference to these businesses.  Dave still uses the K programming a lot at work.

clark1

Dave is responsible for the Eclipse IDE and the IBM Java virtual machines, and has delivered software on everything from mainframes to wristwatches.

Rush in the 70s with Symbolics and the amazingly cool hardware, followed by the rush to logic programming that led to his grants disappearing. Midway through the 80s, they discovered we wouldnt get parallelism for free with all this magic. After that came the era of machine learning and knowledge ontology crusades, but they are very specialised bits of tech. There remain sectors where ontologies are useful. Nowadays its the multicore crusades, and FP is emerging as useful.

Every road leads to some form of FP, because the principles underlying it apply. Dave does vector programming, Haskell does this with array transformation.

Dave was talking to Simon PJ outside and claimed that “FP has won!”, at which point Simon replied “really?” (laughter). OO techniques can also lead to a lot of tangled code: witness the evolution of Smalltalk to the current generation of languages.

FP technical challenges: what happens when you have a lot of functions, and a group of consenting adults are doing it together. This is called software engineering, or at worst, IT!

Lots of good work in the community on approaching large scale programming. Scala is a bit too complex for Dave (and audience rumbles in agreement or disagreement?). Java and JNI should just die. Erlang and actors are successful to decouple software. But do we really need to hide all our queues, and do we really need to hide our queues? Static analysis of queues are desired.

Want to go fast? Computers like rectangles, all the way down! Array operations and vector processing is the way to go in the long-term.

To go to industry, we need end-to-end toolchains. If you use MS Project, then you probably shouldn’t introduce any new tools.

The first commercial customer for Smalltalk in the 1980s was Techtronix, who wanted to embed reusable tech in their oscilloscopes. If you’re building a serious product like this, it needed language interop between C++ (the logic sensor) and the Smalltalk. Respect the language cultures among components (anyone remember s-records?). ROM had to be replaced with Flash because noone can write code that doesn’t need upgrading (groans from the audience).

With the right tools, programmers will be able to go as quickly as Smalltalk, or at worse Java. We need the work of Simon [PJ] and colleagues on Haskell. Scala is so complex will become the Java (which is the Cobol of tomorrow).

When shipping products, you need to track all the versions of language components in the toolchain. Too many FP toolchains are monolithic and make it hard to track components in the field. JVMs are so slow because they don’t have a good sharing model (DLLs).

What Dave would like is version-to-version transformation in languages. Y2K was a problem not because IBM hadn’t fixed it in their OS, but because customers had modified their local installations and were pinned 6-10 versions behind! The same will happen for existing FPs that fragment into multiple implementations, so will they descend into multiple version hell or will standards emerge from the community?

Also, lets not leave the data behind. Can we alias data from existing legacy sources into our new languages, and not aim for the perfect representations? Most of our programs are dealing with problems with existing data, not code.

Onto deployment issues. Erlang works very well for the problem domain that it’s designed for, but more complex domains exist like OSGI. Hot codeswap and live updates are often viewed as “boring problems” but are vital to real world deployments. Space usage must be paid attention to: how long can a given JVM stay up without leaking? Serialization is important to persist data across versions. If we’re going to use “esoteric control structures” like CallCC, tailcalls or tags, the JVM isn’t going to get us there.

These days, if you can’t stay in your local cache, you will lose several orders of magnitude performance… you’ll need directly connected CPU (or SSD!) memory, and static initalization, and language interop (eagerly vs laziness). Functional data structures and immutable collections are still immature, but need more work to convince doubting Thomas’ about their scalability in real applications.

Hardware support for these are important. Intel Haswell has hardware transaction support (hence can be used by Azul JVMs etc) but we need more for functional langauge performance. There have been $2bn put into speeding up the JVM to make it as fast as C++. Google has a technology called Pinnacle which uses software write barriers that use VMs that execute directly, with a register model and not a stack model (which is easy to put together but hard to make fast). The new generation of language VMs will support direct execution and support modern functional programming languages.

“We all love types” — Dave on objects, but not variables (“its easy to be cooperative with that definition!”). Pluggable types by Gilad are really cute, because they move us towards a world where particular features (Dave likes behavioural types) can be woven in. Perhaps can be done in a dependently typed world. Another good candidate are approximate types and dimensional types. When all these are put together though, it all needs to fit together with sound compositional semantics.

When using an interesting PL, Dave wants warnings to be emitted when the compiler makes space/time tradeoffs so that the programmer can decide between elegance and performance (also mentioned that Haskell programmers do this for .

How does language evolution work? Arthur Whitney maintains K, and he ensures that every version of K is incompatible with the previous one. This is a blessing (!) since it avoids a kitchen sink language and ensures a high quality set of libraries. Quite often there’s a tussle between languages and libraries because there’s not enough coevolution. For example, are closures in Java really useful for the tradeoff in the complexity of the VM? Perhaps we should be looking towards a collection of small correct languages instead of a few partly right and largely wrong languages such as Ada or Java (the Java hate is strong here — editor).

Those of us that have programmed in vector languages (going back to APL) are largely write once read once languages because the code is so hard to read. One trick Dave does with K is to annotate comments inside expressions and use cold-folding without a point-free style to understand vector codeflow better. Also seeing the environment and current name bindings is extremely useful for teaching new languages.

The real problem is: comprehensions, folds, monads — WTF? Yet to see anyone from the monad community give a talk to understand monads. It’s typically “you dont need monads” and then the word is used multiple times in the talk. If it’s not important, why mention it so often? Quite often, the FP experience makes normal programmers feel stupid because they don’t break through this terminology barrier. We as a community need to help people break through the barrier of not being formally trained in mathematics for whatever reason (“category theory for breakfast?”) and *be productive* in FPs.

Smalltalk was like this in the early days. People see a blank window and aren’t sure what to do next. Objects irritated people when they came out, but FP scares people. This is a danger as we go through and wow people with what we can do, but make it look like magic. People change their habits are different rates and we need to understand that. With Haskell, it was odd that he had to get the program “right” (or right/wrong) before it work, as opposed to not working.

Bill Buxton (UI researcher friend) pointed out that people can’t much get beyond programming a VCR. 4GLs were pretty much their limit. Dave’s done a lot of work on evangelizing Smalltalk, but feels it was still too hard. People just didn’t understand how to separate behavior, as the best practices on how to build objects were foreign to most of them. This all settled on some mix of SQL and some glue code, which doesn’t require learning a lot of new technologies.

Given there are many different types of programmers, do they all really need the full power of FP? Some subset of features is more useful, with one way to do things that doesn’t change from release to release. For example, a simple dictionary keeps changing from Hashmap to concurrent Hashmap to something else, but ultimately is still a dictionary. How do we separate the reasoning over performance and structure and not scare people with both at the same time. Knuth started doing this decades ago with literate code, and we *still* dont have a decent way of writing documentation and code in the same place.

We are still missing a pure FP book that is built in the same style and comprehension as SICP. We need thinks that you look the code and represent things more densely (for example, why don’t we use Unicode in the same style as APL symbols to increase the succinctness?). They are working on J programs that move glyphs around on the iPad as an example of this 4GL density.

We need to take all these ideas and *talk more* about metaphors in FP. Where’s the “Scala, F#, Clojure, the good parts” and a frank exposition about the antipatterns? Don’t lie about where the dragons are, but colour code them appropriately to not surprise programmers.

We need to understand how to design with types. In the vector languages, this is called “shape shifting” as they don’t have types, but similar concept. It’s also hard to get the community people to talk about shape shifting, and as a result the art is being lost in the new generation of languages. For FP, can the Yodas in the audience help their audience to “trust the types” ?

Q: In the top 20 languages, every language has a free-or-open-source runtime. This will make a huge difference as it lets people experiment.

A: It’s not been a barrier in experimentation for Dave, but it has been a barrier for commercial use. Dave would not attribute a particular significance to open source as it’s never been a real problem for him (using trial versions and so forth). Open source is a mixed sword, as it tends to product impoverished language teams that don’t have enough resources to move forward. The really good virtual machine implementors are commercially polished (c.f. the $2bn behind the JVM performance from earlier). Clojure is a great example where they appealed to the community to pay on their own behalf to donate for a production version. To build a serious production PL, you need a lot of government research money or a major commercial sponsor.

Q: Marius asks about mobile and embedded and why FP hasn’t cracked it.

A: they are often late to the game, and it’s often not easy to target the runtimes with a high performance runtime. You have to be willing to understand the interface.

Q: Phil Walder says the scariest thing is that if you become a success, then Smalltalk takes the form of C++

A: (or Java!). Being able to maintain the independence of the language creators vs a major player is crucial to conceptual integrity. (Phil) Tools that will migrate between version X and Y and a sustainable ecosystem. It’s important to ensure that the creators get paid to maintain — even the JVM only has 10-20 people in the core team and a vast ecosystem.

OCaml at Facebook via the Hack language

Julien Verlauget is talking about how Facebook is adopting OCaml.1148380_10151682552766094_1789859397_n

Why PHP is good: fast installation times, lots of libraries, easy to learn, and scales in a parallel way (multiple instances).

However, Facebook doesn’t care about any of these features. What Facebook wants is a fast feedback loop between writing code and having it appear on screen. There’s no good way to write a unit test for an experience, and instead a fast feedback loop between code and a live site is how people still work at Facebook today.

Performance at the scale of Facebook really matters. 1% can be a huge bump, but PHP is really hard to optimize due to all the dynamic features in the language. Aside from runtime performance, the other concern is development time for engineers, who need to avoid security problems and general bugs in an incredibly large codebase.

The first attempt at this was to scale the runtime via HipHop. The first attempt was to compile PHP into an interpreter (via sandboxes) and then compile production code to C++. Unfortunately this lead to a huge divergence in performance characteristics between the interpreter and compiled code which led to confusion in production.

The current one is HHVM which is a combined model via JIT runtime that patches code at runtime and helps merge these models.

HHVM is still a PHP platform, and a new model called Hack provides a *new programming language* that targets HHVM (which is essentially a PHP runtime).

Hack is a statically typed language for HHVM. It’s compatible with PHP and interoperates with no overhead and has the same runtime representation as PHP. It has evolved from PHP, and so if you know PHP, you know Hack. It’s designed for incremental adoption via introducing gradual typing (c.f. keynote talk about not scaring programmers not used to types).

function foo(): void {
  $x = 'hello';
  return $x;
}

This returns a readable error message in the style of a statically typed language, since the inference algorithm keeps a witness about every value’s type and shows counter examples.

The Hack type system must be annotated with class members, function parameters and return types. Everything else is inferred to minimize programmer burden.

Hack types include:

  • Nullable such as
    ?int

    mainly because of the huge code base that had no pattern matching, so null propagation is important

  • Tuples
  • Closures
  • Collections such as
    Vector<int>

    that have generics too

  • Constraints
  • Type Aliasing
  • Extensible Records

Demo of a simple mailbox demo follows. HH has a command line mode that lets you syntax highlight which parts of the source code are dynamically checked, by outputing the code with red if it’s potentially unsafe. Awesome!

class Mailbox {
  private $data;
  public function send($data) {
    $this->data = $data
  }
  public function fetch() {
    return $this->data;
  }
}

(very PHP like but now he annotates it with specific types)

class Mailbox {
  private string $data;
  public function send($data) : void {
    $this->data = $data
  }
  public function fetch() : string {
    return $this->data;
  }
}

As a programmer, I now want to know if this has all been statically checked (since the previous checking mode showed several red sections which were potentially unsafe). He runs the compiler with a strict mode that rejects the source if there is any unsafe bits.

He then demonstrates how to propagate null types through the code, and then introduces several subtle errors that are all caught.

An audience members asks what happens if you send both an int and a string to a mailbox, and Julien demonstrates a “mixed” type that covers both (is this some sort of row polymorphism? — editor).

The response time of the type checker is close to instantaneous (for the developer loop). Even with a thousand files, the type checkers gets back in a second or less with an error message.

To get the tooling right, Hack uses js_of_ocaml to compile the entire OCaml type checker to Javascript and then run in a complete web-based IDE (demonstrates a full cloud9-like IDE they use internally at FB). To work at scale, they created a server that runs persistently in the background and type checks all the files and responds from a memory cache.

At scale, the the autocomplete requires very lowlatency in the webbased editor, and users also use version control (e.g. switching between branches). The server also has to use a reasonable amount of RAM and have a reasonable initalization time. Fundamentally it has to be stable as developers have to depend on their tools.

A vast amount of code is written in OCaml for the tools. Good choice because it is ideal for symbolic computation, excellent performance and can be compiled to Javascript. Good C interop is really good. The main challenge is the lack of multicore, although Facebook works around it with their multiprocess architecture.

Architecture: the master process has multiple processes which uses shared memory to avoid having to go through the master. The underlying buffers are lock free (similar to our Xen shared memory ring! — editor).

Advice on OCaml at scale:

  • IPC — use lockfree structures and think about sockets/pipes instead of just TCP
  • GC small heap and shared memory is compacted by master
  • OCaml makes you think hard about shared objects, which is a good thing.

Q: is it open source??
A: soon, this is the first time Facebook has talked about it publically.

OpenX and Erlang ads

What is OpenX: (i) An opensource PHP ad server (ii) OpenX is a global company with an ad service featuring a big ad exchange written in Erlang. This talk is about the transition from PHP to Erlang.

photo 1

1998-2007. it was a conventional PHP management system and MySQL. The company started using it as an EC2-based SaaS ad server and serving up clients successfully.

In 2008, they wanted to mix in an ad exchange, to add liquidity to marketplaces that have empty ads. At the same time, they moved offices to Pasadena and looked at new tech. Switched to Hadoop (Java) and Postgres (for some reason). Poor technical choices were made, such as writing to the database for every request (sometimes multiple times!).

In 2009 as the customers increased, there was a big push to scaling things up for performance (this is when the speaker joined the company). Switched Postgres to Cassandra to move from a relational database to a more NoSQL approach that had the appropriate consistency properties. At this time, the architecture was also decomposed to give them headroom to scale for a while, particularly by adding HTTP load balancers and so on.

They wanted to add even more low latency stuff, such as spot bids for immediate bids on empty slots. Anthony decides that the model was a good fit for Erlang, since one request needed to fan out to multiple backend services with a soft-realtime limit to return a response to the user. It took a few months to prototype an Erlang based solution which worked very well.

After this first experience with Erlang and evangelising it, and pointing it out to his coworkers, they came up with another good problem: analyzing the frequency of how often someone looks at an ad. This required a new database that would hold bespoke data efficiently, and also integrated Riak from Basho to interface with the database to run the query engine. This now holds on the order of billions of keys.

In 2007, they ran into problems with Cassandra due to making an incompatible upgrade which would have required OpenX to shut down their business to upgrade, which was unacceptable. Since they had a good experience with Riak core, they developed a new system to move away from Cassandra and add more Erlang for management stack.

In 2011, they decided to split their UI layer into a more Web 2.0-like UI API. The API continues to be implemented in PHP (with a Javascript UI clientside), but the Erlang evolution continued with a gateway service and marketplace service that are both pure Erlang. The basic problem is ad selection, and the logic for this is written as a DSL implemented in both Erlang and Java to enable code reuse across the tools.

However, Cassandra was still sticking around, and they wanted to move away from it. They implemented a data service layer to abstract away from the database. From the early days, they made a bad (but fast) choice to have clients directly call Cassandra via Thrift, as it was easy. They moved away to a more abstract model to get away from Thrift, and so could change clients. The new system could call both Cassandra and Riak to help evolve with both services running simultaneously and provide a seamless migration path to the new service (in stark contrast to the Cassandra upgrade story from earlier).

They also decided that the PHP use for the API was a bad idea, and so switched to Python. They engaged Erlang Solutions to write them an API router since their core team was busy (an example of external consultants coming in and being useful — editor). The current system involved RabbitMQ coordination, and Riak’s replication logic to move out to the colos for publication.

Currently they have 14-15 services in Erlang 7-8 in Java, and a blend of Python/PHP.

How did a PHP shop end up with so much Erlang? Well, they laid off all the PHP developers when they moved offices (lots of audience laughter).

More seriously, the architectural choices are important:

  • Cloud based (generic hardware, automated bootstrap, deployment, package oriented development and fault tolerance).
  • Service based (loosely coupled, single purpose components, pools of components, polyglot so multiple languages/
  • Cross language tools: Thrift (RPC between components), protobuf (RTB and Riak) and Lwes (logging and monitoring) which was open sourced by Yahoo after being originally written at Goto.com. A system called Framework that provides templates for code layout, builds RPMs/debs and integrates with autotools. It enforces versioning and reproducibility across languages.
  • Finally, evangelism is really important. If possible fix the game via architectural choices to set the scene for change. Find a project to showcase the technology, and be responsible for the project’s success. Make sure to share work, and remember that tools are really important to make it easy for others to use.

Really important is figuring out build toolchain — it can be so arcane. They did spend time on integrating with Erlang (erlrc inegrates with packaging system, erlstart to manage lifecycle, erlnode for ?).

Hiring: train people rather than trying to hire knowledgeable. Operations have unfamiliar systems to manage, and developers have new concepts and patterns.

How did Erlang help OpenX developers? The language fit the service model very well. The let-it-crash philosphy and supervision trees led to fewer real service outages.

OpenX is doing really well as a business, too: 250+ billion monthly transactions, 12+ billion daily bids, thousands of machines in 5 colos, 300 employees and $150m+ revenue in 2012.

Q: Marius: with a service based solution, how do you library them up?
A: Their services aren’t hugely fine grained and so usually a module or two (not a lot of state)

Q: Anil: is framework usable from languages? RPM integration seems killer
A: It’s still esoteric and internals not docced, but would welcome a chat!

BAE systems on Redesigning the Computer for Security

Tom Hawkins speaking on DARPA funded project.

SAFE is a codesign of:
a new application langauge (breezE)
a new systems programming language (Tempest)
A new OS
A new processor
and security at every level for defense in depth.

Emphasising hardware enforced security since dynamic checking is too much overhead in software, for example for fine-grained information flow control. Hardware IFC handles the most general attack model (scripting down to machine code injection).

Hardware overview: every word of data contains a tuple of (group 5 bits, tag 59 bits and payload 64 bit). The Atomic Group Unit checks atom typs (e.g. instructions, data pointers and streams). The Fat Pointer Unit (confusingly, FPU) checks pointer operations, and tag management unit (TMU).

The label model can be used for efficient hardware typing, and more recently they are using labels for Control Flow Integrity (e.g. to prevent return oriented attacks). Register file is 32 elements deep, memory is accessed through load/store, and every bit of data in the system is a well-formed atom (and so has tag/groups associated with it).

FPU deals with buffer range checks (e.g. underflow or overflows for a buffer).

The TMU is where the magic is. It combines all the operations from the various operations being worked on (instructions, memory) and the TMU acts as a “database” checking that the operation is consistent with local policy, and the machine can continue execution or raise a security exception.

At day 1, they had an outline for an ISA but nothing else. TIARA project was a baseline (Howie Shrobe, Andrw DeHon, Tom Knight) and they had no languages, no hardware and no toolchain.

They started by sketching out an assembly language and then building an ISA simulator. They simulated simple assembly programs, and HW researchers started coding in Bluespec (a Haskell-like FP that Lennart keynoted about at CUFP 2011 in Japan).

The PL researchers started designing Breeze. Plan was to “steal” Andrew Meyers work on JIF, and “Breeze should be done in a couple of months”!. It didn’t quite work out that short (laughter).

SAFE assembly is as low level as expected, but it adds some new properties:

  • frames manage fat pointer bounds
  • Atomic group declarations on data mark critical sections
  • Tags on data can be marked (e.g. bootloader private)
  • Secure closures is code/data with the authority to run code over the data (they call it gates)

Assembly is tedious and so they started with macros. The Breeze interpreter is now running and there was pressure to start building the compiler. The OS guys wanted a language to build the higher level components.

Solution: a SAFE assembly DSL embedded in Haskell that uses Haskell as the macro language and becomes the library for the future Breeze compiler.

Breeze Language: version 7 — just the datatypes for Booleans was hard and took 4-5 weeks (“this IFC stuff is kind of tricky”). The convenience and modularity of lexical authority passing and one-principal per module is extremely inconvenient!

SAFE Assembly in Haskell looks like a typical EDSL. A monad captures program descriptions, and macros set up data tags and also for better control flow (such as a while loop or ifelse conditionals in the asssembly).

At Year 1.5, the EDSL is still assembly language, no matter how many macros there are. Manual register allocation, calling conventions and data structures. Meanwhile the Breeze compiler was inching off the ground, but there is an awkward transition from high level CPS IR to assembly. There needs to be an intermediate form somewhere. However, the SAFE EDSL in Haskell works great in the compiler backend plugged in seamlessly with the Breeze code generator, which was nice.

Breeze Language is now on Version 12 and continues to answer questions. “What do we do on an access violation?”. Simply stopping the machine isn’t enough, since “What if I maliciously send you data that you can’t access?”. “Simple I’ll just check the label before I attempt to read it?”. “But what if the label itself is private?”. This was the big “aha” moment in IFC, since private labels end up being a poison pill in IFC.

At Year 2.0, the Breeze compiler is going through a major overhaul since it needs improvements (the middle IR is improved but not enough). They temporarily shelve the compiler and they realize it wont come to the rescue of the OS. They really needed a higher low-level language than just assembly.

Breeze language v23 has a solution to poison pills: make all labels public! To label data you must specify the label in advance (brackets), but public labels are not compatible with lexical authority passing so they had to relax their original vision a little.

Year 2.5: Tempest is started as the systems PL for SAFE. Imperative with register allocation and optimizations and control of assembly with inlining and user specified calling conventions. It uses the SAFE EDSL to generate assembly (that was a successful DSL) and nicely fills the Breeze IR gap in the compiler for when that is resurrected.

Breeze language 34: delayed exceptions with not-a-value values (Nav).

The Tempest EDSL with inline assembly has a SAFE assembly sublanguage. The SAFE flow isnow the Breeze, Templest and Safe Assembly layers, with EDSLs of each gluing the compiler pipeline together. There is a SAFE ISA simulator and debugger, along with a Bluespec simulator to save on the FPGA debugging cycles. The TMU is an elaborate piece of hardware (an exotic CAM and cache structure) that a lot of the effort has gone into.

Lessons learnt:

  • designing a higher order IFC language is very hard even with some of the most brilliant PL researchers in the world. Optimal number of researchers is 2-7 rather than any more :) Ben Pierce and Greg Morrisett arguing over very hard problems was a lot of fun in a hard problem space.
  • Day 1 should have started with Tempest, not assembly. Hard to be productive with assembly code, and tempest is the right level of runtime/processor codesign. Level of indirection provides insulation from a changing ISA.
  • EDSLs are great for bootstrapping a language and make excellent backend libraries. However, engineers must be comfortable with the host language and not all the engineers were comfortable with Haskell when they started. EDSLs are hard to debug (line number information apparently hard to get out of error messages vs a concrete syntax).
  • Concrete syntax is more relevnt for some languages than others (e.g. Tempest vs SAFE assembly). The best transition point to a concrete syntax is after developer pressure for modular programming. Once language has modularity, then the switch can be made without disruption.
  • Would a DSL have helped hardware design? Forever debugging ISS and FPGA instead of raw Bluespec. A DSL describing ISA semantics would have kept is synchronized by generating Bluespec, ISS and SAFE EDSL and Coq documentation.

This has been a huge project spanning a vast number of layers. It has produces a large number of interesting papers including one at ICFP this week on testing non-interference quickly by using QuickCheck to generate programs that don’t violate non-interference properties.

FRP at Netflix by Jafar Hussain

Netflix used to have tight coupling between the middle tier and the UI teams. One-sized fits all messages and led to inefficient call patterns.

The plan was to decouple this by giving UI developers the ability to decouple components (or “force!”).

Two developer personas at Netflix: Cloud and UI people. Netflix deals with a third of all US broadband traffic, so they need to think at scale but also to run on really tiny cheap devices that eke out all performance. How do they turn UI developers into effective cloud developers who think at scale?

Gave them UI comforts: Groovy, OO API and a Reactive API. Most UI developers (or indeed any developers) can’t be trusted with locks. This means that reactive programming isn’t enough, which leads to “how do we make parallel programs safe for developers”

Enter Erik Meijer, the legend himself! He asked “Whats the difference between a database query and a mouse drag event?”. Database would use functional code and a mouse drag might be imperative. Erik points out that they are both collections.

New JS closure syntax in ES6 is fantastic to make them easier to use. If you imagine a query for well-rated movies, then you need to map over video lists. He then shows a mouse drag event, but points out that the structure of the code is very simpler. Instead of a “filter” operator, he uses “takeUntil” which stops the operation after some predicate has been satisfied.

The iterator pattern is similar to the observer, except for the bidirectional signalling. Iterators can throw next/hasNext/Throwable, but Observers dont have a similar completion semantic. What happens when we add these to the Observer? We end up with “onNext” and “onCompleted” and “onError”, so now the Observer has comparable semantics to Iterator, so the only thing that changes is the direction of the errors.

Some platforms also have an Iterable. There is now also an Observable to which you can pass an Observer, in the same vein as passing an iterator to a Java Iterable. We also add a Disposable to both, so that when the IO stream has terminated, it can be stopped immediately without having to junk the rest of the data.

Erik and his team at MSR demonstrated that these patterns are dual, so that Iterable and Observable are dual! These are the Reactive Extensions libraries that is a combinator library for the Observable type. It’s open source and has been ported to a number of other languages than C# (including Java).

Observable Monad is a Vector version of the Continuation monad. Null propagation semantics of the Maybe monad and Error propgation semantics of the Either monad. Produced and consumed with side-efects and acts as a fault line between pure and impure code. It can be composed functionally and cancellation semantics are clean. It can be synchronous or asynchronous which is a common mistake (to assume that Iterable is async — either can also be sync simply by calling the functions immediately).

If you Map over Observable, then

(rapid transcript)
var map = (observable, func) =>
  { forEach: observer => 
    {
    var subscription =
     observable.forEach
       onNext : item => oserver.onNext(func(item))
       onError : error => observer.onError (error)
       onCompleted () => observer.onCompleted()

We can use this to come up with a single reactive type for cloud and UI developers! Jafar demonstrates example of social notifications in the middle tier, but using Observable to filter notifications from a composition of the socialService and messageService. Both underlying services are returning observables, and are non-blocking and completely parallel. Both can execute synchronously or asynchronously and decide which knobs to turn for their particular scheduling needs. The programmer doesn’t need to care about the details.

Another example is search autocomplete, which is surprisingly hard. Need to decide how often to send requests (throttle them to be not per-key) and also deal with concurrent responses from keypresses coming in out of order.

var res = 
  keyPresses.throttle(20).flatMap(
    search => 
      getSearchResults(search).takeUntil(keypresses));

res.forEach(resultSet => listBox.setItems(resultSet)

This code fragment does all this composably and handles all that logic. It’s throttled, ordered correctly and terminates until the keypresses stop (with respect to the throttling policy.

The full tier is a UI (huge amount of mutable state), dropping in to a distributed functional style, and heading into the data tier. Recall this is at huge scale.

Wins: Rx is open sourced from MS and ported to Java (Observable) and generally better tooled now.
Evangelism comes up again since you can’t assume the best technical solution will win. Practice public speaking and focussing on soft skills! Speaker got better in a short amount of time at distilling the essence of the benefits, and explaining to the target audience why something matters (e.g. to a UI developer vs a cloud developer as he explained earlier).

Challenges: Training/Hiring. Netflix didn’t fire all their UI developers as the previous speaker did (audience laughter). Instead, they leveraged their skills at eking out mobile performance, and taught them alternatives: functional programming, vector programming and reactive programming. Be available to help out people (developers work weird hours!) and interactive training exercises really help too. Also help them understand when to apply each technqique for a particular programming task.

Some of the Javascript programmers had some intuition for these techniques from things like Promises. It wasn’t quite the right answer, as sometimes you need vector versions for performance instead (since it degenerates to a scalar version by passing a vector of size 1). Netflix dont bother teaching scalar, only vector.

Although the programmers are used to mapreduce, the embedded version of this is a lot harder for them to grasp — it’s just a programming style and not just magic in Hadoop. Hiring process adapts to spot the competent functional programmer.

bind/flatMap/concatMap/mapcat/mapMany are all monadic versions with subtle semantics and so they use interactive training exercises to train developers in their runtime semantics.

Challenges: performance needs some chunking for lowend devices (vector a manually chunked scalar monad) and is best applied to less chatty event streams. It worked for search on lowend devices since it takes so damn long to type anything in, so a less chatty UI stream (audience laughter). “We are screwed if the hardware UIs improve”.

Type-unsafe flatMap is easier to understand but needs to be used carefully, but is useful on lowend deviecs.

http://github.com/Reactive-Extensions/RxJS has all the source, so give it a try!

Q: How do Databinding in Rx?
A: No templates, but many other tiers before the view layer are bound together. We introduce an Observable with both pure and impure code (an Observable type with a value property). However they learnt how to do this directly since Observable’s lazy default is the wrong for data binding. Implementing a different type that is strict and publishes the result directly.

Medical Device Automation using message passing concurrency in Scheme

What does a molecular diagnostic device do and what does Scheme have to do with all this?

Detects the presence of specific strands of DNA/RNA in sample and then does necleic acid extractions and purification. Sample are excited using lasers as they undergo PCR which results in spectrographs.
Device contains 19 boards with temp, motor control with many parts that move simultaneously at very high precisions. (author shows demos of the motors running and playing music from the system!)

Client is written in C# and WPF and is thin/stateless. Sever is written in Scheme and Erlang with OTP embedded. Supervision structure to isolate hardware failures. A DSL provides the glue logic.

Scheme is exceptionally clear and simple syntax, and they love the hygeinic syntactic abstractions for their DSL and the first-class functions and continuations are key. Arbitrary precision arithmetic is important for result calculations.

The Erlang embedding in Scheme is how they interop. They embed records by referring to structured data by field name instead of field index. This allows for copying a record and changing particular fields.

There’s no pattern matching built into Scheme and enables matching on all the usual datatypes (strings, lists, vectors and records). It also allows matching the value against a local variable and/or binding values to local variables in the pattern itself, which is new in Scheme (but quite ML-like? — editor)

Processes are on-shot continuations with software timer interrupts. The Gen-server patterns provide a functional way of serializing messages and mutating state. The init/handle-call/handle-cast/handle-info and terminate are all encoded, and so it’s all done except code-change (the FDA wouldnt appreciate code changing on the fly!)

The event manager is responsible for event logging and subscriptions, and supervisors monitor processes and execute a restart or clean shutdown semantic (with enough logging for a field debug cycle).

How it works in the instrument server: “let is crash” throughout and a notable lack of defensive programming due to the actor model. They manage some big hardware which fails all the time and this architecture directly addresses that reality.

Easy assay is executed via a DSL which defines the pipeline in a confirm mode to estimate running time, suggest missing resources and warn about a lack of timeslots. Once satisfied this is all taken care of, the queuing is done via a Tetris-style binpacking heuristic. Each action has its own process for fault tolerance. The DSL exposes an option to ignore action failures which is important for test runs and debugging (also during hardware development).

Some examples of device failures are very bad: Therac 25 was a RTOS that gave people fatal radiation doses. It was tracked down to bad use of mutable state and integer overflow (taken care of in their system with FP and arbitrary precision math). Ariane 5 was due to interger overflow due to conversion precision, which also doesn’t happen in their Scheme system which has non-lossy conversion. And finally the space shuttle simulator crashed due to memory unsafety (a restart was unclean), which cannot happen in Scheme. The supervision hierarchy in Erlang/Scheme can handle crashes with a stable strategy at any layer.

Why a DSL? The assay developers need to write their own recipes such as “which recipes for this instrument, and in which order?”. The scientists shouldn’t need software engineers to help them work, so the DSL sorts that out. Mechanical engineers need to check reliability and timing/variability on the hardware, which they can do by configuring DSL parameters.

There are a few bits to the DSL: an extraction/purification language (tip loading, pipetting, moving magnets) and the protocol language (aspirating, dispensing, etc) and finally the motor language.

Shows an example of the PCR language, which is a complex recipe involving baking, thermal cycling in loops. The DSL follows the logic flow of the English language version of the recipe very closely (with more brackets).

Wrote a web server that supports HTTP requests and JSON/jQuery and an sexpression-based HTML code generator. Can render pages with Scheme logic and render to webbrowsers. Use by engineers for debugging and prototyping without having to touch the real UI, and non-engineers use it to map data sources from instruments into JSON (which can be fed into validation tools, for example to send to the FDA for regulation).

Lessons learnt:

  • Message passing is great but not a silver bullet. Gen servers can deadlock between each other despite being deadlock free internally. Timeouts in genservers and supervisor logging makes this possible to catch at a higher level, which is crucial!
  • Not much on the web about a design pattern for the supervisor structure, so they had to pioneer it (didnt really want to pioneer it though)
  • Automated unit test frameworks also built by them by hand
  • Hiring remains hard into this unique environment.
  • Existing quality metrics like bug density are problematic, since Scheme is terse and so looks artificially worse than more verbose languages!
  • FDA has more scrutiny into this system since this is different from the usual C/C++/C# codebases and also the large number of open source tools such as Sqlite/Emacs/git. Unique to this regulated problem space of course.

Message passing concurrency makes reasoning about semantics of concurrent process easier, ass well as debugging problems via fault isolation. Let-it-crash leads to shorter, less defensive code with fewer branch points and test cases. Strong typing and runtime checks provide cognizant failure instead of silent corruption (crucial to avoid in this medical space).

Mixing languages has been nice, since they can pick the good bits of Scheme (macros, arbitrary precision math) and Erlang (OTP). DSLs provide the glue here, and rapid prototyping for non-expert software developers.

Gilt and Scala microengines

Clothing group, and they need to customize the website with per-user targetting. The customer facing stuff is the tip of the iceberg with a bunch of backoffice features that need to be plumbed through. The traffic patterns are all about flash traffic that spike massively during sales, and if they go down in these times they lose lots of money.

Gilt is about six years old, and it was built as a Rails app in the beginning that was quite large. It had unclear ownership, complex dependencies, lengthy test cycles, and unexpected performance impacts from small changes. Rails encourages to “throw everything into one directory — a thousands models in one directory” which is really unscalable as the number of engineers grew. Some bad production issues came out of this problem.

Then they started to transition to microservices by transitioning a few core critical pieces into microserviecs such as inventory management. The number of services grew to over 300 services in just 4 years (each HTTP microprotocols), all built using Scala. They developed an architecture to support this sort of growth of microservices.

Cornerstones of infrastructure (goal==cool stuff for customers!)

  • Build system: SBT is a “simple build tool” which then turned into “Scala build tool” due to its complex mental model. Configuring everything correctly can be tricky if you have a number of dependencies. Gilt wanted to abstract developers from this so they could focus on writing appliction code. They solved it via a large SBT plugin that takes care of dependency management that provides a standardized configuration for everything.
    A complete build specification is shown which is very succinct: it just has a combinator for adding dependencies. Behind the scenes, the plugin provides a huge amount of support (testing, rpms, dependency, continuous builds, and many more)
  • Configuration is handled by a Zookeeper cluster and can be overridden with local files or system properties. It’s mapped into strongly-typed classes in Scala and validated via Hibernate so that configuration must match up with service expectations.
  • Testing is the most challenging part of a microservice architecture. Functional tests are the gold standard, but hard for developers to run 300 services on their local laptops. Unit tests can run locally, but there’s no mocking service for these (and overmocking results in useless testing that in non representative). They use the Cake pattern for dependency injection, which enables type-safe and scalable composition. It uses Scala’s self-types and multiple trait inheritance. Things like object repositories or configuration providers can be expressed in this way. The pattern lets them mixin different sources and weave in real/mocked sources.UI Testing is handled via Selenium and is built on ScalaTest. Automated browser testing with reusable components works well, and leverages Scala’s type system well to reject invalid tests/code/configuration.
  • Continuous Delivery lets them deliver 20 to 30 service versions every day using the aforementioned structures

Some cool stuff: on their website, there are badges on their website showing inventory for each item. The inventory badges update in realtime as you look at the page, which really motivates people to buy as they spot inventory dropping down. Shoppers get excited and feel like they’re in a Black Friday sale and are engaged in the shopping activity! Building this stuff was quite easy to do using event-driven Play application and Akka actors.

They build on this for “freefall sales”. Essentially a Dutch auction and is also web sockets and Play based. It initiates a sale that lasts for a fixed time with the price dropping regularly. The highlight is a $16000 purse that leads to audience laughter.

Q: how to deal with production side of microservices such as service discovery or log aggregation?
A: no service discovery as Zookeeper tracks where everything is (which ports something is on). Log aggregation isnt great at the moment, so they’re working on centralized collection atm.

Q: what’s the greatest win of FP here? Is type safety really important or something like that?
A: in the context of the company development and the technical stack, one nice thing about Scala is that it combines the “good parts” of Ruby (concise, expressive) with type safety and performance (as they had in Java). Scala provided a balance between Java and Ruby by giving them type safety, performance and conciseness, which is a big win in terms of expressivity and reuse of components.

Q: how do you handle versioning of microservices?
A: each service maintains back compat in its API until they ensure all their clients are upgraded. The URI path has the version number, so service providers bump the version number for new clients and keep the old version working. They then migrate all the clients of the service as quickly as is practical.

PalletOpts and infrastructure automation

(I missed a lot of this talk due to a lunch chat spilling over. I’ll fill it in later –editor)

A Clojure DSL that runs shell scripts and handles very complex infrastructure automation.

Q: Yaron: How do you handle error propagation through scripts (the equiv of set -e in sh)
A: Each combinator propagates errors and there are different run strategies that handle errors differently depending on needs.

Streaming data at Twitter with Summingbird

Speaker is @sritchie (his slides). photo 3

Summingbird is how to do streaming Map Reduce at the scale they have at Twitter (scale is “staggering”). It has been open sourced since Sep 3rd and a bunch of internal systems at Twitter run on it. Mainly three people: @posco @sritchie and Ashu — small teams.

Vision:

  • “Write your logic once”: Twitter has a number of 1000+ host Hadoop clusters and a lot of their data is done via k/v processing with Hadoop. Going streaming had to be done evolutionary by letting it be done either on Hadoop or on some new streaming service. Programmers experimenting with data shouldn’t have to make systems decisions early on in the process. Twitters scale: 200m+ active users, 500M tweets/day and scale is pervasive.
  • Solve systems problems once so that systems that are scaling stay fixed.
  • Make non-trivial realtime compute as accessible as Scalding.

This leads onto Summingbird: a declarative map/reduce streaming DSL. It’s a realtime platform that runs on Storm, and a batch platform that runs on Hadoop. The datastructures that come out of the DSL can be run on either platform (so you can choose between batch/streaming very late in the process).

One of the big concepts that Summingbird embraces is the idea of “associative plus” operations — the monoid.
The 4 S’ of Summingbird (“you know you’re onto something when they all have something in common!”):

Source[+T]
Store[-K,V]
Sink[-T]
Service[-K,+V]

Source is where data comes from, Store is a k/v store that is being aggregated into. Sink is a dump as processing items, and Service permits k/v lookups during a job.

Store: what values are allowed? What constraints can we place on a MR job such that any valid value can make progress.

trait Monoid[V] {
  def zero: V
  def plus(l:V, r:V):V
}

“less scary than a monad!”. If you can take any two values and smash them together, you have a monoid — e.g. integers. Also Sets have a monoid (concat) and lists (concat) and maps (recursively pull in the monoid on the value). Others include: HyperLogLog, CMS, ExponentialMA, BloomFilter, Moments,MinHash,Topk. All you need to do is find a datastructure with a monoid and you can use Summingbird. There is an open source library called Algebird (birds birds birds!).

It took some convincing (the words like “monoid” still scared people at Twitter, sadly).

Monoids are amazing due to Associativity. If you know that the values you are reducing are associative, then it’s easy to parallelise since they can be built “in any order”.

Even though realtime is important at Twitter, noones wanted to dip in due to the difficulty of getting started. Summingbird can pull in data from an existing static Hadoop cluster and then union it with a realtime feed. If you give the logic to Summingbird, it will run the logic on Hadoop (consuming discrete batches of data) and the client will keep track of how far it’s gone. The realtime comes from partial aggregations over fixed time buckets, and is merged together in the client using the monoid to aggregate them together to look like a proper realtime feed.

One cool feature: when you visit a tweet, you want the reverse feed of things that have embedded the tweet. The MR graph for this comes from: when you see an impression, find the key of the tweet and emit a tuple of the tweetid and Map[URL,Long]. Since Maps have a monoid, this can be run in parallel, and it will contain a list of who has viewed it and from where. The Map has a Long since popular tweets can be embedded in millions of websites and so they use an “accountment sketch” which is an approximate data structure to deal with scale there. The Summingbird layer which the speaker shows on stage filters events, and generates KV pairs and emits events.

Twitter advertising is also built on Summingbird. Various campaigns can be built by building a backend using a Monoid that expresses the needs, and then the majority of the work is on the UI work in the frontend (where it should be — remember, solve systems problems once is part of the vision).

Future plans:

  • Akka, Spark, Tez platforms
  • Pluggable graph optimizations since various planners do improved filter push
  • Metadata publishing with HCatalog — in OSS, this means better tooling for EC2 and other cloud platforms.
  • And of course, more tutorials

All work happens on Github so contributors are welcome.

Summingbird is working well for the majority of realtime apps at Twitter. It’s all about the Monoid! Data scientists who arent familiar with systems can deploy realtime systems, and 90%+ of the code is reused across the realtime and batch systems.

Q: Did Hadoop need any changes to cope with the interfacing with Summingbird.
A: Scalding (written at Twitter) handles much of the interfacing with Hadoop. Hadoop is really the classical k/v store now. Real problem is figuring out which operations can work on both the batch and realtime and only admitting realtime.

Q: Yaron: associativity is one nice thing about Monoid, but what about commutativity is also important. Are there examples of non-commutative datastructures
A: Good examples. It should be baked into the algebra (non-commutativity). This helps with data skew in particular. An important non-commutative application is Twitter itself! When you want to build the List monoid, the key is userid,time and the value is the list of tweets over that timeline (so ordering matters here). It’s not good to get a non-deterministic order when building up these lists in parallel, so that’s a good example of when associativity and commutativity are both important.

Functional Probabilistic Programming

Avi from Charles River, introducing the Figaro language.

Problem: Suppose you have some information (“suppose Brian ate pizza last night”) and you want to answer some questions based on this (“is Brian a student or a programmer?”) and keep track of the uncertainty in the answers.

Solution: Create a joint probability distribution over the variables, assert the evidence and use probabilistic inference to get the answer back.

One common way to do this generative models. Probabilistic models in which variables are generated in order. Later values can depend on previous variables with a binding. This is a v popular approach with several implementations available over the years.

Developing a new model requires figuring out a representation, an inference algorithm and a learning algorithm. All of these are significant challenges with each of them still being considered paper-worthy.

The dominant paradigm is functional probabilistic programming. An expression describes a computation that produces a value.

let student = true in
let programmer = student in
let pizza = student && programer in 
(student, programmer, pizza)

let student = flip 0.7 in
let programmer = if student flip(0.2) else flip(0.1) in
let pizza = if student && programmer flip(0.9) else flip(0.3)
(student, programmer, pizza)

How are we supposed to understand this kind of program. Sampling semantics: run this program many times, each with a sample outcome. In each run, each outcome has some probability of being generated. The program defines a probability distribution over all the outcomes.

This is a powerful approach since we are taking a Turing powerful language and probabilistic primitives are added, and this naturally expresses a wide range of models.

Structured variable elimination, Markov chain Monte Carlo, importance sampling and so forth can all be built using functional probabilistic programming. PPLs aim to “democratize” model building — one shouldn’t need extensive training in ML or AI to build such models, or use them effectively.

IBAL (2000-2007) is the first practical PPL, implemented in OCaml but not embedded in OCaml. Exact inference using structured variable elimination and later had intelligence important sampling. However, it was hard to integrate it with applications and data (in IBAL the data had to be hardcoded into the data, which wasnt practical). It also lacked continuous variables. Therefore the speaker built Figaro to work around the limitations.

Figaro (2009-present) is a EDSL in Scala and allows distributions over any data type. It has a highly expressive constraint system that allows it to express non-generative models. There is an extensible library of inference algorithms that contains many of the popular probabilistic inference algorithms (eg. variable elimination).

The goals of Figaro are to implement a PPL in a widely used library: Scala is the interop language as Java compat is a prime goal. Figaro provides various good abstractions such as OO that come out of Scala (not sure, he skipped slide rapidly here? — editor)

Figaro has an Element[T] which is a class of probablistic models over type T. Atomic elements (Constant,Flip, Uniform, Geometric) are combined with compound elements (If (Flip(0.9), Constant(0.5)...).

Under the hood, there is a probability monad to track the state. The Constant is the monadic unit and Chain(T,U) implements the monadic bind. Apply(T,U) implements monadic fmap. Most elements in Figaro are implemented using this monad.

photo 4

Example 1: Probabilistic walk over a graph. In Figaro, we start by defining datastructures over the webpage graph (Edge, Node, Graph) and then we define some parameters of the random walk.

Example 2: Friends and smokers. Create a person with a smokes variable. Friends are three times as likely to have the same smoking habit than different, so this is a constraint variable. Finally this is applied to all pairs of friends by adding constraints to the pairs. This is then instantiated in a particular situation by building a set of people (alice, bob, clara) and running the inference algorithm (VariableElimination) to calculate the probability that Clara smokes.

Example 3: Hierarchical reasoning. We observe an object (say a vehicle on a road, or a person, or something) and we want to know what type of object it is. We have some observations about this object (its size, color or shape) and we want to know if its something we should be interested in. It’s very natural to think of these objects via an inheritance hierarchy. In Figaro, every element has a name and belongs to an element collection. A reference is a sequence of names (such as vehicle size). Starting with an element collection, we can get to the vehicle associated with an element (go through through the sequence of nested element collections). However, there may be uncertainty in the identity of a reference (e.g. you dont know what vehicle1 is). Figaro resolve the element to the actual element.

Obtaining Figaro is free and open source (cra.com/figaro) but v2 is on GitHub (like everything else today, it seems).

Q: What are the limitations of Figaro, such as non-terminating Bernoulli tasks and what constraints cant be encoded?
A: Anything you can express can be put in constraints since any code can go in there. It’s very easy to express something that you cant compute. Figaro is working on helping the programmer that can be computed over, such as via rejection sampling or simply rejecting the bad constraint.

Q: Similar to Norman Ramsey’s work (such as Church), whats the relationship
A: The languages fall into categories: their own languages and embedded libraries. The latter is Figaro and Infer.NET and Factory. INFER.NET is limited in expressivity and so can be compiled into a factor graph so is very efficient. Similarly Factory is lowlevel and compiles to factor graph. Lots to discuss here.

Haskell and FPComplete

Company exists to fix Haskell development tools. All the tools are cloud-based and run in a web browser. Content can be browsed anonymously. This talk will not have the Monad word in it!photo

There exists a feature to create tutorials and embed code within those tutorials.

Latest product (being announced today) is a Haskell IDE that is a web based tool which requires no setup and no download. It has a number of cool features:

  • Errors are displayed immediately as it compiled as you type
  • Git based so full version control

Company goals: increase Haskell adoption and make it more accessible. Offer commercial grade tools and support and simplify Haskell development. Also to support the Haskell community and leverage the community to improve FP Complete’s own products.

The FP Haskell Center (released in Sep 2013) has:

  • a web front end (no setup,access to help and integrated with Haskell)
  • Haskell backend has project management, developer feedback with errors, build system and a cloud based execution and deployment platform. The cloud allows for faster product evolution!
  • Soon to have a personal version and version that will work behind firewalls etc.

It is all implemented in a Haskell web framework like Yesod. Lots of library support via conduits, and the UI uses Javascript (via Fay). All the heavy lifting is done in the backend and not via Javascript. Very few issues surfaced using Yesod and no major modifications were required to build their product.

Challenges were outside the scope of Haskell usually, but Haskell was used to fix them.

  • Javascript coding was significant in the frontend
  • Stable set of libraries are important
  • Compiler integration (feedback and errors) must be available as a library to let the IDE spot identifier location problems
  • Uses LXC containers to created isolation containers — ephemeral storage and containers can run on either shared or dedicated systems.
  • IDE uses isolation git state to separate users from each other. When projects are created, there is a visual representation of projects (in git repositories each, in S3) and each repo is accessed via Gitlib-2. Interesting is that this offers multiple backends for git access: Git C library, GitHub C library, IDE local repo in S3 or others in the future.

For building projects, code is being precompiled continuously in the backend and so bytecode can generated quickly and run within the IDE. Bytecode runs in the GHC frontend container and exceptions dont break the IDE. Keter and Chef provide deployment assistance to get it into a shared application server or container, and FP Complete offers this deployment as part of their product.

Billing Processor provides SOAP APIs. Added SOAP Haskell libraries to Haskell via C++ bindings.

Mobile gaming from Japan

They normally built mobile games using Ruby, PHP, etc, but have switched to FP in recent years for reliability, high performance and productivity. However, they’ve suffered from memory leak (blaming Haskell’s laziness, to audience laughter), and some data loss and performance degradation.photo

Gree is one of the largest mobile games platforms, with 37 million users and 2000 games as of June 2013. They develop social games and a platform for 3rd party games. They also do social media and advertising and licensing, and also venture capital.

Employees are 2582 with a significant number of engineers (cant disclose exact numbers). On client side, it’s normally Java, ObjC, Javascript and Unity/C#. Server side is usually PHP, MySQL and Flare (a KV store written in C++ developed internally at Gree). They started a Haskell project in Jun 2012 and they have 6 Scala and 4 Haskell engineers.

They developed a KVS management system using Haskell, to setup and destroy KVS nodes in response to hardware faults and sudden access spikes. It’s used in a large social game application. Another example is an image storage gateway, which converts WebDAV to memcached/S3 for the SNS photo uploader to store images. It uses Warp and http-conduit to do this conversion.

Some pitfalls they faced with Haskell:

  • Memory leaks via lazy evaluation: frontend server keeps a list of active thread ids in a TVar for monitoring. When deleting from the thread list, the function didnt reduce the function to a normal form and so held onto a reference. This led to a serious memory leak that took the service down. It was fixed easily by evaluating to the normalform and forcing sequentialisation. They felt it is too easy to mix up TVar/MVar writes with other IO operations (which do reduce to normal form and avoid the leak). There is much confusion over the strict/lazy alternatives for non-IO functions
  • Race conditions: Data put in a queue is very rarely lost. The reason is that the timeout functions can be used only once safely with an IO function. (details omitted)
  • Performance degradation: from forking a lot of threads to do monitoring. This was easy to fix but happens often since Haskell libraries are developed actively and the implications of changes are not always obvious.

How to fix all this? Better testing to start with. Haskell has a great QuickCheck framework and they developed HUnit. This lets services be started in the setup (in their example, memcached using a spare port), and then run services against it and make sure its torn down in a timely fashion. They used this to extensively test their KVS system and Flare (written in C++). Over 150 system tests and a control server with 5000+ assertions, so this is a big scale test.

Next up is documentation in Haskell. A problem report describes details of the problem and linked from a bug tracker. The timeline, workaround, impact and details caused must all be captured and are mixed up with a number of other reports, so none of the Haskell people read it. They collected up the aggregation and summarised it for other Haskell products. Despite this, other Haskell programmers still dont read it (audience laughter). To sort this out, they do automated checking using hlint. They customize hlint to “grep” for their particular pitfall, and then check for Emacs. hlint isnt quite good enough for a lot of their pitfalls though, such as the timeout issue from earlier, so they had some workarounds to deal with this. The good thing is that hlint is integrated into their text editor. Not all pitfalls can be caught by hlint: high level design issues will remain problematic.

Education is also important. They have a brown bag (lunch? — editor) FP meeting where they “Make GREE a better place through the power of FP”. They cover both Scala and Haskell topics. They also run an education program for new graduates, and they have a Haskell code puzzle series from Project Euler to help the new graduates.

They conclude that they love FP and they develop some key components of our service using FP. But there are many pitfalls, and they are developing tools to assist them sort them out.

Haskell for distributed systems

Jeff Epstein (formerly a masters student in the Cambridge SRG, so it’s good to see him again!) talks about their new distributed system that they are building in Haskell. His company is also available for hiring!photo

HA middleware manages networked clusters, aiming for exabyte-scale clusters. It’s architected by Mathieu Boespflug and Peter Braam (who has spoken about their FPGA work at past CUFPs). It manages 10000+ nodes and creates consensus in the cluster about the location of active services. It can handle events due to failures, recovery time should be approx. 5 seconds.

Zookeeper didn’t scale (audience interruption: why didnt Zookeeper scale? Jeff says he doesn’t know and should chat offline, audience member notes that they maintain Zookeeper and wants to know more).

Events are sent by nodes and aggregated at a proxy node, and enqueued at an HA station. Events are processed by the hA coordinator and updates a node state graph (which is also replicated — everything is replicated and is a key theme in the talk). The HA coordinator is the decision maker which decides what to do on failure. Its main action is updating the node state graph to reflect the new strategy after failure.

Key components are: Haskell, Cloud Haskell and Paxos.

How we used Haskell:

  • Functional Data structures result in a purely functional graph and replicated (impurely) between cluster managers. Most of the feels imperative though, which was important.
  • Refactoring being easier was important due to strong typing.
  • Laziness was yet again a problem causer in IO due to space leaks in low-level networking code. He notes that distribution is a natural barrier to laziness, since the act of serializing a message forces its evaluation. He also notes that processes much behave the same between processes on the same host as between hosts, and so even local communication must be forced.Phil Walder interjects to ask “why did you use Haskell? It’s a bad fit! Jeff replies that “we like Haskell” and introduces Cloud Haskell, the topic of his MPhil thesis.Cloud Haskell is a library for distributed concurrency in Haskell and provides an actor-style message-passing interface, similarly to Erlang. It provides a “typed erlang”.Cloud Haskell is a good fit for this project in terms of interacting communicating components. Refactoring is easy, and this isn’t a “big data” project in the sense of data size. There is a huge amount of communication messages though.CH has a pluggable transport layer such as TCP, but also for different underlying networks (they built Infiniband). However, CH requires ordered delivery of packets and this was a poor fit (they want out of order delivery, which could be built with UDP, but CH requires an ordering for some reason). Space leaks are a real problem in this layer.Cluster bringup is tough for CH, since nodes need to find each other at start of day, and there some hacks in there to work around a morning storm of messages.(Interjection from audience: why not use Erlang? Jeff: performance wasnt up to par, and static types are important for them). Possibly the native code / bytecode erlang might be useful but speaker wasn’t sure if it would work.

    Paxos: is all about making systems agree (well-known proven algorithm for consensus). They implemented this as a general purpose library on top of CH. The algorithm is a good match for the programming model, with the client, acceptor, proposer learning and leader all built as CH processes. It ended up being a small readable implementation (1.5kLOC) that closely matched the pseudocode in the original Paxos paper (made them feel better about correctness). It’s used for synchronizing the state graph and for the distributed message queue.

    Challenges: Paxos is hard and untyped messages in CH make it worse. Getting reasonable performance from it also requires the modified variants. Liveness also is tricky to avoid overapproximations to prevent nodes being fenced out of the consensus unnecessarily. Multipaxos is something that solves some of this, and they’re working on more specific stuff also.

    Debugging: The CH programming model makes it tractable to debug a distributed application on a single machine. Nevertheless, distributed code is still parallel and involves race conditions and mixing messages. They would like a distributed ThreadScope but instead mainly used oldskool logging.

    They did build a deterministic CH thread scheduler to replace CH’s core message handling primitives, and they also used PROMELA

    Q: how much is open source or will be?
    A: none as far as speaker knows

    Q: Marius: how are non-deterministic sources composed with the deterministic scheduler?
    A: deterministic scheduler is for testing only, so timeouts also so deterministic.

    IQ: Functional Reporting

    Speaker is Edward Kmett on how they got FP in the door at S&P Capital in the door, their extensions to Haskell and generally how they’re working on opensourcing their tech. SAP is huge and have a number of products (S&P Capital IQ Platform, ClariFI, AlphaWorks, Compustat).photo

    Getting FP in the door:

    • Monoids (introduced as with the Summingbird talk, for easy parallelisation)
    • Reducers are another really powerful abstraction which can take a container full of values and give you back a structure, and it has a comonadic structure that lets you (e.g.) pass it many containers full of structures and generally compose things nicely. Reduction can also be done simultaneously across monoidal structures and essentially “we dont see the wiring hanging out of our code any more”.
    • Performance attribution: This is a model that breaks down the reasons why you make money (what did you do that was significant). The old and busted model (implemented in Java, full dataset in memory, hard to extend) was horrible, and the new hotness is implemented in Scala, runs in constant space, is really fast and results are flattened elegantly via combinators to a database backend. In other words: better, but some details are glossed over (multipass algorithms, iteratees, caching layers, and handling missing data) — but all this was sorted out and really sold Scala to the bigger company as a real asset.

They developed Ermine, which is Haskell with Row Types. It’s not quite Haskell, but it’s main feature is that it’s not Scala either (audience laughter). Writing functional programs in Scala is quite painful (e.g. not blowing stack with monads requires trampolining everything and is labour intensive just to get by day to day). They wrote Ermine in both Scala and Haskell with a portable Core that can run on the JVM.

Ermine has a strong type system, with novel row types and polymorphic and constraint kinds (“they were just coming into GHC at the time and I wanted to understand them”) and Rank-N types via a derivative of HMF. It fit their domain requirements really wel, and has builtin database support, can be exposed to end users, a structured code editor that can only generate type-safe code as you’re defining, and full control over error messages.

Row types are a powerful way of describing the presence of a particular field. Its modelled via has, lacks and/or subsumes constraints. This is easily checked, but now inference flows unidirectionally through a join.
In Ermine, they have existentials in their constraint types (technical explanation follows, not shown here –editor).

Their reporting layer has one declarative specification to do all the relational algebra specifications they want, and push it into specific backends (SQL Server, etc) and ensure they run their queries as close to the data as possible. They can also create extensive HTML reports that render pretty consistently across different target platforms. Their example of Ermine was written by a functional programmer that had done no functional programming previously, which required a little hand holding to work through some of the more obscure type errors, but generally it worked ok.

All the code is being stuck onto Hackage/Github etc.

Q: has it been rolled out to a lot of users?
A: being used on the main S&P website. One big client facing page for users. The next gen stuff is using Ermine more widely. The desktop app coming soon has Ermine and will be used by a huge number of users. Trend is to use it in an increasing number of their products.

Enterprise scheduling with Haskell at skedge.me

They have a cloud-based scheduling platform that takes care of complex enterprise scheduling. It has a branded customer experience (like Google Apps does), a flexible workflow and deep integration with backend systems and in-store workflow (e.g. on iPads).photo

Example is Sephora who do makeup, and their appointment system is integrated via skedge.me that is embedded in an iFrame that looks like a seamless part of their site so that endusers cant tell the difference.

When speaker started at skedge.me, there was a lot of fire fighting. It was 43k lines of Groovy on Grails, and several major intractable bugs were affecting the business: timezones, double bookings, recurring events not firing and notifications being delayed. Performance was an issue, as things that were interactive were sometime taking minutes to load. There was also a lot of inflexibility — hilariously, they had to sometimes ask clients to change their business model to fit with the skedge.me model.

After some careful though, they decided they had to rewrite skedge.me to cope with growth and better service. The author had done Haskell before, but not specifically was with building a website using it.

Haskell had the perfect model for fixing their scaling problems. At the lowest level, they built a RawDB on top of the IO monad that did everything necessary to maintain ACID guarantees during transactions. An example is that once an effect gets committed (e.g. an email gets sent confirming an appointment and then is cancelled, which happened on their old racy platform). This RawDB layer gives them stronger consistency guarantees with tracked effects and automatic retries on temporary failure.

On top of the RawDB, they build a DB monad which provides a higher level than SQL via algebraic data types, caching and data validation. They then layer a security monad on top of the DB monad to guarantee that all accesses have been checked. Finally there is a business logic layer to enforce local policy and customise per customer (which was a real problem in the past).

Security is complex due to all the customization, and they needed roles by client (owner, staff, customer) but also because clients can customise their verbs that only make sense within the component (Appointments can be joined, rescheduled, or notifications can be sent, edit cancelled). Haskell lets all this fit within a multiparameter type class, instead of doing OO-style inheritance that will rapidly get out of control. They map customisations from the customer onto a more standard schema that they can manipulate generically on a component-by-component basis, and it’s all statically checked and generally easy to read (important for security critical code).

Haskell “lets them build code for the long run”. Sometimes though, you need to write code quickly and not care about longevity. This was the case for the importer, which is a quick-and-dirty hack to get code into their system. It has a limited set of inputs, the “end user” is the developer, and it’s thrown away once the job’s done. (shows epic type). Often when people need to do this sort of thing, they resort to bash or dynamically typed systems. In this case, even for this “quick hack”, they couldnt have achieved what they did “as badly as they did, without the help of the type system”. Generally very positive!

Libraries help them not have to write code. Haskell is generally lower in terms of package count, but their *use* of Hackage libraries has been a lot higher than Javascript. They use 9 Javascript libraries, and 71 unique Hackage libraries (+87 dependencies). Its a lot harder to find a quality library that works in the dynamically typed land. Only one of their Haskell libraries had to be replaced due to bugs. This is because Hackage is well-organized, and it’s easier to vet a library since the purely functional ones are less risky (fewer sideeffects through other code). In contrast, Javascript libraries can sideeffect (e.g. on the DOM) very late in the QA process. So: fewer Haskell libraries, but the number of useful libraries seems to be higher.

Problems: cyclic module dependencies with hs-boot files are unwieldy, so avoid them. Scalability with 4+cores was limited, but they havent tested GHC 7.8 (which apparently fixes a lot of this). Idle GC can cause issues with long pauses, so disable this feature — it depends on your workload but it was ok for their server based workload.

Debugging production problems without stack traces is really hard, so consider using profiling mode for all libraries.

Bottom line: 8200 lines of Haskell from 40k+ lines of Groovy. No recurring bugs and very good performance and flexibility for customer needs. Used daily by thousands of people. Check out the website!.

Q: what about testing (QuickCheck?)
A: they’ve done some QuickCheck on critical components (such as an map from times to events). Overall though, just putting Haskell together in the most obvious way led to improvements over their previous platform. Their QA guys are now writing unit tests in Haskell (because their Selenium bindings were harder).

Integration of Mathematica with MapReduce

From Paul-Jean Letourneau, a data scientist at Wolfram Research. Started with a show of hands: most of the audience has used Mathematica and had heard of Wolfram.photo

The speaker has been responsible for a lot of the personal analytics posts on the Wolfram blog (the email analysis, genomics and several others). The theme of all his projects has been to work on large data. Mathematica has always kept its data oncore, which is a problem with very large data sets, and so this talk is about how they migrate over to a more big-data model to run offcore.

Fundamental principles of Mathematica: everything is an expression, expressions are transformed until they stop changing, and transformation rules are in the form of patterns. Expressions are data structures (actually m-expressions in Lisp parlance). (full explanation of these in Mathematica demonstrated but omitted here)

Expressions in Mathematica are immutable data structures (“program as data”) and so expressions can be rebound. Homoiconicity is pervasive as expressions are the data structure, even internally during evaluation (where it’s represented as a tree).

“everything is a one-liner in Mathematica… for a sufficiently long line” (Theo Gray). (audience laughter). He demonstrates some cool oneliners that are Tweetable (<140chars) and demonstrates how to construct an image recursively out of itself (wow).

He views it as a gateway drug to declarative programming

y = 0
For[i=1; i<=10; i++,y+=i^2]; y
385

(and then they discover Fold and all the advanced topics such as scoping, evaluation control and the MathLink protocol for IPC).

HadoopLink is how they link Mathmetica to Hadoop. Example is WordCount, wherein he shows code to do it on a single core first (to test the output), and then an export to HDFS so that Hadoop can see the code. This looks fairly straightforward via a DFSExport command from HadoopLik.

The Mapper looks something like:

Function[{k,v},
  With[{words=ToLowerCase 
    /@ StringSplit [k,RegularExpression[], ...

The reduces is similar, as it users an iterator to sum up each of the inputs. It actually communicates to the JVM to ensure that they dont have to load all the values into memory. Running the job is just a special HadoopMapReduceJob along with all the relevant expressions, and it goes ahead and submits it to Hadoop.

Used this to build a genome search engine using all this. He shows how to prep data using Mitochondrian genomic data. The mapper only knows about a single base in the genome and its position, and the query sequence, how can it possibly align the query sequence with the genome? Algorithm scales with the size of the query, and not the size of the dataset. Hadoop can be used to handle the full dataset so getting local algorithm right in Mathematica can be focussed on rather than the bigger problem.

Challenges: when this scaled up to search the entire human genome, my Hadoop cluster blows up. Nodes start dying, Nagios alerts, the works! After drilling down on these cryptic errors, the “garbage collector overhead was exceeded” which was hard to interpret. The memory consumption on each worker node when running with both Mathematica and the slave is quite high. This is a setting that’s made on a clusterwide basis and so is problematic to tweak just for this. What he’d like to do is add job-specific options to run things like heap size.

Its on GitHub and open source (github.com/shadanan/HadoopLink) so go for it.

Announcements

Ashish Agarwal is organizing BoFs! On Monday @ 6pm there will be a bioinformatics session with Jeff Hammerbacher with location on the CUFP website.

On Tuesday there will be a BoF about organizing local events for FP. In the same way that politics is local, this BoF will help you figure out strategies for local events.

Yaron Minsky announces that 17 organizations have sponsored CUFP this year, which helps students. Industry attention to FP is really amazing to see how FP is moving into the mainstream. Industrial reception at 1830 on Monday downstairs.

Our thanks to Marius Eriksen and Mike Sperber for an amazing program!

by Anil Madhavapeddy at Mon, 23 Sep 2013

OPAM 1.1 beta available, with pretty colours (Anil Madhavapeddy)


Thomas just announced the availability of the OPAM beta release. This has been a huge amount of work for him and Louis, so I’m excited to see this release! Aside from general stability, the main highlights for me are:

  • A switch to the CC0 public-domain-like license for the repository, and LGPL2+linking exception for OPAM itself. The cutover to the new license was the first non-gratuitous use of GitHub’s fancy issue lists I’ve seen, too! As part of this, we’re also beginning a transition over to hosting it at opam.ocaml.org, to underline our committment to maintaining it as an OCaml community resource.

  • Much-improved support for package pinning and updates. This is the feature that makes OPAM work well with MirageOS, since we often need to do development work on a low-level library (such as a device driver and recompile all the reverse dependencies.

  • Support for post-installation messages (e.g. to display licensing information or configuration hints) and better support for the external library management issues I explained in an earlier post about OCamlot testing.

  • Better library structuring to let tools like Opam2web work with the package metadata. For instance, my group’s OCaml Labs has a comprehensive list of the software packages that we work on generated directly from an OPAM remote.

  • A growing set of administration tools (via the opam-admin binary) that run health checks and compute statistics over package repositories. For example, here’s the result of running opam-admin stats over the latest package repository to show various growth curves.


Number of unique contributors to the main OPAM package repository.

Total number of unique packages (including multiple versions of the same package).

Total packages with multiple versions coalesced so you can see new package growth.

Is it a hockey stick graph? Only time will tell! See Thomas’ full release announcement and let us know how you get along with this new release…

Thu, 19 Sep 2013

OCaml Monthly Meeting – Live Blog (Heidi Howard)


Today’s OCaml Labs Monthly Meeting is all about practise talks for OCaml2013 so in that spirit, I’ll practising a bit of live-blogging too.

13:53 – Today’s SRG Meeting is over and its time for some work before the OCaml Labs meeting at 4:00, see you then …

16:02 Techincal difficulties delayed the start

16:02 Intro from Anil

introducing Gabriel Scherer who is visiting us this week and going we are going to Maypole after this meeting. We had a cash prise from ASPLOS after winning the HiPEAC paper award and the money will go towards SRG wine for XMAS party. Signpost paper was accepted to FOCI and a HotNet paper on Trevi was also just accepted

OCL Website – Too much manual management at the moment, moving to an ocaml planet feed of blog posts. David has been busy hacking on OPAM2web, OPAM has 512 packages, Opam2web takes a subset of the OPAM packages and makes the metadata into a minisite, like on OPAM. Doesn’t require manual updates, like an ATOM feed.

Upcoming events – Tomorrow is the 2nd compiler hacking event, at the makespace. Anil will be talking at QCon on Mirage, Mirage 1.0 release date is October 22nd, so maybe a workshop before. We 3 talks for Ocaml2013 (Platform, OcamlOT and Ctypes) so here we go …

16:09 Anil practice talk on OCaml Platform 1.0

Languages take many difference approaches to platform, but what does platform even mean? As a late mover in this field, we can learn from other languages. A platforms is NOT a group of temporarily motivated hackers to build a replacement standard library. Its hard to adopt a particular approach without a domain specific purpose, there are too many opinions, we need objective way to determine what belongs in the platform, we need a genie community that is sustainable (even if a large party leaves). A platform is a bundle of tools that interoperate, with quantitative metric to judge success, built in agility and supporting developers thought the whole development life cycle. Industrial partners have a range of needs, as each work in different domains.

Tooling – Overview of 5 areas: OPAM from OCamlPro, IDE Tools, OPAM-DOC, OCaml compiler itself and Ocaml.org.

OPAM – 1.1 released today (maybe), over 100 contributors to OPAM,  500+ packages, 1500+ unique versions, external dependency solver using CUDF

IDE Support – OCaml has many intermediate files. In OCaml 4.0 onwards, we have a binary format of an abstract syntax tree with type annotations called cmt (and cmti for interface files), we can now create external tools to query this like opam-doc. ocp-index and ocp-indent from OCamlPro, and Merlin (I thinks this is EPIC) are also now available

opam-doc – Now we have cmt files, we need unified documentation across packages, this is much harder than it sounds as it touches every part of the tool stack. Not all packages can be installed at once due to conflicts. Module inclusion is tough to code in static html. (Need to make a demo) bindoc takes the Typed AST (in cmt) and generates cmd, which include the ocamldoc comments, Opamdoc takes the cmt database for opam and output a single website with your universe of packages.

ocaml.org - Demo of ocaml.org at ocaml-redesign.github.io/pkg/index.html, feedback is welcome says amir

Now we have the tools, what metrics can we extract to see how well our tools are doing.

Portability – windows compatibility ?

Maintainer – is there a place for docs and will people response to issues/comments/emails, where can issues be submitted ?

Tests – code coverage, multi variant benchmarking in core-bench

Stability – OPAM support pining, how stable are the interfaces of libraries ?

opam tracks compiler constraint, statically analyses the build system from logs (OCamlOT)

Agility – Building a platform is EXHAUSTING. We want to ask “WANT IF” questions: what if let was monomophic? what if we removed camlp4? what is the syntax precedence changes ?

Distrusted workflow – build on git, distributing tasks between 3 actors: Author (library writers), OCamlOL workers and maintainers. As we become more stable we move from staging to stable to inclusion in the platform.

We are building a tussle, we want to launch a game in janurary and let people put standard libraries into the ring, running OCamlOT to discover the winner

No clear winner: Lwt – portability, Batteries – free of syntax extensions, core – comprehensive.

16:36  Discussion over the battle of the standard libraries and talk feedback

C: talk is a bit long, not sure what to cut..

C: OPAM was dicussed last year at OCaml2013, we want to update everyone and follow on without overlapping too much

Q: Haven’t we already decided on JS’s core ?

A: No, we use all of them, i.e. Mirage used lwt extensively

Q: What if we don’t want any of the new standard libraries ? maybe I just want to use domain specific libraries from OPAM as and when I need them

A: We are not forcing the new standard libraries on anyone, but they are useful for beginners, nice to have consistent style, interoperability and few open statements e.g. Open Core.Std

Q: What if I have already decided which standard library I want to use ?

A: Again we are not forcing standard libraries on anyone, we are just trying to force effort more directly. OCaml tools will always be standard library agnoctic

C: the diagram of OCamlOT is confustion

C: how to not overlap with david talks

16:41 Davids talk on OCamlOT

State for the open source OCaml community

Outline: what is quality software? what is the user experience? what is feedback loop for package authors? How do we represent the thing underneath this all? utopian future ?

Quality: Work on every core (ANIL: We want multi-core :P ), consistent results: work or die nicely with obvious solution, not more “What have I forgotten?” questions, it should just tell you. We need addictive actions (not sure what they are), consistency, quality functions…

Universal concerns: compiler hypothesis “what if” questions (anil already said this), build system hypotheses “what strange assumuptions is the buid system making?”, package manager hypothesis and environmner hypothesis

Workflow: Make a pull request, curator observes the proposal, predict the future, proposes amendments, feedback loop and finally agreement is reached. Core is release weekly for example, we are trying to work like linux kernal patches

New workflow: promote health of OCaml community, preaching compatibility, “observe, orient, decide and act”, Computer assisted curator will help a human, to run the loop faster, human can pose questions to the computer assisted curator e.g  ”will this run on ARM ?”

Repository Observation: github binding with web hooks but we are not tied to github. We merge into the world and we need dependences from each possible users prospective of the world

Dependency Orientation: capabilities with environmental dependances, packages with constriant-based dependencies, repositories with revision dependencies and artifact dependencies. example of the android repo

Triage Decisions: taking plain text error and parsing them into categories such as unsatisfiability (can’t have these two packages), dependencies (if my dependency is broken, then I am broken), transient (network down), system, metadata, external dependences (you forgot to write a dependency), build errors and a combo of many of the above.

State Action: commit intention, build, error analysis and buid results

Internet res: The agents negotiates over REST API on HTTPS, independent metadata layers (not sure about this) ,everythings an s-exp, branch consistent store explained, like git or Irminsule

Current state: github web hooks, we are conservative so one byte changes and we rebuild everything, basic triage heuristics completed, no amendment are proposed by the system atm, we don’t commit the outcome but the evidence, simple reactions to results, a website with green and red boxes in the large table

History: we have found lots of metadata issues, many packages bugs, some tool bugs like a non relocatable compiler and ocamlbuild PATH ignorer, we currently have 30+ x84-64 30+x84-32, 8 ARMs , many Linux distros , dead Raspberry Pi, panicking *nix filesystems and lots of people have set warning as error

Future: opamfu for DAG analysis, schema migration overhead, lower overhead for administrating exotic workers contributed to OCamlOT, we need to authenticate machines using ocaml-sodium, we need more advanced automation, proposed amendments, lets have a dialogue, better website integration, benchmarking your upgrades (how much improves cost), run experiments on whole OPAM universe with differential analysis and VM-based test system, to specific the worker finely.

What I think quantity is, vision of the future, how its represented underneath and what’s next,

Discussions

C: that was 20mins, feedback to David regarding content to be cut,

17:23 Ctypes by Jeremy 

This is a update not a practice talk

An examples of puts from C, how we can write no C and link in OCaml,

NEW things in Ctypes:
prettyprinting – for C types and C values, making it much eaiser to examine values for debuygging

biarray – support for lump of C memory

More type – nullable string, complex numbers

String conversions – much faster

Memory management issues – ctypes now gives the programmer more control over lifetime of OCaml passed to C,

finaliser – which you can attach to memory

Future

stub generation – instead of dynamically binding, it will generate stub code to act to the API

capability-style memory safty – one rogue pointer in a C library, can cause hell, loading each C library in a seperate address space so i library can only kill itself, you can then even run on C library on a foreign host or on a virtual machine

static strcut/union layout – checking layout of structures and unions against the API

17:40 Amir demo of ocaml-resdesign.githuib.io/docs/opam, (its look great :))

ocaml

by Heidi-ann at Tue, 17 Sep 2013

Inaugural compiler hackers meeting (Compiler Hacking)


Compiler Hacking

The first OCaml Labs compiler hacking session brought together around twenty people from OCaml Labs, the wider Computer Lab, and various companies around Cambridge for an enjoyable few hours learning about and improving the OCaml compiler toolchain, fuelled by pizza and home-made ice cream (thanks, Philippe!).

We benefited from the presence of a few experienced compiler hackers, but for most of us it was the first attempt to modify the OCaml compiler internals.

The first surprise of the day was the discovery that work on the list of projects was underway before we even arrived! Keen collaborators from The Internet had apparently spotted our triaged bug reports and submitted patches to Mantis.

Standard library and runtime

There was an exciting moment early on when it emerged that two teams had been working independently on the same issue! When Jon Ludlam and Euan Harris submitted a patch to add a get_extension function to the Filename module they found that they had been pipped to the post by Mike McClurg. There's still the judging stage to go, though, as the patches wait on Mantis for official pronouncement from the Inria team.

Vincent Bernardoff also spent some time improving the standard library, fleshing out the interface for translating between OCaml and C error codes, starting from a patch by Goswin von Brederlow.

Stephen Dolan looked at a long-standing issue with names exported by the OCaml runtime that can clash with other libraries, and submitted a patch which hides the sole remaining offender for the runtime library. As he noted in the comments, there are still a couple of hundred global names without the caml_ prefix in the otherlibs section of the standard library.

Tools

There was a little flurry of work on new command-line options for the standard toolchain.

A Mantis issue submitted by Gabriel Scherer suggests adding options to stop the compiler at certain stages, to better support tools such as OCamlfind and to make it easier to debug the compiler itself. The Ludlam / Harris team looked at this, and submitted a patch which provoked further thoughts from Gabriel.

Vincent looked at extending ocamldep with support for suffixes other than .ml and .mli. Since the issue was originally submitted, ocamldep has acquired -ml-synonym and -mli-synonym options that serve this purpose, so Vincent looked at supporting other suffixes in the compiler, and submitted a patch as a new issue.

The OCaml top level has a simple feature for setting up the environment — when it starts up it looks for the file .ocamlinit, and executes its contents. It's sometimes useful to skip this stage and run the top level in a vanilla environment, so David Sheets submitted a patch that adds a -no-init option, due for inclusion in the next release.

Error-handling/reporting

Error handling issues saw a good deal of activity. Raphaël Proust submitted a patch to improve the reporting of error-enabled warnings; David investigated handling out-of-range integer literals and return-code checking of C functions in the runtime, leading to some discussions on Mantis. Stephen submitted a patch to improve the diagnostics for misuses of virtual. Gabriel Kerneis and Wojciech looked at some typos in ocamlbuild error messages, and Mike opened an issue to clarify the appropriate use of the compiler-libs package.

Language

The open operation on modules can make it difficult for readers of a program to see where particular names are introduced, so its use is sometimes discouraged. The basic feature of making names available without a module prefix is rather useful, though, so various new features (including local opens, warnings for shadowing, and explicit shadowing) have been introduced to tame its power. Stephen looked at adding a further feature, making it possible to open modules under a particular signature, so that open M : S will introduce only those names in M that are specified with S. There's an initial prototype already, and we're looking forward to seeing the final results.

The second language feature of the evening was support for infix operators (such as the List constructor, ::) for user-defined types, a feature that is definitely not in any way motivated by envy of Haskell. Mike's prototype implementation is available, and there's an additional patch that brings it closer to completion.

Next session

The next session is planned for 6pm on Wednesday 18th September 2013 at Makespace, Cambridge. If you're planning to come along it'd be helpful if you could add yourself to the Doodle Poll. Hope to see you there!

by cl-ocamllabs@lists.cam.ac.uk (OCaml Labs) at Tue, 17 Sep 2013

Camlpdf, the first good command-line PDF tool I've found (Anil Madhavapeddy)


The fine folks at O’Reilly have been proof-reading the Real World OCaml book that I’ve been working on. My heart leapt with joy when the copyeditor commented that she thought it was very well written, but my joy was short-lived when it turns out that all her comments were safely ensconced as PDF comments. A few hundreds comments that required a response from us, too.

“No problem! MacOS X Preview can handle these!” was of course my first response, but it turns out it’s totally broken with PDF notes. The first note that you select will appear as the content of all the other subsequent notes. Yaron Minsky then experimented with Adobe Acrobat, which I’ve sworn never to install again after an unfortunate incident involving the uninstaller a couple of years ago. That turned out to be incredibly slow. I tried a few open-source tools such as Skim which, while otherwise an excellent bit of software, couldn’t render these particular annotations.

I literally jumped off my seat upon discovering cpdf.

Meanwhile, John Whitington just announced the release of the Coherent PDF command-line tools. Since these are all written in OCaml (and have been developed over quite a few years now), he also sent in an OPAM pull request to add it to the database. And most conveniently, this ended up solving my little PDF conundrum in less than an hour of hacking, and has almost cured me of my lifelong fear of dealing with anything in a PDF-like format. Here’s what I did:

  • I installed the tools via opam install cpdf. This installed the library but not the binary (swiftly fixed).

  • Reading the license told me that it’s for non-commercial use only, so I bought a license from the Coherent PDF website (a bargain price, given how much it does!).

  • I ran cpdf -list-annotations over the PDF, and it dumped out all the comments as a text file to stdout. This wasn’t quite enough for me, since I needed to match the annotation to a page number. But since John has released it as open-source, I forked the repository and patched the support directly into the command-line tools, and sent a pull request back over to John. Since it’s under a non-standard license, I decided to place my patch in the public domain to make it easier for him to accept it if he chooses.

  • My co-authors can just run opam pin cpdf git://github.com/avsm/cpdf-source#annotation-page-numbers to pin their local copy of CPDF to my forked branch in their own OPAM installations, and easily use my copy until John gets a chance to integrate my changes properly upstream.

Total time including this blog post: 40 minutes. Now, onto fixing the author responses comments for Real World OCaml now. I’m so happy to have cpdf as a simple, hackable PDF utility, as it does things like page combining and rotations that have always been a little flaky in other tools for me. It’s the Pandoc of PDFs!

Sun, 15 Sep 2013

OCaml Development in Vim (Heidi Howard)


This is a quick run-through of how I set up my development environment in vim:

Install pathogen.vim

mkdir -p ~/.vim/autoload ~/.vim/bundle; \
curl -Sso ~/.vim/autoload/pathogen.vim \
    https://raw.github.com/tpope/vim-pathogen/master/autoload/pathogen.vim

Add the following to ~/.vimrc:

execute pathogen#infect()
syntax on
filetype plugin indent on

Install Syntastic

cd ~/.vim/bundle
git clone https://github.com/scrooloose/syntastic.git

Then quit vim and used :Helptags to check installs so far have worked.

Install Merlin

opam switch 4.01.0dev+trunk
opam update
opam upgrade
opam install merlin

Add the following to ~/.vimrc

:set rtp+=~/.opam/4.01.0dev+trunk/share/ocamlmerlin/vim
:set rtp+=~/.opam/4.01.0dev+trunk/share/ocamlmerlin/vimbufsync
let g:syntastic_ocaml_checkers=['merlin']

:SyntasticInfo will return a list of syntax checkers available to Syntastic, check that this now includes merlin

Install OCP Indent

opam install ocp-indent

Add the following to ~/.vimrc

autocmd FileType ocaml source /home/heidi-ann/.opam/4.01.0dev+trunk/share/typerex/ocp-indent/ocp-indent.vim

by Heidi-ann at Mon, 09 Sep 2013

OCamlot--exploring the edges of OPAM packages (Anil Madhavapeddy)


The new release of OCaml (4.01.0) was just announced! The runup to a major release like this is normally a frantic time to test that your favourite applications don’t break unexpectedly due to some incompatible language feature. This release cycle has a little different though, as we’ve been working hard on using the OPAM package database to build an online regression testing infrastructure to mechanize much of this process.

I wanted to share some of what OCamlot does today, some of the results from about 3 months worth of runs that may help OCaml package maintainers, and finally where we’re going with future developments. This work has been done in collaboration with David Sheets (who built the OCamlot daemon) and the OPAM team ably led by Thomas Gazagnaire at OCamlPro. We’ve also had a lot of help from Jeremie Dimino and Yury Sulsky from Jane Street and Dave Scott and Jon Ludlam from Citrix acting as guinea pigs for their respective regular releases of the Core and Xen/XAPI releases to OPAM.

Towards a truely portable OCaml

The upstream OCaml toolchain is built on very UNIX-like principles, with a number of command-line tools that form a build pipeline. This process usually ends with linking the intermediate object files with a runtime library that provides the garbage collector and other intrinsic OS functions.

Given these raw compiler tools, it’s very easy to compile OCaml into all sorts of weird and wonderful architectures. We’ve seen it run on 8-bit PICs, several efficient Javascript backends (originally ocamljs and more recently js_of_ocaml), and of course my own Mirage Xen unikernel.

While the compiler tools themselves are quite portable and work on almost any UNIX-like system, the build system scaffolding around third-party packages is less portable. Some features such as C bindings often contribute to build breakage on some less-used operating systems such as FreeBSD or OpenBSD, as they usually require probing for header file locations or adding custom CFLAGS before building.


Every Internet of Things starts with a tangled pile of ARM Dreamplugs.

And in our server room, venerable Sparc and PowerPC G4 OpenBSD boxen still live.

Finding older machines is getting tough, but here's Dave's old iMac G5 running Linux.

Most OCaml developers use x86-based machines and so foreign architectures also get less day-to-day testing (OCaml has superb support for fast native code compilation on ARM, PowerPC, Sparc32, and we’re working on MIPS64 here as part of the CHERI project).

We want to make sure that OCaml and its package ecosystem works just as well in the embedded ecosystem as well as it does on vanilla x86 Linux. This includes running on my Linux iMac G5, my FreeBSD Raspberry Pi, my OpenBSD Pandaboard, or even on a bytecode-only architecture like an OpenBSD/Sparc64.

In the longer term, this paves the way to reliable cross-compiled packages for Windows, Android and iPhone (all of which have OCaml ports, but aren’t heavily tested with the full package database). The only practical way to get started on this is by building an automated test infrastructure for OCaml that explores the feature matrix (which eerily for me, happened in the early days of Xen too, via XenRT to stabilize the hypervisor).

Why not Jenkins or Travis?

When we first started hacking on the OPAM testing infrastructure earlier this year, I maintained a local Jenkins installation that monitored the repository. While Jenkins is a superb tool for many continuous integration tasks, it fell over badly when trying to use it on the non-x86 and non-Linux (or Windows) operating systems. Jenkins requires a full Java runtime stack to be available on each of the client machines, which was taking up more time to get up and running than a simple OCaml-based client and server that could compile to both portable bytecode or fast native code.

The other difficulty with OPAM is selecting which packages actually need to be tested, as it has a constraint-based package solver that supports expressive forwards and backwards version restrictions. While basic tests of the latest packages worked with Jenkins, we needed to increasingly customize it to automate interfacing directly with the OPAM libraries and calculating test schedules based on incoming change requests.

Another factor that ruled out depending on hosted services such as Travis is that they tend to support x86-only architectures, whereas we really want to test the full spectrum of CPU variants supported by OCaml. This doesn’t mean that there’s no place for Travis of course, and in fact Mike Lin has already made this work with OPAM.

For our full testing needs though, OCamlot was born: an OCaml client and server system which coordinates different versions of the compiler, architectures and OPAM versions and records the results for triage and fixing issues.

Running OCamlot

The latest alpha release of OCamlot is pretty straightforward to run locally, if you are so inclined. First start a server process:

$ git clone git://github.com/ocamllabs/ocamlot
$ cd ocamlot
$ ./install_deps.sh
$ oasis setup
# (edit lib/config.ml if you need to change ports)
$ make
$ ./_build/lib/ocamlot_cmd.native --help
$ ./_build/lib/ocamlot_cmd.native serve

The server listens on localhost only by default, and normally an SSL-to-TCP proxy is deployed to listen for external connections (I use stud, which is fast and easy to configure).

The OCamlot clients require a local compilation of OCaml, and they autodetect their local CPU architecture and operating system. I’ve put a gist script together that automates this on most Linux and BSD variants. Just customize the top variables (set MAKE to gmake under BSD) and set the hostname to your server process.

The results repository and auto-triage

Once the client is running, the server dispatches tasks from its test matrix, which is calculated from the OPAM package repository. The server maintains a results repository, which is a Git filesystem database that records the build results and logs via an s-expression per task. It also uses Cohttp to serve up the results for a web browser.

It’s very convenient using Git repositories like this since we can use GitHub (or any other Git host) to coordinate and record results, and restart the server without needing any local state. So convenient, in fact, that Thomas and I have been working on a more formal engine for this called Irminsule (more on that in a later post, though).

It’s almost unheard of to have a full OCamlot run go by without some errors, and so David put together the ocamlot triage command. This takes the state repository and runs a set of regular expressions over it to classify them into common errors. The full file is here, but an excerpt should give you an idea of what we look for:

	(* ordered minor to severe *)
	type analysis =
	  | Solver of solver_error option
	  | Dep of string * analysis
	  | Transient of transient_error
	  | System of system_error
	  | Meta of meta_error
	  | Ext_dep of ext_dep_error
	  | Build of build_error
	  | Multiple of analysis list
	    with sexp
	

The errors are ordered by severity to aid in color highlighting. They start with OPAM solver failures and dependency failures (e.g. due to trying to build a package that requires a specific OCaml version that isn’t available), and move onto missing package dependencies or system libraris.

Testing the run up to OCaml 4.01

Of course, running all these tests is useless without taking action on the results. I’ve been keeping track of them in issue #1029. The nice thing about GitHub issues is that when this bug is referenced in commits (even in other repositories) a cross-reference shows up on that webpage and lets everything be tracked nicely.

So what were the most common failures in the runup to 4.01, and what should you avoid when writing your own code?

Different standard library module signatures

There have been a few changes to some of the functor signatures in the standard library, such as adding a find function to Set (mantis #5864). A third-party library that tries to match the functor signature will fail to compile with a type error, such as this one below for zipperposition.0.2:

Error: The implementation src/ptset.ml
       does not match the interface src/ptset.cmi:
       ...
       In module Big:
       The field `find' is required but not provided

The relevant code in zipperposition makes it clear what the problem is:

	(* Big-endian Patricia trees *)
	module Big : sig
	  include Set.S with type elt = int
	  val intersect : t -> t -> bool
	end
	

This particular bug was reported upstream, and the fix requires implementing the find function for the Patricia-tree-based Set. Since the OPAM package will always be broken on OCaml 4.01, it was marked with a compiler version constraint to prevent it being selected for installation under that compiler. When a new version with the fix is uploaded to OPAM, it will always be selected in preference to this broken one.

One other 4.01 change that temporarily broke most of the bigger networking libraries such as Core, Lwt and Batteries was the addition of the close-on-exec flag to the Unix module. This change only affects upstream packages that redefine the UNIX module for their own purposes (such as adding an asynchronous I/O monad as Lwt does), hence it affects the standard library replacement packages.

The fix here was to locally add patches into the relevant OPAM packages to immediately unbreak things when the fix when into the 4.01 branch of the compiler, and notify upstream maintainers to release new versions of their projects. There’s a subtle problem here: when a patch such as this goes into an unreleased branch of the compiler (such as 4.01.0dev), it’s hard to reliably detect if the user has got the very latest version of the compiler or not. If you do have problems like this in the future, try recompiling via opam switch reinstall <version> to the latest branch.

It’s very useful to be able to drop in bleeding-edge compiler tools into the OPAM repository using compiler constraints like this. For an example, see Alain Frisch’s ppx_tools, that require the very latest 4.02dev trunk release to compile his new extension-points feature.

Multiple object definitions

OCaml 4.01 also restricts multiple method definitions with the same name in the same object. This leaves only inheritance as the way to override method names, but some packages such as OCamlnet and Mlorg had minor uses of the old mechanism.

You can see this by using opam switch:

$ opam switch 4.00.1
$ eval `opam config env`
$ ocaml
# object method x = 1 method x = 2 end;;
- : < x : int > = <obj>
$ opam switch 4.01.0
$ eval `opam config env`
$ ocaml
# object method x = 1 method x = 2 end;;
Error: The method `x' has multiple definitions in this object

New warnings, and the dreaded warnings-as-errors

After a decade of being deprecated, the (or) and (&) operators finally had a warning turned on by default.

$ ocaml
# true or false;;
Warning 3: deprecated feature: operator (or); you should use (||) instead
- : bool = true

This wouldn’t normally be so bad, except that a surprising number of released packages also turn warnings into fatal errors (by using the -w @ flags explained in the manual). Warnings-as-errors is extremely useful when developing code but is rather harmful in released code, since a future compiler can choose to emit new warnings that aren’t necessarily fatal bugs.

Packages that failed like this include ocamlgraph, spotlib, quickcheck, OPA, Lablgtk-extras and many more. Please do make an effort to not leave this option turned on in your packages, or else it makes life more difficult for testing your code on bleeding edge versions of the compiler in the future.

It’s worth noting here that OCaml 4.01 has introduced a fair number of new and very useful warnings across a number of areas, mainly to do with detecting unexpected ambiguation or shadowing of values. I’ll cover more on these in a future post about the new 4.01 goodies.

External system dependencies

While there any many packages in OPAM that are pure OCaml, there are also a substantial number that require other system tools to be installed. The Lablgtk GUI library obviously requires the C gtk library to be installed.

Determining if these libraries are installed on a particular OS is well beyond the scope of OPAM, as there are almost as many package managers as there are operating systems. However, it’s important for automated testing and user-friendly error messages to have some notion of detecting if the environment is ready for the OCaml package or not.

We’re solving this by using a depexts field in OPAM that consists of a set of tags that identify OS-specific packages that need to be present. A separate script can query these tags from OPAM and do the OS-specific tests or installation.

For example, here’s the sqlite3-ocaml OPAM description:

opam-version: "1"
maintainer: "markus.mottl@gmail.com"
build: [
  ["ocaml" "setup.ml" "-configure"]
  ["ocaml" "setup.ml" "-build"]
  ["ocaml" "setup.ml" "-install"]
]
remove: [ ["ocamlfind" "remove" "sqlite3"] ]
depends: ["ocamlfind"]
depexts: [
  [ ["debian"  ] [ "libsqlite3-dev"]    ]
  [ ["freebsd" ] [ "database/sqlite3"]  ]
  [ ["openbsd" ] [ "database/sqlite3"]  ]
]

The depexts field here lists APT package for Debian, and the ports tree locations for FreeBSD and OpenBSD. It could also list more specialised tags for particular versions of an OS. You can query this from OPAM as follows:

$ opam install -e debian sqlite3-ocaml
libsqlite3-dev
$ opam install -e openbsd sqlite3-ocaml
database/sqlite3

OCamlot therefore needs to query the depexts field from the package and run the right apt or pkg_add commands. I’ll write about this in more detail when it’s fully baked, as we’ve modified the semantics of the tag querying between OPAM 1.0 and OPAM 1.1 to make it easier to use in OCamlot.

Portable shell scripts

Once we’ve gotten past the hurdle of the compiler version causing failures, there is the small matter of testing non-Linux operating systems, as well as non-x86 CPU architectures. The #1029 overview lists many of these failures under the Portability section.

Damien Doligez made some excellent points about how to write portable Makefiles that works across both GNU and BSD makes. This is why the carefully crafted OCaml Makefiles do not require GNU make to be installed when compiling on FreeBSD or OpenBSD (MacOS X gave up the fight a long time ago and installs GNU make as its default make command).

OPAM tries to help out BSD by providing a make macro in opam files that is substituted with either "make" (by default) or "gmake" (for BSD). While this works for for the toplevel invocation of the Makefile, it fails if the Makefile recursively invokes further targets without using the $(MAKE) variable instead of directly calling the command. Patching these sorts of things is easy but tedious: see the patchfile for the Facile constraint programming library for an example.

The real problem here, of course, is that package maintainers cannot be reasonably expected to test their code on systems that they don’t normally use–if we demanded perfect portability to be present in the main OPAM repository, we would’t get any submissions!

OCamlot automates this nicely though, by finding lots of portability bugs automatically, and maintainers are by-and-large very responsive when we report the problem upstream.

The emerging distributed workflow

The big drawback to OCamlot in its current form is the amount of triage effort it puts on the OPAM maintainers. The package database has now exceeded 500 packages in just a short few months, and has over 1500 unique versions that all need build testing and more accurate constraints. The wider community has been really keen to participate in helping with triage (just look at all the other people that leapt in on bug #1029), so its our immediate priority to make OCamlot more transparent for people that want to use it to improve their own packages, and in the future also use it to test various hypotheses about all the available open-source OCaml code (see Jacques’ experiment with monomorphic let as an example of something that can benefit from wider automated compilation).

I’ll talk more about how we’re solving this in my upcoming OCaml 2013 Workshop talk about the Platform. I don’t want spoil it too much, but it involves a lovely distributed Git workflow, an enhanced opam2web, and a brand new metadata overlay system for OPAM that lets us enhance the package database with extra information such as statistics, portability and test results, but without polluting the main Git repository with all this extra non-essential data.

If you’re really curious to know right now, then you can see the outline of the new system at Amir’s new ocaml.org wireframes blog post, where Part III contains the continuous integration workflow. A lot of infrastructure work has gone into building all of this over the summer, and now it’s all starting to be deployed in a very satisfying way…

Sun, 08 Sep 2013

Last updated:
Thu, 15 May 2014
All times are UTC.

Powered by: Planet