What I want to build
What will be the VisiCalc—the killer app—for big data? It will be a tool that makes it intuitive to dig into richly-structured data, to ask questions and test scenarios, and to navigate the vast real-world datasets that are now available. It will become as widespread as Microsoft Excel, and it will be the tool of choice for people in business, tech, science, journalism, and government, whenever they want to communicate quantitative ideas.
Data is being generated globally at a rate which roughly tracks Moore's Law. I think many companies don't know what to do with the bulk of their data, but they're scared of being left behind, so they simply store whatever fits in the latest generation of hardware. Only a high priesthood of data scientists, in Google and Facebook and some universities, can extract deep value from what they collect.
From 2011 to 2016 I worked as the chief data scientist at Urban Engines, a Silicon Valley startup, now part of Google. We built a platform for visualising and analysing big data about things that move: commuters, trains, buses, taxis, delivery fleets, etc. We learnt a lot about how ordinary users want to interact with their data, how to help users understand richly structured data, and what sorts of interaction make sense for what sorts of data.
I returned to academic life in order to turn these fledgling ideas into a systematic theory, and to build a new tool for interacting with big data that any Excel-savvy user can use.
What's the big idea?
Behind all the dashboards we built at Urban Engines, I believe there is a unifying principle: a universal grammar of interactive data visualisation.
I've worked with database gurus and coders, with practical statisticians, and with physics-style modellers in engineering and mathematics departments. Each community of data workers has its special tricks and conceptual tools, and they also have plenty in common. I believe that most of their tools can be presented as ways to interact with a data visualisation. In other words, I want to make data tangible, so that users can uncover its secrets by touching what they see.
There are many lovely interactive visualisations all over the web. But I haven't seen any that unlock the full richness of the dataset and all the tools I might want to apply: mashing up with external datasets, custom comparisons, etc.: all this is restricted to command-line power users, running R or Python. Instead, current designs for interacting with data are based on what the designer wants to show, rather than what the viewer wants to find out. When we rethink data interaction, it will have to be more "honest": I call it true interaction.
In 1786, William Playfair in his book The Commercial and Political Atlas invented the bar chart, the line graph, the scatter plot, and the pie chart. Two hundred and thirteen years later in 1999, Leland Wilkinson in a landmark book The Grammar of Graphics proposed a universal grammar for plotting data. His remarkable idea, that nearly all data graphics can be expressed by composing a small number of data operations, have shaped modern graphics libraries. This is the intellectual backdrop for what I want to build, and gives hope that it's possible.
Advances in data-driven modelling come from advances in computing power. Now we are in the era of big data and of cloud computing and machine learning, and these will let us model and test and depict our insights directly from data, rather than through clever mathematical tricks and expert statistical theory. This is how I believe data science will be opened up to a wide community. We need to refocus today's research algorithms and systems, and build them into robust tools that are oriented around interacting and visualising and suggesting insight.