My take on what the ATI Data Science mission is: To create a unified approach to evidence based modeling. i.e. data science, like physical sciences, life sciences etc, is in the business of meta-models - once you have a specific model, you have an engineering problem. So finding Newtonian Mechanics, Relativity, Thermodynamics, Maxewells equations, Quantum Mechanics, circulatory and metabolic models and so on -thats natural sciences - finding things like graph theory, relations, baysian model inference, neural nets, PCA, k-means, GAs, generalized stochastic hill climbers, topological data models, etc etc, that's the business we are in. Along the way, intermediate tricks like anomaly detection, dimensionality reduction, etc , are also needed
Sometimes, the new general model emerges from studying a particular instance i.e. some data where current models aren't good enough and sometimes, figuring out an algorithm to fit the model efficiently to real data is really a challenge itself too.
Computing Systems at scale are the basis for much of the excitement over Data Science, but there are many challenges to continue to address ever larger amounts of data, but also to provide tools and techniques implemented in robust software, that are usable by statisticians and machine learning experts without themselves having to become experts in cloud computing. This vision of distributed computing only really works for "embarrassingly parallel" scenarios. The challenge for the research community is to build systems to support more complex models and algorithms that do not so easily partition into independent chunks; and to give answers in near real-time on a given size data centre efficiently.
Users want to integrate different tools (for example, R on Spark); don't want to have to program for fault tolerance, yet as their tasks & data grow, will have to manage this; meanwhile, data science workloads don't resemble traditional computer science batch or single-user interactive models. These systems put novel requirements on data centre networking operating systems, storage systems, databases, and programming languages and runtimes. As a communications systems researcher for 30 years, I am also interested in specific areas that involve networks, whether as technologies (the Internet, Transportation etc), or as observed phenomena (Social Media), or in abstract (graphs).
See blog entry for two simple student project ideas
Inferencing, using probablistic programming e.g, anglican, over temporal encounter graphs.
Talking to a guy who studies mosquitos, prof austin burt at Imperial who said it is possible quite easily to identify the particular mosquito that carries Zika from its charactistic sound (wings whine etc) - I think (but we'd need to do some quick test) that most smart phone microphones would very easily be good enough to do this... (copying colleagues who know about this) and we could potentially build an app for people to monitor for the presence of said mosquitos - as with the flu tracking app, we could also measure the encouters between people (and mosquitos, with some likely error) and see if we can figure out some of the parameters of the zika spreading process....so that's the encounter rates sorted. (new data:)
so there's two populations with different SIR parameters (in the simplest possible model) - actually was , of course, all members of even single populations have varying values for S,I and R (and there are models with more parameters) so it isn't so simple - for example, we dont even know if recovery imparts immunity (in humans) plus there's a relatively low rate human->human infection rate, as well as the human->mosquito, and, obviously, mosquito->human infection rates...
we also don't know what the distribution of encounters / bites/infection is like (it might need multiple exposures to get infected) so the simple vector of encounter=bite=infected if susceptible, isn't enough. Nice problem space - noisy, but probably useful things can be extracted that would indicate (for example) what the impact of reducing the mosquito population by some fraction would be (even to the point where we break the epidemic process).
Background - Analyitics and Targeted Adverts with privacy
We'd like to infer things about peoples' preferences for goods&services in an environment where the data is no longer in the cloud at some large data center, but is kept on personal cloud systems (we call these databoxes, and have an EU and an EPSRC project that has built prototypes of them, c.f. UCN plus the Hat startup/project is trying to create this new version of the old two-sided market hat exchange
so there's work on parallelising (hierarchical) Bayes inferencing - e.g. see Newcastle work, but this is an extreme version where we decentralise (not just multicore, or distributed in a low-latency data center, but spread over wide area) - so this is pushing the bounds of performance for such systems (and exchange of data between components) - so we'll need to look at hierarchical aggregation, and compression of interim data exchanges in the model update phases....and maybe figure out how to do the initial training in a similar way...
Peer review and metrics like citation counts
Longitudinal data in conference review systems such as hotcrp, running for a few years, and use to do:
Result would be a profile of reviewers and a calibration of them against each other, over time.
Multigraph/type/dimensional partition edges but with -ve weights too
most graphs are naively viewed as homogeneous. most could easily be partitioned by edge types - in fact this is often done in an ad hoc way (e.g. on facebook, thresholding edges between "friends" who havn't communicated for some period. but also for k-clique algorithms, based on some notion of connectedness or centrality or many other things
In reality we can also look at more details of info flow on edges, and at declared node data (over time) and see if we can extract things like sentiment (e.g. comments on a node might indicate enmity, not friendship), and also common gender, age etc etc....
so we have a multigraph, and we have a (likely) zipf distribution of degree in each sub-graph (and overall) - how does the overall graph stats reflect its constituent parts as the graph is grown (or shrunk, going back through time)? if we include negative edge weights the right way, do elements of the subgraphs (e.g. typical friendship group size) stabilize (to dunbars number in this case)? can we make a nice model for all this?
A longer term topic of interest to me & I hope to the FCA would be to figure out how to make a network of companies resilient to failures of parts of the network - e.g. how does the current Brazilian turbulence impact on UK business and how would you add capacity (relationships/edges in the graph) to make the system more resilient - a topical version of this (too late, i suspect) would be to actualy have a model of brexit....
we had work on resilience in communications networks (see Finding critical regions and region-disjoint paths in a network in ACM/IEEE joint transactions on networks - Critical Regions and Resilience which looked at geographic failure impact on topologies....same technique might, curiously, apply directly on the company data if the FCA had it in the right form....but the challege would be the scale would be several orders of magnitude larger, and the data a lot noisier...and there are dynamic versions of the problem that would be interesting to tackle
See also Sanjeev Goyal's work on contagion in nets from a financial economist viewpoint.
can we analyze a lot of traces of internet activity and determine what is machine-to-machine, and what is human-to-machine or human-to-human?
One assumes there are simply signatures, like timing (aka "the essence of all true comedy")....so web browse/read/think/browse cycles ought to be obvious (as are even simply typing (althoguh masked by modern tools like ssh)....
are there other signatures?
are there anti-sigs? (someone's streaming to a TV/tablet, but not in the room, or the window's closed? etc
of course, on the internet, no-one knows if you're a god....but maybe we can do better