Jon's ATI very drafty list of Systems Problems

My take on what the ATI Data Science mission is: To create a unified approach to evidence based modeling. i.e. data science, like physical sciences, life sciences etc, is in the business of meta-models - once you have a specific model, you have an engineering problem. So finding Newtonian Mechanics, Relativity, Thermodynamics, Maxewells equations, Quantum Mechanics, circulatory and metabolic models and so on -thats natural sciences - finding things like graph theory, relations, baysian model inference, neural nets, PCA, k-means, GAs, generalized stochastic hill climbers, topological data models, etc etc, that's the business we are in. Along the way, intermediate tricks like anomaly detection, dimensionality reduction, etc , are also needed

Sometimes, the new general model emerges from studying a particular instance i.e. some data where current models aren't good enough and sometimes, figuring out an algorithm to fit the model efficiently to real data is really a challenge itself too.

Computing Systems at scale are the basis for much of the excitement over Data Science, but there are many challenges to continue to address ever larger amounts of data, but also to provide tools and techniques implemented in robust software, that are usable by statisticians and machine learning experts without themselves having to become experts in cloud computing. This vision of distributed computing only really works for "embarrassingly parallel" scenarios. The challenge for the research community is to build systems to support more complex models and algorithms that do not so easily partition into independent chunks; and to give answers in near real-time on a given size data centre efficiently.

Users want to integrate different tools (for example, R on Spark); don't want to have to program for fault tolerance, yet as their tasks & data grow, will have to manage this; meanwhile, data science workloads don't resemble traditional computer science batch or single-user interactive models. These systems put novel requirements on data centre networking operating systems, storage systems, databases, and programming languages and runtimes. As a communications systems researcher for 30 years, I am also interested in specific areas that involve networks, whether as technologies (the Internet, Transportation etc), or as observed phenomena (Social Media), or in abstract (graphs).