Infrastructure, programming models, frameworks
1- Introduction
In the context of Data-Science (https://en.wikipedia.org/wiki/Data_science , http://statweb.stanford.edu/~tibs/ElemStatLearn/) large scale systems must address numerous issues, among them the issues of provenance, analysis, preservation. Computing models are also numerous, among them the cloud family (private, public, hybrid), the cluster family (including the federation of clusters that we call grids), and soon computing models for the platforms of the Internet of Things (IoT).
2- A service oriented view for Data-Science
To introduce the different issues of Data-Science, from a System perspective, we can take the analogy with the following category of cloud computing services:
- SaaS (Software as a Service): It is sometimes referred to as "on-demand software ». In the context of Data-Science and data analysis software, It may concern to provide to end user data mining tools, algorithms, analytics suites… All these tools are available through a Web browser.
- PaaS (Platform as a Service): it allows users to develop, run, and manage applications without the complexity of building and maintaining the infrastructure and through a Web browser In the context of Data-Science it provides to end users platforms to build their own data analytics applications or to extend and existing suite without any idea about the underlying physical architecture;
- IaaS (Infrastructure as a Service): In the context of Data-Science, but not only, it provides a set of virtualized resources (services, processors…) that developers can assemble to run analytics applications or to store data.
- NaaS (Network as a Service): it describes services for network transport connectivity. In the context of Data-Science it may concern the production of Virtual Private Network that enable a host computer to send and receive data across shared or public networks with the functionalities and policies of the private network.
Typical examples of frameworks that can be offered as a service (with some efforts to cloudily them) are Apache Haddop and Mahout, SciDB, CloudFlows, Spark, Flink, TensorFlow, BigML, Splunk Hunh… (mettre des hyper liens sur chacun des termes)
3- Models of tasks and jobs
In the different disciplines of e-Sciences we find some generic vocabulary to talk about tasks (and elementary unit of work) and jobs (the composition of many jobs):
- Tasks:
- Single task application. For data Science it may concern with supervised or unsupervised classification, clustering, association rules discovery…
- Parameter-Sweeping application. For data Science it may concern the analyzing of a dataset over multiple instances of the same classification algorithm;
** Workflow based application. For the data Science it may concern the discovering of a certain knowledge where the discovering application is specified as graphs linking data sources, discovering tools, data output.
- Jobs:
- HPC: using many computing resources over short periods of time
- HTC https://en.wikipedia.org/wiki/High-throughput_computing => a expliquer un tout petit peu
- MTC https://en.wikipedia.org/wiki/Many-task_computing => a expliquer un tout petit peu
According to the type of tasks and jobs, computer scientists design programming models.
4- Programming models ; software engineering issues
New abstract programming models need to be considered for the Data-Science field. A research effort is needed to develop scalable, adaptive and general purpose models as well as models for the coordination of codes and data integration. Standardized formats for data and data exchange is also required. APIs are also needed to support cooperation between data producers. A scalable programming model must include mechanisms for:
- parallel data access
- data processing and exchange on limited groups of cores
- near-data synchronization in order to avoid overheads caused by the synchronization protocol
- in-memory querying to boost the performance
- locality-based data selection and classification to boost performance
The landscape of dedicated programming models/languages for Data-Science is narrow comparing to the landscape of innovative programming language for high performance computing. It is not easy to explain why ‘HPC programming’ models like X10, SHMEM, ECL, Swift (mettre des hyper liens sur chacun des termes) are not really used for Data-Science. Evolutionary models are promising (https://en.wikipedia.org/wiki/Evolutionary_programming). Data-science software are probably developed with more traditional approaches for solving the above mentioned issues rather than according to a new model, especially created for solving the identified problems.
5- Conclusion
All these considerations about Data Science (tasks and job models, architectural and programming models) concern distributed models. New ways to efficiently compose different distributed models and paradigms are required in such a way that interactions between hardware resources and programming levels must be addressed. For instance, one interesting question is to analyze why a given big-data framework needs or do not need a specific programming model. Another one is to ‘measure’ how difficult is the task of deploying a framework inside another framework. A third one is to study why a certain framework scales. Scalability is a key feature for data analysis and knowledge discovery, otherwise the framework will not use all the available resources that become more and more important and used by professionals (RightScale 2015 STATE OF THE CLOUD REPORT <ref name="scale">http://www.rightscale.com/lp/2015-state-of-the-cloud-report</ref>).