D1.1 Technology survey: Prospective and challenges - Revised version (2018)
4 ICT based systems for monitoring, control and decision support
4.4 Cloud Architecture – Datacenters and in-memory computing
Another challenging issue is to provide real-time analysis of shared and distributed data. While most real-time processing engines including Spark [Zaharia, 2012] and S4 [Neumeyer, 2010] can efficiently benefit of the un-debatable performance of in-memory processing, they don’t consider the data management during data processing (i.e. where to store the intermediate temporary data) or dependencies in-between processed data, which are common in environmental applications. We aim to explore the trade-off between fast in-memory processing, data management in-between applications, and the network latency.
Map Reduce paradigm, widely used programming model for handling large data sets in distributed systems, has some shortcomings when thinking about iterative processes: machine learning iterative steps or iterative queries on the same dataset loaded from multiple sources [Engle, 2012]. Spark has been developed to solve the problems that Map Reduce has with working sets, while providing similar capabilities in terms of scalability and fault tolerance. Spark is based on the resilient distributed datasets (RDDs) abstraction. RDDs represent a collection of records, read-only and partitioned, built using deterministic steps or other RDDs and they "know" the information on how they were derived. The users may specify the partitioning of the RDDs and the persistence (if the RDD should be kept in-memory). The benefits of the RDDs are:
- RDDs are read-only: allow fault tolerance without the overhead of check pointing and roll- back (the state may be restored on nodes in parallel without reverting the program execution);
- Are appropriate to run backup tasks: don’t have to handle concurrent updates;
- Use data locality for scheduling in bulk operations (doesn’t handle conflicts - no updates);
- If there is not enough memory, the RDDs are stored on the disk. In [Engle, 2012] it is described an implementation of Spark on Hive - Shark. Shark is a data warehouse implementation based on RDDs, with the following improvements: exploit inter-query and intra- query temporal locality (machine learning algorithms), exploit the cluster’s main memory using RDDs.
GraphX computation system [Xin, 2015] extends the Spark RDDs abstraction to RDG (Resilient Distributed Graph) to distribute efficiently graph data in a database system and uses Scala integration with Spark to allow users to process massive graphs. RDGs support graph operations on top of a fault-tolerant, interactive platform, provided by Spark; represent RDGs in an efficient tabular model. RDGs were used as a base to implement PowerGraph and Pregel frameworks.
The main idea for in-memory computing is to keep data in a distributed main memory near to the application code, ready to be processed. This approach appeared over 20 years ago, but the main memory was very expensive, and also there was no motivation to implement an in-memory computing framework. The drop in RAM costs and increasing need for real-time processing of big-data represented an incentive for this model to be developed. The data is stored in an in-memory database, and the processing is performed in the platform layer, a distributed in-memory database system.
In-memory storage and query solutions are in-memory databases (IMDBs) and in-memory data grids (IMDGs). IMDBs move the data to be queried in the main memory. There are native IMDBs (HANA or Altibase) or traditional DBs with in-memory extensions (Oracle). For IMDGs, the data may be processed in a distributed system of commodity servers, using the Map Reduce framework. An important point is the difference between in-memory computing and in-memory databases and data grids. In-memory computing is a paradigm that deals with computing, too and takes into account scheduling tasks and deciding whether to move the data near the code or the code near the data; in contrast to the data solutions that deal only with data. In-memory data solutions can be used as building blocks for an in-memory computing solution.
Cloud-based Applications
In water management information systems a very important challenge is to be able to provide reliable real time estimation of the degree of water pollution. Sometimes professional software that simulates pollutant transport (such as DHI’s MIKE11) is not available for various reasons. The focus of article [Ciolofan, 2017] was the design and implementation of system able to accurately assess the concentration of the pollutant at any points along a river in respect to a given pollution scenario. This system reuses historical offline data resulted from previous executions of MIKE11 software. The pollution scenario is determined by a set of user specified input variables (chainage, pollutant concentration, discharged volume, type of pollutant, etc). In order to compute the result the authors used multivariate interpolation. The validation of the system was done using data from a real use case on Dîmboviţa river. The obtained results have a mean percentage error less than 1.3%. To efficiently cope with millions of records, the computing intensive application was deployed on Jelastic Cloud in order to take the advantage of on demand elastic RAM and CPU resources.