Project Proposal

Processing And Storage of Time series (PAST)

MOTIVATION & OVERVIEW

A time series is a collection of data points obtained from sequential measurements over time. Nowadays, we are swimming in time series data as the number of digital sources of information grows rapidly. Time series analysis is a complex problem, mainly due to the high-dimensionality of time series data but it is essential for a wide range of real-life problems in various fields of research. A particular example is the spatiotemporal data collected from instruments and simulations which is extremely valuable for a variety of scientific domains. With the increased computational power of today’s high-performance machines, the resolution of the simulations over both temporal and spatial scales can be very high. At the same time, scientific instruments are also collecting data with increasing granularity. As a result scientists face massive datasets that are hard to manage, process and analyze. Therefore, there is a need for tools that support efficient time series storage and retrieval and for mining algorithms that match massively increasing datasets.

Although there is a variety of algorithms for mining time series in the literature, most of them cannot handle time series datasets of terabyte-plus sizes. At the same time, there is a lack of tools for handling time series that can efficiently address the needs of time series data and queries. Analyzing time series in an ad hoc fashion, e.g. via text files that are read into MATLAB or R is an inefficient solution that severely limits scalability. Traditional relational databases are not an appropriate framework either, as the data mining operations commonly applied to time series do not resemble traditional database operations. Some examples of more appropriate tools designed specifically for time series handling are the RRDTool[1], OpenTSDB[2], and iSAX[3].

The goal of this project is to create specialised open-source solutions for storing, retrieving and mining large sets of time series data in an efficient and systematic manner. Each data entry of one time series consists of a timestamp which is associated with one or multiple values. In other words, we aim at building a system optimized for storing and querying numeric time series data. In order to do so, we will examine the characteristic properties of time series data as well as the key requirements of the queries and the mining algorithms that are applied on top of this data, focusing in the use case of scientific spatiotemporal simulation data. 

One specific characteristic of time series data is that it is append-only, there are no arbitrary updates.Therefore we plan on designing an offline framework that supports append-only modifications. Also, we don't need any sophisticated concurrency control as new values are appended to each time series independently.  

Regarding operations applied on time series data, the simplest and most typical query is a time range query that retrieves all the values between two timestamps. Preprocessing (e.g. resampling to a fix rate, interpolating missing data, normalisation) is also commonly applied on time series. It may also be desirable to store data in multiple resolutions. In many cases, raw data may be too large or too messy for many of the analysis techniques to work. There is a variety of representations that can be extracted from time series to facilitate the methods that work on top of it and make analysis more tractable. Some of these representations can also serve as a foundation for indexing the time series. Lastly there is the need to compress the data, to store it as compactly as possible.  

Our framework will make it possible to implement a variety of data mining tasks on top of it, such as querying by content the time series, clustering, classification, segmentation, prediction or anomaly detection. We envision that within our framework time series analysis will be more efficient and it will also be more easily expressed so that a practitioner can use the framework with low effort. 

In the following subsections we define in more details the different aspects of this project. 

TECHNICAL GOALS & METHODS

FRAMEWORK AND TIME SERIES DATASTORE

The goal is to design an efficient way to arrange and process time series data. This task is mainly about designing the high-level architecture of the system; compression and indexing discussed subsequently will also be incorporated in the framework eventually.

API

The main contribution of our project will be an intuitive and effective way to work with time series. For this reason, a well-thought and well-designed API is needed. The API will interact with the storage back end and take advantage of the existing representations seamlessly, in order to provide the computation primitives needed to write time series related algorithms.

For example, one will be able to express queries like  "what is the Fourier transform for the time range 0 to 3 seconds". In this case, if a Fourier transform representation exists for the queried time series, it will be used, otherwise it will retrieve all values and apply the transform.

Frameworks

The core part of the system in addition to data representation is the framework. It is very important to use a framework that efficiently processes, stores and manages the data. GraphLab, Hadoop and Spark are only some of the latest frameworks especially designed to work with big data. Before deciding on the concrete framework implementation it is worth to do extensive research on existing platforms.

Even though Hadoop is most commonly associated with Map/Reduce and big data, we found it not suitable for our application due to its long processing time and lack of support for real time queries.

GraphLab is another state of the art distributed big data processing system. In our case, it is not most suitable being that it is graph based and scientific time series data do not necessarily have a graph representation.

In particular, we believe that Spark might be the right framework to build our application upon. Compared to Hadoop, Spark has a far better response time and is designed for processing iterative machine learning algorithms and providing results in real-time. Spark has a well-defined user interface that simplifies most operations over a dataset.

Spark has already predefined methods for iterating over data and supports methods for associating data to be processed by one function (mapping) as well as sending all elements to a program. It will be up to us to write the driver program defining the workflow of our application and efficiently launching operations over our dataset.

Even though the concrete entry points to our system are yet to be defined, it is most convenient for the user to use an existing file containing the data that will be processed by our framework. This is also a good reason to use Spark because there is built-in support for porting data from a single file and creating RDD objects out of the contained data. Of course, it is the programmer who has control over the exact construction of the RDD object. This has to be defined accordingly with the desired representation of the data.

Its iterative processing nature is able to support machine learning algorithms over time series. 

Being that Spark is built in Scala, Scala might be the language to proceed further development in. 

Time series datastore

Due to the specific and known characteristics of time series we can greatly optimize their storing. Those characteristics are the following: the entries are composed of one single numeric key (timestamp) and the values of the entry (e.g. (timestamp, heat), (timestamp, price)...). Moreover, those timestamps are ordered, they have a certain order in the database which will not change.

We also know what kind of operations will be performed on those time series. After insertion there will be no deletion and no update (it would make little sense, an entry in a time series datastore represents something in the past: we can't, nor do we want to change the past), but the information retrieval and the insertion need to be optimized. We also know that new data are more likely to be added at the end of the time series rather than in the middle of it. So we consider that the data is static and will not change once inserted in the system.

For our application it makes little sense to use a standard row store architecture where each entry is stored as a tuple. Thus we can use a column store [6], with one column representing the key (the timestamps of the time series) and one or more representing the value(s). With this technique each kind of data is isolated from the others, which will help performing compression or optimizations. For example, time series keys which are ordered numeric values could be represented by a start and end timestamps and each value in-between is represented by a delta in relation to the previous value.

An important operation for a time series mining system is the ability to perform range queries between two points in time. With the aforementioned representation these types of queries would be efficient since you would only have to search where the data is at the lower and upper ends of the range and get all the necessary and required data in between in one batch.

If we need to preprocess the data (sampling for example) we could still use this technique but we would need compression if we aim at keeping multiple copies of time series with different “level of information” on disc.

COMPRESSION

Compressing time series data is fundamental not only for the obvious storage size reduction but also for improving performance because less data needs to be read/written on disk. The compression scheme should allow for high decompression speeds as the data will be frequently accessed for answering queries and shouldn't compromise the ability to mine the data. As it is hard to choose an aggregation upfront without knowing what kind of analysis will be applied on the time series, lossless compression sounds like a better option since it will eliminate the risk of dropping data points that are essential for a particular analysis. The goal is to develop a compression methodology that can exploit the domain-specific characteristics of simulation datasets (spatial and/or temporal neighbors have very similar values), in order to achieve a better trade-off between compression overhead and space and to preserve the fundamental characteristics of the series. 

TIME SERIES REPRESENTATION

We are considering to store the time series in multiple resolutions. Lower resolution can be useful for approximate queries and visualization. Also, many time series data mining approaches do not operate on raw time series data; to keep analyses computationally tractable, they rely on higher-level representations of the data. The main motivation of representations is thus to emphasize the essential characteristics of the data in a concise way. 

There are a lot of time series representations, most of them are presented in the following list:

In non-data adaptive representations, the parameters of the transformation remain the same for every time series regardless of its nature. [5]

In Data Adaptive representations, the parameters of a transformation are modified depending on the data available. By adding a data-sensitive selection step, almost all non-data adaptive methods can become data adaptive. [5]

Model based representations are based on the assumption that the time series observed has been produced by an underlying model. The goal is thus to find parameters of such a model as a representation. [5]

Our goal is to review the existing representation schemes and then choose some of them for which we will provide an implementation. Based on a specific representation, specialized indexes can be built to accelerate queries. 

INDEXING

An indexing scheme that imposes minimal space overhead should be employed in order to efficiently manage and query fast ever-growing massive datasets. An index can be created either on top of the raw time series or after a dimensionality reduction technique (like the ones mentioned in the previous paragraph) has been applied.

A state-of-the-art indexing approach for time series is iSAX (indexable Symbolic Aggregate approXimation) [3]. It is an index based on a symbolic representation which essentially quantises portions of the time series. iSAX allows fast exact search or ultra fast approximate search and can support indexing of massive datasets.

Our goal is to build upon existing indexing structures and develop a new indexing technique based on the domain-specific data characteristics which we will incorporate in our framework.

CLUSTERING

Clustering or discovering meaningful partitions of data based on a measure of similarity is a critical step in scientific data analysis. We plan to implement a clustering algorithm on top of our framework to demonstrate how a classic data mining task can be expressed within our framework.

Given the original raw time series or a representation scheme, the first step is to define appropriately the distance between time series. Then, it is proposed to cluster time series data using the affinity propagation algorithm[4] that views each data point as a node in a network and finds a good set of exemplars as well as the corresponding clusters by passing messages between the nodes. We plan to explore if it is possible to implement a parallel version of affinity propagation on top of our framework.

Lastly, clustering can be even used as a preprocessing step within the framework for improving the compression ratio as it discovers similarities that can be exploited while exemplars themselves store compressed information.

SOFTWARE

As we are mostly certain that we are going to use Spark, we are going to implement our framework in JVM languages, like Scala and Java.

RESOURCES

During the first phase of development we will develop the framework on local machines. Once we have a stable system we will require a small cluster in order to test the framework on a distributed system. Since we are developing a framework meant to process large amounts of data, it should work on a distributed system. It will be important to test our clustering and indexing algorithms in respect of such a system.

The exact number of machines and type of hardware needed will be defined in the following days after the peculiarities of the system are clear.

DATA ACQUISITION

We need time series datasets for the experimental study of the performance of our system. We have access to some time series data from EPFL's Brain Project that we can possibly use. We have also found online some data sources (e.g. http://www.cs.ucr.edu/~eamonn/time_series_data/) and we are still investigating other potential sources of time series data. We could also collaborate with other teams that are dealing with time series data in their project. If we don't manage to obtain real datasets that are big enough, we could synthetically produce big datasets based on the real smaller ones in a systematic way(e.g. by interpolating the values in the time series).

RISKS

As this is an open-ended research project, the main risk is that coming up with viable and successful ideas may be more complicated than anticipated. As a result, it may be hard to follow the proposed timeline.

TEAM COMPOSITION 

The team consists of the following 9 people that are experienced in programming and Systems Design, whereas have varying levels of knowledge in Machine Learning:

TIMELINE

Deliverable

Involved Team Members

Deadline

(Semi-)stable API for the framework.

 

Preprocessing and transformations on time series.

Command line utility in order to be able to use and test the framework.

Nicolas Tran, Thomas Mühlematter, Mihaela Turcu
 
11/04/2014

 

 

 

Time series storage using dummy compression & clustering interfaces.

 

Time series storage - Spark integration

 

Time series querying without indexes.

Nikolaos Kokolakis, Eric Beguet 11/04/2014
     
Indexes Eleni Tzirita Zacharatou, Nikolaos Kokolakis, Jasmina Malicevic 02/05/2014
     
Compression

Puneet Sharma, Saurabh Jain

02/05/2014
     
Clustering

Mihaela Turcu, Jasmina Malicevic

12/05/2014
     

Implementation of an application on top of our framework as well as on top of an existing library. 

Determining parameters for comparison & evaluation experiments.

Nicolas Tran, Thomas Mühlematter, Jasmina Malicevic

 

The first two milestones are on the critical path of our project. If the involved members are making a slow progress, we will assign more members to take over a part of the work.

If this happens we may in the end drop the last milestone (clustering) so that the overall load is evenly distributed among the members of the team. 

TASK ASSIGNMENTS

Implementation consists also of writing tests and documentation.

 

ADMINISTRATIVE

There is an EPFL group (S13478) that contains all the project’s members and a wiki (https://wiki.epfl.ch/bigdata-project-timeseries).

There is a GitHub organization called 'PASTdb' and a Git repository at https://github.com/pastdb/PAST.

REFERENCES

[1] http://www.mrtg.org/rrdtool/

[2] http://opentsdb.net/

[3] Shieh, J., Keogh, E.: iSAX: Indexing and Mining Terabyte Sized Time Series. In: Proc. of ACM SIGKDD (2008)

[4] http://en.wikipedia.org/wiki/Affinity_propagation

[5] https://s3-us-west-2.amazonaws.com/mlsurveys/98.pdf

[6] http://people.csail.mit.edu/tdanford/6830papers/stonebraker-cstore.pdf