Project description

Bitcoin pricing prediction and trading simulation through time series and sentiment analysis

 

 

The volume of Bitcoin transactions has increased a lot in the last few months, bringing a lot of interest around this crypto-currency. We propose to build a framework capable of predicting the evolution of Bitcoin market and simulating different trading strategies. The main difference between a Bitcoin market and current stock exchanges is its high volatility and that transactions are not instantaneous (it can take up to 5 minutes for a transaction to complete).

 

Time series analysis

 

Financial models for predicting the price of such data have been widely studied on “conventional” markets and could be applied to the Bitcoin market.

The main goal of time series analysis (in all fields, not only financial markets) is forecasting. This is possible mainly by separating different components of such a series like seasonality, trends, slow/fast variation, spotting anomalies and "removing" them.

 

We plan on trying different models like single/double moving average, ARMA models, Kalman filters, or other models, apply them on past market values, and simulate prediction of future values.

 

We will first train the model on historic data, for which we have a complete historic of transactions since August the 14th, 2011 (~500MB file, stored on HDFS as a CSV file). Once the model will be properly chosen and trained, we will basically re-train the system with these new values on a frequency we haven't defined yet. In order to maintain scalability, we will automatically prune the oldest transaction such that data still remain of manageable size. For the live crawling of bitcoin transaction, we already came up with a Scala program that automatically crawl new transactions (transactions number can vary, but on average we expect less than 1 transaction per second).

 

Sentiment analysis

 

In order to enhance our prediction we plan to use sentiment analysis. The aim is to determine the attitude or polarity of a document.

 

We plan to use natural language processing, text analysis applied on Tweets, that are currently very active about the mood of the Bitcoin market. We can filter tweets by the major newspaper company (NewYork Times, The Guardian, etc..) so that we get only relevant information, and not individual thoughts about Bitcoin market. Sentiment analysis applications usually compute a polarity score (from -1 : very negative to +1 : very positive) from a sentence input, so what we want to compute is a simple file on the HDFS from which each line will contain 1) the day 2) the averaged polarity of this day. Scalability is not a big concern here because we will basically need only 2 bytes (Short) for the day column (which can cover up to 60'000 days) and only 1 byte (char) for the polarity (scaled between -127 and 128). That means each entry is 3 bytes, if we compute the polarity for each day during 3 years, it would fit in a < 10KB file.

 

The website Archive.org provides databases of monthly tweets. For each month, we can get all the tweets (which weight about 30 to 50 GB on average). We intend to download these files, upload them one by one on the cluster, and preprocess them (remove any line that doesn't talk about Bitcoin). For this task, we are going to use a really simple MapReduce job.

 

As sentiment analysis framework are already widely deployed, we will first try to use some open source sentiment algorithm implementations (e.g. TextBlob, Gate, …). If these frameworks do not give good results (either in terms of accuracy or running time), we could try implementing our own version of a model.

 

The main question we want to answer with sentiment analysis, is if it's a valid method of evaluating/predicting the evolution of bitcoin value. There is no doubt a correlation between the "mood" of the documents and the prices, but we would like to show that it is not only the market driving the news (as an example, a drop in price would generate negative coverage in the medias afterwards). In order to predict the trend of the market's mood, we're going to also use time-series models on sentiment analysis.

 

Web Front-end

 

A web front-end showing real-time Bitcoin exchange trends in a graph that will illustrate our predictions and current market mood. It would also allow to visualize the gain during a chosen time span according to a fictive start investment.

 

Several data visualization tools are accessible on the web for free (D3, Visual.ly, Quadrigram, etc.).

 

We will need to crawl 2 different kinds of data: Bitcoin transaction data, and news website data. Crawlers will be stored and launched from the cluster provided by the course staff.

We will have 500GB of storage, which is more than enough for our project.

 

Since many models have been widely studied for pricing prediction of financial data, even if Bitcoin markets might not have exactly the same properties as “conventional” markets, we think that these models should work pretty well.

 

The use of sentiment analysis in this field is more recent, but has already been studied for more than 10 years. It has shown positive results on conventional financial markets, thus we expect it to be working on cryptocurrencies. (ref.: http://people.csail.mit.edu/azar/wp-content/uploads/2011/09/thesis.pdf)

 

Bitcoin live data :

The website https://www.firebase.com/docs/data/real-time-bitcoin-exchange-rate.html shows real time Bitcoin price evolution, and the results can conveniently be retrieved in a json format.

 

Bitcoin historic data :

The website http://api.bitcoincharts.com/v1/csv/ proposes to download historic data about bitcoin transaction. We found a complete containing bitcoin exchanges data since August 2011 till now. This file weights less than 500MB.

 

Twitter live data :

Twitter4J is a powerful library, we use it in a Scala crawler and each time a new tweet is created, we can see it. Of course, filters (author, hashtags, etc..) can be applied so that we are only notified about relevant news.

 

Twitter historic data :

As explained, we can find huge database for at least 1 year of tweets on Archive.org :

https://archive.org/details/twitterstream (thanks to Maxime Augier for pointing this out)

 

See https://wiki.epfl.ch/bitcoin/timeline

 

We are 2 students in Communications Systems & 3 from Computer Science, with solid skills in optimization, statistics, machine learning and software engineering. We intend to implement our solution in Java, in which we are all proficient, and use CSS, HTML and JS for the web front-end.

 

Jonathan Cheseaux (Team leader) : sentiment analysis tool - Twitter/Google/newspaper data crawling and preprocessing - Github repo setup and project organization - Web front-end (Tweet list, statistics, sentiment correction and retraining based on user interaction)

Ilia Kebets : Web front-end (Graph)

Fabien Schmitt : Implementation of the time series predicting models, algorithm optimization, testing

Igor Vokatch: Data visualization (Graph)

Marzell Camenzind (time series resp) : Bitcoin transaction data fetching, models backtesting - quantitative analysis - programming architecture responsible