Motivation
Many countries in the global south are affected by high and especially volatile prices of staple foods & other commodities. These circumstances most heavily impact low-income consumers, small producers and food traders. These people in general have no direct access to price information, let alone price predictions, that would allow them to plan ahead and thereby at least mitigate the impacts of price volatility. Furthermore many governments don't have sophisticated models to coordinate their interventions on the commodity markets and optimally distribute their commodities.
Project Goal
The main idea of this project is to find indicators present in social media that help in the price prediction for basic commodities. In this context we would like to conceive a commodity price & supply prediction framework for developing countries through combining commodity price data from official sources (governments, international organizations such as the World Bank and the IMF) and indicators derived from social media data.
The aim of this project is to pilot the approach for a specific country and thereby hopefully to achieve the prediction of commodity prices on a regional level. We are currently contemplating either India or Indonesia as the country to target, depending on where we can get the most concrete data from.
Twitter statistics very much favor Indonesia with 11.7% of the population being users while in India the percentage is only 1.3%.
However, it seems there is no price information available from Indonesia neither by the government nor the WorldBank. In consequence we will start building the system using data made available by the Ministry of Agriculture of the government of India. It provides monthly and partly even weekly price recordings dating back to 2002 through the
website.
An interesting additional feature would be to derive food supply indicators from social media and to figure out if it is possible to detect underlying patterns in food supply in a country that are actually caused by policy decisions or inaction. This could aid the respective government in allocating resources in their substantial intervention in the food market.
Technical Goals and Challenges
To achieve a regional prediction of commodity prices paired with an additional commodity supply monitor and prediction tool we have to address multiple technical challenges.
The first task is to look at the input. For the moment we have decided on two "parameters" to include in the prediction. The data directly made available by governments and international organizations and the data gathered from the web. Aspects such as models of long-term drivers of food prices and policy decisions by a government can be an option once the system is running. For the moment we have to discuss how to access the online data and what exactly we are looking for.
Online data gethering
The data gathered from official sources will probably be used to train a neural network for basic predictions. As this data is timely sparse it is essential to collect additional data from the web, e.g. Twitter, Facebook, newspapers and other sites that potentially harbor interesting data. The key is to find out how changes in the online data, for example the number of tweets with certain keywords, regarding relate to the volatility of prices of these commodities in the past. An additional challenge could be to find out if detecting changes in sentiment actually provides more accurate information than the approach based solely on keywords.
In order to get a more complete picture of the online ecosystem, relevant social media sites in the country of choice, India, and their representativeness have to be researched beforehand. On top of that there might be the case of a language barrier in light of the fact that there are 20 major languages spoken in India and not all online content can be expected to be in english. In case the volume of English content, for example tweets, is not enough we will use the free Bing translate API or cheaply crowdsource translations through Amazon's "Mechanical Turk".
Relevant types of information
For the beginning we plan to build a system that can handle weekly analysis cycles and extracts different "semantic" kinds of data from the textual input. The data relevant to our project can be subdivided into several categories:
Price information
A way to gather additional data to append to the sequence of price data provided by the Indian governent would be through filtering content that contains explicit information about prices of a basic commodity, e.g. rice, wheat, eggs, poultry.
To collect this information we would have to achieve the selection of content/posts that contain explicit price information. Possible examples include instagram pictures of food with a tagged price and the commodity name in the title or twitter posts that contain name and price of a food or dish. (Additional: process government, NGO & int.gov.org. reports to get access to data). It should be made possible to integrate this data with the data made available by governments.
Crowdsourced information
Although crowd sourcing is an effective way to collect data, the technical challenges of ensuring the quality (correctness) are immense and very likely too advanced for the scope of this project. Workers could potentially record the price of a different version or packet size of a product by mistake. Solutions in image processing have been proposed where an algorithm would check the image and automatically upload the price tag. There already exist companies that are using incentives such as airtime top-ups to encourage people on the ground to post and share price information through their system. Premise is a company which essentially does just that. Rather than building our own system we could integrate their API and use premise’s data as an additional data source. They have trained workers and have made great efforts to ensure the correctness of their data. We believe their data sources could greatly improve the quality of our predictions. A request has been sent for a trial membership which should give us access to their APIs and data collections.
Sentiment information analysis
Sentiment analysis, or opinion mining, is the concept of using different computer science techniques (mainly machine learning) in order to extract sentiment information and subjective opinions from the data. In our case this will help us to find out how circumstances in relation to commodity prices affect the overall mood in the population. For example, the sentence "I'm disappointed, because I can't buy any eggs to bake a cake" which is taken from a tweet or forum post can give the valuable information about current situation on a food market in general.
Building on a research project by the UN Global Pulse on the classification of tweets in Indonesia it may be possible to link price levels to sentiments in the online community, further deduce if a region is affluent in a commodity or in need of it and provide corresponding indicators. Challenges would include defining a change in sentiment and how to account for the bias in the online communities, after all it is still the case that the upper echelons of society dominate social media especially in countries where social media penetration is not very high as it is the case in India.
We don't want to build our own sentiment analysis system since the proper implementation, testing and evaluation would require a considerable effort compared with the total project workload.
Instead, we are planning to use some of already developed solutions and tune them to our needs. One of the most promising projects at the moment is
SentiStrength which was produced as a part of CyberEmotions project, as it allows to estimate the strength of both positive and negative emotions in short texts, even for informal language which is extensively used by users of social networks. The project description and some initial testing showed that the system is indeed able to detect positive and negative emotions correctly (using two corresponding scales) with relation to different keywords like "rice" and "price", as in the following example: "I really like rice but hate the current price". One of the possible solutions we are considering right now is to select specific keywords, split them into several groups (like "products", "assets"), filter the data for these keywords and then measure and evaluate results for sentiments, taking into account a keyword group. This will help us to distinguish between emotions related to the specific type of commodity and those that relate to current circumstances like surplus or shortage which usually result in price changes.
Data sources
As said in previous paragraphs, the location where this project is focused on is India. In such a country where only a minority of the population accesses the internet, it is very difficult to find representative data. The data sources were selected following Alexa's statistics. Alexa is a subsidiary company of Amazon.com providing a large range of web traffic statistics and in particular for this project, a ranking of India's most popular websites.
As expected, most popular social platforms in India are Facebook, Youtube and Twitter. The surprise among the top of the ranking is Google's blogging platform Blogspot. To spread out the origin of data, an indian online supermarket has been retained to collect actual commodities prices (altough a supermarket offers more elaborate products we won't consider). It is accessible at
naturebasket.
Another criterion to select those websites is the existence of an appropriate crawling tool. Indeed, time won't permit to develop our own one. Facebook offers FQL (Facebook Query Language), a complete set of query including fetching public comments. Youtube also provides its own crawler
own crawler. Twitter, specifically, provides various APIs for access to historical and streaming tweets (which are both discussed in details below). For other sources, a possibility is to use Common Crawl, Amazon's non-profit crawler accessible to everyone.
Querying Twitter
Geotagged information
Based on those requirements mentioned above, there are two approaches that could be used to collect data from Twitter:
Historical tweets: This kind of data is accessible by RESTful API. Querying mechanism includes keyword (which is related to the purpose of food price prediction), and geo-location (latitude, longitude, radius)
Example:
- Keyword: rice
- Location: 37.781157,-122.398720,1mi
The disadvantage of this approach mainly comes from the limit of request rate per window (15 minutes), which is 180 per single user or 450 for an application token.
Streaming tweets: Current tweets, as they happen, can be retrieved using the Streaming API. Different from the RESTful API above, Streaming API does not require any specific keyword on the request, but possibly only the location parameters (latitude, longitude)\\
Example:
- Location: 122.75,36.8,-121.75,37.8, which refers to San Francisco
Spark Streaming has an API for using Twitter as a data source in the streaming pipeline.
User data as the bridge
User tweets based on geo-data could not easily be retrieved from Twitter, especially in developing countries like India where modern digital devices with locating capability are not yet popular. It derives to the idea of gathering user data in the relevant location (hereby India), which will be used to investigate the ecosystem around themselves including tweets, favorites, followings, followers, etc. This would reasonably facilitate the process of collecting tweets without the existence of geotagged information.
Twitter Data grant
On March the 15th we applied for a Twitter data grant. We are currently waiting for feedback. In case the application is successful we would get all data related to a set of keywords dating back to 2010 in one go to store on our machines which greatly simplify the act of finding correlations between changes in tweet patterns and prices.
Linking content to a location
Problem: we have to get tweets by country.
Approaches for extracting information from Twitter:
- In the first step we record the set of relevant coordinates as those of the cities for which price information is recorded in India and create a list of relevant keywords.
- In the second step we repeatedly query for all pairs of coordinates and keywords in order to extract tweets
- Then we analyze the resulting dataset to see how much relevant information can actually be found in tweets, e.g. how much people actually tweet about food and basic commodities
- Since only about 2\% of tweets are actually geotagged, we will afterwards try a second approach. That is we will find a list of people likely to belong to India, by crawling the "followers graph" (we start with a list of local celebrities who are likely followed by many people). For these people we record their hometowns in order to get a list of people by city.
- For all those people, we also query for all their tweets that contain at least one of our keywords of interest
- As we expect to get more tweets by including the second approach, we can agian perform some analysis to see how much data of value there is
- Should we come to the conlcusion that the amount of valuable data is low, or that the predictors are not decent enough, we will start with other sources of data: crawling news websites, reading reddit comments etc.
Processing & analysis
The data collected from the web should first stay separated by source and clustered according to predefined keywords. We will mainly focus on extracting information from twitter feeds. In order to that system should further be able to recognize relevant clustering according to types of information and KEYWORDS.
Example Keywords
"can't afford food", "afford rice", "afford food", "can't afford rice", "rice expensive", "buy food", "food expensive", "rice expensive", "rice cheap", "India".
Clustering posts/tweets
Classify (via unsupervised technique - k-means/GMMs) tweets into classes of types of information defined above and then sub classes related to the specific information content of the messages. Get example tweets to train method what to look for, tag tweets to be more effective then twitter keyword search.
"
Spark MLlib supports k-means clustering, one of the most commonly used clustering algorithms that clusters the data points into predfined number of clusters."
Prediction techniques
A substantial part of the project is to evaluate different modelling approaches to the prediction of food prices and conceiving a model that allows for the inclusion of quantified insights through online data analysis. We will try several prediction models and algorithms from Time Series analysis and machine learning and compare their performance, flexibility and scalability to large datasets.
Prediction Problem Formulation
For our prediction problem, we need to specify a clear formulation about the input data and the expected output after we acquire the complete dataset. Input features will be extracted from Time Seires analysis and sentiment metrics. They need to be selected carefully and extensively to form a meaningful representation of the original data. Prediction objectives can be divided into 2 categories under several time scales (daily, weekly, monthly, or annualy): binary classification and regression. The former includes predicting future trends (up or down) in the food prices; the later associates a real number as a prediction of food prices at a time step.
Time Series analysis
The food price data is by nature a temporal sequence of data points from which Time series analysis is able to uncover important characteristics such as autocorrelation or seasonality. One classical Time series forecasting approach is to use the ARMA (Auto-Regressive Moving Average) model to predict the target variable as a linear function which consists of the auto-regressive part (lag variables) and the moving average part (effects from recent random shocks). The fitting of the model and the historical data can be accomplished by maximum likelihood estimation. The implementation of ARMA model is found in various statistical libraries. For Python we have StatsModels.
Machine learning techniques
Support Vector Machine (SVM) is a powerful kernel-based supervised learning algorithm which first applied to classification problems and later adapted for regression. SVM constructs a maximum margin hyperplane to separate the data into 2 groups. The data points used to determine the optimal hyperplane are called support vectors. If the original data set is not linearly separatable, by using Kernel tricks, SVM can project the data space into a higher dimensional feature space without incurring the computation of the explicit mapping. Support Vecotr Regression (SVR) is very similar to SVM but subject to a different constraint. The time complexity of training a nonlinear SVM is between O(n2) and O(n3).
One potential difficulty to apply SVM to our task is the size of our food prices and Twitter dataset. As an alternative, we can apply Stochastic Gradient Descent (SGD). SGD is very efficient in practice with large datasets. Compared to SVM, the primary advantage of SGD is that SGD iteratively minimize the objective function by choosing data points randomly instead of taking the whole dataset into account. As a result, the time complexity is dramatically reduced.
Artificial Neural Networks
Price prediction using Artificial Neural Networks (ANN) is an emerging field with promising prospects. ANNs are made of layers of interconnected neurons. The first layer is input layer, followed by one or more hidden layers, and the last is output layer which shows the results. Input signals to a neuron are multiplied by the weightings on inter-neuron connections. Each neuron receives signals from the external or another neuron, and then processes them according to some activation functions before sending them to the next neurons. A neural network learns to minimize output error by modifying the weightings on the connections.
Different kinds of neural networks have been successfully used for forecasting of financial data series due to their ability to model non-linear dynamical environments. We consider implementing the most studied two of them: Feedforward and Recurrent Neural Networks. Feedforward Neural Networks (a.k.a. Multi-layer Perceptrons) are the first and the simplest type of artificial neural networks, and are widely applied. They belong to a class of neural networks which propagates all signals toward the output, and uses supervised learning. On the other hand, Recurrent Neural Networks (RNN) have directed cycles in their network structures. They are advantageous in modeling temporal data series because of their internal memory for historical input signals, which makes them very suitable to our food price prediction task. However, the difficulty in training RNN will be a major technical challenge ahead.
Computation and storage resources
We need a cluster with Spark support to do data analysis and prediction on the fetched data. India has approximately 33 million Twitter users. If each user has about one MB of data then we need to process about 33.000 GB of Tweets. If we also take other sources into account this could be even more data. Since we are currently only planning on calculating predictions on a weekly/monthly basis, processing doesn't need to be immensely fast. We strive for a scalable solution. So performance will be depended on the resources available to us.
Choice of software
Tech stacks brief
We use Python to handle the communication with Twitter API as well as process the collected data locally. Python is a handy and well-supported language for data processing and analytic purposes due to its simplicity, top-notch performance, and most importantly the availability of library ecosystems such as Python-ETL, SciPy, Scrapy, etc.; in addition to that it strongly supports the currently tech stacks of the API by Twitter, like OAuth2 for authentication and JSON data format. Besides that, parallel data processing is empowered by MapReduce (Hadoop/Spark) in addition to a DBMS (SQL/NoSQL). Cloud computing, with its scalability, reliability, and high performance, would ideally satisfy our need of hardware infrastructure at low cost. Specifically our server is expected to run on Ubuntu on top of a middle-size Amazon Web Services EC2 instance which is fully capable for most processing and computational tasks.
Web Crawling With Scrapy
Scrapy is an open-source web crawling package written in Python. It allows developers to define `spiders' - classes representing instructions for the crawling engine on what content to `scrape' (extract) from a web page as well as how to browse from page to page via hyperlinks. In principle, this platform offers the ideal functionality for exploring Twitter. Consider the following methodology: 1) identify several popular Indian celebrities as root nodes; 2) use Scrapy to interrogate the followers list of each of these celebrities and store the user ID of each follower; 3) use the stored user IDs to scrape the twitter activity of these users and prepare for analysis.
The Scrapy framework allows for the scraping of dynamic web content, hence it should be possible to traverse the entirety of a given followers list, a page element which is accessed through repeated requests for further information. As celebrities are typically followed overwhelmingly more frequently than other users, this technique could be exploited to very quickly obtain the IDs of many millions of Twitter users for further scraping of tweets. A user's tweets appear as page elements which can be easily harvested with Scrapy.
Nevertheless, as is true of this project as a whole, the effectiveness of this approach for supplying the prediction models with useful and workable information depends crucially on the availability of highly specialised geolocation data. Though we may be confident of collating a large volume of Twitter accounts belonging to Indian citizens, fine-grained geolocation data of tweets is not public by default, and many users simply choose never to opt in to revealing this information.
Why Spark?
Comparison to Hadoop
Mark Gover, Hadoop engineer for Cloudera about Hadoop: "You have map and reduce tasks and after that there's a synchronisation barrier and you persist all of the data to disc." While this feature was designed to allow a job to be recovered in case of failure, "the side effect of that is that we weren't leveraging the memory of the cluster to the fullest", he said. He then highlights the advantage of Spark: "What Spark really does really well is this concept of an Resilient Distributed Dataset (RDD), which allows you to transparently store data on memory and persist it to disc if it's needed. [..] there's no synchronisation barrier that's slowing you down. The usage of memory makes the system and the execution engine really fast."
http://www.zdnet.com/faster-more-capable-what-apache-spark-brings-to-hadoop-7000026149/
Hadoop integration
This is a very useful feature for making humanitas available to more people since a lot of existing infrastructure has HDFS support.
Spark interactive shell
The shell can be used to test small code snippets fast or do some quick analysis.
Many use Spark
Spark is already used a lot in production by companies like Yahoo!, Nokia, IBM, Intel, etc.. There's also a large and diverse community contributing to the project.
Useful Extensions
Spark Streaming is built on top of the core Spark API and makes high-throughput, fault-tolerant stream processing of live data streams possible. Data can also be integrated from Twitter via a provided API.
Spark MLlib is a scalable machine learning library and it supports a varity of machine learning algorithms which could be applied in humanitas.
Shark is a distributed SQL query engine which is compatible with Apache Hive. Since it also provides an interface for converting SQL queries into RDDs, it can be used to integrate query processing with machine learning to do efficient data analysis.
OpenDL a deep learning training library on top of Spark.
Part of our project might be contributing to these libraries, because we may have to extend their functionality to integrate our approach with parallel data processing.
Team organization
Project Members (9)
Alexander John Büsser, Anton Ovchinnikov, Ching-Chia Wang, Duy Nguyen, Gabriel Grill, Julien Graisse, Joseph Boyd, Fabian Brix, Stefan Mihaila.
Assigned TA: Aleksandar Vitorovic.
Team leaders
The team leader is Fabian Brix (Sciper ID: 236334). The role of the team leader is to organize the project and drive its completion in time as well monitoring its progress together with the assigned teaching assistant, Aleksandar Vitorovic. He is also responsible of making sure every team member implements a fair share of the project.
Tentative assignment of tasks
Gabriel Grill: Online information processing and clustering with Spark, parallelization of ML algorithms
Ching-Chia Wang: Price prediction with Time Series Analysis
Joseph Boyd: Crawling, anomaly detection in online content and price sequences
Stefan Mihaila: Querying Twitter, price prediction with Recurrent Neural Networks
Duy Nguyen: Organizing documentation, Interacting with APIs on social media, Visualizing results, Designing of web services and interfaces
Anton Ovchinnikov: Data crunching \& preprocessing (e.g. for pdfs), Sentiment extraction from tweets, checking possible links in changes of sentiment to changes in price sequences.
Alexander Busser: Data gathering, crowdsourcing possibilities, association rule learning
Julien Graisse: Research of relevant data sources, price prediction with Machine learning methods (different types of regression, e.g. SVR)
Fabian Brix: Project management, defining properties for clustering and help with analysis, price prediction with Recurrent Neural Networks
Collaboration
In order to make sure that there are no overlaps between teams and work is not unnecessarily repeated, project teams in the Big Data course are obliged to assign a team member responsible for collaboration with teams that have related subjects. For our project an overlap exists in possible approaches to sentiment analysis with the project team that wants to predict bitcoin prices through time series analysis and social media analysis.
Documentation
The team is going to record its project in a
LaTeX document which is going to be created online in a team effort using the collaborative service
writeLaTeX. Progress on the report can be followed by Professor and TA's under the following
link to the document. Additionally we are going to provide information on the project
wikipage.
Content specification
There will be roughly three types of content contained in the documentation:
- Theory: Theoretical foundation of the deployed machine learning methods, foundation of deployed infrastructure models and architecture model of the implemented system.
- Implementation details: Translation of specific math & CS paradigms to practice on our own of with the use of freely available software packages.
- Evaluation: Here we evaluate different approaches towards the machine learning problems we face as well as different architecture models for the final system.
Documenting the project in this manner will allow us to put our final results into context and to estimate their relevance.
Milestones
In the following we define the project milestones. We, however, emphasize the tentative nature of these milestones as we so far can't fully grasp the possible problems and possibilities that may arise during the implementation of the project.
We further will not assign members to a certain piece of work in advance for the whole project, far more we are going to have fluid assignments of members to sub-teams that implement parts of the project at different stages. For the assignment of team members to tasks of the respective parts we are already using
Trello an online organization tool. The work on different parts of this project proposal were assigned in this way. Trello makes it to set deadlines and check if members have done the work and we will report people that do not do their work according to the requirements.
April 6th - data collection scheme & ML research due
By March the 31st the system for weekly updates from twitter should be set up together with clustering of the gathered data. A large portion of the group about 5 people will at first work on this part. Deliverables: set of keywords for the clustering, database of twitter users in India, optimal querying for posts related to basic commodities, identification and crawling/querying of other relevant sites, Clustering of tweets and other online content using Spark's MLlib.
A team of 4 people will work on researching the most suited machine learning techniques, e.g. "Recurrent Neural Networks" or "Support Vector Regression" and Time Series Analysis techniques that allow sequence prediction as well as incorporation of the additional indicators from the social media analysis. Further research should go into noise models. Deliverables: Short literature review of state-of-the-art methods used in price/numeric sequence prediction, detailed description of training chosen methods including a "how to" for incorporating sequence information taken from the web as well as the additional social media indicators into the model.
April 13th - ML models \& architecture \& theory part due
For this milestone the Machine Learning models should be chosen, specified and implemented. Furthermore the architecture for the basic machine learning part is due as well as the architecture used in gathering online information. On top of that the theory part as defined in the content specification should be documented.
April 20th - Social media indicators due
At this milestone the social media indicators have to be devised for all large cities for which price data is provided by the Indian government. For this deadline to be met any number of available members will be called to work. The key aspect of this deadline is to get the \textbf{anomaly detection} right so that we can actually detect changes in what is posted online about commodity prices and the supply level. For this we also have to incorporate the growth in users of twitter and other sites. The clusterings and the change indicators should be visualized in order to provide an easier ground for analysis.
April 27th - Integration of price sequence prediction and social media indicators due
Towards this milestone the whole team will work jointly on the validation of indicators in the data gathered online against patterns in the price sequence data under the supervision of the members assigned to these tasks. The gained insights should allow us to refine the predictions given by the machine learning models. At this stage we will see if it is actually possible not only to make the predictions more accurate by incorporating the information gained from online sources but also if it is even possible to expand the models to predict pricesfor shorter, perferably weekly, periods of time.
May 11th - Evaluation & documentation & presentation due
For this milestone the evaluation of different mathematical models for the price prediction is due and all the steps of the project have to be thoroughly document as well.
May 13th - Project due date
At this milestone we will hand-in the project with the code from the github repository
humanitas together with the documentation and the presentation.