Weather project

weather forecast for USA based on the meteorological data

Team

Abstract

Weather is proved to be a completely chaotic system. Nevertheless, it is possible to make make weather forecasts with some level of confidence. Our goal is to take a large amount of historical data about weather conditions in USA for the last 30 years (maximum) and apply a number of large-scale machine learning algorithms in order to get a short-term and a long-term predictions about it. Based on the weather data, we are expanding web application even further to make predictions and recommendations. Putting together weather data, along with foursquare data and data sets from music sources, application is offering variety of functions, fitting in user's everyday life.

Project goals

The final goal of the project is to have a web-service that will provide recommendations based on the weather forecast, for a chosen region of USA indicating the level of confidence of the prediction. There are four different groups of recommendations:

1. Recommend places for sports (skiing, hiking etc.) using data from Foursquare and weather;
2. Recommend places for vacations using also data from Foursquare and weekly forecasts.
3. Recommend places for photographers, romantic couples etc. (clear sky full of stars, sunny winter...)
4. Recommend music depending on the weather using different public music datasets.

In order to achieve this goal, we will need to pass 3 important milestones:

I. Data preprocessing

We will use data from multiple sources. Therefore, the data will have different format, precision, etc, and contain different types of information. In the data-preprocessing stage of the project we will need to bring all these data to a common format in order to work with it independently from its source.

After bringing the data to a same format, we will need to perform a feature processing task: create new features that would be useful for the prediction and remove those that we do not need (if such features would exist).

At the end of this milestone we will have a huge dataset from different sources and a number of scripts that will allow us to insert new data entries easily.

Deadline for this milestone: 26.03.2015

Deliverables: 

Each small team should deliver the code that clean the data and perform a feature processing task.

The teams will be split by datasets - one dataset per team. One person will start with the data format definition, then will join one of the teams to help them, and near the deadline the same person will merge all deliverables and test if all the needed infrastructure works (and if not - will fix it).

Deadline for this milestone deliverables: 23.03.2015

Format definition and final merge:

Deadline for the format definition: 09.03.2015

Teams:

Technologies

Spark and HDFS to store and process datasets. Use of bash/perl/python scripting languages to prepare the data is possible

Resources

All 5 datasets should not takes 1TB of storage unzipped.

Plan B

We have 5 different datasets here. If some teams will not succeed, we still have data to work with.

If all teams will not succeed, the obvious question will be "are we really working on this project?"

II. MapReduce, datasets processing, prediction and data classification

The main approach to weather forecasting is to create a model and apply it to existing dataset. That is a huge computational and theoretical task. 

That was said, the goal of this project's stage is to merge and process data used from different data sources (music, places from Foursquare, weather data), and after it to use the number of machine learning algorithm implemented in Spark (that will be described a little bit later) to predict the weather without having a mathematical model of the whole system and showing the magnitude of the error of such prediction. Later, the data used for prediction is going to be classified (different classes for different recommendations afterwards). Some people are sure that having enough of data we can find a good model that fills this data - this stage will prove the contrary.

Deadline for this milestone: 16.04.2015

Deliverables:

As deliverables for this milestone, we will have implementations of the following ML algorithms :

We will have a system that will compare results of different predictions and chose the best one, and also we will have the data classified to be used in different parts of application (recommendation for sports, for music, for vacations and for photographers).

Teams:

Teams members will be discussed after the first milestone.

Technologies

All these algorithms are implemented in Spark.

Resources

No additional resources are needed. After this stage, we should have less data (it will be already cleaned and optimized to our purposes)

Plan B

One of prediction algorithms and one of Classification algorithms should be implemented anyway

Otherwise - no plan B.

III. Web-service

At this stage we already have system that makes weather forecast. The goal here will be to create a web application that will provide this information in a clean way. 

Deadline for this milestone: 08.05.2015

Deliverables

All small teams will merge their work, and the final deliverable will be a web-application

Teams:

Teams members will be discussed after the second milestone.

Technologies

Backend: Scala, Play Framework and/or Python

Frontend: Bootstrap, AngularJS

Additional: probably we will use a "middleware" database like MongoDB or MySQL to cache there our forecast data. This will be decided after the second milestone.

Resources

No additional resources needed

Plan B

If the creation of the web application will be to difficult or if we will not have more time, the plan B would be to create a simple API such that every one could access our predictions and use them in their own purposes.