- français
- English
Weather project
weather forecast for USA based on the meteorological data
Team
- Roman Shirochenko (team leader)
- Baptiste Vinh Mau
- Frédéric Bonnand
- <anonymous>
- Ivan Maslov
- Nemanja Drobnjak
- Victor Constantin
Abstract
Weather is proved to be a completely chaotic system. Nevertheless, it is possible to make make weather forecasts with some level of confidence. Our goal is to take a large amount of historical data about weather conditions in USA for the last 30 years (maximum) and apply a number of large-scale machine learning algorithms in order to get a short-term and a long-term predictions about it. Based on the weather data, we are expanding web application even further to make predictions and recommendations. Putting together weather data, along with foursquare data and data sets from music sources, application is offering variety of functions, fitting in user's everyday life.
Project goals
The final goal of the project is to have a web-service that will provide recommendations based on the weather forecast, for a chosen region of USA indicating the level of confidence of the prediction. There are four different groups of recommendations:
1. Recommend places for sports (skiing, hiking etc.) using data from Foursquare and weather;
2. Recommend places for vacations using also data from Foursquare and weekly forecasts.
3. Recommend places for photographers, romantic couples etc. (clear sky full of stars, sunny winter...)
4. Recommend music depending on the weather using different public music datasets.
In order to achieve this goal, we will need to pass 3 important milestones:
I. Data preprocessing
We will use data from multiple sources. Therefore, the data will have different format, precision, etc, and contain different types of information. In the data-preprocessing stage of the project we will need to bring all these data to a common format in order to work with it independently from its source.
After bringing the data to a same format, we will need to perform a feature processing task: create new features that would be useful for the prediction and remove those that we do not need (if such features would exist).
At the end of this milestone we will have a huge dataset from different sources and a number of scripts that will allow us to insert new data entries easily.
Deadline for this milestone: 26.03.2015
Deliverables:
Each small team should deliver the code that clean the data and perform a feature processing task.
The teams will be split by datasets - one dataset per team. One person will start with the data format definition, then will join one of the teams to help them, and near the deadline the same person will merge all deliverables and test if all the needed infrastructure works (and if not - will fix it).
Deadline for this milestone deliverables: 23.03.2015
Format definition and final merge:
Deadline for the format definition: 09.03.2015
Teams:
- Team 1:
- Dataset: http://www.ncdc.noaa.gov/data-access/land-based-station-data/land-based-datasets/global-historical-climatology-network-ghcn
- Members: Bat' Vinh Mau, Vincent de Marignac
- Team 2:
- Dataset: http://www.ncdc.noaa.gov/data-access/radar-data
- Members: Santtu Saijets, Azfar Bassir
- Team 3:
- Dataset: https://developer.foursquare.com/
- Members: Ivan Maslov, Nemanja Drobnjak
- Team 4:
- Dataset: http://www.ncdc.noaa.gov/data-access/satellite-data/satellite-data-access-datasets
- Members: Roman Shirochenko Frédéric Bonnand
- Team 5:
- Dataset: http://berkeleyearth.org/about-data-set?/dataset/
- Members: Victor Constantin, <anonymous>
Technologies
Spark and HDFS to store and process datasets. Use of bash/perl/python scripting languages to prepare the data is possible
Resources
All 5 datasets should not takes 1TB of storage unzipped.
Plan B
We have 5 different datasets here. If some teams will not succeed, we still have data to work with.
If all teams will not succeed, the obvious question will be "are we really working on this project?"
II. MapReduce, datasets processing, prediction and data classification
The main approach to weather forecasting is to create a model and apply it to existing dataset. That is a huge computational and theoretical task.
That was said, the goal of this project's stage is to merge and process data used from different data sources (music, places from Foursquare, weather data), and after it to use the number of machine learning algorithm implemented in Spark (that will be described a little bit later) to predict the weather without having a mathematical model of the whole system and showing the magnitude of the error of such prediction. Later, the data used for prediction is going to be classified (different classes for different recommendations afterwards). Some people are sure that having enough of data we can find a good model that fills this data - this stage will prove the contrary.
Deadline for this milestone: 16.04.2015
Deliverables:
As deliverables for this milestone, we will have implementations of the following ML algorithms :
- Ridge Regression for linear predictions
- Kernel Ridge Regression for nonlinear predictions
- Principle component decomposition for features space reduction
- K-means and Neuronal Networks for classifying weather into more human-readable format (Good or Bad weather, Good for skiing, Bad for hiking, etc)
We will have a system that will compare results of different predictions and chose the best one, and also we will have the data classified to be used in different parts of application (recommendation for sports, for music, for vacations and for photographers).
Teams:
Teams members will be discussed after the first milestone.
- Team 1
- Task: Merge and process data from different sources
- Team size: 2
- Team 2
- Task: Prediction
- Team size: 5
- Team 3
- Task: Data classification
- Team size: 3
Technologies
All these algorithms are implemented in Spark.
Resources
No additional resources are needed. After this stage, we should have less data (it will be already cleaned and optimized to our purposes)
Plan B
One of prediction algorithms and one of Classification algorithms should be implemented anyway
Otherwise - no plan B.
III. Web-service
At this stage we already have system that makes weather forecast. The goal here will be to create a web application that will provide this information in a clean way.
Deadline for this milestone: 08.05.2015
Deliverables
All small teams will merge their work, and the final deliverable will be a web-application
Teams:
Teams members will be discussed after the second milestone.
- Team 1
- Task: create a system that will provide the data to the web application server
- Team size: 4
- Team 2
- Task: create a web application backend
- Team size: 4
- Team 3
- Task: create a web application frontend
- Team size: 3
Technologies
Backend: Scala, Play Framework and/or Python
Frontend: Bootstrap, AngularJS
Additional: probably we will use a "middleware" database like MongoDB or MySQL to cache there our forecast data. This will be decided after the second milestone.
Resources
No additional resources needed
Plan B
If the creation of the web application will be to difficult or if we will not have more time, the plan B would be to create a simple API such that every one could access our predictions and use them in their own purposes.
- Ce wiki
- Cette page