- français
- English
Meeting Reports
Date: 06.03.2015
Time: 15.00 - 17.00
Attendees: Christian, Matthieu, Nicolas V., Nicolas H., Pascal
Author: Christian
Agenda: Create proposal
Done since the last meeting
- Read proposal specificatios
Items discussed
- All items in the project proposal
Date: 13.03.2015
Time: 15.00 - 17.00
Attendees: All except Bastien, Mateusz, Nicolas V., Pierre and Raphael
Author: Nicolas Hubacher
Agenda: Define more precisely the first tasks
Items discussed
- DevSearch's architecture:
Most of this meeting's time was spent on this subject. The result can be seen in the project proposal. - Definition and distribution of tasks:
We have defined 7 tasks of which 3 need to be done until next tuesday. These tasks are the following:- Find out how to connect DevMin with our project (DB, API, architecture of DevMine)
- Find out how a search engine works
- Identify potential feautures
- Find ML algorithms that could be used by DevSearch
- Write a function that takes an AST and produces a feature vector*
- Write a Spark script to generate features given a feature function*
- First UI version (web page with search function)*
* needs to be done until the first milestone (23.02.15)
The assignment of the tasks is done on DevSearch's Trello board.
Date: 17.03.2015
Time: 14:30-16:00
Attendees: Everybody but Julien Graisse
Author: Damien Engels
Agenda: Summary of research tasks & planning for the week
Done since the last meeting
- Set up website
Items discussed
With the knowledge gained these passed days we decided to move the first Milestone to Friday March 27th.
Minimum viable project doesn't need machine learning, so we focus first on getting the search working with a simple Page Rank and simple feature intersection and we will add machine learning afterwards.
Page rank: Mathieu and Bastien
Look at what the devmine team did, and find a way to get a way to rank repositories and files.
Parsing java: Nicolas V.
The parsers from devmine are too simple to do anything useful, we will write our own to start with.
Distribute tree parsing: Nicolas H. Damien
Implement a Spark app to parse all repositories. We will need to ask for some space on hdfs to store the parsed trees
Feature extraction & build index: Mateusz, Pierre, Raphael, Julien
Given the AST, implement a Spark app to extract all features.
Set up datastore for index : Pascal, Christian
HDFS has no index, unpractical to search. See how we can use MongoDB to store index, we will need computing resources for this (Azure)
Date: 27.03.2015
Time: 15.00 - 17.00
Attendees: Nicolas H, Nicolas V, Pierre, Damien, Julien, Matthieu, Pascal
Author: Pascal Lau
Agenda: Milestone I
Items discussed
- Team Julien, Pierre, Mateusz
- Done
- Extracted features from files (import and inheritance)
- Next to do
- Extract key which are unique for each feature
- Extract list of identifiers and their type as new features
- Make it works on HDFS, integrates it on Spark
- Identified problems
- How to identify a repository?
- Done
- Team Nicolas V
- Done
- Java parser
- Next to do
- Parser for Python
- Parser for Ruby
- Parser for the queries we will submit to the search engine
- Identified problems
- How to parse only snippet of code?
- How to handle code that does not get parsed?
- Done
- Team Damien, Nicolas H
- Done
- Script for aggregating DevMine data almost finished, should have Data by end of next week
- Next to do
- Finalize script and fix bugs
- Compress Data and send it to HDFS
- Identified problems
- Symbolic links are annoying, and it is quite hard to decide whether a file is text.
- Done
- Team Pascal, Christian
- Done
- Code to connect to icdataportal2 via Play. Can start a script from there.
- Next to do
- Add the query parser to Play such that we can use it to parse queries
- Try to start some Spark job from Play with a Spark script
- Set up SSH tunnel on Heroku
- Identified problems
- Define a way to match the features
- Done
- Team Matthieu, Bastien
- Done
- Scrape starring events from GitHub with Big Query.
- Next to do
- Get all the commits from users with Big Query
- Identified problems
- Information only available from 2011.
- DevMine does not have the information about who contributed to which repository.
- DevMine has information like email X pushed into repository Y, but we cannot match user to email address
- Done
Date: 02.05.2015
Time: 11.00 - 13.00
Attendees: Bastien, Christian, Matthieu, Nicolas H, Nicolas V, Julien, Pierre
Author: Nicolas Hubacher
Agenda: Discuss architecture, Define and distribute tasks for Milestone II
Done since the last meeting
The devsearch-concat script has finished its job. The output folder is of size 386.6GB.
Items discussed
Architecture: The discussion about the architecture of our project resulted in a more detailed diagram. Some of the most important points:
- Feature Representation: Features will be represented as a key-value pair, where the key is the feature's name and the value its location (repo owner, repo name, file path, line position).
- Index representation: The key of a file consists of its features. The ranking of files is done after getting the query request.
- PageRank: The algorithm runs on a graph with the 'repository' and 'user' nodes. The nodes are connected as follows: If a user stars a repo, then there is an arc from the user to the repo. If a user contribues to a repo, there is an arc from the repo to the user.
- Online Part: Our plan is that we are going to do the searching of code (lookup, matching, groupBy and sorting) with spark for M2. For the third milestone we want to move it to Azure.
Tasks for Milestone II (due by 15.04.15):
Our goal for Milestone II is to get everything integrated. The actual tasks and the responsible team members can be found on our Trello board.
Our slogan for this milestone: Don't make your code perfect from the beginning! Create something that works and make it better later...
Date: 21.04.2015
Time: 13.00 - 14.30
Attendees: Everybody
Author: Nicolas Hubacher
Agenda: Machine Learning, Implementation of Feature Matching, Inverted Index, Task Review, Task Assignment, Code Formatting
Done since the last meeting
We have managed to integrate the whole system. During a meeting with Amir we could demonstrate that the code search basically works.
Items discussed
- How are we going to use machine learning now?
The big question was if we can even use machine learning and how. The problem is that applying ML to all the features would lead to overfitting because there are too many factors. Therefore we agreed to apply it to "super features" such as: the number of matches, locality and diversity of matching features. The trained model will tell us which kind of match would be the best one and can be used for sorting the search results.
- How are we going to implement the feature matching operations? (Akka)
We still have one azure account. Running the play application would cost around 20$. We discussed about if we should do the sorting on Akka and how.
- Inverted Index
The main point of this part of the discussion was, if we should create an inverted index by ourselves or if we should just let MongoDB create an index for us. We concluded that we need first to check what kind of indexes MongoDB provides.
So we will certainly start to work with a provided index and then look further.
- Assigning Tasks
We have identified critical tasks such as finishing or fixing the offline Spark job or getting into Akka. The actual tasks and their assignment can as usual be seen on our Trello board.
Remember: M3 is due by 12.5!