- français
- English
Proposal
DevSearch - Big Data Project Proposal
Revision 3
Project name
We named our project DevSearch. We assume that we are building a layer on top of DevMine that will be external to it.
The team
Access the member list here.
Short description of the project
DevSearch will be a tool to help developers find relevant code based on short snippets thus providing context and examples to API and language usage. The query space will be language agnostic to leverage the full extent of open source software repositories.
Detailed description of the project goals
We aim to provide a user friendly web interface where users can enter code snippets. Our system will then build a feature vector from the input to search a database of GitHub projects by using our trained model. Our feature vector will be based on the generic ASTs provided by the DevMine project. The search results will contain entries in different programming languages. We will extend DevMine to include other popular languages such as Python and JavaScript.
Description of methods
We need to create two systems for this project: the online search system and the offline model training system.
For the training system we will perform feature extraction on all ASTs from the DevMine database. We will define and generate a large list of features such as "does the code contain a loop" or "is there a variable named x". The full list of features per repository will be stored on HDFS. Then we will perform learning on the features × repositories matrix. Key components of this system will be the feature templates as well as the learning algorithm.
For the online search system we will extract the same features as used in the model and then send the feature vector to a distributed search system that will multiply the input vector with the trained matrix. The best matches will be selected and then the content of the matched repository will be fetched and sent back to the client.
We plan to use Scala for all sub-systems. We will use Spark for extracting features and MLlib for training the model. The front-end system will use the Play framework.
Required resources
We will need to index all of DevMine’s data and store the indexed data in a specialized data store. The data is already available through the DevMine database.
Risks to the success of the project
We need to define a key that provides flexible yet precise matching. Data store that is suitable for structured data. Index-lookup performance.
Milestones, deliverables, work packages and assignment
We define three milestones and set the following tentative deadlines:
-
March 23: Milestone 1 (M1)
-
April 15: Milestone 2 (M2)
-
May 12: Project delivery (M3)
We have split the project into three separate layers. For each layer we define the following milestones:
Language parsers
Assignment: Julien, Nicolas V, Mateusz
-
M1:
-
create AST format
-
write parser for Java
-
initial exploration of potential program features
-
write feature extractor prototype
-
-
-
M2:
-
integrate / implement Python with DevMine
-
more and better feature extraction
-
query language parser
-
-
M3: JavaScript + more
Search layer
Assignment: Pascal, Pierre, Nicolas H, Raph, Bastien, Matthieu, Christian, Damien
-
M1:
-
write page rank algorithm for repositories
-
set up database or Spark job with all features for matching
-
implement simple search based on number of matching features and page rank
-
-
M2:
-
implement learing algorithm
-
write matrix multiplication for selection
-
integrate with parser team features
-
-
M3:
-
refine keys, optimize
-
UI
Assignment: Pascal, Pierre, Raph, Matthieu, Damien
-
M1:
-
define exact functionality of product (mockups)
-
define interface between UI and search layer
-
first look & feel implementation
-
-
M2:
-
implement core UI features
-
-
M3:
-
implement more advanced features
-
Our Trello board contains a more detailed breakdown of tasks.