Proposal

DevSearch - Big Data Project Proposal

Revision 3

Project name

We named our project DevSearch. We assume that we are building a layer on top of DevMine that will be external to it.

The team

Access the member list here.

Short description of the project

DevSearch will be a tool to help developers find relevant code based on short snippets thus providing context and examples to API and language usage. The query space will be language agnostic to leverage the full extent of open source software repositories.

Detailed description of the project goals

We aim to provide a user friendly web interface where users can enter code snippets. Our system will then build a feature vector from the input to search a database of GitHub projects by using our trained model. Our feature vector will be based on the generic ASTs provided by the DevMine project. The search results will contain entries in different programming languages. We will extend DevMine to include other popular languages such as Python and JavaScript.

Description of methods

We need to create two systems for this project: the online search system and the offline model training system.

 

For the training system we will perform feature extraction on all ASTs from the DevMine database. We will define and generate a large list of features such as "does the code contain a loop" or "is there a variable named x". The full list of features per repository will be stored on HDFS. Then we will perform learning on the features × repositories matrix. Key components of this system will be the feature templates as well as the learning algorithm.

 

For the online search system we will extract the same features as used in the model and then send the feature vector to a distributed search system that will multiply the input vector with the trained matrix. The best matches will be selected and then the content of the matched repository will be fetched and sent back to the client.

 

We plan to use Scala for all sub-systems. We will use Spark for extracting features and MLlib for training the model. The front-end system will use the Play framework.

Required resources

We will need to index all of DevMine’s data and store the indexed data in a specialized data store. The data is already available through the DevMine database.

Risks to the success of the project

We need to define a key that provides flexible yet precise matching. Data store that is suitable for structured data. Index-lookup performance.

Milestones, deliverables, work packages and assignment

We define three milestones and set the following tentative deadlines:

 

We have split the project into three separate layers. For each layer we define the following milestones:

 

Language parsers

 

Assignment: Julien, Nicolas V, Mateusz

 

Search layer

 

Assignment: Pascal, Pierre, Nicolas H, Raph, Bastien, Matthieu, Christian, Damien

 

UI

 

Assignment: Pascal, Pierre, Raph, Matthieu, Damien

 

Our Trello board contains a more detailed breakdown of tasks.