Proposal

DevSearch - Big Data Project Proposal

Revision 3

Project name

We named our project DevSearch. We assume that we are building a layer on top of DevMine that will be external to it.

The team

Access the member list here.

Short description of the project

DevSearch will be a tool to help developers find relevant code based on short snippets thus providing context and examples to API and language usage. The query space will be language agnostic to leverage the full extent of open source software repositories.

Detailed description of the project goals

We aim to provide a user friendly web interface where users can enter code snippets. Our system will then build a feature vector from the input to search a database of GitHub projects by using our trained model. Our feature vector will be based on the generic ASTs provided by the DevMine project. The search results will contain entries in different programming languages. We will extend DevMine to include other popular languages such as Python and JavaScript.

Description of methods

We need to create two systems for this project: the online search system and the offline model training system.

For the training system we will perform feature extraction on all ASTs from the DevMine database. We will define and generate a large list of features such as "does the code contain a loop" or "is there a variable named x". The full list of features per repository will be stored on HDFS. Then we will perform learning on the features × repositories matrix. Key components of this system will be the feature templates as well as the learning algorithm.

For the online search system we will extract the same features as used in the model and then send the feature vector to a distributed search system that will multiply the input vector with the trained matrix. The best matches will be selected and then the content of the matched repository will be fetched and sent back to the client.

We plan to use Scala for all sub-systems. We will use Spark for extracting features and MLlib for training the model. The front-end system will use the Play framework.

Required resources

We will need to index all of DevMine’s data and store the indexed data in a specialized data store. The data is already available through the DevMine database.

Risks to the success of the project

We need to define a key that provides flexible yet precise matching. Data store that is suitable for structured data. Index-lookup performance.

Milestones, deliverables, work packages and assignment

We define three milestones and set the following tentative deadlines:

March 23: Milestone 1 (M1)
April 15: Milestone 2 (M2)
May 12: Project delivery (M3)

We have split the project into three separate layers. For each layer we define the following milestones:

Language parsers

Assignment: Julien, Nicolas V, Mateusz

M1:
- create AST format
- write parser for Java
- initial exploration of potential program features
  - write feature extractor prototype
M2:
- integrate / implement Python with DevMine
- more and better feature extraction
- query language parser
M3: JavaScript + more

Search layer

Assignment: Pascal, Pierre, Nicolas H, Raph, Bastien, Matthieu, Christian, Damien

M1:
- write page rank algorithm for repositories
- set up database or Spark job with all features for matching
- implement simple search based on number of matching features and page rank
M2:
- implement learing algorithm
- write matrix multiplication for selection
- integrate with parser team features
M3:
- refine keys, optimize

UI

Assignment: Pascal, Pierre, Raph, Matthieu, Damien

M1:
- define exact functionality of product (mockups)
- define interface between UI and search layer
- first look & feel implementation
M2:
- implement core UI features
M3:
- implement more advanced features

Our Trello board contains a more detailed breakdown of tasks.

This wiki
- Home
- Sitemap
- Files
- New page
- Administration
This page
- Edit
- Clean
- Delete
- History
- Print
- Comments (0)
Share

Prospective students portal

Students portal

Researchers portal

Staff portal

Business portal

Mediacorner

Teaching portal

EPFL Alumni Portal

Architecture, Civil and Environmental Engineering ENAC

Basic Sciences SB

Engineering STI

Computer and Communication Sciences IC

Life Sciences SV

Management of Technology CDM

College of Humanities CDH

EPFL

Education

Research

Innovation & Tech Transfer

EPFL Campus