Goals and ways to achieve them

The main goal is to build a database of developers, along with an evaluation of their programming skills, and make it available through a freely accessible API. This database could then be used by recruiters for companies or members of open source projects to find developers with the specific skills set they are looking for. In order to build this database, we will first use GitHub data. Of course, thinking about this project in a larger scope than the course itself means that we will need to consider using other sources of data (BitBucket, SourceForge, Gitorious, etc.) and maybe even other kind of data (data from LinkedIn profiles to provide the academic background of the developer for instance, data from StackOverflow with user reputation, etc.).

 

In order to show the possibility offered by the information by this API, a web frontend will be implemented but is not, of course, at the core of this project.

 

The main idea behind this project is that someone's contributions to open-source projects can be a valuable source of information for recruiters. A website that could highlight the skills of a developer, in a relatively good manner, by not taking into consideration only his academic formation or past employers but also actual code contributions and interactions within open-source communities would probably be

valuable source of information.

 

Social coding platforms, such as Github, have become quite popular and can provide valuable information. For instance, developers on Github can follow other developers or projects, propose pull requests, commit to projects, open issues and so on. All of this data can be retrieved through an API and even queried through the GitHub Archive project. Moreover, developers programming skills can be determined by the languages of the projects they commit to the most, the complexity in terms of size and people involved in these projects, the amount of contributions to high-profile projects and so on.

 

To achieve these goals, ie: classifying developers by their skills set and rank them, we will need to create a feature vector. This feature vector should be extensible with new features in an easy way. Features should be determined out of data analysis.

Some features, for instance assessing the programming languages known by a developer, will be pretty straight forward since this information can be almost directly retrieved through the GitHub API. However, other features will require a lot of data processing. For instance, we thought about a "developer pagerank feature". The idea behind this is that, on GitHub, users can follow other users, watch projects, star projects of follow projects. Relations within developers and projects may be used as a feature.