Data Syndrome

In our last two posts, we covered Downloading and Processing the Github Archive and Github’s 18 event types.

In this post we’re going to create an implied rating system from the Github Archive Data. Implied rating systems - as opposed to a literal or direct rating systems - are inferred from records of user actions or gestures. You’re familiar with implied ratings already: your FICO score is an implied rating. Just like your credit record, the github archive is a rich set to infer ratings from, consisting of over 100GB of user actions from which we can infer ratings for github repositories. Inferred ratings are great for building recommender systems, which we’ll talk more about in the next post.

I reviewed the 18 github archive event types (covered yesterday), and came up with the following implied rating system:

ViewEvent (missing) - 0.0 Rating

Missing from this analysis are github’s web traffic logs, which are not public. Were we able to access those, we would rate a project 0.0 when a user views it but does not interact with it. This has the effect of dramatically increasing the scope of between-user comparisons we end up performing, increasing the performance of our recommender system.

WatchEvent - 1.0 Rating

A WatchEvent is generated when a user clicks ‘watch project’ on a github project. This indicates interest in the project and we assign it an implied rating of 1.0.

https://gist.github.com/rjurney/5686720

IssuesEvent - 2.0 Rating

An IssuesEvent is generated when a user files an issue with a github project. This implies the user has acquired and attempted to use the project, so we give this a 2.0 implied rating.

https://gist.github.com/rjurney/5687106

ForkEvent - 3.0 Rating

A ForkEvent occurs when a user forks a project, which means he is not only interested in using it but potentially in modifying it and contributing code back. Therefore we give this an implied rating of 3.0.

https://gist.github.com/rjurney/5687232

CreateEvent - 4.0 Rating

A CreateEvent occurs when a user creates a new project. This is the highest implied vote of 4.0.

https://gist.github.com/rjurney/5687269

As you can see, we’ve created implied ratings from 0.0 - 4.0, which will enable us to create a recommender system tomorrow!