Introduction
In our last three posts, we covered Downloading and Processing the Github Archive, Github’s 18 event types and creating an implied rating system.
In this post we will be measuring the distance between github projects (items) using the Pearson product-moment correlation coefficient using Pig and Jython.
Loading the Data
We addressed loading the data in our last post. Briefly, this code loads select Github Archive events in the form of ratings:
Calculating Pearson’s
Once the ratings are loaded, we need to make the links bidirectional, to increase their density and to make links between projects flow both-ways. Next, we need to filter out the top-most cross-linked projects, as they are not relevant and make the next step never finish. Next, we emit all co-ratings per project. This means we emit a project pair and ratings, each time a user has rated two projects. This is the meat of our recommendation data. Finally, we take a Pearson’s distance between all projects with co-ratings.
The Pearson’s correlation coefficient tells us how similar different projects are, based on their ratings. This data can be used to drive recommendations, which we’ll look at tomorrow. For now, a sample of the Pearon’s scores look like this:apache/pig apache/avro 0.6914285714285715
apache/pig apache/bval 1.0
apache/pig apache/gora 1.0
apache/pig apache/hive 0.6410256410256406
apache/pig apache/isis 1.0
apache/pig apache/jena 1.0
apache/pig apache/lucy 1.0
apache/pig apache/mina 1.0
apache/pig apache/oodt 1.0
apache/pig apache/qpid 1.0
apache/pig apache/rave 1.0
apache/pig apache/solr 1.0
apache/pig apache/tika 1.0
apache/pig apache/wink 1.0
apache/pig enyojs/enyo 1.0
apache/pig rails/rails 0.9486832980505138
apache/pig scala/scala 1.0
apache/pig zohmg/zohmg 1.0
apache/pig andrew/split 0.8485281374238569
apache/pig apache/camel 1.0
apache/pig apache/derby 1.0
apache/pig apache/flume 0.7352941176470589
apache/pig apache/hbase 0.644736842105263
apache/pig apache/httpd 0.7352941176470589