Cosine Similarity between Github Projects


In our last four posts, we covered Downloading and Processing the Github Archive, Github’s 18 event types, creating an implied rating system and calculating a Pearson’s correlation between projects.

To create an item-based nearest neighbor recommender, cosine similarity is a better measure of similarity between projects. Below, we implement that measure between github projects using Pig and Python.

Once we have a distance between projects, we can immediately display the most similar projects: