Data Syndrome

The Github Archive is a rich dataset available to all via githubarchive.org. In this post we will download and create relations from the github archive.

Instructions for downloading it are available on that site, but don’t work well with the version of Bash included with some versions of Mac OS X. To download the github data, I created a ruby script called get_all_data.rb which will do the job.

def prepend(number)
  return number <= 9 ? ("0" + number.to_s) : number.to_s
end

for year in ['11', '12', '13'] do
  for month in (1..12) do
    month = prepend(month)
    for day in (1..31) do
      day = prepend(day)
      unless File.exist?("data/20#{year}-#{month}-#{day}-23.json.gz")
        system "wget -P data/ http://data.githubarchive.org/20#{year}-#{month}-#{day}-{0..24}.json.gz"
      else
        puts "Skipped file...\n\n\n"
      end
    end
  end
end

That script will produce 8,760 gzip files per year - 24 for each hour per day * 365 days. In my own experiments, I work with the data between 2012 and present day, which is about 90GB uncompressed json. Unzipped, the files contain a list of JSON objects that describe 18 event types, delineated by the common field, ‘type.’ This type field can be used to split the data into 18 relations of like-formatted records. Splitting records this way is a common step for processing this data, and is therefore best split out in a pre-processing step for efficiency’s sake. We’ll use Apache Pig to split our data into relations.

From the Apache Pig site: Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.

Pig lets us define dataflows to query any data, large or small. We can load JSON records into Pig maps using the elephant-bird project’s JsonLoader. To get Pig to parse these records though, we need a carriage return after each one instead of a Javascript array of objects. To achieve this, I created a simple script called newline_format.rb based on the githubarchive example. It is simple:

require 'rubygems'
require 'zlib'
require 'yajl'
 
Dir.glob('data/*.json.gz').each do |f|
  begin
    gz = open(f)
    js = Zlib::GzipReader.new(gz).read
 
    Yajl::Parser.parse(js) do |event|
      puts Yajl::Encoder.encode(event)
    end
  rescue
  end
end
gz.close

This can be run locally in hours or as a Hadoop streaming job in minutes. It will produce one large JSON file. Next, we split that file using split_events.pig.

github_events = load '/tmp/newline.json' using com.twitter.elephantbird.pig.load.JsonLoader() as json:map[];

SPLIT github_events INTO CommitCommentEvent IF $0#'type' == 'CommitCommentEvent',
                         CreateEvent IF $0#'type'        == 'CreateEvent',
                         DeleteEvent IF $0#'type'        == 'DeleteEvent',
                         DownloadEvent IF $0#'type'      == 'DownloadEvent',
                         FollowEvent IF $0#'type'        == 'FollowEvent',
                         ForkEvent IF $0#'type'          == 'ForkEvent',
                         ForkApplyEvent IF $0#'type'     == 'ForkApplyEvent',
                         GistEvent IF $0#'type'          == 'GistEvent',
                         GollumEvent IF $0#'type'        == 'GollumEvent',
                         IssueCommentEvent IF $0#'type'  == 'IssueCommentEvent',
                         IssuesEvent IF $0#'type'        == 'IssuesEvent',
                         MemberEvent IF $0#'type'        == 'MemberEvent',
                         PublicEvent IF $0#'type'        == 'Public Event',
                         PullRequestEvent IF $0#'type'   == 'PullRequestEvent',
                         PullRequestReviewCommentEvent IF $0#'type' == 'PullRequestReviewCommentEvent',
                         PushEvent IF $0#'type'          == 'PushEvent',
                         TeamAddEvent IF $0#'type'       == 'TeamAddEvent',
                         WatchEvent IF $0#'type'         == 'WatchEvent';

This data is now split into 18 different files, ranging from megabytes to gigabytes, which can be joined together as needed. At this point the data is still within the capabilities of Pig’s local mode, pig -l /tmp -x local to process the data, so you can begin experimenting locally. In my own case, the data quickly balooned to nearly 1TB, necessitating uploading the split data to S3 and continuing my analysis using Hadoop via Elastic MapReduce.

In the next post, we’ll look at what these 18 different event types have to offer.