It took a month of platform work, but I just used Amazon’s Elastic MapReduce to read data from Amazon S3 in Avro format, process it in pig on a 3 node Hadoop cluster and store it in MongoHQ.
I did not anticipate such difficulties in doing this, but as I labored to make it work… and as friends of mine labored with similar problems with Hadoop I/O as I did, it reminded me:
Every data science team needs an embedded platform engineer that spends a good deal of her time responding to issues data scientists and developers have. And resolving them.
