mongodb IS web scale: hadoop-mongodb


MongoDB is Web Scale. Lulz, right? Turns out, Mongo is the first NoSQL to nail painless Hadoop and Pig integration… thus becoming the first ‘web scale’ database.


Install Mongo & run it:

Install Mongo’s hadoop integration:

git clone
cd mongo-hadoop
mvn install
cd examples
mvn install # Then check out pigtutorial/
cd ..
REGISTER /me/pig/build/ivy/lib/Pig/avro-1.5.3.jar
REGISTER /me/pig/build/ivy/lib/Pig/json-simple-1.1.jar
REGISTER /me/pig/contrib/piggybank/java/piggybank.jar
REGISTER /me/pig/build/ivy/lib/Pig/jackson-core-asl-1.7.3.jar
REGISTER /me/pig/build/ivy/lib/Pig/jackson-mapper-asl-1.7.3.jar

REGISTER /me/mongo-hadoop/mongo-2.3.jar
REGISTER /me/mongo-hadoop/core/target/mongo-hadoop-core-1.0-SNAPSHOT.jar
REGISTER /me/mongo-hadoop/pig/target/mongo-pig-1.0-SNAPSHOT.jar

DEFINE AvroStorage;
sh rm -rf '/tmp/sent_counts.avro' /* Workaround for PIG-2441 */

messages = LOAD '/tmp/10000_emails.avro' USING AvroStorage();
messages = FILTER messages BY from IS NOT NULL AND to IS NOT NULL;
smaller = FOREACH messages GENERATE from, to;
pairs = FOREACH smaller GENERATE from, FLATTEN(to) AS to:chararray;
pairs = FOREACH pairs GENERATE LOWER(from) AS from, LOWER(to) AS to;

froms = GROUP pairs BY (from, to);
sent_counts = FOREACH froms GENERATE FLATTEN(group) AS (from, to), SIZE(pairs) AS total;
-- STORE sent_counts INTO '/tmp/sent_counts.avro' USING AvroStorage();
STORE sent_counts INTO 'mongodb://localhost/test.pig' USING com.mongodb.hadoop.pig.MongoStorage;

Note that the use of Avro is optional, but is fun too.

Check out your data:

bash$ mongo pig

> show collections

> db.pig.find()

{ "_id" : ObjectId("4ef2dc29f37d4e414133e522"), "from" : "", "to" : "", "total" : NumberLong(1) }
{ "_id" : ObjectId("4ef2dc29f37d4e414233e522"), "from" : "", "to" : "", "total" : NumberLong(1) }
{ "_id" : ObjectId("4ef2dc29f37d4e414333e522"), "from" : "", "to" : "", "total" : NumberLong(1) }
{ "_id" : ObjectId("4ef2dc29f37d4e414433e522"), "from" : "", "to" : "", "total" : NumberLong(2) }

MongoDB is web scale. Who would have thought? Usability matters. 90% there is not enough. ;)