Data Syndrome

“How do you QA Big Data Pipelines?”

The harsh reality is that some extent: you can’t, as many errors are emergent as data volumes increase. One out of a large enough group of records must be corrupt. For what its worth, this is what I have on the topic:

I started a discussion of Big Data ETL here on Quora, and there were some great answers: http://www.quora.com/Big-Data/In-Big-Data-ETL-how-many-records-are-an-acceptable-loss

In general I would think that you only care about record loss during ETL if it meaningfully impacts the answer to some question on the transformed data.

Pig could pioneer a sane approach via the ON ERROR proposal, which would let you split errant records off into its own relation to deal with as needed by a custom User-Defined-Function. This would enable you to alter your pipelines to accommodate the errant records, or would at least let you understand the type of records that are going missing. In addition, you PIG-2620 would allow you to set thresholds for errors: up to some % and/or some raw number of records lost.

https://issues.apache.org/jira/browse/PIG-2620
http://wiki.apache.org/pig/PigErrorHandlingInScripts