Data Syndrome

I am vectorizing some features in sklearn, and I have run into a problem. DictVectorizer works well if your data can be encoded into one dict per item. What if your items can have two values of the same column? For instance, DictVectorizer works fine on an item like this one:

{'a': 'b', 'b': 'c'}

But what about something like this, with more than one value per column? {‘a’: [‘b’,’c’], ‘b’: ‘d’} The strategy of one-hot-encoding can still apply, you simply want two a columns… a=b and a=c. So far as I can tell, no such vectorizer exists! What is one supposed to do in this situation? Do I need to create my own MultiDictVectorizer?

I just posted this to StackOverflow.

To answer my own question, I am adding support for lists to DictVectorizer, or creating a new class MultiDictVectorizer that does so. From StackOverflow:

DictVectorizer can’t handle multiple values per key, so I am adding this ability to it. If the pull is accepted, this will be a part of sklearn. If not, I will subclass DictVectorizer in MultiDictVectorizer and will release a package for this class.

Pull request at Github
Issue in sklearn Github project