Data Syndrome

Today I decided to experiment with unsupervised learning by clustering the Yelp Academic Dataset. I decided to do a location-based clustering so that I could check my results against a map. I started by pre-processing the data in Pig into business_id/latitude/longitude records:

REGISTER /me/Software/elephant-bird/pig/target/elephant-bird-pig-3.0.6-SNAPSHOT.jar
REGISTER /me/Software/pig/build/ivy/lib/Pig/json-simple-1.1.jar
SET elephantbird.jsonloader.nestedLoad 'true'

Register 'udfs.py' using jython as udfs;

SET default_parallel 10

rmf yelp_phoenix_academic_dataset/locations.tsv

businesses = LOAD 'yelp_phoenix_academic_dataset/yelp_academic_dataset_business.json' using com.twitter.elephantbird.pig.load.JsonLoader() as json:map[];

/* {open=true, neighborhoods={}, review_count=14, stars=4.0, name=Drybar, business_id=LcAamvosJu0bcPgEVF-9sQ, state=AZ, full_address=3172 E Camelback Rd
Phoenix, AZ85018, categories={(Hair Salons),(Hair Stylists),(Beauty & Spas)}, longitude=-112.0131927, latitude=33.5107772, type=business, city=Phoenix} */
locations = FOREACH businesses GENERATE $0#'business_id' AS business_id:chararray, 
                                      $0#'longitude' AS longitude:double, 
                                      $0#'latitude' AS latitude:double;
                                      
STORE locations INTO 'yelp_phoenix_academic_dataset/locations.tsv';

I began with K-means, which gave me this unsatisfactory result:

Then I tried DBSCAN:

# From example at http://scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html#example-cluster-plot-dbscan-py

from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.datasets.samples_generator import make_blobs
from sklearn.preprocessing import StandardScaler
import pylab as pl
import numpy as np

X = []
f = open('yelp_phoenix_academic_dataset/locations.tsv/part-m-00000')
for line in f:
    business_id, latitude, longitude = line.rstrip().split('\t')
    X.append([float(latitude), float(longitude)])

X = StandardScaler().fit_transform(X)
db = DBSCAN(eps=0.3, min_samples=10).fit(X)
core_samples = db.core_sample_indices_
labels = db.labels_
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
unique_labels = set(labels)

colors = pl.cm.Spectral(np.linspace(0, 1, len(unique_labels)))
for k, col in zip(unique_labels, colors):
    if k == -1:
        # Black used for noise.
        col = 'k'
        markersize = 6
    class_members = [index[0] for index in np.argwhere(labels == k)]
    cluster_core_samples = [index for index in core_samples
                            if labels[index] == k]
    for index in class_members:
        x = X[index]
        if index in core_samples and k != -1:
            markersize = 14
        else:
            markersize = 6
        pl.plot(x[0], x[1], 'o', markerfacecolor=col,
                markeredgecolor='k', markersize=markersize)

Which results in this:

The results look pretty good - you can see a large contiguous ‘downtown’ cluster, as well as highways, townships, etc. Glancing at a map of Phoenix, Arizona, you can see that it looks right. When overlaid onto a map of Phoenix, you can see towns as separate clusters!