Today I decided to experiment with unsupervised learning by clustering the
Yelp Academic Dataset. I decided to do a location-based clustering so that I could check my results against a map.
I started by pre-processing the data in Pig into business_id/latitude/longitude records:
REGISTER /me/Software/elephant-bird/pig/target/elephant-bird-pig-3.0.6-SNAPSHOT.jar
REGISTER /me/Software/pig/build/ivy/lib/Pig/json-simple-1.1.jar
SET elephantbird.jsonloader.nestedLoad 'true'
Register 'udfs.py' using jython as udfs;
SET default_parallel 10
rmf yelp_phoenix_academic_dataset/locations.tsv
businesses = LOAD 'yelp_phoenix_academic_dataset/yelp_academic_dataset_business.json' using com.twitter.elephantbird.pig.load.JsonLoader() as json:map[];
/* {open=true, neighborhoods={}, review_count=14, stars=4.0, name=Drybar, business_id=LcAamvosJu0bcPgEVF-9sQ, state=AZ, full_address=3172 E Camelback Rd
Phoenix, AZ85018, categories={(Hair Salons),(Hair Stylists),(Beauty & Spas)}, longitude=-112.0131927, latitude=33.5107772, type=business, city=Phoenix} */
locations = FOREACH businesses GENERATE $0#'business_id' AS business_id:chararray,
$0#'longitude' AS longitude:double,
$0#'latitude' AS latitude:double;
STORE locations INTO 'yelp_phoenix_academic_dataset/locations.tsv';
I began with K-means, which gave me this unsatisfactory result:

Then I tried DBSCAN:
# From example at http://scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html#example-cluster-plot-dbscan-py
from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.datasets.samples_generator import make_blobs
from sklearn.preprocessing import StandardScaler
import pylab as pl
import numpy as np
X = []
f = open('yelp_phoenix_academic_dataset/locations.tsv/part-m-00000')
for line in f:
business_id, latitude, longitude = line.rstrip().split('\t')
X.append([float(latitude), float(longitude)])
X = StandardScaler().fit_transform(X)
db = DBSCAN(eps=0.3, min_samples=10).fit(X)
core_samples = db.core_sample_indices_
labels = db.labels_
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
unique_labels = set(labels)
colors = pl.cm.Spectral(np.linspace(0, 1, len(unique_labels)))
for k, col in zip(unique_labels, colors):
if k == -1:
# Black used for noise.
col = 'k'
markersize = 6
class_members = [index[0] for index in np.argwhere(labels == k)]
cluster_core_samples = [index for index in core_samples
if labels[index] == k]
for index in class_members:
x = X[index]
if index in core_samples and k != -1:
markersize = 14
else:
markersize = 6
pl.plot(x[0], x[1], 'o', markerfacecolor=col,
markeredgecolor='k', markersize=markersize)
Which results in this:

The results look pretty good - you can see a large contiguous ‘downtown’ cluster, as well as highways, townships, etc. Glancing at a map of Phoenix, Arizona, you can see that it looks right. When overlaid onto a map of Phoenix, you can see towns as separate clusters!