Data Syndrome

Realtime Predictive Analytics

With Kafka, PySpark, Spark MLlib and Spark Streaming

This two hour course covers the construction of an entire predictive analytics web application. We use Kafka, PySpark, Spark MLlib and Spark Streaming on the back end and complete the predictive system with a Python/Flask/JQuery front end. This builds on chapters 7 and 8 in the book Agile Data Science 2.0 (O'Reilly, 2017).

After a brief overview in Lesson Zero, in Lesson One we go over the concepts behind the architecture we'll be building with. In Lesson Two, Project Setup, we'll quickly setup our "big data" environment using a local Vagrant/Virtualbox virtual machine or a boot script for Amazon EC2. In Lesson Three, we use PySpark and Spark MLlib in batch mode with records recording flight delays for the year 2015 to train a classifier model to predict flight delays in terms of four categories: Very Early, Early, Late, Very Late. In Lesson Four, we build a web application front-end to our predictive system, which submit prediction requests to a Kafka queue and polls the server, awaiting a prediction. In Lesson Five, we use Spark MLLib with Spark Streaming to respond to those prediction requests by making predictions and storing them to MongoDB where the web application can access them. In Lesson Six, we demonstrate and analyze the entire system together. It is extremely fun to watch data flow through the system on multiple consoles.

Once you've completed this course, you will have a base of operations for your own predictive analytics applications. You'll have working code and a working system to start from to alter and build your own applications. In this way, the course will empower you by making you a full-stack app dev in just a few hours!

Code for the video and screenshots of the application are freely available at http://github.com/rjurney/Agile_Data_Code_2. Look in the ch08 directory.

Lesson 0: Introduction
Lesson 1: Architecture
Lesson 2: Project Setup
Lesson 3: Training a Model

Lesson 4: Building a Web Application
Lesson 5: Realtime Prediction with Spark Streaming
Lesson 6: Closing the Loop