Let's Talk

Corporate Training

  • Agile Data Science
  • Introduction to PySpark
  • Realtime Predictive Analytics

Agile Data Science

Agile Data Science: Building Full-Stack Analytics Applications

Course Length

Three Days, 8 Hours Per Day = 24 Hours

Required Enrollment

6 students

Course Overview

This is a professional development class that teaches how to iteratively craft entire analytics web applications using Python, Flask, Spark (SQL, Streaming, MLlib), Kafka, MongoDB, ElasticSearch, Bootstrap and D3.js. This stack is a popular one and is an example of the kind of stack needed to process and refine data at scale in real world applications of data science. During this course, students will use airline flight data to produce an entire analytic web application, from the ground up.

The course will serve as a tutorial in which the student learns basic skills in all the categories needed to ship an entire analytics application. Each section, the student will add a layer to their application, creating the start of something they can really use in their respective domains. While users will not learn the tools in detail, they will establish the working foundation needed, and will practice the kind of active learning-as-you-go that analytics requires. The goal of the course is to establish a working foundation with working code that the student can extend as they learn going forward. Working end-to-end code makes learning much easier.

The organizational principle behind the course is the 'data value pyramid', pictured below. Students will climb the data-value pyramid, refining data at one step to reach the next.

pyramid

This class will be a chance for a practicing data scientist or web developer to learn to turn their data and analyses into full-blown actionable web applications. We will put the student in a position where she can independently improve on the foundation we have given her to go on to build great analytics applications

Learning Promise:

The course appeals to one of the following persons:

  • I am a programmer who wants to get a foothold in the emerging field of data science.
  • I am a practicing data scientist that wants to learn to build full-stack applications or apply agile principles to data science.
  • I am an entry-level data scientist that wants to learn how analytics applications are crafted.
Roles:

Programmers who want to learn introductory data science.

Practicing statisticians who want to learn to build entire applications.

Entry level data scientists who want to learn how to craft full-stack applications

Audience Level:

The difficulty of the material is intermediate. While all the content we cover is introductory, the breadth of the material we cover makes this an intermediate challenge.

Teams:

The course can be tailored to meet the needs of data science teams. Teams will learn to work together to create full-stack applications. Each student will learn their individual role, as well as how to collaborate using agile development to do data science.

Prerequisites:

Students should be fluent programmers in at least one language, preferably Python and with some experience in Javascript. Exposure to data analysis on some level is required, but that might be limited to SQL. We can provide a pre-test which students should pass to benefit optimally from the course.

A virtual machine image for use with either Vagrant or Amazon EC2 will be provided which will contain the environment for the course. Data for the course will be downloadable. Students wishing to install the tools on their own computers can refer to Appendeix A of Agile Data Science 2.0 (O'Reilly, 2017), which contains detailed installation instructions. We won't cover a custom install in the course directly.

Program:

Students will learn the theory and practice of employing agile development principles to build entire analytics applications. Students will gain real-world experience building all aspects of a real analytics application.

A lecture on theory will begin the course, followed by a lecture and examples illustrating how the tools form a complete data platform. Next a lecture explaining the dataset used in the course. After this, the remainder of the course will be guided exercises.

We will iteratively extract increasing amounts of value from raw data as we refine it in stages that correspond to the levels of the data-value pyramid.

Exercises throughout the course will ensure that students learn to apply what they are learning. Students will end up with a simple web application and data processing scripts they can alter and customize to fit their own problem domain and their own dataset.

Big Ideas:
  1. Agile development applies to data science
  2. Data mining is approachable if done in stages.
  3. Analytics applications can begin simply and grow in complexity.
Expected Outcomes

By the end of this course...

Participants will understand:

  1. How to use pyspark and web visualization to refine data
  2. How to build analytics applications
  3. How to approach predictive analytics
  4. How to deploy predictive systems in batch and realtime

Participants will be able to:

  1. Build analytics applications from the ground up
  2. Make visualizations and predictions
  3. Use Python, Spark, Spark SQL, Spark MLlib and Spark Streaming to build entire applications
Common Misunderstandings:

What are 2-3 of the most common ideas, skills, or performance abilities that someone new to this content struggles with?

  1. How to get started making predictions with real world data.
  2. How to get experience munging data into visualizations and predictive algorithms
  3. How to actually do data science in an agile, iterative manner
  4. How to get started with realtime analytics
  5. The role of batch versus realtime computing
Learning Activities and Assessments:

Throughout the course, students will work with source code on a virtual machine/Amazon EC2 image and in github which they can run, tweak and modify to learn. In addition, some portions of the course will include Jupyter or Zeppelin notebooks. There will be an assessment in each section, where the student takes what they've learned and extends an example to do something new.

Introduction to PySpark

Introduction to PySpark

Course Length

Three Hours

Required Enrollment

3 students

Course Overview

This is a three hour professional development class that teaches introductory PySpark. PySpark is the Python interface to Apache Spark, the leading general purpose distributed data processing platform. With PySpark you get the power of Python and the power of Spark, Spark SQL and Spark MLlib. This course will breeze you through installation and setup through the use of a virtual machine with the environment already setup, and get you rapidly productive processing data with PySpark in local mode. No Spark cluster needed!

A short theory section introduces the concepts behind Spark. Environment setup is easy with a virtual machine running locally in Vagrant/Virtualbox or on Amazon EC2. Basic PySpark introduces both the RDD and DataFrame APIs and we'll compute the same metric using dataflow programming with RDDs and Spark SQL using DataFrames. Next, we flex our Spark muscles with exploratory data analysis on airline flight delay data. We'll discover how often flights are late, and why.

The final section of the course introduces machine learning and predictive analytics with PySpark by creating a classifer model to predict flight delays into one of four categories: Very Early, Early, Late and Very Late. We'll show how to create an experimental setup and determine the performance of our model.

PySpark Console

Realtime Predictive Analytics

Realtime Predictive Analytics

Course Length

Six Hours

Required Enrollment

3 students

The course covers the construction of an entire predictive analytics web application using Python/Flask/JQuery, Kafka, PySpark, Spark MLlib and Spark Streaming. This is similar to chapters 7 and 8 in the book Agile Data Science 2.0 (O'Reilly, 2017).

First we go over the architecture used. After project setup using a local virtual machine or Amazon EC2, we use PySpark in batch mode to train a classifier model to predict flight delays in terms of four categories. Next we build a web application front-end to our predictive system, which submit prediction requests to a Kafka queue. Next we use Spark MLLib with Spark Streaming to respond to those prediction requests. Finally, we demonstrate and analyze the entire system together.

This course is also available on video.