Three Days, 8 Hours Per Day = 24 Hours
This is a professional development class that teaches how to iteratively craft entire analytics web applications using Python, Flask, Spark (SQL, Streaming, MLlib), Kafka, MongoDB, ElasticSearch, Bootstrap and D3.js. This stack is a popular one and is an example of the kind of stack needed to process and refine data at scale in real world applications of data science. During this course, students will use airline flight data to produce an entire analytic web application, from the ground up.
The course will serve as a tutorial in which the student learns basic skills in all the categories needed to ship an entire analytics application. Each section, the student will add a layer to their application, creating the start of something they can really use in their respective domains. While users will not learn the tools in detail, they will establish the working foundation needed, and will practice the kind of active learning-as-you-go that analytics requires. The goal of the course is to establish a working foundation with working code that the student can extend as they learn going forward. Working end-to-end code makes learning much easier.
The organizational principle behind the course is the 'data value pyramid', pictured below. Students will climb the data-value pyramid, refining data at one step to reach the next.
This class will be a chance for a practicing data scientist or web developer to learn to turn their data and analyses into full-blown actionable web applications. We will put the student in a position where she can independently improve on the foundation we have given her to go on to build great analytics applications
The course appeals to one of the following persons:
Programmers who want to learn introductory data science.
Practicing statisticians who want to learn to build entire applications.
Entry level data scientists who want to learn how to craft full-stack applications
The difficulty of the material is intermediate. While all the content we cover is introductory, the breadth of the material we cover makes this an intermediate challenge.
The course can be tailored to meet the needs of data science teams. Teams will learn to work together to create full-stack applications. Each student will learn their individual role, as well as how to collaborate using agile development to do data science.
A virtual machine image for use with either Vagrant or Amazon EC2 will be provided which will contain the environment for the course. Data for the course will be downloadable. Students wishing to install the tools on their own computers can refer to Appendeix A of Agile Data Science 2.0 (O'Reilly, 2017), which contains detailed installation instructions. We won't cover a custom install in the course directly.
Students will learn the theory and practice of employing agile development principles to build entire analytics applications. Students will gain real-world experience building all aspects of a real analytics application.
A lecture on theory will begin the course, followed by a lecture and examples illustrating how the tools form a complete data platform. Next a lecture explaining the dataset used in the course. After this, the remainder of the course will be guided exercises.
We will iteratively extract increasing amounts of value from raw data as we refine it in stages that correspond to the levels of the data-value pyramid.
Exercises throughout the course will ensure that students learn to apply what they are learning. Students will end up with a simple web application and data processing scripts they can alter and customize to fit their own problem domain and their own dataset.
By the end of this course...
Participants will understand:
Participants will be able to:
What are 2-3 of the most common ideas, skills, or performance abilities that someone new to this content struggles with?
Throughout the course, students will work with source code on a virtual machine/Amazon EC2 image and in github which they can run, tweak and modify to learn. In addition, some portions of the course will include Jupyter or Zeppelin notebooks. There will be an assessment in each section, where the student takes what they've learned and extends an example to do something new.
This is a three hour professional development class that teaches introductory PySpark. PySpark is the Python interface to Apache Spark, the leading general purpose distributed data processing platform. With PySpark you get the power of Python and the power of Spark, Spark SQL and Spark MLlib. This course will breeze you through installation and setup through the use of a virtual machine with the environment already setup, and get you rapidly productive processing data with PySpark in local mode. No Spark cluster needed!
A short theory section introduces the concepts behind Spark. Environment setup is easy with a virtual machine running locally in Vagrant/Virtualbox or on Amazon EC2. Basic PySpark introduces both the RDD and DataFrame APIs and we'll compute the same metric using dataflow programming with RDDs and Spark SQL using DataFrames. Next, we flex our Spark muscles with exploratory data analysis on airline flight delay data. We'll discover how often flights are late, and why.
The final section of the course introduces machine learning and predictive analytics with PySpark by creating a classifer model to predict flight delays into one of four categories: Very Early, Early, Late and Very Late. We'll show how to create an experimental setup and determine the performance of our model.
The course covers the construction of an entire predictive analytics web application using Python/Flask/JQuery, Kafka, PySpark, Spark MLlib and Spark Streaming. This is similar to chapters 7 and 8 in the book Agile Data Science 2.0 (O'Reilly, 2017).
First we go over the architecture used. After project setup using a local virtual machine or Amazon EC2, we use PySpark in batch mode to train a classifier model to predict flight delays in terms of four categories. Next we build a web application front-end to our predictive system, which submit prediction requests to a Kafka queue. Next we use Spark MLLib with Spark Streaming to respond to those prediction requests. Finally, we demonstrate and analyze the entire system together.
This course is also available on video.