Let's Talk

Corporate Training

  • Practicing Agile Data Science
  • Full Stack "Big Data" App Dev
  • Realtime Predictive Analytics

Practicing Agile Data Science"

Practicing Agile Data Science

Course Length

1/2 to 3 Days

Course Overview

Practicing Agile Data Science is an interactive course for data science and analytics teams that covers the methods for applying agile strategies outlined in the book Agile Data Science 2.0. It teaches agile data science team members, including product/project managers, data scientists, software engineers and designers how to set up, structure and manage analytics projects. We’ll go over the history of the waterfall method and the emergence of agile methods. We'll cover agile software engineering. We'll give an introduction to Big Data and what it means beyond the hype. We'll show the differences between software engineering and data science that require changes to make agile methods effective.

The course trains students to adapt to the dynamic nature of data science, which is both engineering and science. It puts the focus of the team on creating an application for their problem domain that describes their datasets and guides them on their critical path to unlock value. This application serves to enable customer interaction, the bedrock of agile methods. This interaction helps to avoid common pitfalls in agile data science and can deliver consistent, predictable results.

The half day course consists of a three hour interactive presentation followed by a guided discussion. Longer courses add a pre-staged hackathon phase, where the lessons presented are applied to create new prototypes using your own data resources. This jumpstarts the practice of agile methods in your data science team. Training periods can be split, with a half or full day presentation followed by a one or two day hackathon. Hackathons include a virtual environment setup for easy productivity with your own datasets, staging your team to unleash their creativity and skill to be immediately productive and build great things. This combination of traditional training with a customized hackathon delivers unparalleled value and benefit for your organization.

Full Stack "Big Data" App Dev

Agile Data Science: Building Full-Stack Analytics Applications

Course Length

Three Days, 8 Hours Per Day = 24 Hours

Required Enrollment

6 students

Course Overview

This is a professional development class that teaches how to iteratively craft entire analytics web applications using Python, Flask, Spark (SQL, Streaming, MLlib), Kafka, MongoDB, ElasticSearch, Bootstrap and D3.js. This stack is a popular one and is an example of the kind of stack needed to process and refine data at scale in real world applications of data science. During this course, students will use airline flight data to produce an entire analytic web application, from the ground up.

The course will serve as a tutorial in which the student learns basic skills in all the categories needed to ship an entire analytics application. Each section, the student will add a layer to their application, creating the start of something they can really use in their respective domains. While users will not learn the tools in detail, they will establish the working foundation needed, and will practice the kind of active learning-as-you-go that analytics requires. The goal of the course is to establish a working foundation with working code that the student can extend as they learn going forward. Working end-to-end code makes learning much easier.

The organizational principle behind the course is the 'data value pyramid', pictured below. Students will climb the data-value pyramid, refining data at one step to reach the next.

pyramid

This class will be a chance for a practicing data scientist or web developer to learn to turn their data and analyses into full-blown actionable web applications. We will put the student in a position where she can independently improve on the foundation we have given her to go on to build great analytics applications

Learning Promise:

The course appeals to one of the following persons:

  • I am a programmer who wants to get a foothold in the emerging field of data science.
  • I am a practicing data scientist that wants to learn to build full-stack applications or apply agile principles to data science.
  • I am an entry-level data scientist that wants to learn how analytics applications are crafted.
Roles:

Programmers who want to learn introductory data science.

Practicing statisticians who want to learn to build entire applications.

Entry level data scientists who want to learn how to craft full-stack applications

Audience Level:

The difficulty of the material is intermediate. While all the content we cover is introductory, the breadth of the material we cover makes this an intermediate challenge.

Teams:

The course can be tailored to meet the needs of data science teams. Teams will learn to work together to create full-stack applications. Each student will learn their individual role, as well as how to collaborate using agile development to do data science.

Prerequisites:

Students should be fluent programmers in at least one language, preferably Python and with some experience in Javascript. Exposure to data analysis on some level is required, but that might be limited to SQL. We can provide a pre-test which students should pass to benefit optimally from the course.

A virtual machine image for use with either Vagrant or Amazon EC2 will be provided which will contain the environment for the course. Data for the course will be downloadable. Students wishing to install the tools on their own computers can refer to Appendeix A of Agile Data Science 2.0 (O'Reilly, 2017), which contains detailed installation instructions. We won't cover a custom install in the course directly.

Program:

Students will learn the theory and practice of employing agile development principles to build entire analytics applications. Students will gain real-world experience building all aspects of a real analytics application.

A lecture on theory will begin the course, followed by a lecture and examples illustrating how the tools form a complete data platform. Next a lecture explaining the dataset used in the course. After this, the remainder of the course will be guided exercises.

We will iteratively extract increasing amounts of value from raw data as we refine it in stages that correspond to the levels of the data-value pyramid.

Exercises throughout the course will ensure that students learn to apply what they are learning. Students will end up with a simple web application and data processing scripts they can alter and customize to fit their own problem domain and their own dataset.

Big Ideas:
  1. Agile development applies to data science
  2. Data mining is approachable if done in stages.
  3. Analytics applications can begin simply and grow in complexity.
Expected Outcomes

By the end of this course...

Participants will understand:

  1. How to use pyspark and web visualization to refine data
  2. How to build analytics applications
  3. How to approach predictive analytics
  4. How to deploy predictive systems in batch and realtime

Participants will be able to:

  1. Build analytics applications from the ground up
  2. Make visualizations and predictions
  3. Use Python, Spark, Spark SQL, Spark MLlib and Spark Streaming to build entire applications
Common Misunderstandings:

What are 2-3 of the most common ideas, skills, or performance abilities that someone new to this content struggles with?

  1. How to get started making predictions with real world data.
  2. How to get experience munging data into visualizations and predictive algorithms
  3. How to actually do data science in an agile, iterative manner
  4. How to get started with realtime analytics
  5. The role of batch versus realtime computing
Learning Activities and Assessments:

Throughout the course, students will work with source code on a virtual machine/Amazon EC2 image and in github which they can run, tweak and modify to learn. In addition, some portions of the course will include Jupyter or Zeppelin notebooks. There will be an assessment in each section, where the student takes what they've learned and extends an example to do something new.

Realtime Predictive Analytics

Realtime Predictive Analytics

Course Length

Six Hours

Required Enrollment

3 students

The course covers the construction of an entire predictive analytics web application using Python/Flask/JQuery, Kafka, PySpark, Spark MLlib and Spark Streaming. This is similar to chapters 7 and 8 in the book Agile Data Science 2.0 (O'Reilly, 2017).

First we go over the architecture used. After project setup using a local virtual machine or Amazon EC2, we use PySpark in batch mode to train a classifier model to predict flight delays in terms of four categories. Next we build a web application front-end to our predictive system, which submit prediction requests to a Kafka queue. Next we use Spark MLLib with Spark Streaming to respond to those prediction requests. Finally, we demonstrate and analyze the entire system together.

This course is also available on video.