Building Full Stack Data Analytics Applications with Kafka and Spark
Agile Data Science 2.0 (O’Reilly 2017) defines a methodology and a software stack with which to apply the methods. The methodology seeks to deliver data products in short sprints by going meta and putting the focus on the applied research process itself. The stack is but an example of one meeting the requirements that it be utterly scalable and utterly efficient in use by application developers as well as data engineers. It includes everything needed to build a full-blown predictive system: Apache Spark, Apache Kafka, Apache Incubating Airflow, MongoDB, ElasticSearch, Apache Parquet, Python/Flask, JQuery. This talk will cover the full lifecycle of large data application development and will show how to use lessons from agile software engineering to apply data science using this full-stack to build better analytics applications.
Spark has emerged as the leading general purpose distributed data processing platform. PySpark offers a Python interface to Spark that enables all the power of Python for data processing to come to bear when computing with Spark. Working with airline flight delay data, the tutorial will start by covering basic operations in PySpark: loading and storing data, filtering, mapping, grouping, and SQL operations. We'll go on to tour the RDD and DataFrame APIs, showing how and when to use them. We'll learn how to prepare data and store it in different kind of databases. The class will show how to combine data flow programming andSpark SQL to slice and dice data of any size. Finally, we'll show how to use machine learning via Spark MLlib to build a predictive model to predict flight delays.