Cloudera Data Engineering: Developing Applications with Apache Spark

Price
$3,520.00 USD

Duration
4 Days

 

Delivery Methods
Virtual Instructor Led
Private Group

Course Objectives

  • Distribute, store, and process data in a CDP cluster
  • Write, configure, and deploy Apache Spark applications
  • Use the Spark interpreters and Spark applications to explore, process, and analyze distributed data
  • Query data using Spark SQL, DataFrames, and Hive tables
  • Use Spark Streaming together with Kafka to process a data stream

Who Should Attend?

This course is designed for developers and data engineers. All students are expected to have basic Linux experience, and basic proficiency with either Python or Scala programming languages. Basic knowledge of SQL is helpful. Prior knowledge of Spark and Hadoop is not required.
  • Top-rated instructors: Our crew of subject matter experts have an average instructor rating of 4.8 out of 5 across thousands of reviews.
  • Authorized content: We maintain more than 35 Authorized Training Partnerships with the top players in tech, ensuring your course materials contain the most relevant and up-to date information.
  • Interactive classroom participation: Our virtual training includes live lectures, demonstrations and virtual labs that allow you to participate in discussions with your instructor and fellow classmates to get real-time feedback.
  • Post Class Resources: Review your class content, catch up on any material you may have missed or perfect your new skills with access to resources after your course is complete.
  • Private Group Training: Let our world-class instructors deliver exclusive training courses just for your employees. Our private group training is designed to promote your team’s shared growth and skill development.
  • Tailored Training Solutions: Our subject matter experts can customize the class to specifically address the unique goals of your team.

Agenda

Introduction to Zeppelin

  • Why Notebooks?
  • Zeppelin Notes
  • Demo: Apache Spark In 5 Minutes
  • HDFS Introduction

HDFS Overview

  • HDFS Components and Interactions
  • Additional HDFS Interactions
  • Ozone Overview
  • Exercise: Working with HDFS
  • YARN Introduction

YARN Overview

  • YARN Components and Interaction
  • Working with YARN
  • Exercise: Working with YARN
  • Distributed Processing History

The Disk Years: 2000 ->2010

  • The Memory Years: 2010 ->2020
  • The GPU Years: 2020 ->
  • Working with DataFrames

Introduction to DataFrames

  • Exercise: Introducing DataFrames
  • Exercise: Reading and Writing DataFrames
  • Exercise: Working with Columns
  • Exercise: Working with Complex Types
  • Exercise: Combining and Splitting DataFrames
  • Exercise: Summarizing and Grouping DataFrames
  • Exercise: Working with UDFs
  • Exercise: Working with Windows
  • Introduction to Apache Hive

About Hive

  • Hive and Spark Integration

Hive and Spark Integration

  • Exercise: Spark Integration with Hive
  • Data Visualization with Zeppelin

Introduction to Data

  • Visualization with Zeppelin
  • Zeppelin Analytics
  • Zeppelin Collaboration
  • Exercise: AdventureWorks
  • Distributed Processing Challenges

Shuffle

  • Skew
  • Order
  • Spark Distributed Processing

DataFrame and Dataset

  • Persistence
  • Persistence Storage Levels
  • Viewing Persisted RDDs
  • Exercise: Persisting DataFrames
  • Writing, Configuring, and Running Spark Applications

Writing a Spark Application

  • Building and Running an Application
  • Application Deployment Mode
  • The Spark Application Web UI
  • Configuring Application Properties
  • Exercise: Writing, Configuring, and Running a Spark Application
  • Introduction to Structured Streaming

Introduction to Structured Streaming

  • Exercise: Processing Streaming Data
  • Message Processing with Apache Kafka

What is Apache Kafka?

  • Apache Kafka Overview
  • Scaling Apache Kafka
  • Apache Kafka Cluster Architecture
  • Apache Kafka Command Line Tools
  • Structured Streaming with Apache Kafka

Receiving Kafka Messages

  • Sending Kafka Messages
  • Exercise: Working with Kafka
  • Streaming Messages
  • Aggregating and Joining Streaming DataFrames

Streaming Aggregation

  • Joining Streaming DataFrames
  • Exercise: Aggregating and Joining Streaming DataFrames
  • Appendix: Working with Datasets in Scala

Working with Datasets in Scala

  • Exercise: Using Datasets in Scala
 

Get in touch to schedule training for your team
We can enroll multiple students in an upcoming class or schedule a dedicated private training event designed to meet your organization’s needs.

 



Do You Have Additional Questions? Please Contact Us Below.

contact us contact us 
Contact Us about Starting Your Business Training Strategy with New Horizons