Cloudera Data Engineering: Developing Applications with Apache Spark

Price
$3,520.00 USD

Duration
4 Days

Delivery Methods
Virtual Instructor Led
Private Group

Request More Information

Download Course Details

Course Details Only
Course Details & Schedule

Skip to Class Dates

Course Objectives

Distribute, store, and process data in a CDP cluster
Write, configure, and deploy Apache Spark applications
Use the Spark interpreters and Spark applications to explore, process, and analyze distributed data
Query data using Spark SQL, DataFrames, and Hive tables
Use Spark Streaming together with Kafka to process a data stream

Who Should Attend?

This course is designed for developers and data engineers. All students are expected to have basic Linux experience, and basic proficiency with either Python or Scala programming languages. Basic knowledge of SQL is helpful. Prior knowledge of Spark and Hadoop is not required.

Top-rated instructors: Our crew of subject matter experts have an average instructor rating of 4.8 out of 5 across thousands of reviews.
Authorized content: We maintain more than 35 Authorized Training Partnerships with the top players in tech, ensuring your course materials contain the most relevant and up-to date information.
Interactive classroom participation: Our virtual training includes live lectures, demonstrations and virtual labs that allow you to participate in discussions with your instructor and fellow classmates to get real-time feedback.
Post Class Resources: Review your class content, catch up on any material you may have missed or perfect your new skills with access to resources after your course is complete.
Private Group Training: Let our world-class instructors deliver exclusive training courses just for your employees. Our private group training is designed to promote your team’s shared growth and skill development.
Tailored Training Solutions: Our subject matter experts can customize the class to specifically address the unique goals of your team.

Agenda

Introduction to Zeppelin

Why Notebooks?
Zeppelin Notes
Demo: Apache Spark In 5 Minutes
HDFS Introduction

HDFS Overview

HDFS Components and Interactions
Additional HDFS Interactions
Ozone Overview
Exercise: Working with HDFS
YARN Introduction

YARN Overview

YARN Components and Interaction
Working with YARN
Exercise: Working with YARN
Distributed Processing History

The Disk Years: 2000 ->2010

The Memory Years: 2010 ->2020
The GPU Years: 2020 ->
Working with DataFrames

Introduction to DataFrames

Exercise: Introducing DataFrames
Exercise: Reading and Writing DataFrames
Exercise: Working with Columns
Exercise: Working with Complex Types
Exercise: Combining and Splitting DataFrames
Exercise: Summarizing and Grouping DataFrames
Exercise: Working with UDFs
Exercise: Working with Windows
Introduction to Apache Hive

About Hive

Hive and Spark Integration

Hive and Spark Integration

Exercise: Spark Integration with Hive
Data Visualization with Zeppelin

Introduction to Data

Visualization with Zeppelin
Zeppelin Analytics
Zeppelin Collaboration
Exercise: AdventureWorks
Distributed Processing Challenges

Shuffle

Skew
Order
Spark Distributed Processing

DataFrame and Dataset

Persistence
Persistence Storage Levels
Viewing Persisted RDDs
Exercise: Persisting DataFrames
Writing, Configuring, and Running Spark Applications

Writing a Spark Application

Building and Running an Application
Application Deployment Mode
The Spark Application Web UI
Configuring Application Properties
Exercise: Writing, Configuring, and Running a Spark Application
Introduction to Structured Streaming

Introduction to Structured Streaming

Exercise: Processing Streaming Data
Message Processing with Apache Kafka

What is Apache Kafka?

Apache Kafka Overview
Scaling Apache Kafka
Apache Kafka Cluster Architecture
Apache Kafka Command Line Tools
Structured Streaming with Apache Kafka

Receiving Kafka Messages

Sending Kafka Messages
Exercise: Working with Kafka
Streaming Messages
Aggregating and Joining Streaming DataFrames

Streaming Aggregation

Joining Streaming DataFrames
Exercise: Aggregating and Joining Streaming DataFrames
Appendix: Working with Datasets in Scala

Working with Datasets in Scala

Exercise: Using Datasets in Scala

Get in touch to schedule training for your team
We can enroll multiple students in an upcoming class or schedule a dedicated private training event designed to meet your organization’s needs.

Do You Have Additional Questions? Please Contact Us Below.

Contact Us about Starting Your Business Training Strategy with New Horizons