Optimizing Apache Spark on Databricks Training Course

Price
$1,500.00 USD

Duration
2 Days

 

Delivery Methods
Virtual Instructor Led
Private Group

Course Overview

Apache Spark is powerful, but only when optimized. Most Spark performance issues boil down to a handful of root causes: shuffle, skew, spill, serialization, and storage. In this two-day course, you’ll learn to diagnose and resolve these using the Spark UI, targeted optimization techniques, and tools in Spark 3.0.

This course also explores how to optimize your query execution, manage shuffle partition issues, and structure data using Delta Lake, partition strategies, and data skipping. You’ll apply hands-on skills to improve the performance of real-world workloads and design better clusters. Whether you’re tuning for SQL queries or preparing for large-scale machine learning pipelines, this course will help you get the most out of Spark and Databricks.

Course Objectives

By the end of this course, you’ll be able to identify and fix common Spark performance bottlenecks. You’ll also understand how to apply Spark 3.x features and cluster design strategies to improve efficiency.

  • Diagnose skew, spill, shuffle, storage, and serialization issues
  • Use the Spark UI to investigate performance bottlenecks
  • Apply performance tuning techniques during data ingestion
  • Use features like Z-ordering, bucketing, and Adaptive Query Execution (AQE)
  • Configure a Databricks cluster for optimal Spark performance
  • Top-rated instructors: Our crew of subject matter experts have an average instructor rating of 4.8 out of 5 across thousands of reviews.
  • Authorized content: We maintain more than 35 Authorized Training Partnerships with the top players in tech, ensuring your course materials contain the most relevant and up-to date information.
  • Interactive classroom participation: Our virtual training includes live lectures, demonstrations and virtual labs that allow you to participate in discussions with your instructor and fellow classmates to get real-time feedback.
  • Post Class Resources: Review your class content, catch up on any material you may have missed or perfect your new skills with access to resources after your course is complete.
  • Private Group Training: Let our world-class instructors deliver exclusive training courses just for your employees. Our private group training is designed to promote your team’s shared growth and skill development.
  • Tailored Training Solutions: Our subject matter experts can customize the class to specifically address the unique goals of your team.

Course Prerequisites

  • Hands-on experience developing Apache Spark applications (6+ months). We recommend the Apache Spark Programming course to get started working with Spark.
  • Intermediate experience in Python or Scala

Agenda

Day 1: Understanding and Diagnosing Performance Issues

  • Spark architecture and Spark UI
  • Skew and data imbalance
  • Spill and memory issues
  • Shuffle mechanics
  • Storage formats and tuning
  • Serialization performance

Day 2: Optimizing and Scaling Spark Workloads

  • Data ingestion: partitioning, predicate pushdown
  • Z-ordering and bucketing strategies
  • Adaptive Query Execution (AQE)
  • Designing clusters for specific workloads
  • Hands-on optimization labs using Databricks
 

Get in touch to schedule training for your team
We can enroll multiple students in an upcoming class or schedule a dedicated private training event designed to meet your organization’s needs.

CourseID: 3605021E
 



Do You Have Additional Questions? Please Contact Us Below.

contact us contact us 
Contact Us about Starting Your Business Training Strategy with New Horizons