Optimizing Apache Spark on Databricks

Price
$1,500.00 USD

Duration
2 Days

 

Delivery Methods
Virtual Instructor Led
Private Group

Course Overview

In this course, you will explore the five key problems that represent the vast majority of performance issues in an Apache Spark application: skew, spill, shuffle, storage, and serialization. With examples based on 100 GB to 1+ TB datasets, you will investigate and diagnose sources of bottlenecks with the Spark UI and learn effective mitigation strategies. You will also discover new features introduced in Spark 3 that can automatically address common performance problems. Lastly, you learn how to design and configure clusters for optimal performance based on specific team needs and concerns.

Course Objectives

  • Articulate how the five most common performance problems in a Spark application can be mitigated to achieve better application performance
  • Summarize the most common performance problems associated with data ingestion and how to mitigate them
  • Articulate how new features in Spark 3.x can be employed to mitigate performance problems in your Spark applications
  • Configure a Spark cluster for maximum performance given specific job requirements
  • Top-rated instructors: Our crew of subject matter experts have an average instructor rating of 4.8 out of 5 across thousands of reviews.
  • Authorized content: We maintain more than 35 Authorized Training Partnerships with the top players in tech, ensuring your course materials contain the most relevant and up-to date information.
  • Interactive classroom participation: Our virtual training includes live lectures, demonstrations and virtual labs that allow you to participate in discussions with your instructor and fellow classmates to get real-time feedback.
  • Post Class Resources: Review your class content, catch up on any material you may have missed or perfect your new skills with access to resources after your course is complete.
  • Private Group Training: Let our world-class instructors deliver exclusive training courses just for your employees. Our private group training is designed to promote your team’s shared growth and skill development.
  • Tailored Training Solutions: Our subject matter experts can customize the class to specifically address the unique goals of your team.

Course Prerequisites

  • Hands-on experience developing Apache Spark applications (6+ months). We recommend the Apache Spark Programming course to get started working with Spark.
  • Intermediate experience in Python or Scala

Agenda

Day 1

  • Review of Spark architecture and Spark UI
  • Skew
  • Spill
  • Shuffle
  • Storage
  • Serialization

Day 2

  • Ingestion basics
  • Predicate push downs
  • Disk partitioning
  • Z-ordering
  • Bucketing
  • Optimization with Adaptive Query
  • Execution (AQE)
  • Designing and configuring clusters for high performance
 

Get in touch to schedule training for your team
We can enroll multiple students in an upcoming class or schedule a dedicated private training event designed to meet your organization’s needs.

 



Do You Have Additional Questions? Please Contact Us Below.

contact us contact us 
Contact Us about Starting Your Business Training Strategy with New Horizons