Course Overview
In this course, you will explore the five key problems that represent the vast majority of performance issues in an Apache Spark application: skew, spill, shuffle, storage, and serialization. With examples based on 100 GB to 1+ TB datasets, you will investigate and diagnose sources of bottlenecks with the Spark UI and learn effective mitigation strategies. You will also discover new features introduced in Spark 3 that can automatically address common performance problems. Lastly, you learn how to design and configure clusters for optimal performance based on specific team needs and concerns.
Course Objectives
- Articulate how the five most common performance problems in a Spark application can be mitigated to achieve better application performance
- Summarize the most common performance problems associated with data ingestion and how to mitigate them
- Articulate how new features in Spark 3.x can be employed to mitigate performance problems in your Spark applications
- Configure a Spark cluster for maximum performance given specific job requirements
- Top-rated instructors: Our crew of subject matter experts have an average instructor rating of 4.8 out of 5 across thousands of reviews.
- Authorized content: We maintain more than 35 Authorized Training Partnerships with the top players in tech, ensuring your course materials contain the most relevant and up-to date information.
- Interactive classroom participation: Our virtual training includes live lectures, demonstrations and virtual labs that allow you to participate in discussions with your instructor and fellow classmates to get real-time feedback.
- Post Class Resources: Review your class content, catch up on any material you may have missed or perfect your new skills with access to resources after your course is complete.
- Private Group Training: Let our world-class instructors deliver exclusive training courses just for your employees. Our private group training is designed to promote your team’s shared growth and skill development.
- Tailored Training Solutions: Our subject matter experts can customize the class to specifically address the unique goals of your team.
Agenda
Day 1
- Review of Spark architecture and Spark UI
- Skew
- Spill
- Shuffle
- Storage
- Serialization
Day 2
- Ingestion basics
- Predicate push downs
- Disk partitioning
- Z-ordering
- Bucketing
- Optimization with Adaptive Query
- Execution (AQE)
- Designing and configuring clusters for high performance