Who Should Attend?
This course is designed for data analysts, business intelligence specialists, developers, system architects, and database administrators. Some knowledge of SQL is assumed, as is basic Linux command-line familiarity. Prior knowledge of Apache Hadoop is not required.
- Top-rated instructors: Our crew of subject matter experts have an average instructor rating of 4.8 out of 5 across thousands of reviews.
- Authorized content: We maintain more than 35 Authorized Training Partnerships with the top players in tech, ensuring your course materials contain the most relevant and up-to date information.
- Interactive classroom participation: Our virtual training includes live lectures, demonstrations and virtual labs that allow you to participate in discussions with your instructor and fellow classmates to get real-time feedback.
- Post Class Resources: Review your class content, catch up on any material you may have missed or perfect your new skills with access to resources after your course is complete.
- Private Group Training: Let our world-class instructors deliver exclusive training courses just for your employees. Our private group training is designed to promote your team’s shared growth and skill development.
- Tailored Training Solutions: Our subject matter experts can customize the class to specifically address the unique goals of your team.
Agenda
1 - Introduction
2 - Apache Hadoop Fundamentals
- The Motivation for Hadoop
- Hadoop Overview
- Data Storage: HDFS
- Distributed Data Processing: YARN, MapReduce, and Spark
- Data Processing and Analysis: Hive and Impala
- Database Integration: Sqoop
- Other Hadoop Data Tools
- Exercise Scenario Explanation
3 - Introduction to Apache Hive and Impala
- What Is Hive?
- What Is Impala?
- Why Use Hive and Impala?
- Schema and Data Storage
- Comparing Hive and Impala to Traditional Databases
- Use Cases
4 - Querying with Apache Hive and Impala
- Databases and Tables
- Basic Hive and Impala Query Language Syntax
- Data Types
- Using Hue to Execute Queries
- Using Beeline (Hive's Shell)
- Using the Impala Shell
5 - Common Operators and Built-In Functions
- Operators
- Scalar Functions
- Aggregate Functions
6 - Data Management
- Data Storage
- Creating Databases and Tables
- Loading Data
- Altering Databases and Tables
- Simplifying Queries with Views
- Storing Query Results
7 - Data Storage and Performance
- Partitioning Tables
- Loading Data into Partitioned Tables
- When to Use Partitioning
- Choosing a File Format
- Using Avro and Parquet File Formats
8 - Working with Multiple Datasets
- UNION and Joins
- Handling NULL Values in Joins
- Advanced Joins
9 - Analytic Functions and Windowing
- Using Analytic Functions
- Other Analytic Functions
- Sliding Windows
10 - Complex Data
- Complex Data with Hive
- Complex Data with Impala
11 - Analyzing Text
- Using Regular Expressions with Hive and Impala
- Processing Text Data with SerDes in Hive
- Sentiment Analysis and n-grams in Hive
12 - Apache Hive Optimization
- Understanding Query Performance
- Cost-Based Optimization and Statistics
- Bucketing
- ORC File Optimizations
13 - Apache Impala Optimization
- How Impala Executes Queries
- Improving Impala Performance
14 - Extending Apache Hive and Impala
- Custom SerDes and File Formats in Hive
- Data Transformation with Custom Scripts in Hive
- User-Defined Functions
- Parameterized Queries
15 - Choosing the Best Tool for the Job
- Comparing Hive, Impala, and Relational Databases
- Which to Choose?
16 - Conclusion
- Apache Kudu
- What Is Kudu?
- Kudu Tables
- Using Impala with Kudu