Big Data Processing
Distributed data processing.
Core Concepts
RDD: resilient distributed datasets. DataFrame: structured data. Dataset: typed.
Transformations: map, filter, groupBy. Actions: collect, count, save.
Spark SQL
SQL on Spark. Temporary views. UDFs: user-defined functions.
Optimization
Partitions: control parallelism. Caching: persist data. Broadcast joins: small tables.
Key Takeaways
- RDD, DataFrame, Dataset APIs
- Spark SQL for SQL interface
- Partition and cache for performance