spark

Apache Spark

Apache Spark patterns for large-scale data processing with typed Datasets and Spark SQL in Scala

Details

Language / Topic
scalaScala
Category
framework
Compatible Frameworks
spark

Rules

balanced
- Use `Dataset[T]` with case class encoders instead of untyped RDDs — the structured API enables Catalyst query optimization and type safety.
- Read data with `spark.read.parquet("s3://bucket/path")` — avoid `sc.textFile` except for raw log ingestion in exploratory pipelines.
- Cache with `ds.cache()` only when a Dataset is reused in multiple downstream transformations — uncache with `ds.unpersist()` when done.
- Inspect the query plan with `ds.explain(extended = true)` before deploying — look for missing predicates and unnecessary full scans.
- Never call `ds.collect()` on large datasets — write results with `ds.write.parquet("output")` or `ds.write.mode("overwrite").jdbc(...)`.
- Tune partitioning: `repartition(n)` for full reshuffling (causes shuffle), `coalesce(n)` for reducing partitions without shuffle.
- Enable AQE: `spark.conf.set("spark.sql.adaptive.enabled", "true")` — optimizes join strategies and coalesces small output partitions.
- Broadcast small tables: `df.join(broadcast(lookup), "key")` converts expensive sort-merge joins to broadcast hash joins.