Apache Spark

Apache Spark patterns for large-scale data processing with typed Datasets and Spark SQL in Scala

Language / Topic

Scala

Rules

Apache Spark · balanced balanced

- Use `Dataset[T]` with case class encoders instead of untyped RDDs — the structured API enables Catalyst query optimization and type safety.

- Read data with `spark.read.parquet("s3://bucket/path")` — avoid `sc.textFile` except for raw log ingestion in exploratory pipelines.

- Cache with `ds.cache()` only when a Dataset is reused in multiple downstream transformations — uncache with `ds.unpersist()` when done.

- Inspect the query plan with `ds.explain(extended = true)` before deploying — look for missing predicates and unnecessary full scans.

- Never call `ds.collect()` on large datasets — write results with `ds.write.parquet("output")` or `ds.write.mode("overwrite").jdbc(...)`.

- Tune partitioning: `repartition(n)` for full reshuffling (causes shuffle), `coalesce(n)` for reducing partitions without shuffle.

- Enable AQE: `spark.conf.set("spark.sql.adaptive.enabled", "true")` — optimizes join strategies and coalesces small output partitions.

- Broadcast small tables: `df.join(broadcast(lookup), "key")` converts expensive sort-merge joins to broadcast hash joins.