Apache Spark Reference
Free reference guide: Apache Spark Reference
About Apache Spark Reference
The Apache Spark Reference is a searchable quick-reference covering the full Spark ecosystem for distributed data processing. It includes RDD operations like parallelize, map, filter, and reduceByKey, along with the higher-level DataFrame API for reading CSV and Parquet files, selecting columns, filtering rows, performing joins, and writing partitioned output. Each entry provides practical Scala code examples that you can adapt directly to your Spark applications.
Beyond core data processing, this reference covers Spark SQL with createOrReplaceTempView, Hive table creation, window functions like RANK() OVER PARTITION BY, and user-defined function (UDF) registration. The Structured Streaming section demonstrates reading from Kafka sources, configuring triggers and output modes, applying watermarks for late data handling, and using foreachBatch for custom micro-batch processing logic.
The reference also includes MLlib machine learning entries such as VectorAssembler for feature engineering, LogisticRegression for classification, and Pipeline for composing multi-stage ML workflows. The Config section covers essential tuning parameters including shuffle partitions, executor memory settings, cache and persist strategies with StorageLevel, repartition and coalesce for partition management, and the spark-submit command for deploying applications to YARN clusters.
Key Features
- RDD creation and transformation operations including parallelize, textFile, map, filter, and reduceByKey
- DataFrame API for CSV/Parquet I/O with schema inference, column selection, filtering, groupBy, and joins
- Spark SQL with temp views, Hive table DDL, window functions (RANK, PARTITION BY), and UDF registration
- Structured Streaming with Kafka source, watermark-based late data handling, and foreachBatch processing
- MLlib machine learning pipeline with VectorAssembler, LogisticRegression, and multi-stage Pipeline
- Performance tuning with cache/persist, repartition/coalesce, and shuffle partition configuration
- spark-submit command examples with YARN cluster mode, executor memory, and core allocation
- Practical Scala code examples for each API with options, modes, and partition strategies
Frequently Asked Questions
What is the difference between RDD and DataFrame in Spark?
RDDs (Resilient Distributed Datasets) are the low-level API providing fine-grained control over data distribution and transformations. DataFrames are a higher-level abstraction organized into named columns, similar to database tables. DataFrames benefit from Catalyst optimizer and Tungsten execution engine, making them significantly faster than RDDs for most workloads. Use DataFrames unless you need low-level control over partitioning or unstructured data.
How does reduceByKey differ from groupByKey?
reduceByKey combines values with the same key locally on each partition before shuffling, reducing the amount of data transferred across the network. groupByKey shuffles all key-value pairs first and then groups them, which can cause memory issues with large datasets. Always prefer reduceByKey or aggregateByKey over groupByKey for aggregation operations, as they typically run 2-10x faster.
What is a watermark in Spark Structured Streaming?
A watermark defines the threshold for how late data can arrive and still be included in window aggregations. For example, withWatermark("timestamp", "10 minutes") tells Spark to wait up to 10 minutes for late events. Data arriving after the watermark is dropped. This enables Spark to clean up old state and prevent unbounded memory growth in long-running streaming queries.
When should I use cache() vs persist() in Spark?
cache() stores data in memory only (equivalent to persist(StorageLevel.MEMORY_ONLY)). persist() allows you to choose the storage level: MEMORY_ONLY, MEMORY_AND_DISK, DISK_ONLY, or serialized variants. Use cache() for data that fits in memory and is reused multiple times. Use persist(MEMORY_AND_DISK) when data might not fit entirely in memory, so overflow spills to disk instead of being recomputed.
What is the difference between repartition() and coalesce()?
repartition(n) performs a full shuffle to create exactly n partitions and can increase or decrease the partition count. coalesce(n) only decreases partitions by combining existing ones without a full shuffle, making it much more efficient. Use coalesce when reducing partitions (e.g., before writing a single output file) and repartition when increasing partitions or redistributing data by a column.
How do I register and use a UDF in Spark SQL?
Register a UDF with spark.udf.register("funcName", (input: Type) => transformation). Once registered, use it in SQL queries: spark.sql("SELECT funcName(column) FROM table"). For the DataFrame API, use udf() to create a Column function. Note that UDFs are black boxes to the Catalyst optimizer and prevent certain optimizations, so prefer built-in functions when possible.
What output modes are available in Structured Streaming?
Spark offers three output modes: "append" writes only new rows since the last trigger and works with queries without aggregations or with watermarked aggregations. "complete" rewrites the entire result table each trigger, required for non-watermarked aggregations. "update" writes only rows that changed since the last trigger. Choose based on your query type and downstream sink requirements.
How do I configure spark-submit for a YARN cluster?
Use --master yarn --deploy-mode cluster for production deployments. Key parameters include --num-executors (number of executor containers), --executor-memory (RAM per executor, e.g., 4g), --executor-cores (CPU cores per executor), and --driver-memory. For dynamic allocation, set spark.dynamicAllocation.enabled=true instead of fixing executor count. Always test with --deploy-mode client first for easier debugging.