Spark is now available on LetsData

Feb 16, 2024

Today, we are announcing the availability of Apache Spark on the LetsData. You can now run Spark using LetsData's, leveraging the LetsData's Serverless task infrastructure built on AWS Lambda. You don’t need to create and manage clusters, job run schedules or any dedicated infrastructure. Your Spark code will just work out of the box - no jar issues, classpath problems or elaborate session and cluster configurations. LetsData spark jobs create Lambda compute on demand and allow for elastic scale.

Here are docs to get you started on LetsData Spark.

Spark Compute Engine Docs, S3 Spark Read Connector Docs and S3 Spark Write Connector Docs
LetsData Spark Interfaces: Java, Python
LetsData Spark Interface Implementation Examples: Java, Python
End to End Example - Spark Extract and Map Reduce

How it Works?

LetsData's Spark interfaces are inspired by the original Google Map Reduce paper. We have defined a MAPPER interface and a REDUCER interface.

Here is a Spark Dataset's Architecture Diagram and how the mapper and reducer tasks process a dataset using Spark.

`Figure: Spark Dataset's Architecture Diagram (runSparkInterfaces: MAPPER_AND_REDUCER)`

Recall that a dataset's amount of work is defined by a manifest file. For example, for S3 read destination, this is essentially a list of files that need to be processed.

The manifest file above specifies 4 files that need to be processed.

Mapper Tasks

Each manifest file becomes a separate Mapper Task in LetsData.
And each Mapper Task runs the mapper interface code.
The mapper interface implements single partition operations (narrow transformations)
The Dataframe returned by the Mapper Task is written to S3 as an intermediate file by LetsData.
In this example case, 4 manifest files -> 4 Mapper Tasks -> 4 intermediate files.
The interfaces themselves are quite simple, here is the mapper interface:

The SparkMapperInterface - Spark Read Uri, Read Options, Write Uri and Write Options and an AWS Secret Manager ARN are passed in as function parameters

Here is a sample implementation that reads the web crawl archive files to extract document which is written to S3 as an intermediate file:

SparkMapperInterface - example implementation that uses the default session, read and write functions. Customer writes only their business logic code

Reducer Task

LetsData creates a reducer task for any reduce operations for the dataset.
The intermediate files from the mapper phase are read by the reducer, any multi-partition operations, shuffles, aggregates, joins or similar wide transformations are performed on the intermediate files.
The dataset returned by the Reducer Task is written to the dataset's write destination.
In this example case, 4 intermediate files from the mapper tasks -> 1 output file by the reducer task
Here is the reducer interface:

SparkReducerInterface - Spark Read URIs, WriteUri and options are passed as parameters

Here is a sample implementation that reads the intermediate files from S3 computes the reduce operations and writes the results to S3:

SparkReducerInterface - example implementation that uses the default read, write and session functions. Customers write only their business logic code.

Config

You can configure the LetsData datasets to run both MAPPER_AND_REDUCER tasks, MAPPER_ONLY tasks or the REDUCER_ONLY tasks. Here is the config schema for Spark Compute Engine:

High-level Overview of How Spark works on Clusters

At a high level, Spark is usually run on a cluster of machines where the cluster resources are divided as spark master, driver and workers processes.
Each of these processes is allocated resources (memory, cpu etc) according to their expected workload.
A spark cluster can run many applications, each application is scheduled by the master to have driver and some workers.
Spark cluster can run as many applications as the driver / worker node divisions allow. For example in a 4 machine cluster, where each machine memory is divided to have 4 workers, the cluster will have 16 workers (assuming that drivers and masters have already been accounted for). Now if the applications that are being run request 8 workers, we can run 2 applications concurrently on such cluster. Additional applications will wait for resources to be available.
The application's driver will coordinate the application run / worker management. It will send tasks that need to be accomplished to the workers. For narrow transformations (single partition), the task will be contained to a worker. For wide transformations (shuffle, repartition) it coordinates data from different partitions and assigns these reduce tasks to workers. Essentially, driver is the brain that coordinates the data processing and efficiently manages the workers.

How LetsData does it?

LetsData uses the same primitive constructs, but stitches the dataset workflow a little differently. Here is what we do:
- Scope Work: Since the files that are to be read are specified in the manifest, at a high level LetsData knows the number of readers (tasks) that it needs to initialize. We currently map each file to a single mapper task.
- Mapper Task: Each of these tasks (mapper) runs as a DataTask function on AWS Lambda. We install spark on each lambda function in standalone mode. This essentially means that each function (10 GB memory, 10 GB disk) now has a master, driver and workers for this task (which maps to a single file). Since each task is a self contained spark instance, it can only perform narrow transformations (and will lead to incomplete results for any cross partition actions such as re-partition etc). User's spark code is run to compute the intermediate result for the file / partition. This intermediate result is stored in S3. This will be read by the reducer task.
- Reducer Task: We create a reducer task that runs user's wide transformations (cross partition code). This task again is a DataTask on AWS Lambda where we've installed spark (with 10 GB memory, 10 GB disk) which now has a master, driver and workers for this reducer task. This task reads all the intermediate files and computes the final result using user's code. This final result is written to the write destination
We've traded the smarts that Spark has built that compute an efficient query execution plan for a dataset with only task level efficiencies for now. As of now, we believe that with availability of workers elastically, parallelism can possibly be adequately sufficient in terms of performance. With usage based costing, we don’t pay for idle clusters, we do not have waits for a fully saturated cluster and on surface the cost efficiencies seem beneficial. Theoretically this seems like a good architecture (and we've seen good results on the tests that we have done), however, we'd like to run it with larger datasets to see how the system performs and what we can improve. In future, the standalone spark instances can be replaced with workers and LetsData could become the brain (driver) for the dataset and send work to the workers. (bigger project for another day).
Defaults: We've built in default spark sessions, default read and write code and task management that should make getting started really easy. JAR incompatibilities, CLASSPATH issues, other setup and runtime issues have been standardized in our images and users should not have to deal with these.

Errors, Logs, Metrics and Checkpointing: Spark tasks do not support the LetsData record errors (where individual record errors are sent to the error destination as json error detail files) and task checkpointing infrastructure yet. (where tasks periodically checkpoint their progress and support resume from last checkpoint semantics). So to answer the general question about errors, logs, metrics and checkpointing in spark tasks:
- Logging: Logger is available, and logs would be sent to Cloudwatch and made available.
- Metrics: No metrics support within the mapper interface yet. We generate some high level metrics. We will hopefully enable the spark metrics soon.
- Errors: The LetsData dataset errors are currently not available for spark interfaces. You decide how you want to deal with errors. (Write separate error files (using same write credentials), log them to log file etc.). Any unhandled / terminal failures can be thrown as exceptions and the task will record these and transition to error state.
- Checkpointing: The LetsData checkpointing and restart from checkpoints is not available yet for spark interfaces. Interfaces run either completely or fail, in which case the intermediate progress isn't used for reduce. If rerun, the tasks will run from beginning and overwrite any intermediate progress.

Example

A working step by step example for Spark on LetsData is available on the LetsData Docs Website. Here is the problem description that the example solves:

“we'll read files (web crawl archive files) from S3 using Spark code and extract the web crawl header and the web page content as a LetsData Document. We'll then map reduce these documents using Spark to compute the 90th percentile contentLength grouped by language and write the results as a json document to S3”

Conclusions

We are amazed by Spark and the scale, ease and facilities it provides for data processing. With the simplifications and Serverless support that we’ve added, we are actually fairly excited about what we have built.

We’d love customers to try it out and let us know what works and what doesn’t so we can help improve. We’d also want to know how folks have productized Spark for their data processing and any pros and cons of their work with Spark.

Thoughts / comments? Let us know.