Resilient Distributed Datasets 2012

Back to Research-Papers

distributedsystems

A distributed memory abstraction for performing in memory computations, on a large cluster. It only allows batch transformations on the whole dataset (coarse-grained), but

RDD Abstraction

An RDD is a read-only partitioned collection of records

Spark Programming Interface

Spark allows programmers to create RDD objects from stable storage, then apply methods on the objects (transformations, returning another RDD, or actions, returning a value)

lines = spark.textFile("hdfs://...")        # defines RDD, not in RAM
errors = line.filter(_.startsWith("ERROR")) # derives RDD from lines RDD, not in RAM
errors.persist() # reduced dataset, so now stores RDD in RAM, greatly increasing future computation 

Fine Grained vs Coarse Grained

RDD vs DSM

Distributed Shared Memory is a very general abstraction, so harder to optimize and make fault-tolerent

Spark

Spark provides an abstraction over RDDs, written in scala for static typing and interactive use

Representation of Resillient Distributed Datasets

Important to capture the concept of the linage graph of a RDD through its formalized definition. This is achieved through a simple graph representation, without needing extra logic in the schedule for each one

Implementation

Spark is built in Scala, on Mesos cluster manager; a distributed kernel, built at a different layer of abstraction, for handelling a large distributed system

Job Scheduler

Assigns tasks based on data locality, which works very well for narrow RDDs

Interpreter Integration

Scala's interactive interpreter works by compiling each line of input as a class and invoking a function on it. Each object, for ex. Line1, includes a singleton object containing the lines variables and functions.

Memory Management

RDDs can be stored as unserialized Java objects, serialized data in memory and serialized on disk

Checkpointing

Lineage graphs can always be used to recompute lost data, but is time consuming for large jobs