Introduction to Technologies

Back to Data-Science

Very fast in-memory data processing framework, ~100x faster than Hadoop

Spark and the Resilient Distributed Dataset (RDD)

Transformations and Actions are the two functions types

Background: OLTP OLAP

transactional OLTP and analytical OLAP describe the intent of some data system

Spark Task Distribution Optimiation

Important to partition data for optimized performance

Setup

```
curl -o ~/Downloads/spark-2.1.0-bin-hadoop2.7.tgz http://d3kbcqa49mib13.cloudfront.net/spark-2.1.0-bin-hadoop2.7.tgz