Hadoop

Back to Data-Science

A framework for storing and processing big data. Mainly for batch processing. Useful for creating and managing distributed systems, over a network of computers. Provides many useful layers, like a file system (HDFS) and the MapReduce algorithm. Written in java. The logo is a baby elephant.

Cute Baby Elephant

Big Data Analysis Approaches

Simple

A single computer stores all data, in some sort of relational database for example, and users connect to the server to access the information. This is limits the volume of data, or increases costs.

Distributed System

Google was the first to implement a MapReduce algorithm. It divides the data into small parts and assigns it to many computers in a network, all unified by a centralised system. This is what Hadoop originates from. By using MapReduce, Hadoop uses parallel computing to process data, running "clusters" of computers that perform complex tasks on large volumes of data.

Introduction to Hadoop

There are four core modules in Hadoop:

These provide the main components of Hadoop: MapReduce (Distributed Computation), HDFS (Distributed Storage), YARN Framework, and Common Utilities.

More on the MapReduce Algorithm

A MapReduce program is composed of a Map() method, that reformats data into a key-value pairs (tuples), and a Reduce() method that combines tuples into smaller sets of tuples. MapReduce framework has a single master JobTracker and slave TaskTracker per cluster-node.

Map applies a function onto each element in a list. Reduce applies a fold on the list, applying a binary function onto all elements.

Hadoop Distributed File System

Provides a system for file management over a distributed system of hundreds of computers.

How Hadoop Works

Stage 1: Specify a job, with the location of input and output files, the map reduce functions, and the configurations.

Stage 2: The Hadoop job client submits the job (.exe) and configurations to JobTracker which distributes the job to the slaves, while monitoring their status.

Stage 3: The TaskTracker on different nodes executes the MapReduce function and stores the output data.

Advantages of Hadoop