A framework for storing and processing big data. Mainly for batch processing. Useful for creating and managing distributed systems, over a network of computers. Provides many useful layers, like a file system (HDFS) and the MapReduce algorithm. Written in java. The logo is a baby elephant.
A single computer stores all data, in some sort of relational database for example, and users connect to the server to access the information. This is limits the volume of data, or increases costs.
Google was the first to implement a MapReduce algorithm. It divides the data into small parts and assigns it to many computers in a network, all unified by a centralised system. This is what Hadoop originates from. By using MapReduce, Hadoop uses parallel computing to process data, running "clusters" of computers that perform complex tasks on large volumes of data.
There are four core modules in Hadoop:
These provide the main components of Hadoop: MapReduce (Distributed Computation), HDFS (Distributed Storage), YARN Framework, and Common Utilities.
A MapReduce program is composed of a Map() method, that reformats data into a key-value pairs (tuples), and a Reduce() method that combines tuples into smaller sets of tuples. MapReduce framework has a single master JobTracker and slave TaskTracker per cluster-node.
Map applies a function onto each element in a list. Reduce applies a fold on the list, applying a binary function onto all elements.
Provides a system for file management over a distributed system of hundreds of computers.
Stage 1: Specify a job, with the location of input and output files, the map reduce functions, and the configurations.
Stage 2: The Hadoop job client submits the job (.exe) and configurations to JobTracker which distributes the job to the slaves, while monitoring their status.
Stage 3: The TaskTracker on different nodes executes the MapReduce function and stores the output data.