Big Data, MapReduce and Hadoop

5 minute read

Updated: March 07, 2020

Big Data

Not a single data where data is not collected and stored in a warehouse
- a lot of them is ecommerce, web data
- grocery shop, and pay with credit card transaction data
Data needs to be managed, analysed, summarised, virtualised, so that we can discover knowledge from these data → imp for business, and trend on fb for advert
And you want to do it in a timely manner and scalability is key, you expect to deal with these data in massive scale

Typical large-data problem

Take massive data and iterate over a large number of records
Extract sth of interest → MAP
Shuffle intermediate results
Aggregate intermediate results → REDUCE
Generate final output
- Key idea – map by extracting, and aggregate the result to reduce!!
- Provide a functional abstraction for these two operations

Data comes in a large scale and MapReduce is a programming model to express distributed computations at a massive scale. you want to process it parallelly hence you need this programming model.
Not only it does the processing, but it will organize and preform search computations so that the process takes minimum time.
Been introduced with google in 2003, and there’s open source implementation called Hadoop

Parallelization challenges

How do we assign work units to workers?
What if we have more work than workers?
What if workers need to share partial results? → May need to synchronize
How do we aggregate partial results?
How do we know if they’ve finished? → some form of synchronization
What if workers die?
- Fault tolerance, need to detect AND need a mechanism to get the work restarted
MapReduce has to deal with these problems

My code needs exclusively 2 functions in the code - Map and Reduce
Map – take a key and associate value to it. And we reduce all the values with same keys to the same worker
And you end up with some form of intermediate results that says mapping of key v1 and k1 has led to a1 b2 etc.
As you could see, it looks like a b and c appeared a number of times as intermediate results. Hence, we shuffle and sort out aggregated values by keys. A with 1,5 and b with 2,7 and c with 2,3,6,8.
So that they can get to final reduction operation
All this is done !! by MapReduce so we needa write Map and Reduce.

MapReduce “Runtime”

MapReduce

Can add.2 more functions - Partition and Combine
If you know the data you are dealing with, you can look at number of partitions as an important parameter in ur code. you can then then can divide up the key space for parallel reduce operation, as you know the key to divide with
And can for performance reason, run mini reduces that run-in memory after the map phase. Once you do the map and start generating data, you can already start the reduction operation and combine intermediate result
- For optimization reason - to reduce network traffic when sharing intermediate result
→ 2 extra thing if you know the data

Logical data flow in 5 processing steps in MapReduce

Input data has a Key and the value as a pair.
Map function generates these pairs. And I’m gonna sort out key value pairs, sort by key. Then you group them so key 1 has n values and key x has n values.
We reduce and we get the final output
→ Sort and group

A world counting example on <Key, Count> distribution

Use MapReduce to count the number of words in a book
As an input, you have ‘most ppl ignore most poetry’ and ‘most poetry ignore most ppl’
Map function will count the individual words as keys. All these keys are individual words
you sort them out and group them
- you realize ignore - 1 instance, ignores- 1 instance
By reducing, you can count the number of instances in each unique keys to find out how many times each words have been mentioned.

MapReduce Implementations

Have google, which had a propriety implementation in C++ and provided bindings in Java and Python
Hadoop – open source implementation written in Java by apache, then yahoo
- Rapidly expanding, and there’s tons of variance. i.e. spark. Spark claims to be 100 times faster than Hadoop as it runs on memory
Lots of custom research implementations
- If you use GPU as accelerators, you can add custom implementation of Hadoop itself

MapReduce will work, but the underlying system, you need distributed file system – you need to arrange data so that Hadoop can access and work with it and make that data available to the framework
The data itself can be stored in local disks of nodes in the cluster
- The data is first stored on data nodes and is spread across the cluster in school test bed
- you start up the workers on the node that has the local data, aka that is closer to data
Why?
- Not enough RAM to hold all the data in memory and is better for latency
- Disk access is slow but disk throughput is reasonable
Distributed file system is the answer!
- GFS (Google File System)– google’s version of MapReduce
- HDFS (Hadoop Distributed File System) – to prepare data for parallel processing

GFS

Assumptions
- Google always relied on commodity hardware
- Expected failure to be common.
- Have modest number of huge files and multi gigabyte files are common
- Files are write once- but files may be appended
How do they design gfs?
- Expect a GFS client looking for some data. It calls a GFS master which has metadata aka which is where the data is. It then calls a GFS chunk server. Interesting is bc of fault tolerance, the data is replicated at least 3 times. Once the data is found, the chunk is passed back to client
- There’s a single master to coordinate access
- Caching doesn’t exist – google claims says there’s no point in large dataset
- But in this case there’s a single point of failure – the master node. And google says that google can detect that master node has failed, and they have a system that you really quickly assign a new master