Big Data, MapReduce and Hadoop

5 minute read

Updated:

Big Data

  • Not a single data where data is not collected and stored in a warehouse
    • a lot of them is ecommerce, web data
    • grocery shop, and pay with credit card transaction data
  • Data needs to be managed, analysed, summarised, virtualised, so that we can discover knowledge from these data → imp for business, and trend on fb for advert
  • And you want to do it in a timely manner and scalability is key, you expect to deal with these data in massive scale

Typical large-data problem

  1. Take massive data and iterate over a large number of records
  2. Extract sth of interest → MAP
  3. Shuffle intermediate results
  4. Aggregate intermediate results → REDUCE
  5. Generate final output
    • Key idea – map by extracting, and aggregate the result to reduce!!
    • Provide a functional abstraction for these two operations

What is MapReduce?

  • Data comes in a large scale and MapReduce is a programming model to express distributed computations at a massive scale. you want to process it parallelly hence you need this programming model.
  • Not only it does the processing, but it will organize and preform search computations so that the process takes minimum time.
  • Been introduced with google in 2003, and there’s open source implementation called Hadoop

Parallelization challenges

  • How do we assign work units to workers?
  • What if we have more work than workers?
  • What if workers need to share partial results? → May need to synchronize
  • How do we aggregate partial results?
  • How do we know if they’ve finished? → some form of synchronization
  • What if workers die?
    • Fault tolerance, need to detect AND need a mechanism to get the work restarted
  • MapReduce has to deal with these problems

MapReduce

image

  • My code needs exclusively 2 functions in the code - Map and Reduce
  • Map – take a key and associate value to it. And we reduce all the values with same keys to the same worker
  • And you end up with some form of intermediate results that says mapping of key v1 and k1 has led to a1 b2 etc.
  • As you could see, it looks like a b and c appeared a number of times as intermediate results. Hence, we shuffle and sort out aggregated values by keys. A with 1,5 and b with 2,7 and c with 2,3,6,8.
  • So that they can get to final reduction operation
  • All this is done !! by MapReduce so we needa write Map and Reduce.

MapReduce “Runtime”

  • Runtime handles scheduling – assign workers to map and reduce tasks
  • Handles data distribution - moves process to data
  • Handle synchronization – gathers, sorts and shuffles intermediate data
  • Handle errors and faults – detects failure and restart if needed

MapReduce

  • Can add.2 more functions - Partition and Combine
  • If you know the data you are dealing with, you can look at number of partitions as an important parameter in ur code. you can then then can divide up the key space for parallel reduce operation, as you know the key to divide with
  • And can for performance reason, run mini reduces that run-in memory after the map phase. Once you do the map and start generating data, you can already start the reduction operation and combine intermediate result
    • For optimization reason - to reduce network traffic when sharing intermediate result
  • → 2 extra thing if you know the data

Logical data flow in 5 processing steps in MapReduce

  • Input data has a Key and the value as a pair.
  • Map function generates these pairs. And I’m gonna sort out key value pairs, sort by key. Then you group them so key 1 has n values and key x has n values.
  • We reduce and we get the final output
  • → Sort and group

A world counting example on <Key, Count> distribution

  • Screenshot 2020-02-25 at 9 51 31 pm
  • Use MapReduce to count the number of words in a book
  • As an input, you have ‘most ppl ignore most poetry’ and ‘most poetry ignore most ppl’
  • Map function will count the individual words as keys. All these keys are individual words
  • you sort them out and group them
    • you realize ignore - 1 instance, ignores- 1 instance
  • By reducing, you can count the number of instances in each unique keys to find out how many times each words have been mentioned.

MapReduce Implementations

  • Have google, which had a propriety implementation in C++ and provided bindings in Java and Python
  • Hadoop – open source implementation written in Java by apache, then yahoo
    • Rapidly expanding, and there’s tons of variance. i.e. spark. Spark claims to be 100 times faster than Hadoop as it runs on memory
  • Lots of custom research implementations
    • If you use GPU as accelerators, you can add custom implementation of Hadoop itself

Distributed file system

  • MapReduce will work, but the underlying system, you need distributed file system – you need to arrange data so that Hadoop can access and work with it and make that data available to the framework
  • The data itself can be stored in local disks of nodes in the cluster
    • The data is first stored on data nodes and is spread across the cluster in school test bed
    • you start up the workers on the node that has the local data, aka that is closer to data
  • Why?
    • Not enough RAM to hold all the data in memory and is better for latency
    • Disk access is slow but disk throughput is reasonable
  • Distributed file system is the answer!
    • GFS (Google File System)– google’s version of MapReduce
    • HDFS (Hadoop Distributed File System) – to prepare data for parallel processing

GFS

  • Assumptions
    • Google always relied on commodity hardware
    • Expected failure to be common.
    • Have modest number of huge files and multi gigabyte files are common
    • Files are write once- but files may be appended
  • How do they design gfs?
    • Screenshot 2020-02-25 at 9 52 09 pm
    • Expect a GFS client looking for some data. It calls a GFS master which has metadata aka which is where the data is. It then calls a GFS chunk server. Interesting is bc of fault tolerance, the data is replicated at least 3 times. Once the data is found, the chunk is passed back to client
    • There’s a single master to coordinate access
    • Caching doesn’t exist – google claims says there’s no point in large dataset
    • But in this case there’s a single point of failure – the master node. And google says that google can detect that master node has failed, and they have a system that you really quickly assign a new master

Summary

  • Cloud computing and large data
  • General idea of MapReduce and about large data processing
  • Importance of underlying distributed file systems
  • GFS as a case study

Leave a comment