Wednesday, January 28, 2009

MapReduce: An exclusive Software Framework for Distributed Systems byGoogle

MapReduce is a software framework introduced by Google to support distributed computing on large data sets on clusters of computers. The framework is inspired by map and reduce functions commonly used in functional programming, although their purpose in the MapReduce framework is not the same as their original forms. MapReduce libraries have been written in C++, Java, Python and other programming languages.


MapReduce Motivation:

To motivate MapReduce, we will talk about the Functional Programming that MapReduce encompasses in its distributed settings. So, MapReduce drew its inspiration from functional languages like LISP, Haskell, ML; and there are many features of these programming languages which are common across among but with different syntaxes. So, firstly we will discuss about functions of LISP that become basis for MapReduce. There are few things that we have to remember while dealing with functional programming and how they are different from imperative languages such as C, JAVA etc.


Functional Programming Basics:

So, the very basic thing to remember is that Functional Programming operations don’t modify data structures. They always create new ones. They copy the data to represent in updated form, which means that original data will remain in its unmodified form, so, if we have different we have multiple components of program which are operated on same data, they don’t need to synchronize when one of these tasks has updated the data. Because updated data reflected as a new copy, which means that data flow is implicit in the program design. Every time when we have a new data, we have to assign a new name to it.
In more concrete examples, we have functions over list of integers, and this functions returns a value might be sum of every integer to itself, either multiplication or length of list as shown in below,

fun foo(l: int list) =
sum(l) + mul(l) + length(l)

The order of sum(), mul(), and length() doesn’t matter because they don’t modify list. What is the key of these function??? And that is, none of these functions has side effects, being referred to as pure functions. Where side effects referred to things such as printing to screen, user interaction, or writing data to disk connected to network. Pure functions simply expresses mathematical computations such as we see in Algebra. So, knowing that sum(), mul(), and length() are pure functions, we can evaluate them either form left to right or right to left, or if we are so clever then we can put these functions into separate threads and get the results by simply join them together at the end of computation. On the other hand, there are some other functions that can actually modify the list, such as append() as shown,

fun append(x, list) =
let list’ = reverse list in
reverse(x :: list’)

The element x is going to append at list and list will be modified. But as we see in the above code, the list is renamed after second line with new word list’ is introduced for the computation. So, by creating new name for data we made data flow explicit in the program.

The functions themselves can be used as an arguments to other functions that is directly applied by MapReduce.

fun DoDouble(f, x) = f(f, x)
(it doesn’t matter that what ‘f’ will do to ‘x’, DoDouble will do it twice. E.g. fn => fn z => z * 2)

Data type of DoDouble() depends upon the argument if it is integer the return type of function will be integer, or double, float etc. If it is string, using for some concatenation purposes, then return type will be string. That is he basics for the functional programming. Functional programming is also associated with its standard libraries, the names of these functions may vary from language to language. Functional programming in particular is operated on lists, so most of the functions have to deal with it, and two of the most commonly used functions are Map and Fold.


What Map does???

Map applied on the ‘list’ operated by function ‘f’ and returns an independent list, ‘f’ values extracted from the list and data type of the above list is different from the list obtained by applying function. For example, if Map function is applied on any document, then it may return all integers which appear in the document or may be words or characters, vary from situation to situation.


Google motivation towards MapReduce:

So, the motivation for designing MapReduce was that GOOGLE has large problems in joint datasets, sometimes larger than peta-byte of data as input and then process run on several thousands of processors to accomplish task up to some reasonable time. Google want to make this whole process easier for the programmers to do their work. So, MapReduce was developed as framework provide automatic parallelization & distribution. So, programmers use these functions to save their time.


Programming Model of MapReduce:

MapReduce consist of two functions map() and reduce() borrows from functional programming, detail of these functions is given below:

map(in_key, in_value) => (out_key, intermediate_values) list

"Map" step: The master node takes the input, chops it up into smaller sub-problems, and distributes those to worker nodes. (A worker node may do this again in turn, leading to a multi-level tree structure.) The worker node processes that smaller problem, and passes the answer back to its master node.


reduce(out_key, intermediate_values: list) => out_value_list

"Reduce" step: The master node then takes the answers to all the sub-problems and combines them in a way to get the output - the answer to the problem it was originally trying to solve.


The advantage of MapReduce is that it allows for distributed processing of the map and reduction operations. Provided each mapping operation is independent of the other, all maps can be performed in parallel - though in practice it is limited by the data source and/or the number of CPUs near that data. Similarly, a set of 'reducers' can perform the reduction phase - all that is required is that all outputs of the map operation which share the same key are presented to the same reducer, at the same time. While this process can often appear inefficient compared to algorithms that are more sequential, MapReduce can be applied to significantly larger datasets than that which "commodity" servers can handle - a large server farm can use MapReduce to sort a petabyte of data in only a few hours. The parallelism also offers some possibility of recovering from partial failure of servers or storage during the operation: if one mapper or reducer fails, the work can be rescheduled -assuming the input data is still available.


Logical view:

The Map and Reduce functions of MapReduce are both defined with respect to data structured in (key, value) pairs. Map takes one pair of data with a type on a data domain, and returns a list of pairs in a different domain:

Map(k1,v1) -> list(k2,v2)

The map function is applied in parallel to every item in the input dataset. This produces a list of (k2,v2) pairs for each call. After that, the MapReduce framework collects all pairs with the same key from all lists and groups them together, thus creating one group for each one of the different generated keys.

The Reduce function is then applied in parallel to each group, which in turn produces a collection of values in the same domain:

Reduce(k2, list (v2)) -> list(v2)

Each Reduce call typically produces either one value v2 or an empty return, though one call is allowed to return more than one value. The returns of all calls are collected as the desired result list.

Thus the MapReduce framework transforms a list of (key, value) pairs into a list of values. This behavior is different from the functional programming map and reduce combination, which accepts a list of arbitrary values and returns one single value that combines all the values returned by map.
It is necessary but not sufficient to have implementations of the map and reduce abstractions in order to implement MapReduce. Furthermore effective implementations of MapReduce require a distributed file system to connect the processes performing the Map and Reduce phases.


Example:

The canonical example application of MapReduce is a process to count the appearances of each different word in a set of documents:

map(String name, String document):

// key: document name
// value: document contents
for each word w in document:
EmitIntermediate(w, 1);

reduce(String word, Iterator partialCounts):
// key: a word
// values: a list of aggregated partial counts
int result = 0;
for each v in partialCounts:
result += ParseInt(v);
Emit(result);

Here, each document is split in words, and each word is counted initially with a "1" value by the Map function, using the word as the result key. The framework puts together all the pairs with the same key and feeds them to the same call to Reduce, thus this function just needs to sum all of its input values to find the total appearances of that word.



Execution Process:

We can see all this graphically from Figure, we have set of mapper(workstation nodes) that are connected to some data source(could be files, data tables). And each map process creates set of intermediate values tagged by certain intermediate keys and values. The barrier is the synchronization mechanism, which just waits for all the process to complete so when we have ‘n’ processes then there is counter on master machine, starting form ‘n’ and decrements to ‘0’. By hitting ‘0’, it will block all the tasks(threads) at once. At this time, we agreed that intermediate values are all together. All same values shuffle by reducers and go to particular reducer ‘1’ or ‘2’ etc. So, parallelism created by MapReduce comes with this fact that there is no synchronization among each mapper and the running threads are totally independent to each other, similarly for the reducers.
Each work node has its own buffering memory to store intermediate keys and values. So if one of the node goes down then mapper(Master) rescheduled its assigned tasks to any other free node. But if unfortunately master or database goes down, then whole process will be aborted to some specific time.


Optimization during Execution Process:

No reduce function will start until all map functions perform their tasks, in the mean while there may be one system that is too slow then the other nodes, which can limit the overall process. In this situation too, the master node is taking care of this, by scheduling the slower node tasks to other nodes which are free.
In the same time, another function named as combiner(), taking care of gathering all same intermediate values produced by mappers. So, combiner() causes mini reducer phase to occur in order to save bandwidth.


Fault Tolerance:

MapReduce also provide Fault Tolerance by managing tasks to any other free node(machine) if one of the node gets fail due to any problem either network failure or system crash down. Because if there are thousands of computers interconnected for performing some job and one of the computer goes down, it will defiantly stop the whole process.


MapReduce: A Major Step Backward: (From Database Community…)

A giant step backward in the programming paradigm for large-scale data intensive applications
A sub-optimal implementation, in that it uses brute force instead of indexing
Not novel at all -- it represents a specific implementation of well known techniques developed nearly 25 years ago
Missing most of the features that are routinely included in current DBMS
Incompatible with all of the tools DBMS users have come to depend on


My point of View:

Unfortunately, I don't work more in the field of databases, but one thing is sure that every person always try to adopt the procedure which is easy to learn and have less complications. Handling data using SQL requires more knowledge than using MapReduce function, because syntax of map() & reduce() is C or JAVA, and everyone is familiar with this. MapReduce technique can be applied to either structured or unstructured data but SQL always require structured from of data for further analysis or computations. So, I think, MapReduce has proven to be a useful abstraction for simplifying large computation on distributed computing.


References:


Integration of SQLite3 and Netbeans C/C++ IDE

Few days back, I wanted to use SQLite database for one of my project. I spend couple of hours to find a way to integrate with Netbeans. Mayb...