Wednesday, 11 September 2013

How is load balancing achieved while sending data to the reducers in Hadoop

How is load balancing achieved while sending data to the reducers in Hadoop

As we know, that during the copy phase of hadoop, each of the reduce
worker processes read data from all mapper nodes and perform a merge of
the already sorted data (was sorted during in-memory sorting on the mapper
side) and work on their share of keys and their values.
Now, we also know that all the data corresponding to a particular will go
to only one reducer.
My question is : How is the data split to be transferred to the reducers
i.e. how is the partition size decided and by what process is it decided
as the data is transferred using a pull mechanism instead of a push
mechanism. An interesting challenge to counter here would have been to
determine the overall size of the data as the data resides on multiple
nodes (I am guessing that the job tracker/master process may be aware of
the size and location of data for all the nodes, but I am not sure on that
too).
Wouldn't it be a performance penalty in terms of parallel processing if
the data is highly skewed and most of it belongs to a single key where
there are 10 or more reducers. In this case, only one reducer process
would be processing most of the data in a sequential fashion. Is this kind
of a situation handled in Hadoop? If yes, how?

No comments:

Post a Comment