`combinebykey`, Pyspark
Solution 1:
In short, combineByKey lets you specify explicitly the 3 stages of aggregating (or reducing) your rdd.
1. What is done to a single row when it is FIRST met?
In the example you have provided the row is put in a list.
2. What is done to a single row when it meets a previously reduced row?
In the example, a previously reduced row is already a list and we add to it the new row and return the new, extended, list.
3. What is done to two previously reduced rows?
In the example above, both rows are already lists and we return a new list with the items from both of them.
More, well explained, step by step examples are available in those links:
A key explanation from the second link is:
Let us see how combineByKey works in our use case. When combineByKey navigates through each element i.e for partition 1 - (Messi,45) it has a key which it has not seen before and when it moves to next (Messi,48) it gets a key which it has seen before. When it first time see a element , combineByKey() use function called createCombiner to create an initial value for the accumulator on that key. i.e. it use Messi as the key and 45 as value. So current value of the accumulator of that key (Messi) becomes 45. Now next time combineByKey() sees same key on same partition it does not use createCombiner instead it will make use of second function mergeValue with current value of accumulator (45) and new value 48.
Since all this happens parallel in different partition, there is chance that same key exist on other partition with other set of accumulators. So when results from different partition has to be merged it use mergeCombiners function.
Post a Comment for "`combinebykey`, Pyspark"