How Can A Reduce A Key Value Pair To Key And List Of Values?

April 21, 2024 Post a Comment

Let us Assume, I have a key value pair in Spark, such as the following. [ (Key1, Value1), (Key1, Value2), (Key1, Vaue3), (Key2, Value4), (Key2, Value5) ] Now I want to reduce this

Solution 1:

collections.defaultdict can be the solution https://docs.python.org/2/library/collections.html#collections.defaultdict

>>> from collections import defaultdict
>>> d = defaultdict(list)
>>> for key, value in [('Key1', 'Value1'), ('Key1', 'Value2'), ('Key1', 'Vaue3'), ('Key2', 'Value4'), ('Key2', 'Value5') ]:
...     d[key].append(value)

>>> print d.items()
[('Key2', ['Value4', 'Value5']), ('Key1', [ 'Value1','Value2', 'Vaue3'])]

Solution 2:

val data = Seq(("Key1", "Value1"), ("Key1", "Value2"), ("Key1", "Vaue3"), ("Key2", "Value4"), ("Key2", "Value5"))

data
  .groupBy(_._1)
  .mapValues(_.map(_._2))

res0: scala.collection.immutable.Map[String,Seq[String]] =
     Map(
        Key2 -> List(Value4, Value5), 
        Key1 -> List(Value1, Value2, Vaue3))

Solution 3:

I'm sure there's a more readable way to do this, but the first thing that comes to mind is using itertools.groupby. Sort the list by the first element of the tuple (the key). Then use a list comprehension to iterate over the groups.

from itertools import groupby

l = [('key1', 1),('key1', 2),('key1', 3),('key2', 4),('key2', 5)]
l.sort(key = lambda i : i[0])

[(key, [i[1] for i in values]) for key, values in groupby(l, lambda i: i[0])]

Output

[('key1', [1, 2, 3]), ('key2', [4, 5])]

Solution 4:

Something like this

newlist = dict()
for x in l: 
    if x[0] notin newlist: 
        dict[x[0]] = list()
    dict[x[0]].append(x[1])

Solution 5:

The shortest, using the defaultdict, is the following; no requirements on being sorted.

>>>from collections import defaultdict                                                                                       >>>collect = lambda tuplist: reduce(lambda acc, (k,v): acc[k].append(v) or acc,\
                                     tuplist, defaultdict(list))
>>>collect( [(1,0), (2,0), (1,2), (2,3)])
defaultdict(<type 'list'>, {1: [0, 2], 2: [0, 3]})

Python Channel