Skip to content Skip to sidebar Skip to footer

Turn Rdd Into Broadcast Dictionary For Lookup

What I have so far is: lookup = sc.textFile('/user/myuser/lookup.asv') lookup.map(lambda r: r.split(chr(1)) ) And now I have a RDD looks like [ [filename1, category1], [f

Solution 1:

Broadcast variables are read-only on workers. Spark provides accumulators which are write only but these are intended for things like counters. Here you can simply collect and create a Python dictionary:

lookup_bd = sc.broadcast({
  k: v for (k, v) in lookup.map(lambda r: r.split(chr(1))).collect()
})

It is not realistic to do it in any sort of SQL because you have to create a table for table1 which has 25K columns, the create table syntax will be ridiculous long.

Creation shouldn't be a problem. As long you know the names you can easily create table like this programmatically:

from pyspark.sql import Row

colnames = ["x{0}".format(i) for i inrange(25000)] # Replace with actual names

df = sc.parallelize([
   row(*[randint(0, 100) for _ inrange(25000)]) for x inrange(10)
]).toDF()

## len(df.columns)## 25000

There is another problem here which is much more serious even when you use plain RDDs. Very wide rows are generally speaking hard to handle in any row-wise format.

One thing you can do is use sparse representation like SparseVector or SparseMatrix. Another is to encode pixel info for example using RLE.

Post a Comment for "Turn Rdd Into Broadcast Dictionary For Lookup"