Turn Rdd Into Broadcast Dictionary For Lookup

September 07, 2023 Post a Comment

What I have so far is: lookup = sc.textFile('/user/myuser/lookup.asv') lookup.map(lambda r: r.split(chr(1)) ) And now I have a RDD looks like [ [filename1, category1], [f

Solution 1:

Broadcast variables are read-only on workers. Spark provides accumulators which are write only but these are intended for things like counters. Here you can simply collect and create a Python dictionary:

lookup_bd = sc.broadcast({
  k: v for (k, v) in lookup.map(lambda r: r.split(chr(1))).collect()
})

It is not realistic to do it in any sort of SQL because you have to create a table for table1 which has 25K columns, the create table syntax will be ridiculous long.

Creation shouldn't be a problem. As long you know the names you can easily create table like this programmatically:

from pyspark.sql import Row

colnames = ["x{0}".format(i) for i inrange(25000)] # Replace with actual names

df = sc.parallelize([
   row(*[randint(0, 100) for _ inrange(25000)]) for x inrange(10)
]).toDF()

## len(df.columns)## 25000

There is another problem here which is much more serious even when you use plain RDDs. Very wide rows are generally speaking hard to handle in any row-wise format.

One thing you can do is use sparse representation like SparseVector or SparseMatrix. Another is to encode pixel info for example using RLE.

Python Channel

Turn Rdd Into Broadcast Dictionary For Lookup

Solution 1:

Post a Comment for "Turn Rdd Into Broadcast Dictionary For Lookup"