Turn Rdd Into Broadcast Dictionary For Lookup
Solution 1:
Broadcast variables are read-only on workers. Spark provides accumulators which are write only but these are intended for things like counters. Here you can simply collect and create a Python dictionary:
lookup_bd = sc.broadcast({
k: v for (k, v) in lookup.map(lambda r: r.split(chr(1))).collect()
})
It is not realistic to do it in any sort of SQL because you have to create a table for table1 which has 25K columns, the create table syntax will be ridiculous long.
Creation shouldn't be a problem. As long you know the names you can easily create table like this programmatically:
from pyspark.sql import Row
colnames = ["x{0}".format(i) for i inrange(25000)] # Replace with actual names
df = sc.parallelize([
row(*[randint(0, 100) for _ inrange(25000)]) for x inrange(10)
]).toDF()
## len(df.columns)## 25000
There is another problem here which is much more serious even when you use plain RDDs. Very wide rows are generally speaking hard to handle in any row-wise format.
One thing you can do is use sparse representation like SparseVector
or SparseMatrix
. Another is to encode pixel info for example using RLE.
Post a Comment for "Turn Rdd Into Broadcast Dictionary For Lookup"