Spark: How To Parse Json String Of Nested Lists To Spark Data Frame?
How to parse JSON string of nested lists to spark data frame in pyspark ? Input data frame: +-------------+-----------------------------------------------+ |url |json
Solution 1:
With some replacements in the strings and by splitting you can get the desired result:
from pyspark.sql import functions as F
df1 = df.withColumn(
"col_1",
F.regexp_replace("url", "https://url.", "")
).withColumn(
"col_2_3",
F.explode(
F.expr("""transform(
split(trim(both '][' from json), '\\\],\\\['),
x -> struct(split(x, ',')[0] as col_2, split(x, ',')[1] as col_3)
)""")
)
).selectExpr("col_1", "col_2_3.*")
df1.show(truncate=False)
#+-----+-------------+------+#|col_1|col_2 |col_3 |#+-----+-------------+------+#|a |1572393600000| 1.000|#|a |1572480000000| 1.007|#|b |1572825600000| 1.002|#|b |1572912000000| 1.000|#+-----+-------------+------+
Explanation:
trim(both '][' from json)
: removes trailing and leading caracters[
and]
, get someting like:1572393600000, 1.000],[1572480000000, 1.007
Now you can split by
],[
(\\\
is for escaping the brackets)transform
takes the array from the split and for each element, it splits by comma and creates structcol_2
andcol_3
explode the array of structs you get from the transform and star expand the struct column
Solution 2:
df.select(df.url, F.explode(F.from_json(df.json,"array<string>")))
.select("url",F.from_json((F.col("col")),"array<string>").alias("col"))
.select("url",F.col("col").getItem(0),F.col("col").getItem(1))
.show(truncate=False)
+-------------+-------------+------+
|url |col[0] |col[1]|
+-------------+-------------+------+
|https://url.a|1572393600000|1.0 |
|https://url.a|1572480000000|1.007 |
|https://url.b|1572825600000|1.002 |
|https://url.b|1572912000000|1.0 |
+-------------+-------------+------+
Post a Comment for "Spark: How To Parse Json String Of Nested Lists To Spark Data Frame?"