Skip to content Skip to sidebar Skip to footer

Spark: How To Parse Json String Of Nested Lists To Spark Data Frame?

How to parse JSON string of nested lists to spark data frame in pyspark ? Input data frame: +-------------+-----------------------------------------------+ |url |json

Solution 1:

With some replacements in the strings and by splitting you can get the desired result:

from pyspark.sql import functions as F

df1 = df.withColumn(
    "col_1",
    F.regexp_replace("url", "https://url.", "")
).withColumn(
    "col_2_3",
    F.explode(
        F.expr("""transform(
            split(trim(both '][' from json), '\\\],\\\['), 
            x -> struct(split(x, ',')[0] as col_2, split(x, ',')[1] as col_3)
        )""")
    )
).selectExpr("col_1", "col_2_3.*")

df1.show(truncate=False)

#+-----+-------------+------+#|col_1|col_2        |col_3 |#+-----+-------------+------+#|a    |1572393600000| 1.000|#|a    |1572480000000| 1.007|#|b    |1572825600000| 1.002|#|b    |1572912000000| 1.000|#+-----+-------------+------+

Explanation:

  1. trim(both '][' from json) : removes trailing and leading caracters [ and ], get someting like: 1572393600000, 1.000],[1572480000000, 1.007

  2. Now you can split by ],[ (\\\ is for escaping the brackets)

  3. transform takes the array from the split and for each element, it splits by comma and creates struct col_2 and col_3

  4. explode the array of structs you get from the transform and star expand the struct column

Solution 2:

df.select(df.url, F.explode(F.from_json(df.json,"array<string>")))
.select("url",F.from_json((F.col("col")),"array<string>").alias("col"))
.select("url",F.col("col").getItem(0),F.col("col").getItem(1))
.show(truncate=False)

+-------------+-------------+------+
|url          |col[0]       |col[1]|
+-------------+-------------+------+
|https://url.a|1572393600000|1.0   |
|https://url.a|1572480000000|1.007 |
|https://url.b|1572825600000|1.002 |
|https://url.b|1572912000000|1.0   |
+-------------+-------------+------+

Post a Comment for "Spark: How To Parse Json String Of Nested Lists To Spark Data Frame?"