The json
格式错误。的json
api of sqlContext
正在将其读取为损坏的记录。正确的形式是
{"group":{},"lang":[[1,"scala","functional"],[2,"java","object"],[3,"py","interpreted"]]}
假设你把它放在一个文件(“/home/test.json”)中,那么你可以使用以下方法来获取dataframe
你要
import org.apache.spark.sql.functions._
import sqlContext.implicits._
val df = sqlContext.read.json("/home/test.json")
val df2 = df.withColumn("lang", explode($"lang"))
.withColumn("id", $"lang"(0))
.withColumn("langs", $"lang"(1))
.withColumn("type", $"lang"(2))
.drop("lang")
.withColumnRenamed("langs", "lang")
.show(false)
你应该有
+---+-----+-----------+
|id |lang |type |
+---+-----+-----------+
|1 |scala|functional |
|2 |java |object |
|3 |py |interpreted|
+---+-----+-----------+
Updated
如果您不想更改下面评论中提到的输入 json 格式,您可以使用wholeTextFiles
阅读json
文件和parse
如下
import sqlContext.implicits._
import org.apache.spark.sql.functions._
val readJSON = sc.wholeTextFiles("/home/test.json")
.map(x => x._2)
.map(data => data.replaceAll("\n", ""))
val df = sqlContext.read.json(readJSON)
val df2 = df.withColumn("lang", explode($"lang"))
.withColumn("id", $"lang"(0).cast(IntegerType))
.withColumn("langs", $"lang"(1))
.withColumn("type", $"lang"(2))
.drop("lang")
.withColumnRenamed("langs", "lang")
df2.show(false)
df2.printSchema
它应该给你dataframe
如上所述和schema
as
root
|-- id: integer (nullable = true)
|-- lang: string (nullable = true)
|-- type: string (nullable = true)