Spark sql parses the json of an array of nested objects

1. Json data is now available as follows

{"id": 11, "data": [{"package": "com.browser1", "activetime": 60000}, {"package": "com.browser6", "activetime": 1205000}, {"package": "com.browser7", "activetime": 1205000}]}

{"id": 12, "data": [{"package": "com.browser1", "activetime": 60000}, {"package": "com.browser6", "activetime": 1205000}]}
.
, the activation time of app in json is to analyze the total activation time of each app
I use sparK sql to parse json

    val sqlContext = sc.sqlContext

    val behavior = sqlContext.read.json("behavior-json.log")
    behavior.cache()

    behavior.createOrReplaceTempView("behavior")
 

    val appActiveTime = sqlContext.sql("SELECT data FROM behavior")    //    sql
    
    appActiveTime.show(100,100)  //  dataFrame
    appActiveTime.rdd.foreach(println)  //  rdd

but the printed dataFrame looks like this

+----------------------------------------------------------------------+
|                                                                  data|
+----------------------------------------------------------------------+
|                        [[60000,com.browser1], [12870000,com.browser]]|
|                          [[60000,com.browser1], [120000,com.browser]]|
|                          [[60000,com.browser1], [120000,com.browser]]|
|                         [[60000,com.browser1], [1207000,com.browser]]|
|                                                [[120000,com.browser]]|
|                        [[60000,com.browser1], [1204000,com.browser5]]|
|                        [[60000,com.browser1], [12075000,com.browser]]|
|                          [[60000,com.browser1], [120000,com.browser]]|
|                         [[60000,com.browser1], [1204000,com.browser]]|
|                          [[60000,com.browser1], [120000,com.browser]]|
|                         [[60000,com.browser1], [1201000,com.browser]]|
|                                              [[1200400,com.browser5]]|
|                         [[60000,com.browser1], [1200400,com.browser]]|
|[[60000,com.browser1], [1205000,com.browser6], [1205000,com.browser7]]|

rdd is like this

[WrappedArray([60000,com.browser1], [60000,com.browser1])]
[WrappedArray([120000,com.browser])]
[WrappedArray([60000,com.browser1], [1204000,com.browser5])]
[WrappedArray([12075000,com.browser], [12075000,com.browser])]

and I want to convert the data into

com.browser1 60000 
com.browser1 60000 
com.browser 12075000 
com.browser 12075000 
.......

is to change the array elements of each row in rdd into one row. Of course, it can also be other structures that are easy to analyze

because I am a beginner in spark and scala, I have tried unsuccessfully for a long time, so I hope you can guide me.


do you have a solution for this?


No problem.
according to your original json, if you convert to a table structure, this format is indeed true, because there are two attributes in json, one id, and one data,data is a list. If you don't map,spark yourself, you won't know what the structure of your list is mapped into a table

.
+--------------------+---+
|                data| id|
+--------------------+---+
|[[60000, com.brow...| 11|
+--------------------+---+

so if you want to present your json, in a table structure, you need to map it yourself

   val session = SparkSession.builder().config(sc).getOrCreate()
    try {
      val schema = StructType(Seq(
        StructField("id", LongType),
        StructField("package", StringType),
        StructField("activetime", LongType)
      ))

      val encoder = RowEncoder(schema)

      val df = session.read.json("file:\\\\\\E:/anc.json")
        .flatMap(new FlatMapFunction[Row, Row] {
          override def call(r: Row): util.Iterator[Row] = {
            val list = new util.ArrayList[Row]()
            val datas = r.getAs[mutable.WrappedArray.ofRef[Row]]("data")
            datas.foreach(data => {
              list.add(Row(r.getAs[Long]("id"), data.getAs[Long](1), data.getAs[String](0)))
            })
            list.iterator()
          }
        }, encoder)

      df.show()

the final table structure is:

+---+------------+----------+
| id|     package|activetime|
+---+------------+----------+
| 11|com.browser1|     60000|
| 11|com.browser6|   1205000|
| 11|com.browser7|   1205000|
+---+------------+----------+

you can also perform map, key points after you
sqlContext.sql ("SELECT data FROM behavior")
. How do you want to show it in that list


use the EXPLODE function to flatten the array

SELECT EXPLODE(data) FROM behavior
Menu