Why does the sortBy function of Spark generate 4 MapPartitionsRDD?

execute two programs in spark-shell:
first paragraph sortBy:

val list1: List[(String, Int)] = List(("the", 12), ("they", 2), ("do", 4), ("wild", 1), ("and", 5), ("into", 4))
val listRDD1: RDD[(String, Int)] = sc.parallelize(list1)
val result1: RDD[(String, Int)] = listRDD1.sortBy(_._2, false)
result1.collect()

look at the DAG of the program in webui, resulting in three Stage:

clipboard.png

clipboard.png

clipboard.png

4MapPartitionsRDD3ShuffledRDDshuffledRDD
sortBy:


keyByshuffleMapPartitionsRDD, valuesshuffledMapPartitionsRDDMapPartitionsRDDsortByKey

:


DAGStage:

clipboard.png

clipboard.png

clipboard.png

look at the fact that DAG does generate two MapPartitionsRDD, but how are both MapPartitionsRDD generated? And why is there another parallelize phase in the middle? Ask the boss for an answer.

Jul.11,2022
Menu