How can two large pieces of data in spark avoid shuffle in join?

purpose: there are two large pieces of data in spark that require join,. Both input data contain the field userid. Now you need to associate them according to userid. I hope to avoid shuffle.

completed: I pre-processed two pieces of data into 1w files according to userid, which ensures that the data of the same userid falls to the same partition number, and that the files in each partition are sorted according to userid.

for example, the data of the first file is

{"userid": 10001, "value": ""}
{"userid": 1, "value": ""}
{"userid": 21, "value": ""}

the data of the second file is

{"userid": 10001, "value": ""}
{"userid": 1, "value": ""}
{"userid": 92, "value": ""}

the format of the first file after processing is
file1/part-00001

{"userid": 1, "value": ""}
{"userid": 10001, "value": ""}

file1/part-00021

{"userid": 21, "value": ""}

the format of the second file after processing is
file2/part-00001

{"userid": 1, "value": ""}
{"userid": 10001, "value": ""}

file2/part-00092

{"userid": 92, "value": ""}

problems encountered in spark, how can two files with the same partition number fall into the same task to deal with, and I need to be able to operate on two partition files, because the two partitions are sorted according to userid, I can complete the associated operation as long as o (userid) time complexity.

if you can"t operate two partition files, can you merge two files with the same partition number into one task? In that case, I need to do an extra sort within the partition.


if the userid is unique in both files, the performance will be much better if you can union the two pieces of data together and then reduce them.

Menu