Use spark or hadoop to delete duplicate two-way relational data

I have a batch of data (10 billion) as follows,

ID FROM TO
1   A    B
2   A    C
3   B    A
4   C    A

Delete duplicate two-way relational data as follows

ID FROM TO
1   A    B
2   A    C

1. Because the amount of data is too large, bloomfilter is not suitable;
2, the efficiency of using database query to repeat is too low;
3, is it more appropriate to use spark or hadoop to deal with such a large amount of data? All the deduplication solutions found on the network are similar to using a field of groupby to repeat, which doesn"t make much sense to my data.


you can sort the FROM and TO fields with Spark, and the first piece of data becomes

.
ID FROM TO
1   A    B
2   A    C
3   A    B
4   A    C

and then go to the duplicate or reduce

Menu