Use spark or hadoop to delete duplicate two-way relational data

I have a batch of data (10 billion) as follows,

ID FROM TO
1   A    B
2   A    C
3   B    A
4   C    A

Delete duplicate two-way relational data as follows

ID FROM TO
1   A    B
2   A    C

1. Because the amount of data is too large, bloomfilter is not suitable;
2, the efficiency of using database query to repeat is too low;
3, is it more appropriate to use spark or hadoop to deal with such a large amount of data? All the deduplication solutions found on the network are similar to using a field of groupby to repeat, which doesn"t make much sense to my data.

Java bloomfilter elasticsearch spark hadoop

Mar.31,2021

you can sort the FROM and TO fields with Spark, and the first piece of data becomes

ID FROM TO
1   A    B
2   A    C
3   A    B
4   A    C

and then go to the duplicate or reduce

Previous: How to use ant-design to add select control to the cell of editable table

Next: Jquery select elements, select fixed child elements under the same class, only the index values of father and child, how to choose to child

MySQL Query : SELECT * FROM `codeshelper`.`v9_news` WHERE status=99 AND catid='6' ORDER BY rand() LIMIT 5
MySQL Error : Disk full (/tmp/#sql-temptable-64f5-38057aa-366fa.MAI); waiting for someone to free some space... (errno: 28 "No space left on device")
MySQL Errno : 1021
Message : Disk full (/tmp/#sql-temptable-64f5-38057aa-366fa.MAI); waiting for someone to free some space... (errno: 28 "No space left on device")
Need Help?