Baidu interview questions, how to quickly find out the duplicates in the file (large files can not be read at one time)?

Baidu interview questions, roughly means that there is a file, the file is very large can not be read at one time (may not be loaded into memory), the file is stored in the IP address, how to quickly find the duplicate IP address? Ask for advice.

The

file is so large that it can be read line by line. Is it feasible to append to list, take set, and then take the difference set?


not feasible.

  1. append to list is no different from direct one-time reading, which takes up the memory of all data.
  2. the difference set can only be set-list , not list-set .

conditions are not sufficient. If there are 10 million recorded addresses and there are only a few duplicates, you can sort them first and then map-reduce what comes to mind. If there are 10 million records, of which 9 million are duplicates, it can be solved with hashTable.


IPv4? There are only 4Gi addresses in all. Dig a hole in memory and wait for IP to jump. Waste point, use int8 to save that is 4GB memory, save point, use bit to save as long as 500MB. The idea can be more lively, in fact, I think the restriction of giving IP address is to reduce the difficulty .

IPv6, you may have to divide and divide. The basic idea is to check the first few bits of the address according to the length that the memory can bear, throw the collided ones into the same bucket, and then one by one bucket to see if there are any heavy ones, and you can also divide them down. in fact, DBMS does this all day.


first convert ip into int (a bunch of algorithms on the Internet please Baidu)
then make a large array
such as 192.168.3.4 converted to 3232236292
the corresponding array is int array [3232236292-1] to do the count
can be simplified again, because the array has to be pre-allocated, here use map


Mr. into a map that contains all ip, such as map < ip,1 >, which is only 10 million. (this can be optimized) and then read one by one from the file. After reading, take v=map.get (ip) from map. If vroom1, ip,0, and console.log (ip)

, if there is already a repetition of this, the code can be read from the file one by one, and then read one by one from the file. After reading, fetch it from the file.
Menu