You have greatly helped me to see if my processing method is reasonable [a large number of Json files are read, parsed and imported into elasticsearch].

< H2 > Business scenario < / H2 >

A large number of json files need to be read and re-parsed and imported into elasticsearch . Json files are saved in different date folders. The size of a single folder is about 80g. The number of json files under the file has not been actually calculated, but there should be 1w +

. < H2 > the way I handle it < / H2 >

Open a thread pool, and then multithread reads different json files. The data in json is in the form of array [data1, data2, data3, data4,.] . I parse the json file, then traverse each data, and parse the data into the desired json format and import elasticsearch . Through the batch import processing of elasticsearch , import every thousand

. < H2 > what you want to improve < / H2 >

I want to add a function similar to breakpoint continuation, that is, when my program is halfway down, I don"t need to start from scratch, but I re-read the import from the place where it was disconnected last time, but I didn"t realize the clue. I did it with Java

curl -XPOST -H 'Content-Type: application/json' localhost:9200/_bulk --data-binary @/path/to/your/file_1.json curl -XPOST -H 'Content-Type: application/json' localhost:9200/_bulk --data-binary @/path/to/your/file_2.json curl -XPOST -H 'Content-Type: application/json' localhost:9200/_bulk --data-binary @/path/to/your/file_3.json curl -XPOST -H 'Content-Type: application/json' localhost:9200/_bulk --data-binary @/path/to/your/file_4.json curl -XPOST -H 'Content-Type: application/json' localhost:9200/_bulk --data-binary @/path/to/your/file_5.json