You have greatly helped me to see if my processing method is reasonable [a large number of Json files are read, parsed and imported into elasticsearch].

< H2 > Business scenario < / H2 >

A large number of json files need to be read and re-parsed and imported into elasticsearch . Json files are saved in different date folders. The size of a single folder is about 80g. The number of json files under the file has not been actually calculated, but there should be 1w +

. < H2 > the way I handle it < / H2 >

Open a thread pool, and then multithread reads different json files. The data in json is in the form of array [data1, data2, data3, data4,.] . I parse the json file, then traverse each data, and parse the data into the desired json format and import elasticsearch . Through the batch import processing of elasticsearch , import every thousand

. < H2 > what you want to improve < / H2 >

I want to add a function similar to breakpoint continuation, that is, when my program is halfway down, I don"t need to start from scratch, but I re-read the import from the place where it was disconnected last time, but I didn"t realize the clue. I did it with Java

.

it may be troublesome to continue to communicate at breakpoints. You can use the split command to first cut these large files into small files, and then directly use the curl command to submit these files through ide/en/elasticsearch/reference/current/docs-bulk.html?q=bulk" rel= "nofollow noreferrer" > bulk API :

.
curl -XPOST -H 'Content-Type: application/json' localhost:9200/_bulk --data-binary @/path/to/your/file_1.json
curl -XPOST -H 'Content-Type: application/json' localhost:9200/_bulk --data-binary @/path/to/your/file_2.json
curl -XPOST -H 'Content-Type: application/json' localhost:9200/_bulk --data-binary @/path/to/your/file_3.json
curl -XPOST -H 'Content-Type: application/json' localhost:9200/_bulk --data-binary @/path/to/your/file_4.json
curl -XPOST -H 'Content-Type: application/json' localhost:9200/_bulk --data-binary @/path/to/your/file_5.json

this saves even multithreading.


copy it all, and delete one file after transferring one file

Menu