Big data's problem of cleaning and storage

as shown in the picture, the company gave hundreds of gigabytes of word documents, looked at the contents, it was very messy, but roughly like the picture related to the company"s information;
but typesetting ah, field names, ah, the consistency is very poor;

the company wants the information to be stored in the warehouse. I don"t know where to start.

do you have any ideas or suggestions?


1 first use the regular expression to split the single message according to the numeric colon (1 br >'1)
get the result:
[
'1: company name: An website: wwwn phone: 123456789n,
'2: company name: An website: wwwn phone: 123456789n,
'3: company name: An website: wwwn Tel: 123456789 address: abbn',
'4: company name: An website: wwwn Tel: 123456789

2 take out each item and replace the numeric colon (1:) with an empty
such as:'1: company name: An website: wwwn Tel: 123456789n,

3 split according to n, and then distinguish key-value pairs according to:
for example: 'company name: An website: wwwn telephone number: 123456789n,
get the result: [{' company name':'A'}, {'website': 'www'}, {' phone': '123456789'}]

Menu