Some questions about scrapy-redis

I want to climb a website with about 1 billion data. Url is http://xxx.com/id=xx accesses and extracts the data and stores it in the database

.

where the id parameter in url is predictable, ranging from 0 to 1000000000

so I can generate these 1 billion URLs directly

for i in range(0,1000000000):
 yield Request(f"http://xxx.com/id={i}", self.parse)

but it can only run on one machine, and the efficiency is too low

I intend to access scrapy-redis, but I have the following questions:

when I connect to scrapy-redis, and plan to use 20 machines:

are 1.Master and Slaver responsible for their respective duties? Master is responsible for generating URLs and putting them on redis,Slaver to retrieve URLs from redis and consume them? If not, what is it all about?

2. Because I want to store the crawling results in the database, does every Slaver have to connect to the database? Can you let Slaver store the data in the database itself, because I feel that it will waste a lot of time and traffic if I send the data back to Master, and let Master into the database

3. According to the articles I have read on the Internet, it seems that there is no strict distinction between Master and Slaver,. We just go to redis before doing it and ask if anyone else has made this id, but this will waste a lot of time. Doesn"t scrapy-redis support my question 1 the function of performing their own duties ?

for the time being, thank you

for these three questions.

Let me give my opinion for reference only:

  1. generated URLs can be stored in, redis set () type by writing a script. Scrapy-redis each machine can read a value of pop in the same redis, (url), and crawl it. In my opinion, there is no distinction between Master and Slaver.
  2. if the database is in the local area network, the speed is not so slow. If the database is on the public network, you can save the local files on each server and merge them after climbing.
  3. to remove duplicates, if you follow the first point, the set type is stored in the redis, and the url will not be repeated.
Menu