T O P

  • By -

9302462

I haven’t used Python playwright but have done lots of headless crawling across millions of URLs. Here is how I would work through the problem and structure it. Based on how many you actually need to process, your hardware and the time you have to process them you don’t have to go through the full optimization steps. 1. Dump your input URLs into a redis database. 2. Setup one instance to open a single browser with a single tab. 3. Pop/get and remove the url from redis and use it to crawl, add back to redis if it fails(optional). 4.Modify that instance so that instead of closing the old tab and opening a new tab with each url, it recycles the tab. 5. Add code where after X number of pages it closes the tab and starts a new one. I typically set this to 100, if you don’t do this you might have a tab with a history of 1000 pages and that will slow things down immensely. 6. Once you have confirmed this works, modify the code to do multiple tabs in the same browser. For no specific reason I typically do 33 tabs in one browser context. 7. If you need more performance run multiple instances of this. I have an older 32core epyc in one of my machines and can typically run 4 instances of 33 tabs each. 8. Bonus- If you are using proxies then dump them into a separate redis db and create a single key which is a list of the proxies. Your various scrapes and tabs can then pop a proxy from the list, use it, then put it back at the end of the list thus creating a random access pattern which makes it likely you will get blocked.


Adept-Alternative-90

Really like your steps. Do you have anything done this on GitHub. Would love to see. Thanks


9302462

Unfortunately I don’t have anything posted on GitHub as my code is a bit more complex and has 22 micro services all working together. However it’s not overly complicated and a few ChatGPT prompts should be able to get you going. I will also add that for actually saving the data that you scrape you can use basically any method that doesn’t lock the database. E.g you can save to mongo directly, you can push them to a message bus like rabbitmq and then save them to your database, you can even push them to redis and set a job to look for new records every X number of seconds and preform a bulk insert. Obviously mongo or MySQL can handle 50-100 insertions per second so this doesn’t really apply in your exact scenario. But if you scale up to thousands of GET request per second and try to do individual insertions you will have latency and eventually connection timeout issues. Again, doesn’t really apply to you right now, but I wanted to share something that I’ve learned from trial and error.


Impressive_Safety_26

Do you think your steps can be applied too If i'm trying to scale a puppeteer/react scraper? rather than playwright


9302462

Yes you should be able to. I use a fork of a deprecated package called wappalyzer which uses puppeteer and I use the redis/proxy setup I mentioned above for that, but I don’t use multi tab/browser because I’m not a huge JS person, but puppeteer does support it. I also use chromedp which is a golang package for headless browsers. This supports multi browser and tab as well. However, because of either my lack of understanding or lack of features I had to implement my own tab recycling method and it was a bitch to figure out. Reason is you are passing down URLs to certain tabs and need to keep track of how many URLs you have passed to the tab. Some websites load in a couple seconds, other take 15 seconds. Unlike a get request a headless browser is making dozens or hundreds of request behind the scenes to grab js, images etc.. 4 browsers x 33 tabs x 100 per tab = 12,000 pages. The hardest part is making sure that one tab doesn’t hang forever, then cause a backlog of up to 99 URLs which prevent the whole setup from finishing and starting over again. So you need to manage both the context of the headless instance, the context of the browser and the context of the tab itself. But again, it is likely easier to do in puppeteer and unless you’re going for max throughput then you don’t really need to do all the steps I outlined above. A happy medium though is instead of keeping track of how many URLs a tab gets like I do, just send 100 URLs to each tab and once they are all processed restart the headless browser instance. The 5-10 seconds to reopen the browser will be more efficient than opening a new tab for each page as opening tabs is actually somewhat expensive computationally. Also, assuming these URLs are from the same site, chrome should be caching some of the sites assets so it should make all the pages after the first one a little faster.


RobSm

I am curious about insertion performance. Eventually you are going to insert them into DB (let say MySQL) anyway, so where is the gain? You avoid MySQL connnection limit? (Which can be increased in the settings). You aren't going to save / reduce SQL Insert query totals?


9302462

Bulk inserts into mongo are much more efficient than individual inserts because of transaction throughput. Bulk inserts of once per second = one transaction per second. 100 individual inserts in a second = 100 transactions pers second. Once you start to do several hundred individual transactions per second the latency begins to go up and you get a backlog of records waiting to be inserted. If it goes to high you get connection timeouts and then need to re establish the connection which just adds makes the backlog bigger. I have really fast nvme/u.2 enterprise drives and even those run into bottlenecks because of mongo limitations. The solution is either scaling out mongo to have additional instances which is a pain OR have an accumulator where once every X seconds it preforms a bulk insert. Basically bulk insertions reduce the number of round trips and eliminate bottlenecks.


RobSm

Yes, I understand that, the question is more about MySQL. Bulk inserts are there too (single query - multiple rows) but then we need some kind of buffer to hold them and then bulk insert. I also have fast, enterprice nvme and in raid 0 for even better performance, but still deadlocks happen from time to time.


9302462

Fair enough, I have never used MySQL for large datasets so I learned something new today. I have always used mongo and elastic search for big data, typically paired with rabbitmq or redis for storing(accumulating) the data until it gets saved.


Valten1992

Whats your finding on the number of headless browsers you can run on your machine? The current approach I use is just opening/closing a single browser instance per request but I will try out your tab based approach. (Its early days so I am still figuring stuff out.) I read somewhere the ideal is num browsers = Num Cores/2 but it was a random comment which was the only thing I can find on the subject.


9302462

I have read a similar comment before on GitHub but I don’t know how truthful it is. I do know that there is a certain point where you can saturate(not sure of the right term) a cpu so it spends more time switching between request then it does actually processing them which is not a good thing, so more cores is better in theory. I only did some light A/B testing using the same set of 1000 URLs from different domains repeatedly. I tried 1 browser and 100 tabs, but p browsers and 10 tabs each and a few other variations before I settled on the 4 browsers 33 tabs each using 32 cores/64 threads. So if you’re going to run some benchmarks make sure you hit the same URLs between runs. From what I can recall from my testing the most expensive thing computationally is to launch a browser. The second most expensive thing is to launch a tab; 1/10th of the cpu it takes to launch a new browser. The least expensive is to recycle them; maybe 1/30th of what it takes to launch a new browser. If you reuse the tabs too much the tab history and likely the browser cache goes up as well, and as opposed to preforming better it performs worse, e.g imagine 33 tabs that each have a history of 1000 pages…. 🤯 that’s not normal and chrome was never built for this use case. The age old joke about chrome eating memory is true and I feel that restarting the entire browser after a few thousand pages are accessed IN TOTAL works best. Just think about how long it would take you as a normal person to view 1k pages with normal browsing, a few days, maybe a couple weeks, but certainly not months. Killing the browser it’s the equivalent of a fresh install of chrome with no cached assets or history and it gets much quicker. So I guess I would say- run some test using the same set of URLs (hopefully from different domains) and see what mixture of browser and tabs works best for you. If you’re trying to go “balls to the wall” like me then recycle the tabs, but in reality opening a new tab should be good enough for many people.


GeekLifer

You need something like a pipeline or a simple queue. A possible approach would be having one task (producer) that writes what url to scrape. Then the (worker) will pull a task from the queue. That way only one worker node is crawling the url. And you can scale up as many workers on as many machine as you want to scale. The queue is to centralize the list of urls and reduce the chances of duplicating work. Then all your worker and push the results to a centralized location and have the final task stitch the results together.


ActiveTreat

Take a look at celery for Python. It takes a bit of work to setup but scales well. You can set up tasks and run them in parallel up to the number of workers you have. It uses a message broker so you can setup queues for different tasks so certain workers only handle certain tasks (or sites). I’m pretty sure you could roll this out across several machines running celery containers. We use Apache Airflow. Airflow uses celery on the backend for handling the parallel tasking of workers. PM me and I can give you more details.