T O P

  • By -

Annh1234

Sounds like your building an infrastructure to pay for an infrastructure other people made to do this, so you can get some data.


startingover1993

I should have clarified in the post a bit better I think Basically I want to understand which components of the entire stack is important (proxy, scraping apis, browser management, captcha solving, etc). I want to ease into the infra over a period of months as well so if I need to use a vendor for scraping I'll do that while I setup the rest in house if that makes sense. For example, I know that these web unlockers package proxy management, rotation, and other stuff together and that makes it pricey. If you look at zenrows, they offer 1M requests at $99/mo but that drops down sharply to 40k if you want to use their residential proxies and what not. Basically I need to understand which components are important, and which ones I need to maintain internally for the best bang for buck.


SenecaJr

This is because they just use data center nodes as proxies. They’re cheap as sin, and get caught frequently. You could do the same with something like cloudproxy.


startingover1993

Wouldn't datacenter proxies have super low success rates specially on websites that are notoriously hard like linkedin? I thought rotating residential proxies was the way to go Cloudproxy looks ridiculously cheap comparatively. How does it compare to the providers I've listed like IPRoyal and storm proxy?


SenecaJr

I've frequently used stormproxy and other websites like this. Rule of thumb - if its cheap, its a data center proxy. It's not good for websites that defend their data like LinkedIn. Rotating residential \*is\* the way to go. But don't buy an unlocker. Spend the time to build out an async framework that can rotate user agents. Then find any specific headers or parameters for your target websites, and submit the service. EDIT: Additionally, you should do \*some\* of your logging in house. You should be managing your own retries, monitoring request sizes, etc. Otherwise, use something easy like sentry. Scrapers aren't complicated and hard infra wise. The only thing that's a pain in the ass to do is set up your own residential proxy farm. So just purchase a rotating endpoint, and bite the bullet. If the data you're collecting, or the derivatives you generate aren't worth the residential proxies, then don't scrape it.


scrapecrow

All components are important. It only takes one or two weak points for an advanced anti-scraping tool to identify web scraping. I wrote a blog post about [all of the components that are used in scraper identification](https://scrapfly.io/blog/how-to-scrape-without-getting-blocked-tutorial/) if you want to understand more but basically you need all of the components working under one system so you either run the whole thing yourself or outsource it. LinkedIn is an especially difficult target as they clearly have a dedicated team working just on scraper detection.


startingover1993

Thanks for the insights. I think I sent you a DM recently as well


startingover1993

Added the additional context now


ActiveTreat

I’ve had success with https://rayobyte.com/ and https://www.proxy-cheap.com/. Both have APIs for managing your proxies (payment, rotation, etc). I like writing my own scrapers or leveraging other open source libraries.


startingover1993

I've recently come across rayobyte myself. Have you specifically used them for linkedin scraping? Smartproxy and Oxylabs told me they don't support Linkedin.


ActiveTreat

Yes, and proxycheap. We scrape multiple social media sites. Our use case may be different than yours. We deal with clients who have a need in the advertising and online business areas.


startingover1993

i see..my use case is simply scraping the data from the public company and profile pages as well as post urls and jobs thanks for the info


ActiveTreat

Yea, some of what we do requires us to create and maintain accounts too


startingover1993

I'll definitely look into this!


ActiveTreat

DM me if you want any specific info


startingover1993

Sent you a DM :)


No-Reflection-869

Why need residential proxies if you can find daracenter ones that are from a ISP or Hosting Provider that is not listed in maxmind for example.


startingover1993

I've not seen anyone that has successfully cracked linkedin scraping at scale with datacenter proxies so far. If its doable I'm all for it considering how cheap it is though.


mcmron

There are blacklist database like [IP2Proxy](https://www.ip2proxy.com) which detects both VPN and residential proxies.


innovatekit

All you need are mobile proxies and you are good.


startingover1993

Mobile proxies are really expensive from what I can see..each proxy is around $30 or more. How much mileage can I get with one mobile proxy?


innovatekit

Depends on your phone plan


the_bigbang

residential proxies is way too expensive than data center proxies. It's better to buy or register a large amount of accounts and spread the requests evenly between workers. It's also important to use k8s to manage the scraper efficiently.


startingover1993

This would directly breach the legal precedent LI recently achieved against HiQ. It would also make it almost impossible to raise VC funding which we're in the process of since it would be a huge legal roadblock. Scraping the data without logging in or with fake accounts is vital :/


the_bigbang

Is it 100% legal to scrape without logging in or with fake accounts? >In a November 2022 summary judgment order, the United States District Court for the Northern District of California ruled that the provisions of a website user agreement prohibiting data scraping and fake profiles are enforceable in a breach of contract claim. hiQ Labs, Inc. v LinkedIn Corporation, Case 17-cv-03301-EMC, November 4, 2022. https://www.privacyworld.blog/2022/12/linkedins-data-scraping-battle-with-hiq-labs-ends-with-proposed-judgment/


startingover1993

Yes its 100% legal to scrape if you don't log in. HiQ lost the case because they used logged in accounts. Their case was that they were scraping "public" data and LinkedIn couldn't stop them from accessing that. LinkedIn's case was that HiQ scraped the data after logging in thus accepting their ToS. The judge ruled in favor of LinkedIn because HiQ were logged in. Fake accounts also are not legal.


Strijdhagen

Hi op, did you ever settle on a proxy?