Before we delve into the depths of what residential proxies are & how to manage them, it is best first to address why proxies are a requirement for web data extraction at a commercial scale.
It is easy to fall for the fallacy "Because I got data from 10 to 1,000 pages per day up until now, I could get 1,000,000 pages per day..." The reality is that getting some data points from the web is so easy, and there are tons of solutions. Getting data at scale consistently from the web is challengingly hard with much more aspects to consider: scalability and reliability of the solution, infrastructure, maintenance, data quality.
Websites nowadays use many technologies that tackle the proxy and browser layer, mainly to protect themselves against bad actors. Geofencing, TCP/IP fingerprinting, browser fingerprinting, etc. What this means is that extracting data from the open web at scale requires a technology layer that returns the highest amount of data to your system and does not hinder the process because, let's say, you are based in Boston, and the target site only displays a certain (if any) information to US-based individuals.
If you want to keep on reading about data extraction, I suggest having a quick read of my colleague Sarah's recent article - How to extract data from a website. For all those interested in proxies specifically, let's get into the thick of it.
Are there different types of proxies? Data center, ISP, Residential, IPv4/IPv6, Sock4, Sock5, Sock5s...
The short answer is yes. Not all target the same OSI layer, not all cost the same, etc. The truth is depending on what you need and where you need to gather the information from one type or another, or indeed a combination of them might suit you best.
A proxy or proxy server is, in its simplest form, a computer that sits between you and the target server. It acts as a gateway between your local environment and a large-scale network such as the internet, for example.
A proxy essentially works as a middle man, intercepting connections between sender and receiver. All incoming data enters through one port and is forwarded to the rest of the network via another port.
Aside from traffic forwarding, proxy servers provide security by hiding the actual IP address of a server. They also have caching mechanisms that store requested resources to improve performance. A proxy server may encrypt your data, so it is unreadable in transit and block/allow access to certain webpages based on IP address, for example.
Data center, residential, mobile proxies are IP addresses that replace your own IP in the eyes of websites & servers. You can use all of these different types of proxies to browse anonymously and change your perceived location. But they're not the same, whether we're talking about price, features, or performance. So how should you choose between them should you want to embark on a web data extraction project at scale? Read on...
Data center Proxies – Fast, affordable, but without a good deal of work and easier to block.
What is a data center proxy?
Data center proxies are IP addresses hosted on the infrastructure owned by data center vendors. They can be of three types:
Public - Your typical free proxy. These are IPs you can find online free of charge but are useless for any sort of data harvesting project. Bear in mind that because of their public nature, they are a security risk as you cannot tell whether your data is being intercepted by someone while in transit (TL;DR = Never use to transmit sensitive information)
Shared - These are IP addresses that can be employed by multiple users at the same time. The best bang for your dollar for simple web data extraction tasks but can be prone to banning because of the inherent nature of using a resource that is also being leveraged by multiple people. This holds true unless you have come up with some serious proxy management logic to keep your IP pool health in check, which includes IP rotation, request throttling, and much more (if you want to relieve yourself of all this DevOps work, why not try Smart Proxy Manager?
Private - Data center IPs that you own the exclusive usage rights within a certain time window and for a specified domain. The
Dedicated - You have basically purchased the rights of usage of that IP from a data center vendor (AWS, Azure, Equinix, Digital Realty, etc.), or if you are not the strapped for cash type, actually went down the bare metal route and actually own the infrastructure yourself.
Most of the market charges you per IP address, although here at Zyte, we do things a little bit different with our data center offering. We charge per successful request sent back to you.
Fast yet stable – because these nodes are housed within an enterprise-grade infrastructure, data center proxies typically have extremely high uptime (99.9% or more) with high bandwidth capacity. Using this type of proxies is a reliable and scalable approach to web scraping, especially if used with some advanced logic to get the most out of them.
Shared or private – you can share a data center proxy with others to save costs or buy one for your exclusive use. This ensures no one can abuse the IP address.
Affordable – typically, a private data center proxy costs around $2. Shared proxies can be bought for cents apiece. This has made it almost a commodity, albeit at the cost of providing a more hand-held approach. You get what you pay for, and off you go (not our way, though)
Unlimited traffic – Most vendors in the market charge per IP address and not by volume of data being transferred. In the case of us here at Site, we charge per successful request.
Few locations – To create IP nodes, you need bare-metal infrastructure, which means a physical presence. As you might have guessed, this makes it a capital-intensive enterprise to build your own data centers across various locations, renting it less so. Still, it is hard to come by a data center company that can provide true worldwide coverage. That's where vendors such as ourselves come into play, making sure our IP pool is curated and purchased from multiple vendors across the globe so as to make that there is no real restriction on locale.
Simple to detect – data center IPs are not assigned to a residential ASN (identifying number of the company that issued the IP), and the subnet will likely be quite small in diversity. As a result, websites that care will see that you're using a proxy, even if it's otherwise fully anonymous. This might or might not be an issue depending on the target domain you wish to extract information from. To be able to take full advantage of data center IPs for enterprise-grade web scraping jobs and reap the benefits of their overall lower TCO, lots of DevOps man-hours and know-how must be put in play to avoid this common pitfall.
A hassle to use - A typical data center vendor will provide their customer with a proxy list with the unique IPs of all the purchased nodes in a txt file...punting it politely, they are inconvenient to use at the best of times. To actually derive value from them, you'll need to spend a considerable amount of time just on the proxy management side of things, let alone figuring out how to effectively extract the data you are after from the individual URLs.
Residential Proxies – The best of the best, but at a price
What is a Residential Proxy?
Residential proxies are IP addresses borrowed from real users: their laptops, phones, and other smart devices on Wi-Fi. This makes them much harder to detect by target websites while also supporting a wider variety of locations and more precise targeting options.
It is usually metered traffic, sometimes bringing into the equation concurrence (how many parallel connections and/or requests you can make)
Highly anonymous – because they connect via a real device, residential proxies are very hard to distinguish from regular users. Websites tend to give them the benefit of the doubt, even if the user is performing irregular, bot-like activities.
Big proxy pool – major providers have millions of IPs under their belt, so you can make a huge number of requests without encountering the same IP twice. This translates into two further benefits:
Many locations – those IPs are usually spread throughout the world. There are some dominant countries that take the lion's share, but you can find proxies in the most exotic locations.
High subnet diversity – another natural benefit is that residential IPs rarely share a subnet. So, you don't have to worry about accidentally blocking a range of IP addresses at once.
Easy to manage – residential proxies use backconnecting residential servers . You receive an address that resembles an URL, it connects you to a proxy server, and the server selects an IP from the provider's proxy pool. This IP changes after a while, but your server's address remains the same. This is very convenient for use cases like web scraping.
IP rotation – backconnect servers also allow you to automatically rotate IP addresses without any effort from your end. You can simply select a rotation frequency, and the provider will do the rotation on its end.
Potentially slower – residential proxies add another element to the connection chain, which is the residential end-point (the actual computer at someone's home or business). What's more, you're never sure if the end-user has good internet. All else being equal these proxies tend to be slower than data center IPs.
The connection can be patchy – the end-user can disconnect at any time, and your connection will be lost. So, even if a provider lets you keep the same IP for 10 or even 30 minutes, it has no way to ensure that you'll actually be able to do so.
Only shared IPs – backconnect servers give all users access to the same pool, meaning that you'll have to share IPs with others.
They cost a lot more – because they're harder to get and maintain than data center proxies, residential IPs cost more. They also tend to have a different pricing model than data center proxies: charged per traffic volume and not per individual IP.
Where do residential proxies shine?
As a rule of thumb, better at helping gather data from domains with stricter thresholds towards bots and tasks that require location-specific IPs. Think highly targeted merchant or marketplace websites where accessing them with the objective of retrieving data at scale is both more difficult and content might be dynamic based on location (country/city-level)...or for a use case such as ad-verification
The bottom line, there is no right or wrong kind of proxy. It all depends on your use case, the target websites/domains you intend to retrieve the data from, and of course, your budget.
A data center proxy pool with enough variety and advanced proxy management logic might very well provide the same level of access to difficult sites as residential proxies, but at a fraction of the cost. Yet, in some specific cases, residential is a must, and if the data being retrieved is valuable enough, then well...a no-brainer.
1/3 Marketer, 1/3 Ops, 1/3 Techie.
Currently Demand Generation Manager at Zyte.
Data analytics, knowledge graph enthusiast with a particular taste for its applications in financial services, cybersecurity, law enforcement & intelligence sectors.
Sign up to the blog
Coding tools & hacks straight to your inbox
Become part of the community and receive a bi-weekly dosage of all things code.