PINGDOM_CHECK

It has been developers’ HTTP library of choice for years. But, when it comes to web data extraction, there are alternatives worth considering.

Stop using Python requests for web scraping: Use these modern modules instead


By Ayan Pahwa, Developer Advocate, Zyte



It has been developers' HTTP library of choice for years. But, when it comes to web data extraction, there are alternatives worth considering.



While the 'Requests' library remains the default choice for many Python developers due to its reliability and extensive documentation, the Python HTTP landscape has evolved considerably.



Modern alternatives now offer significant advantages, including built-in asynchronous support, HTTP/2 compatibility, enhanced performance, and up-to-date TLS handling.



This article introduces and compares three such contemporary clients: HTTPX, curl_cffi, and rnet, detailing their unique features and practical applications.


The problem with Requests for web scraping


It's important to clarify Requests' limitations before proceeding; for simple API interactions with well-behaved endpoints, it still remains the de facto standard.



However, a major drawback of the Requests library when it comes to web scraping is its predictable HTTP client fingerprint. This fingerprint, a unique combination of TLS version, cipher suites, HTTP headers, and connection characteristics, is sent with every request, and is well-known and cataloged by anti-bot systems.



Consequently, if you're interacting with any endpoint, including APIs or services protected by anti-ban vendors, your request can be blocked purely based on how the requests library identifies itself. This happens even before your credentials or payload are scrutinized, highlighting a significant limitation when targeting systems that perform client-side validation.



In addition to issues like fingerprinting, a major limitation of the requests library is its lack of native asynchronous support. This absence of async capability is particularly problematic when handling workloads that involve numerous HTTP requests. Without it, the calls execute sequentially, and the program's thread remains blocked for the entire duration of each individual request.



For straightforward scenarios, the standard requests API call remains perfectly functional, as demonstrated in a quick example.


import requests

response = requests.get(
    "https://jsonplaceholder.typicode.com/posts/1",
    timeout=10,
)
response.raise_for_status()
data = response.json()
print(data["title"])

Clean and simple. For a one-off call to a standard REST API, this is fine. The gaps start showing when you need concurrency, HTTP/2, or when the target endpoint does any kind of client validation.


Install the Alternatives

pip install httpx          or    uv add httpx
pip install curl-cffi      or    uv add curl-cffi
pip install rnet           or    uv add rnet && uv add asyncio

1. HTTPX


HTTPX is the most direct upgrade from Requests as the API is nearly identical. If you know Requests, you already know most of HTTPX. What it adds is first-class async support, HTTP/2, and a more modern internal architecture.



Where it differs from Requests is the explicit use of a Client context manager (strongly recommended over module-level function calls) and the AsyncClient for async usage. This gives you connection pooling and proper resource cleanup by default.



HTTPX is the right starting point if you're looking for a migration that requires minimal code changes.


Example: Sync

import httpx

with httpx.Client(timeout=10.0) as client:
    response = client.get("https://jsonplaceholder.typicode.com/posts/1")
    response.raise_for_status()
    data = response.json()

print(data["title"])

Example: Async (calling the Zyte API)


Async is where HTTPX really earns its keep. Here it's used to fire multiple requests to the Zyte API concurrently, each request blocks on the server side until extraction is complete, but your event loop stays free to send others in parallel:


import os
import asyncio
import httpx

API_KEY = os.environ["ZYTE_API_KEY"]
ENDPOINT = "https://api.zyte.com/v1/extract"

urls = [
    "https://example.com",
    "https://httpbin.org",
]

async def fetch(client: httpx.AsyncClient, url: str) -> dict:
    response = await client.post(
        ENDPOINT,
        json={"url": url, "browserHtml": True},
        auth=(API_KEY, ""),
    )
    response.raise_for_status()
    return response.json()

async def main():
    async with httpx.AsyncClient(timeout=60.0) as client:
        results = await asyncio.gather(*[fetch(client, url) for url in urls])
    for result in results:
        print(result["url"], "—", len(result["browserHtml"]), "chars")

asyncio.run(main())

Notes:


  • raise_for_status() raises httpx.HTTPStatusError on 4xx/5xx responses.
  • HTTP/2 support requires pip install httpx[http2] and passing http2=True to the client.
  • The 60-second timeout accounts for the Zyte API's server-side blocking behavior — it holds the connection open until extraction completes.

2. curl_cffi


curl_cffi wraps libcurl with Python bindings and adds something HTTPX doesn't have: TLS fingerprint impersonation. It can show the TLS handshake of Chrome, Firefox, Safari, and other browsers. For API calls hitting endpoints protected by anti-ban or similar systems, this can be the difference between getting a response and getting a 403.



The interface closely mirrors Requests, with the addition of the impersonate parameter. It supports both sync and async usage. For most API calls where fingerprinting isn't a concern, curl_cffi behaves just like Requests, the impersonate parameter is opt-in.


Example: Sync

from curl_cffi import requests

response = requests.get(
    "https://jsonplaceholder.typicode.com/posts/1",
    impersonate="chrome",
    timeout=10,
)
response.raise_for_status()
data = response.json()
print(data["title"])

Example: Async (calling the Zyte API)

import os
import asyncio
from curl_cffi.requests import AsyncSession

API_KEY = os.environ["ZYTE_API_KEY"]
ENDPOINT = "https://api.zyte.com/v1/extract"

payload = {
    "url": "https://example.com",
    "browserHtml": True,
}

async def call_zyte_api():
    async with AsyncSession(impersonate="chrome") as session:
        response = await session.post(
            ENDPOINT,
            json=payload,
            auth=(API_KEY, ""),
            timeout=60,
        )
        response.raise_for_status()
        data = response.json()
        print(data["url"], "—", len(data["browserHtml"]), "chars")

asyncio.run(call_zyte_api())

Notes:


  • impersonate="chrome" sends Chrome's TLS fingerprint on every request made through this session.
  • Other supported values include "firefox", "safari", "chrome110", and more — check the curl-cffi docs for the full list.
  • The sync interface (from curl_cffi import requests) is nearly identical to the requests module, making it the easiest drop-in if you only need sync.

3. rnet


rnet is the newest of the three. Like a lot of modern Python, it's built on Rust, making it async-first and performance-oriented. Like curl_cffi, it supports TLS impersonation, but its primary differentiator is throughput. It is designed for high-concurrency workloads where you're firing many requests simultaneously.



The API surface is different from Requests, so it's not a drop-in replacement. But the patterns are clean and modern, and for async-heavy workloads it's worth the minor adjustment.


Example: Sample library code

import asyncio
from rnet import Impersonate, Client


async def main():
    # Build a client
    client = Client(impersonate=Impersonate.Firefox139)

    # Use the API you're already familiar with
    resp = await client.get("https://tls.peet.ws/api/all")
    
    # Print the response
    print(await resp.text())


if __name__ == "__main__":
    asyncio.run(main())

Notes:


  • rnet is async-first; sync support is limited.
  • Response body methods like .json() and .text() are awaitable.
  • The Rust core makes it particularly well-suited for high-throughput concurrent workloads.

Comparison Table

FeatureRequestsHTTPXcurl_cffirnet
Sync Support✅ Yes✅ Yes✅ Yes⚠️ Limited
Async support❌ No✅ Yes✅ Yes✅ Yes (primary)
HTTP/2❌ No✅ With extra dependencies✅ Via libcurl✅ Built-in
PerformanceBaselineGoodGood–HighHigh
TLS changes❌ No❌ No✅ Yes✅ Yes

When to use which


Use Requests for simple, one-off scripts, internal tooling, or any situation where you're hitting a cooperative API endpoint and don't need concurrency. Nothing wrong with it in that context.



Use HTTPX when you need async, want the closest migration path from Requests, or need HTTP/2. It's the safest default upgrade for most projects.



Use curl_cffi when TLS fingerprint control matters, whether that's because you're hitting an anti-ban wall or an API with strict client validation, or any service that checks how a client identifies itself at the TLS layer.



Use rnet when raw async performance is the priority. Its Rust foundation makes it the strongest choice for high-concurrency workloads where you're firing many requests simultaneously and need low overhead.



The optimal choice is determined by several factors: your concurrency requirements, the target endpoint's sensitivity to client identification, and the desired similarity between the new code and your existing requests implementation.