Why your API responses look like gibberish: the gzip decompression trap
The script was working. Requests were going out, responses were coming back with HTTP 200. But the response body was unreadable noise, a wall of binary characters that crashed the JSON parser and reported "no data found." No error code, no timeout, no network failure; just garbage where structured data should be.
The culprit was gzip compression. Specifically, the mismatch between what the HTTP client promised it could handle and what it actually did with the compressed bytes it received.
This is a common trap in web scraping and API clients, and it tends to waste an hour because nothing looks obviously wrong. Here is what is happening, why Python's standard library makes it worse, and how to fix it for good.
What gzip compression is and why APIs use it
gzip is a lossless compression format based on the DEFLATE algorithm. Originally built for compressing files on Unix systems, it became the web's dominant response compression method because the trade-off is excellent: a typical JSON API response compresses to 20% to 30% of its original size, with negligible CPU cost on modern hardware.
For web scraping workloads where you are fetching dozens or hundreds of pages, that bandwidth reduction is meaningful. Compressed responses arrive faster, consume less egress on the server, and allow more concurrent connections to run without hitting network limits. In one real-world parallel-fetching scenario, keeping gzip enabled cut total wall-clock time by roughly 60% compared to uncompressed sequential fetches.
How HTTP compression negotiation works
HTTP compression uses a two-header handshake:
Accept-Encoding is sent by the client in the request. It declares which compression formats the client supports:
1Accept-Encoding: gzip, deflateContent-Encoding is sent by the server in the response. It declares which compression format was actually applied to the response body:
1Content-Encoding: gzipThe contract is: the client advertises its capabilities, the server compresses and labels the response, and the client is responsible for decompressing before reading. The phrase "responsible for decompressing" is where things break.
The urllib problem
Most HTTP clients abstract away this responsibility. curl --compressed handles decompression transparently. Python's requests library decompresses automatically. You never see the compressed bytes.
Python's urllib, however, is lower-level. When you manually set an Accept-Encoding header in a urllib.request call, you are signaling to the library: "I know what I am doing — give me the raw bytes."
And it does exactly that. It sends your header, receives the compressed response, and hands you the compressed binary blob without touching it. The Content-Encoding: gzip header is right there in the response, but urllib does not act on it automatically when you set Accept-Encoding yourself.
The result: your JSON parser receives data starting with the gzip magic bytes \x1f\x8b instead of the { it expects. It fails. You see gibberish, or a json.JSONDecodeError, or a silent "no data found" if your error handling swallows the exception.
This is not a urllib bug — it is intentional behavior. The library assumes that if you set the header yourself, you own the decompression step. The problem is that many scrapers copy request headers from curl or browser dev tools, which include Accept-Encoding: gzip, deflate by default, without realizing they have opted into manual decompression handling.
Why this happens with web scraping APIs
Zyte API is standards-compliant. When your client sends Accept-Encoding: gzip, deflate, Zyte API returns compressed responses as it should. The data is there, fully extracted and structured, just wrapped in gzip. The API is doing nothing wrong. The issue is entirely in the client-side handling.
This is not specific to Zyte API. Any well-implemented HTTP API or web server that supports compression will exhibit this behavior. The same trap appears when scraping any site that enables gzip, calling any REST API that respects Accept-Encoding, or consuming any streaming response from a CDN.
Detecting gzip compression reliably
gzip data always begins with the two-byte sequence 0x1f 0x8b. This magic number gives you a format-level check that is more reliable than parsing the Content-Encoding header, because some servers compress the body but omit or misconfigure the header.
The detection pattern is simple:
python
1raw_bytes = response.read()
2if raw_bytes[:2] == b"\x1f\x8b":
3 raw_bytes = gzip.decompress(raw_bytes)
4body = raw_bytes.decode("utf-8", errors="replace")Both gzip and zlib are part of Python's standard library, no additional dependencies needed.
The complete fix with Zyte API
Here is a minimal, working example of a Zyte API call with proper compression handling:
1python
2import urllib.request
3import base64
4import gzip
5import json
6def fetch_from_zyte(url: str, api_key: str) -> dict:
7 auth_string = base64.b64encode(f"{api_key}:".encode()).decode()
8
9 headers = {
10 "Content-Type": "application/json",
11 "Authorization": f"Basic {auth_string}",
12 "Accept-Encoding": "gzip, deflate",
13 }
14
15 payload = json.dumps({"url": url, "product": True}).encode()
16
17 req = urllib.request.Request(
18 "https://api.zyte.com/v1/extract",
19 data=payload,
20 headers=headers,
21 method="POST",
22 )
23
24 with urllib.request.urlopen(req) as resp:
25 raw_bytes = resp.read()
26
27 # Detect and decompress gzip
28 if raw_bytes[:2] == b"\x1f\x8b":
29 raw_bytes = gzip.decompress(raw_bytes)
30
31 return json.loads(raw_bytes.decode("utf-8", errors="replace"))Two lines added after resp.read() — that is the entire fix.
Handling deflate, too
If you want a reusable utility that covers both common encodings:
1python
2import gzip
3import zlib
4
5def decompress_response(raw_bytes: bytes) -> bytes:
6 # gzip: magic number 0x1f 0x8b
7 if raw_bytes[:2] == b"\x1f\x8b":
8 return gzip.decompress(raw_bytes)
9
10 # zlib/deflate: common header byte 0x78
11 if raw_bytes[:1] == b"\x78":
12 return zlib.decompress(raw_bytes)
13
14 return raw_bytesCall this on any raw response body and it returns decompressed bytes, or the original bytes unchanged if no compression is detected.
When to use requests instead
If you are using the requests library, this problem does not arise. Decompression is handled transparently:
1import requests
2
3response = requests.post(
4 "https://api.zyte.com/v1/extract",
5 auth=(api_key, ""),
6 json={"url": url, "product": True},
7)
8
9data = response.json() # already decompressedThe case for urllib is zero external dependencies — useful when you are packaging a lightweight script or running in an environment where you cannot install packages. The case for requests is that it handles this (and many other edge cases) for you. Choose based on your constraints, but if you go the urllib route, keep the two-line decompression check in mind.
The diagnostic checklist
If your scraper or API client is returning what looks like binary garbage:
- Check whether your response starts with the bytes
\x1f\x8b; that is compressed gzip data - Check whether you are manually setting
Accept-Encodingin a low-level HTTP client - Check the response's
Content-Encodingheader :gzipconfirms what happened - Add the two-line magic-byte check and
gzip.decompress()call - Do not remove
Accept-Encodingfrom your headers — keep compression enabled for the bandwidth savings
The issue surfaces in any language or framework where you are working close to the HTTP layer: Go's net/http without Transport.DisableCompression, Rust's reqwest in manual mode, Node.js's http module without a decompression middleware. The diagnostic is always the same, check the first two bytes.
Summary
gzip compression cuts HTTP response sizes by 70% to 80%, which makes it worth keeping enabled in any high-volume scraping workload.
The catch is that low-level HTTP clients like Python's urllib hand you the raw compressed bytes when you set Accept-Encoding yourself, and do not decompress automatically.
The fix is to check for the gzip magic number after reading the response body and decompress with gzip.decompress() when it is present.
Two lines of code, no extra dependencies, and your responses go from unreadable noise back to clean, parseable JSON.
Learn more: Zyte API documentation | Zyte API automatic extraction