Scrape Web Pages and Files Using Python, wget, and Zyte

Introduction
Challenges of Traditional Web Scraping
What is AI-Powered Web Scraping?
Tools and Platforms for AI Web Scraping
Conclusion

Web scraping is a powerful tool in a developer's toolkit — whether it's used for pricing intelligence, aggregating search data, real estate data or something else . In this guide, we’ll start with the fundamentals, using tools like wget and simple Python scripts. From there, we’ll level up to more advanced techniques, including scalable solutions like Zyte’s web scraping API.

But before diving in, it’s crucial to understand the ethics and legality of scraping. While web scraping is not itself illegal, the manner in which you scrape, what you scrape, and what you do with the data that you scrape are key considerations in determining the ethics and legality of your project. We recommend that at the outset of all projects, you consult Zyte’s Compliant Web Scraping Checklist. In particular, when scraping or downloading files, you need to ensure that you comply with intellectual property and copyright laws. Responsible scraping isn’t just good practice — it’s essential for keeping your projects sustainable and trustworthy.

What is wget?

The command-line utility wget (pronounced "web-get") can download online files. This free network downloader may run in the background without user intervention.

Wget supports HTTP, HTTPS, FTP, and proxies.
Its simplicity, efficiency, and looping download feature make it ideal for scraping.
Wget can follow links on HTML pages and download full webpages, retaining the directory structure for offline use.
It automatically follows the Robot Exclusion Standard (robots.txt) when crawling.
Wget works effectively on unreliable connections:
- If the network fails, it will try to download again and continue if the server lets it.
- This makes it ideal for downloading large or slow files.

How to install wget

Wget is usually already installed on Linux. You can install it with your package management (for example, on Debian/Ubuntu, use "sudo apt install wget," and on CentOS/RHEL, use "sudo yum install wget"). If you don't already have Homebrew on your Mac, install it first. Then run brew install wget to get the most recent version. If you use Windows, wget isn't built in, but you can use Windows package manager like Winget or Chocolatey.

For example, type "winget install GnuWin32.Wget" or "choco install wget" at a command line. You may also download wget.exe for Windows and put it in a location that is part of your PATH, such C:\Windows\System32\. You may check that it worked by entering wget --version in your terminal once it is installed.

Basic wget Commands

It's easy to use wget. The basic syntax is:

wget [options] [URL]

Here are some popular commands and settings for scraping:

Download a single file:

Just provide a URL:

wget https://example.com/data/file.csv

Saves file.csv to the current directory with its original name by default.
Displays a status report in the terminal (use -q for silent mode).

Rename the downloaded file:

Use the -O (output document) option:

wget -O latest_data.csv https://example.com/data/file.csv

Downloads the content and saves it as latest_data.csv, regardless of the original name.

Recursive download (entire sites or multiple files):

Use the -r option:

wget -r https://example.com/photos/

Downloads all files and pages in the /photos/ directory.
Follows robots.txt by default.
Defaults to a recursion depth of 5 levels (can be adjusted).

To control recursion:

Limit depth to 3:

wget -r -l 3 https://example.com/photos/

Stay in the initial directory:

wget -r --no-parent https://example.com/photos/

For offline viewing, add:

-p to download page requisites (graphics, CSS).
-k to convert links for local use.

User-agent and custom headers:

Some servers block unrecognized user agents.
wget identifies as Wget/ by default.
To look like a browser:

wget --user-agent="Mozilla/5.0" https://example.com

Send custom headers:

wget --header "Name: Value" https://example.com

Rate limitation and delays:

Add delay between requests:

wget -r --wait=2 --random-wait https://example.com/large-directory/

--wait=2: wait 2 seconds.
--random-wait: adds unpredictability to delay.

Limit download speed:

wget --limit-rate=100k -r https://example.com/large-directory/

Limits bandwidth to 100 KB/s.
Helps avoid server overload or network congestion.

Resume interrupted downloads:

Use the -c (continue) option:

wget -c https://example.com/bigfile.zip

Resumes the download from where it left off.

These are only a few instances. wget provides a lot of settings for:

Controlling recursion depth
Accepting or rejecting specific file types (-A or -R patterns)
Logging output
Spanning domains
And more

It's a strong tool that works on its own.

Using Python for Web Scraping

Wget is wonderful for downloading files directly, but Python allows you the most freedom when it comes to web scraping. With Python, you can get websites, read HTML content, choose which links or data to download, and use scraping as part of bigger data pipelines. Python has great modules for performing web queries and analyzing data, which is why many developers use it for scraping tasks. Some important modules and libraries are:

requests: for easily sending HTTP requests (GET, POST, etc.) and dealing with the replies.

urllib is a component of Python's standard library that lets you work with URLs and downloads at a lower level (for example, urllib.request.urlretrieve). However, requests is usually easier to use.

BeautifulSoup, which comes from the bs4 package, is an HTML/XML parser that makes it easy to go through and search the parse tree. For example, you can use it to locate all the links on a page or scrape content from certain tags.

os and os.path let you do things with files, change paths, and make folders to keep your downloads organized.

Instead of copying everything, you can use Python to write code that selects what to scrape or download. Let's look at an example workflow: utilizing requests and BeautifulSoup to get files from a page.

Using Python requests to download files

To download a file with Python's requests module, all you have to do is use requests.get(URL). You may get to the content using a response object. This is a simple example of how to download a single file and handle errors:

import requests

url = "https://example.com/files/report.pdf"
try:
	res = requests.get(url)
	res.raise_for_status()        	# Raises an HTTPError if the status is 4xx/5xx
	with open("report.pdf", "wb") as f:
    	f.write(res.content)     	# Write response content (bytes) to a local file
	print("Download complete:", "report.pdf")
except requests.exceptions.HTTPError as http_err:
	print(f"HTTP error: {http_err}")
except Exception as err:
	print(f"Error: {err}")

Copy

In this part, requests.get gets the file. We use raise_for_status() to check if the download worked (status code 200 OK). If not, we capture the HTTPError. If so, we open a file on our computer in binary write mode and write res.content to it. This method loads the entire file into memory before writing. For big files, you can broadcast the answer in pieces. You might, for instance, do res = requests.get(url, stream=True) and then use res.iter_content(chunk_size=8192) to write chunks to a file. This way, you don't use too much RAM. The requests library will follow redirects by default (up to a limit), so you don't normally need to do anything special for HTTP 302/301. If you do need to check for a redirect, you may look at res.history.

Finding Links by Parsing Web Pages

One of the best things about Python is that it can read HTML and find precisely what you need, like all the links to download files on a website. This is easy using BeautifulSoup. Let's say you have a webpage with a lot of links to resources like PDFs and photos, and you want to download all of the PDFs that are linked to on that page. You can:

from bs4 import BeautifulSoup

page_url = "https://example.com/resources.html"
res = requests.get(page_url)
res.raise_for_status()
soup = BeautifulSoup(res.text, "html.parser")
 
pdf_links = []
for tag in soup.find_all("a", href=True):
	href = tag['href']
	if href.lower().endswith(".pdf"):
    	# If the href is relative (e.g. "/files/doc.pdf"), make it absolute:
    	if href.startswith("/"):
        	# Construct absolute URL by combining domain with href
        	from urllib.parse import urljoin
        	href = urljoin(page_url, href)
    	pdf_links.append(href)
 
print(f"Found {len(pdf_links)} PDFs:", pdf_links)

Copy

We got the page, parsed it, and then looked for all the anchor tags with a href attribute. We checked if the link ends with ".pdf" (a simple filter for PDF files). If we needed to, we used urllib.parse.urljoin to turn relative URLs into absolute ones. At the end, pdf_links would have all the direct links to PDF files on that page.

You could also filter for other file types (like .zip, .csv, images like .png or .jpg, etc.), or use more complicated logic (for example, find links under a certain section of the page by CSS class or DOM structure).

You can batch download a list of file URLs by using requests in a loop (the same way as above). Just be careful not to overload the server; if you're downloading a lot of files, add a short delay (time.sleep() for one or two seconds between downloads) so that you don't send too many requests at once.

Combining Python with wget

You don't have to choose between Python and wget; you can often use them together to get the best of both worlds. Python can handle the logic, parsing, and decision-making, and then call wget for heavy-duty downloading. This is helpful if you want to use wget's features (like auto-resume, recursive downloading, or robust retrying) in a Python script.

You may use the subprocess module or even easier, os.system, to use wget in Python. For example:

import subprocess

url = "https://example.com/big-dataset.zip"
result = subprocess.run(["wget", "-c", url])
if result.returncode == 0:
	print("Download success")
else:
	print("wget returned an error:", result.returncode)

Copy

This code runs the wget command as if you typed it in a shell. The -c option lets the download continue or start over if it was already in progress. It's usually better to use subprocess.run (or subprocess.call) than os.system because you have more control, but both can run external commands. If you want to log or parse the output from the subprocess, you can also capture stdout and stderr.

This method lets Python choose which URLs to download or make lists of URLs that change over time. Then wget can quickly download each one. For instance, you could use BeautifulSoup to scrape a page for 100 image links and then use a Python loop to call wget on each URL, maybe even in parallel with threads, which we'll talk about next.

You could also use Python's logic to process data before or after wget. For example, Python could read a list of URLs from a CSV file or an API and then call wget to get the content for each one. Or you could have Python call wget -r on a site, and then use Python to sort or analyze the files it downloaded. In short, using Python with wget gives you a lot of options, and it's common to use this pattern to automate batch downloads.

Conclusion

When you need to download a lot of files quickly without having to read the content in detail, wget is the best choice. It's simple and has been tested in battle. Python, on the other hand, is great when you need fine-grained control, like choosing which links to follow, parsing data from pages, interacting with web forms or APIs, and integrating with other data processing. In many cases, the best way to do things is to use both Python and wget together: Python can handle logging in, submitting forms, parsing HTML, etc., and then wget can handle downloading large files or many files reliably.

As your scraping needs grow, managing your own scripts and infrastructure can become a real challenge. That’s where Zyte’s tools come in to help you streamline and scale your scraping projects beyond what basic scripts can do. For example, instead of managing proxy lists and solving IP bans yourself, you could useZyte API to automatically rotate proxies, detect bans and geolocate your requests.

If your project outgrows a simple script, you might consider using a framework like Scrapy – an open-source Python scraping framework that provides a robust structure for large crawls. Scrapy makes it easier to manage parsing logic, throttling and data export and can be deployed on Zyte’s Scrapy Cloud for scaling. For JavaScript heavy sites, Zyte API includes a browser automation API (essentially a browser SDK). There are also specialized tools like Zyte’s Automatic Extraction (AI driven parsing of common data types) and more that can save you a significant amount of development time.

To sum up, start with the basics: use wget for simple file downloads and Python when you need more control. As you combine them, you'll create powerful scraping pipelines. And if you ever find yourself having to reinvent the wheel for things like managing proxies, rendering browsers, or crawling large amounts of data, remember that there are solutions like Zyte API and the Zyte ecosystem that can do the heavy lifting for you so you can focus on getting the data you need in a reliable way.