PINGDOM_CHECK

How to parse HTML tables into structured data (CSV/Excel)

You found the data you need. It's sitting right there in a neat HTML table. The problem is, it's stuck in a webpage — and you need it in a spreadsheet.

Most developers assume this is a five-minute job. Then they open the page source, see a tangle of <table>, <thead>, <tbody>, <tr>, and <td> tags — some with colspan attributes, some with nested tables inside them — and suddenly the five minutes becomes an afternoon.

In this guide, you'll learn three things: how HTML tables are actually structured (so the parsing makes sense), how to extract clean tabular data using Python, and how to export it to CSV or Excel without losing formatting or running into encoding nightmares.

Why HTML tables are harder to parse than they look

Here is a question worth answering before you write a single line of code: why do so many developers get tripped up by HTML table parsing when the data looks so simple on screen?

The answer is that what you see in the browser and what's in the markup are often very different things. Browsers do a lot of silent repair work — closing unclosed tags, inferring missing elements, normalising malformed attributes. Your parser does not.

A table that renders beautifully in Chrome might have no <tbody> element at all (browsers add it implicitly), or it might have colspan and rowspan attributes that merge cells visually but leave gaps in the underlying data structure.

There are three common structural issues you'll encounter:

Merged cells. When a cell spans multiple columns or rows using colspan or rowspan, the raw HTML contains fewer cells than the rendered table suggests. A naive row-by-row parser will produce rows of unequal length, and your resulting CSV will be misaligned.

Nested tables. Some websites use tables for layout purposes — there may be a table inside a table cell. You need to make sure you're targeting the right table and not accidentally extracting layout scaffolding alongside your data.

Missing or inconsistent headers. Not every HTML table uses a <thead> element. Some use the first <tr> of the <tbody> as a header row. Some have no headers at all. Your code needs to handle all three.

Understanding these issues upfront means you can choose the right tool for the job, rather than debugging cryptic misalignment errors halfway through.

Setting up your environment

Before writing any parsing code, install the two libraries you'll need:

1pip install requests beautifulsoup4 pandas openpyxl lxml
Copy

A quick note on what each does:

  • requests fetches the raw HTML from a URL
  • beautifulsoup4 parses the HTML into a navigable tree
  • pandas gives you a DataFrame, which handles table manipulation and export
  • openpyxl is the engine pandas uses to write .xlsx files
  • lxml is a fast HTML parser that Beautiful Soup can use under the hood

If you're working with a page that requires JavaScript to render its tables, requests alone won't be enough — we'll cover that case later.

The simplest approach: pandas read_html

For straightforward tables on static pages, pandas has a built-in function that does most of the work for you.

1import pandas as pd
2
3url = "https://example.com/data-table"
4tables = pd.read_html(url)
5
6# read_html returns a list of all tables found on the page
7print(f"Found {len(tables)} table(s)")
8
9# Inspect the first one
10print(tables[0].head())
Copy

pd.read_html() returns a list of DataFrames — one for each <table> element it finds on the page. If the page has three tables, you'll get a list of three DataFrames.

To export the one you want to CSV:

1tables[0].to_csv("output.csv", index=False)
Copy

To export to Excel:

1tables[0].to_excel("output.xlsx", index=False)
Copy

This approach works well when the table is simple and the page is static. When it doesn't work — because the table is dynamically rendered, you need more control over which table to target, or the structure is malformed — you'll need to go a level deeper.

Targeting a specific table with Beautiful Soup

pd.read_html() is convenient but not always precise. If the page has multiple tables and you need a specific one, Beautiful Soup lets you target it by its attributes.

1import requests
2from bs4 import BeautifulSoup
3import pandas as pd
4
5url = "https://example.com/data-table"
6response = requests.get(url)
7soup = BeautifulSoup(response.text, "lxml")
8
9# Target a table by its ID
10table = soup.find("table", {"id": "results-table"})
11
12# Or by a class name
13table = soup.find("table", {"class": "data-grid"})
14
15# Parse it with pandas
16df = pd.read_html(str(table))[0]
17df.to_csv("output.csv", index=False)
Copy

By converting the specific <table> element back to a string and passing it to pd.read_html(), you get the precision of Beautiful Soup with the convenience of pandas.

Parsing tables manually when the structure is complex

Sometimes you need full control — for example, when a table has colspan or rowspan attributes that pandas doesn't handle well by default, or when you want to map each cell to a specific header.

Here is a function that handles merged cells correctly by tracking which cells have already been "filled" by a spanning cell:

1import requests
2from bs4 import BeautifulSoup
3import csv
4
5def parse_table_with_spans(table):
6    rows = []
7    # Track cells occupied by rowspan/colspan
8    occupied = {}
9
10    for row_idx, tr in enumerate(table.find_all("tr")):
11        row = []
12        col_idx = 0
13
14        for cell in tr.find_all(["td", "th"]):
15            # Skip columns already occupied by a spanning cell
16            while occupied.get((row_idx, col_idx)):
17                row.append(occupied[(row_idx, col_idx)])
18                col_idx += 1
19
20            cell_text = cell.get_text(strip=True)
21            colspan = int(cell.get("colspan", 1))
22            rowspan = int(cell.get("rowspan", 1))
23
24            # Fill occupied tracking dict
25            for r in range(rowspan):
26                for c in range(colspan):
27                    if r > 0 or c > 0:
28                        occupied[(row_idx + r, col_idx + c)] = cell_text
29
30            row.append(cell_text)
31            col_idx += colspan
32
33        rows.append(row)
34
35    return rows
36
37url = "https://example.com/complex-table"
38response = requests.get(url)
39soup = BeautifulSoup(response.text, "lxml")
40table = soup.find("table")
41
42rows = parse_table_with_spans(table)
43
44with open("output.csv", "w", newline="", encoding="utf-8") as f:
45    writer = csv.writer(f)
46    writer.writerows(rows)
Copy

This approach fills in the gaps left by spanning cells, so every row in your output has the same number of columns.

Handling pages that require JavaScript

If the table you need is rendered dynamically — meaning it doesn't appear in the raw HTML response but only after JavaScript executes — then requests will return an empty or incomplete page.

You have two options here.

The first is to check whether the data comes from an API. Open your browser's developer tools, go to the Network tab, filter by XHR or Fetch requests, and reload the page. If the table data is loaded via an API call, you may be able to hit that endpoint directly with requests and get the data as JSON — which is even easier to work with than HTML.

The second option is to use a browser automation tool like Playwright or Selenium to render the page before parsing it:

1from playwright.sync_api import sync_playwright
2from bs4 import BeautifulSoup
3import pandas as pd
4
5with sync_playwright() as p:
6    browser = p.chromium.launch()
7    page = browser.new_page()
8    page.goto("https://example.com/dynamic-table")
9    page.wait_for_selector("table")  # Wait until the table is in the DOM
10    html = page.content()
11    browser.close()
12
13soup = BeautifulSoup(html, "lxml")
14table = soup.find("table")
15df = pd.read_html(str(table))[0]
16df.to_csv("output.csv", index=False)
Copy

For sites with robust anti-bot protection, Playwright alone may not be enough. Zyte API handles browser rendering and unblocking as part of the same request, removing the need to manage a browser instance yourself.

Exporting cleanly to CSV and Excel

Once you have your DataFrame, a few export settings are worth knowing.

For CSV:

1df.to_csv(
2    "output.csv",
3    index=False,        # Don't write row numbers
4    encoding="utf-8-sig"  # utf-8-sig adds a BOM so Excel opens it correctly
5)
Copy

The utf-8-sig encoding is worth noting: plain utf-8 CSVs often display garbled characters when opened in Excel on Windows because Excel expects a byte order mark (BOM) to know the file is UTF-8. Using utf-8-sig adds that marker automatically.

For Excel:

1df.to_excel(
2    "output.xlsx",
3    index=False,
4    sheet_name="Data"   # Name the sheet something meaningful
5)
Copy

If you're writing multiple tables to the same Excel file, use an ExcelWriter object:

1with pd.ExcelWriter("output.xlsx") as writer:
2    tables[0].to_excel(writer, sheet_name="Table 1", index=False)
3    tables[1].to_excel(writer, sheet_name="Table 2", index=False)
Copy

Handling common problems

Duplicate column names. If a table has two columns with the same header, pandas will automatically append suffixes (.1, .2) to disambiguate. You can rename columns afterwards with df.columns = ["col_a", "col_b", ...].

Mixed data types in columns. HTML tables store everything as strings. A column of numbers will be read as object type in pandas. Use pd.to_numeric(df["column"], errors="coerce") to convert, which will turn any non-numeric values into NaN rather than raising an error.

Empty rows or header rows in the body. Some tables repeat header rows partway through (common in long tables that span multiple printed pages). You can drop rows that duplicate your header with:

1df = df[df["Column Name"] != "Column Name"]
Copy

Rate limiting and bot detection. If you're scraping multiple pages of tables, sending too many requests too quickly will often get you blocked. Add delays between requests with time.sleep(), rotate user agents, or use Zyte API's proxy capabilities to handle this transparently.

Putting it all together

Here is a complete script that fetches a page, finds the right table, handles encoding, and exports to both CSV and Excel:

1import requests
2from bs4 import BeautifulSoup
3import pandas as pd
4
5def scrape_table(url, table_id=None, table_class=None):
6    headers = {"User-Agent": "Mozilla/5.0"}
7    response = requests.get(url, headers=headers)
8    response.raise_for_status()
9
10    soup = BeautifulSoup(response.text, "lxml")
11
12    if table_id:
13        table = soup.find("table", {"id": table_id})
14    elif table_class:
15        table = soup.find("table", {"class": table_class})
16    else:
17        table = soup.find("table")
18
19    if not table:
20        raise ValueError("No matching table found on page")
21
22    df = pd.read_html(str(table))[0]
23    return df
24
25df = scrape_table(
26    url="https://example.com/data",
27    table_id="results-table"
28)
29
30df.to_csv("output.csv", index=False, encoding="utf-8-sig")
31df.to_excel("output.xlsx", index=False, sheet_name="Data")
32
33print(f"Exported {len(df)} rows and {len(df.columns)} columns")
Copy

What to do when the table is too complex to parse reliably

Some tables are genuinely difficult: deeply nested structures, cells that contain embedded HTML, tables that are actually images, or pages where the same data appears across dozens of paginated tables.

At that point, the question becomes whether the cost of maintaining a custom parser is worth it. Zyte API's automatic extraction can extract structured data from pages without you having to write or maintain selectors — which is particularly useful when table structures change between scraping runs or across different pages of the same site.

For most standard use cases, though, the combination of Beautiful Soup and pandas described here will get you where you need to go.


Next steps

Parsing a table is often just the first step. Once you have clean structured data, you'll likely want to store it somewhere durable, schedule the scrape to run regularly, or combine it with data from multiple pages.

If the website you're scraping uses JavaScript-heavy rendering or has anti-bot measures in place, take a look at our guides on web scraping dynamic websites with Zyte API and managing bans in web scraping. If you want to run this at scale, our guide to large-scale web scraping with Python covers the infrastructure decisions worth making before you scale up.