Today, we’re opening access to pageContent, a brand-new Zyte API automatic extraction data type that delivers the content of any web page as plain text minus the noise.
Everyone who carries out web data extraction is ultimately focused on the same end goal – obtaining the main content of a destination page. But the need to target underlying CSS or XPath selectors for that content complicates the life of every data engineer.
With pageContent, instead of targeting fragile selectors that are prone to changing, you just send a URL and get back the content from the main body of the page. It’s the fastest route to extract clean LLM friendly content from web pages as plain text.
But pageContent doesn’t only return a page’s main content. It also automatically returns smart navigation links and pagination links, without you needing to script any parsing logic.