Blog - Zyte (formerly Scrapinghub) #1 Web Scraping Service https://www.zyte.com Zyte Web Scraping and Data Extraction Apps and Services Wed, 24 May 2023 05:17:23 +0000 en-GB hourly 1 https://wordpress.org/?v=6.2.2 https://www.zyte.com/wp-content/uploads/2020/12/cropped-zyte-data-icon-512-32x32.png Blog - Zyte (formerly Scrapinghub) #1 Web Scraping Service https://www.zyte.com 32 32 Introducing Zyte API Enterprise – Technology + Expertise to supercharge your in-house data extraction team https://www.zyte.com/blog/introducing-zyte-api-enterprise/ https://www.zyte.com/blog/introducing-zyte-api-enterprise/#respond Mon, 22 May 2023 07:27:56 +0000 https://www.zyte.com/?p=10840 Written by:

Iain Lennon
Chief Product Officer

Today we’re excited to announce to the Zyte and Web Scraping communities our new offering: Zyte API Enterprise.

Zyte API Enterprise combines the power and automation of Zyte API with the industry-leading expertise of Zyte’s compliance and development teams. The Technology and Expertise one-two punch will equip you with the tech you need to scale your in-house scraping, and help you achieve peace of mind as you navigate increasingly opaque scraping laws.

The shift in how companies extract data

The demand for web data has exploded over the past 36 months. Enterprises that require complex webflows for all sorts of data have recognized the need for continuity, scalability and compliance as they navigate whether to outsource data collection or build the muscle in house.

This market evolution has forced businesses that run on data to seek organizational data maturity, and, perhaps more importantly, define what data maturity is. And on the latter point, no two companies have the same answer.

We created Zyte API Enterprise for companies that do web scraping in house. More often than not, an in-house scraping motion can be daunting in an ever-changing scraping landscape. Furthermore, collecting web data is complex, challenging and shrouded in mystery. Building a functional system is difficult enough, nevermind building one that’s cost-effective and has staying power.

Challenges and risks abound

As we’ve witnessed this market evolution, we asked ourselves the following:

What are today’s most common pain points for web data teams?

The answer was two-fold and pretty clear:

  1. Every scraping script you write is pointed at a moving target— it will break or get blocked at some point. 
  2. Maintenance is a bottleneck that makes it hard to be nimble and scale
  3. Scraping laws are complex and continually evolving, making them next to impossible to navigate, which can pose a critical risk to companies that run on data.

These three issues are attached to dozens of sub-issues that impact resource allotment, cost, compliance, scalability, and simply staying organized.

Enter Zyte API Enterprise

Our global leadership in web data extraction – for almost as long as web scraping has existed – has revealed the solution to the two-fold pain points above, and, as you’ve probably gleaned so far, the answer is:

  1. The right technology.
  2. The right expertise. 

With Zyte API Enterprise, you can effectively break the painful cycle of build, break, fix, ban, and unblock, while maintaining global compliance as you scrape even the most complex websites. It’s the one-two punch that will supercharge your data scraping team.


What is Zyte API Enterprise? The Catalyst for Success with Web Data

Zyte API

A single API to solve the 'build, break, fix, ban' problem. Automate the fixing of scrapers and free up your developers’ time.

  • One powerful API to access all websites
  • Reach, reliability and simplicity built in
  • Per-site pricing that just makes sense

Don’t take our word for it…

“I’m really impressed by the Zyte API, when I first heard about it at the 2022 Extract Summit I was just curious about it but I thought it would have been difficult for it to tackle all the anti-bot around. But for the moment, the mission is accomplished! … the real score is 100/100.”

Pierluigi Vinciguerra

Cofounder of Databoutique.com, Author of The Web Scraping Club, and web scraper with 10+ years of experience.

Zyte Expertise

Developer consulting and compliance guidelines to address top challenges
On demand packages by a team of Zyte web scraping experts. 12+ years of experience delivering 13bn pages a month.

  • Building your web scraping tech stack
  • Optimizing your web scraping
  • Scaling your web scraping operations
  • Best in class approach to data quality


Want to supercharge your web scraping team? Talk to us about Zyte API Enterprise and how Zyte API and our team of experts can give you the one-two punch for scraping success.

Here are a couple resources to get you started:

  • Upcoming webinar - June 8th, 2023 - “How to navigate compliance, bans and maintenance to supercharge your web data extraction team” - Register Now
  • Zyte API Enterprise overview - Learn More
]]>
https://www.zyte.com/blog/introducing-zyte-api-enterprise/feed/ 0
An Introduction to Web Scraping with Python lxml library https://www.zyte.com/blog/web-scraping-python-lxml/ https://www.zyte.com/blog/web-scraping-python-lxml/#respond Thu, 18 May 2023 11:36:41 +0000 https://www.zyte.com/?p=10766 Whether you're trying to analyze market trends or gather data for research, web scraping can be a useful skill to have. This technique allows you to extract specific pieces of data from websites automatically and process them for further analysis or use.

In this blog post, we'll introduce the concept of web scraping and the lxml library for parsing and extracting data from XML and HTML documents using Python.

Additionally, we'll touch upon Parsel, an extension of lxml that is a key component of the Scrapy web scraping framework, offering even more advanced capabilities for handling complex web tasks.

What is Web Scraping?
Web scraping extracts structured data from websites by simulating user interactions. It involves navigating pages, selecting elements, and capturing desired information for various purposes like data mining, data harvesting, competitor analysis, market research, social media monitoring, and more.
While web scraping can be done manually by copying and pasting information from a website, this approach is often time-consuming and error-prone.
Automating the process using programming languages like Python allows for faster, more accurate, and more efficient data collection with a web scraper.

What is lxml?

Python offers a wide range of libraries and tools for web scraping, such as Scrapy, Beautiful Soup, and Selenium. Each library has its own strengths and weaknesses, depending on the specific use case and requirements. lxml stands out due to its simplicity, efficiency, and flexibility when it comes to processing XML and HTML. lxml is designed for high-performance parsing and easy integration with other libraries. It combines the best of two worlds: the simplicity of Python's standard module xml.etree.ElementTree and the speed and flexibility of the C libraries libxml2 and libxslt.

HTML and XML files

HTML (HyperText Markup Language) is the standard markup language for creating web pages and web applications. It is also a hierarchical markup language, but its primary purpose is to structure and display content on the web.

HTML data consists of elements that browsers use to render the content on web pages. These elements, also referred to as html tags, have opening and closing parts (e.g., <tagname> and </tagname>) that enclose the content they represent. Each html tag has a specific purpose, such as defining headings, paragraphs, lists, links, or images, and they work together to create the structure and appearance of a web page.

Here's a simple HTML document example:

<!DOCTYPE html>
<html>
<head>
  <title>Bookstore</title>
</head>
<body>
  <h1>Bookstore</h1>
  <ul>
    <li>
      <h2>A Light in the Attic</h2>
      <p>Author: Shel Silverstein</p>
      <p>Price: 51.77</p>
    </li>
    <li>
      <h2>Tipping the Velvet</h2>
      <p>Author: Sarah Waters</p>
      <p>Price: 53.74</p>
    </li>
  </ul>
</body>
</html>

XML (eXtensible Markup Language) is a markup language designed to store and transport data in a structured, readable format. It uses a hierarchical structure, with elements defined by opening and closing tags. Each element can have attributes, which provide additional information about the element, and can contain other elements or text.

Here's a simple XML document example:

<?xml version="1.0" encoding="UTF-8"?>
<books>
  <book id="1">
    <title>A Light in the Attic</title>
    <author>Shel Silverstein</author>
    <price>51.77</price>
  </book>
  <book id="2">
    <title>Tipping the Velvet</title>
    <author>Sarah Waters</author>
    <price>53.74</price>
  </book>
</books>

Both XML and HTML documents are structured in a tree-like format, often referred to as the Document Object Model (DOM). This hierarchical organization allows for a clear and logical representation of data, where elements (nodes) are nested within parent nodes, creating branches and sub-branches.

The topmost element, called the root, contains all other elements in the document. Each element can have child elements, attributes, and text content.

The tree structure enables efficient navigation, manipulation, and extraction of data, making it particularly suitable for web scraping and other data processing tasks.

XPath vs. CSS Selectors

XPath and CSS selectors are two popular methods for selecting elements within an HTML or XML document. While both methods can be used with lxml, they have their own advantages and drawbacks.

XPath is a powerful language for selecting nodes in an XML or HTML document based on their hierarchical structure, attributes, or content. XPath can be considered more powerful for parsing HTML tags and HTML markup compared to CSS selectors, especially when dealing with complex formats. However, it may have a steeper learning curve for those not familiar with its syntax.

CSS selectors, on the other hand, are a simpler and more familiar method for selecting elements, especially for those with experience in web development. They are based on CSS rules used to style HTML elements, which makes them more intuitive for web developers. While they may not be as powerful as XPath, they are often sufficient for most web scraping tasks.

Ultimately, the choice between XPath and CSS selectors depends on your personal preference, familiarity with each method, and the complexity of your web scraping project.

Using lxml for web scraping

Let's look at an example of how to web scrape with Python lxml. Suppose we want to extract data about the title and price of books in Books to Scrape web page, a sandbox website created by Zyte for you to test your web scraping projects.

First, we need to install the Python lxml module by running the following command:

pip install lxml

To perform web scraping using Python and lxml, create a python file for your web scraping script. Save the file with a ".py" extension, like "web_scraping_example.py". You can write and execute the script using a text editor and a terminal, or an integrated development environment (IDE).

Next, we can use the requests module to retrieve the HTML content of HTML page from the website:

import requests

url = "https://books.toscrape.com" 
response = requests.get(url)
content = response.content

After retrieving the HTML content, use the html submodule from lxml to parse it:

from lxml import html
parsed_content = html.fromstring(content)

Then, employ lxml's xpath method to extract the desired data from the web page:

# Parsing the HTML to gather all books
books_raw = parsed_content.xpath('//article[@class="product_pod"]')

books_raw retrieves a list of Element article, which we can parse individually. Although we could extract the data directly by querying the titles and prices, this approach ensures greater consistency in more advanced data extraction cases.

Before proceeding, create a NamedTuple to store book information for improved readability with the following code:

from typing import NamedTuple

class Book(NamedTuple):
  title: str
  price: str

Using NamedTuple is not necessary, but it can be a good approach for organizing and managing the extracted data. NamedTuples are lightweight, easy to read, and can make the code more maintainable. By using NamedTuple in this example, we provide a clear structure for the book data, which can be especially helpful when dealing with more complex data extraction tasks.

With the NamedTuple Book defined, iterate through books_raw and create a list of Book instances:

books = []
for book_raw in books_raw:
  title = book_raw.xpath('.//a/img/@alt')
  price = book_raw.xpath('.//p[@class="price_color"]/text()')
  book = Book(title=title, price=price)
  books.append(book)

The books list will display the following output:

[Book(title=['A Light in the Attic'], price=['£51.77']),
 Book(title=['Tipping the Velvet'], price=['£53.74']),
 Book(title=['Soumission'], price=['£50.10']),
 Book(title=['Sharp Objects'], price=['£47.82']),
 Book(title=['Sapiens: A Brief History of Humankind'], price=['£54.23']),
 Book(title=['The Requiem Red'], price=['£22.65']),
 ...
]

You can execute your web scraping script from the same python console or terminal where you installed the lxml library. This way, you can run the script and observe the output directly in the console or store the scraped data in a file or a database, depending on your project requirements.

Extended lxml with Parsel/Scrapy

While lxml is a popular and powerful library for data extraction in Python, Parsel, a part of the Scrapy framework, can be an excellent addition to your toolkit.

Parsel allows you to parse HTML and XML documents, extracting information, and traversing the parsed structure. It is built on top of the lxml library and provides additional functionality, like handling character encoding and convenient methods for working with CSS and XPath selectors.

The following code is an example using parsel with CSS method:

from parsel import Selector
sel = Selector(text=u"""<html>
    <body>
        <h1>Hello, Parsel!</h1>
        <ul>
            <li><a href="http://example.com">Link 1</a></li>
            <li><a href="http://scrapy.org">Link 2</a></li>
        </ul>
    </body>
    </html>""")
sel.css('h1::text').get()  # Output: 'Hello, Parsel!'

It is also possible to use parsel's selectors with regex expressions after the css and xpath extraction:

sel.css('h1::text').re('\w+')  # Output: ['Hello', 'Parsel!']

Conclusion

Web scraping is a powerful technique that enables users to collect valuable data from websites for various purposes. By understanding the fundamentals of HTML and XML documents and leveraging the Python lxml library, users can efficiently parse and extract data from web pages for simple data extraction tasks.

However, it's important to note that Python’s lxml may not be suitable for handling more complex projects. In those cases, Parsel, a key component of Scrapy, offers a superior solution. Scrapy comes with numerous benefits, including built-in support for handling cookies, redirects, and concurrency, as well as advanced data processing and storage capabilities. By utilizing Parsel for parsing both HTML and XML documents, Scrapy delivers a powerful and efficient way to traverse the parsed structure and extract the necessary information. This comprehensive library, combined with the robust and feature-rich capabilities of Scrapy, enables users to confidently tackle even the most complex web scraping projects.

By understanding the principles and techniques discussed in this blog post, you'll be prepared to tackle web scraping projects using either lxml or a comprehensive solution like Scrapy, harnessing data to achieve your objectives.

]]>
https://www.zyte.com/blog/web-scraping-python-lxml/feed/ 0
A Practical Guide to JSON Parsing with Python https://www.zyte.com/blog/json-parsing-with-python/ https://www.zyte.com/blog/json-parsing-with-python/#respond Thu, 06 Apr 2023 14:29:09 +0000 https://www.zyte.com/?p=10583 JSON (JavaScript Object Notation) is a text-based data format used for exchanging and storing data between web applications. It simplifies the data transmission process between different programming languages and platforms.

The JSON standard has become increasingly popular in recent years. It’s a simple and flexible way of representing data that can be easily understood and parsed by both humans and machines. JSON consists of key-value pairs enclosed in curly braces, separated by a colon.

Python provides various tools, libraries and methods for parsing and manipulating JSON data, making it a popular choice for data analysts, web developers, and data scientists.

In this guide, we’ll explore the syntax and data types of JSON, as well as the Python libraries and methods used for parsing JSON data, including more advanced options like JMESPath and ChompJS, which are very useful for web scraping data.

Reading JSON

One of the most common tasks when working with JSON data is to read its contents. Python provides several built-in libraries for reading JSON from files, APIs, and web applications. To read JSON data, you can use the built-in json module (JSON Encoder and Decoder) in Python.

The json module provides two methods, loads and load, that allow you to parse JSON strings and JSON files, respectively, to convert JSON into Python objects such as lists and dictionaries. Next is an example on how to convert JSON string to a Python object with the loads method.

import json

json_input = '{ "make": "Tesla", "model": "Model 3", "year": 2022, "color": "Red" }'
json_data = json.loads(json_input)
print(json_data) # Output: {'make': 'Tesla', 'model': 'Model 3', 'year': 2022, 'color': 'Red'}

Following, we display an example using the load method. Given a JSON file:

{
    "make": "Tesla",
    "model": "Model 3",
    "year": 2022,
    "color": "Red"
}

We load the data using the with open() context manager and json.load() to load the contents of the JSON file into a Python dictionary.

import json

with open('data.json') as f: 
    json_data = json.load(f) 

print(json_data)  # Output: {'make': 'Tesla', 'model': 'Model 3', 'year': 2022, 'color': 'Red'}

Parse JSON data

After loading JSON data into Python, you can access specific data elements using the keys provided in the JSON structure. In JSON, data is typically stored in either an array or an object. To access data within a JSON array, you can use array indexing, while to access data within an object, you can use key-value pairs.

import json

json_string ='{"numbers": [1, 2, 3], "car": {"model": "Model X", "year": 2022}}'
json_data = json.loads(json_string) 

# Accessing JSON array elements using array indexing 
print(json_data['numbers'][0])  # Output: 1 

# Accessing JSON elements using keys 
print(json_data['car']['model'])  # Output: Model X

In the example above, there is an object 'car' inside the JSON structure that contains two mappings ('model' and 'year'). This is an example of a nested JSON structure where an object is contained within another object. Accessing elements within nested JSON structures requires using multiple keys or indices to traverse through the structure.

JSON and Python objects Interchangeability

JSON is a string format used for data interchange that shares similar syntax with Python dictionary object literal syntax. However, it is essential to remember that JSON is not the same as a Python dictionary. When loading JSON data into Python, it is converted into a Python object, typically a dictionary or list, and can be manipulated using the standard methods of Python objects. When ready to save the data, you will need to convert it back to JSON format using the json dumps function. Remembering this difference between the two formats is essential.

Modifying JSON data

Working with JSON in Python also involves modifying the data by adding, updating or deleting elements. In this post we will focus on the basics, so we will be using the json built-in package, as it provides all basic functions we require to accomplish these tasks.

Adding an element

To add an element, you can modify the corresponding mapping in the JSON object using standard dictionary syntax. For example:

import json

json_string = '{"model": "Model X", "year": 2022}'
json_data = json.loads(json_string)
json_data['color'] = 'red'

print(json_data)  # Output: {'model': 'Model X', 'year': 2022, 'color': 'red'}

Updating an element

Updating an element follows the same logic as the previous snippet, but instead of creating a new key, it will be replacing the value of an existing key.

import json

json_string = '{"model": "Model X", "year": 2022}'
json_data = json.loads(json_string)
json_data['year'] = 2023

print(json_data)  # Output: {'model': 'Model X', 'year': 2023}

Another approach to either adding and/or updating values into a python dictionary is using the update() method. It will add or update elements in the dictionary using the values from another dictionary, or with an iterable containing key-value pairs.

import json

json_string = '{"model": "Model X", "year": 2022}'
json_data = json.loads(json_string) 

more_json_string = '{"model": "Model S", "color": "Red"}'
more_json_data = json.loads(more_json_string)

json_data.update(more_json_data)
print(json_data)  # Output: {'model': 'Model S', 'year': 2022, 'color': 'Red'}

Deleting an element

To remove an element from a JSON object, you can use the del keyword to delete the corresponding value.

import json

json_string = '{"model": "Model X", "year": 2022}'
json_data = json.loads(json_string) 
del json_data['year']

Another approach to removing an element from a dictionary with JSON data is to use the pop method, which allows you to retrieve the value and use it at the same time it is removed.

import json

json_string = '{"model": "Model X", "year": 2022}'
json_data = json.loads(json_string)
year = json_data.pop('year')

print(year)  # Output: 2022
print(json_data)  # Output: {'model': 'Model X'}

Beware, trying to remove an element using del when the element is not present will raise a KeyError exception. The pop method, on the other hand, will return None if it doesn't find the key. Ways to use del when you are not sure if the key is present is by either checking if the key exists.

import json

json_string = '{"model": "Model X", "year": 2022}'
json_data = json.loads(json_string)
if 'year' in json_data: 
    del json_data['year'] 
else: 
    print('Key not found') 

# or wrapping the del operation with a try/catch
json_string = '{"model": "Model X", "year": 2022}'
json_data = json.loads(json_string) 
try: 
    del json_data['year']
except KeyError: 
    print('Key not found')

Python Error Handling: Check or Ask?

When it comes to error handling in Python, there are two methods: "check before you leap" and "ask for forgiveness." The former involves checking the program state before executing each operation, while the latter tries an operation and catches any exceptions if it fails.

The "ask for forgiveness" approach is more commonly used in Python and assumes that errors are a regular part of program flow. This approach provides a graceful way of handling errors, making the code easier to read and write. Although it can be less efficient than the "check before you leap" approach, Python's exception handling is optimized for it, and the performance difference is usually insignificant.

Saving JSON

After tweaking with a previous JSON file or JSON string, you may want to save your modified data back to a JSON file or export it as a JSON string to store data. The json.dump() method allows you to save a JSON object to a file, while json.dumps() returns a JSON string representation of an object.

Saving JSON to a file using json.dump() and with open() context manager with write mode setting (writing mode "w"):

import json

data = '{"model": "Model X", "year": 2022}'

# Saves the dictionary named data as a JSON object to the file data.json 
with open("data.json", "w") as f:
    json.dump(data, f)

Converting a Python object to a JSON string using json.dumps():

import json

data = {"model": "Model X", "year": 2022}

# Converts the data dictionary to a JSON string representation
json_string = json.dumps(data)

print(json_string)  # Output: {"model": "Model X", "year": 2022}

Advanced JSON Parsing Techniques

When traversing JSON data in Python, depending on the complexity of the object, there are more advanced libraries to help you get to the data with less code.

JMESPath

JMESPath is a query language designed to work with JSON data. It allows you to extract specific parts of a JSON structure based on a search query. JMESPath is well-suited for advanced JSON parsing tasks because it can handle complex, nested JSON structures with ease. At the same time, it is easy to use at beginner level, making it an accessible tool for anyone working with JSON data.

Here's an example using the jmespath library in Python to extract data:

import json
import jmespath

json_string = '{"numbers": [1, 2, 3], "car": {"model": "Model X", "year": 2022}}'
json_data = json.loads(json_string) 

# Accessing nested JSON 
name = jmespath.search('car.model', json_data) # Result: Model X 

# Taking the first number from numbers 
first_number = jmespath.search('numbers[0]', json_data)  # Result: 1

Those examples only display the basics of what JMESPath can do. JMESPath queries can also filter and transform JSON data. For example, you can use JMESPath to filter a list of objects based on a specific value or to extract specific parts of an object and transform them into a new structure.

Let's say we have a JSON array of car objects, each containing information such as the car's make, model, year and price:

cars = [
    {"make": "Toyota", "model": "Corolla", "year": 2018, "price": 15000},
    {"make": "Honda", "model": "Civic", "year": 2020, "price": 20000},
    {"make": "Ford", "model": "Mustang", "year": 2015, "price": 25000},
    {"make": "Tesla", "model": "Model S", "year": 2021, "price": 50000}
]

We can use JMESPath to filter this list and return only the cars that are within a certain price range, and transform the result into a new structure that only contains the make, model, and year of the car:

import jmespath

result = jmespath.search("""
    [?price <= `25000`].{
        Make: make, 
        Model: model, 
        Year: year
    }
""", cars)

The output of this code will be:

[
    {'Make': 'Toyota', 'Model': 'Corolla', 'Year': 2018},
    {'Make': 'Honda', 'Model': 'Civic', 'Year': 2020},
    {'Make': 'Ford', 'Model': 'Mustang', 'Year': 2015}
]

Mastering JMESPath is a sure way to never have a headache when dealing with JSON parsing with python. Even complex JSON structures, like those often encountered in web scraping when dealing with a JSON document found on websites, can be easily handled with JMESPath's extensive features.

JMESPath is not only available for Python, but also for many other programming languages, such as Java and Ruby. To learn more about JMESPath and its features, check out the official website.

ChompJS

Web scraping involves collecting data from websites, which may be embedded in JavaScript objects that initialize the page. While the standard library function json.loads() extracts data from JSON objects, it is limited to valid JSON objects. The problem is that not all valid JavaScript objects are also valid JSONs. For example all those strings are valid JavaScript objects but not valid JSONs:

    • "{'a': 'b'}" is not a valid JSON because it uses ' character to quote

    • '{a: "b"}' is not a valid JSON because property name is not quoted at all

    • '{"a": [1, 2, 3,]}' is not a valid JSON because there is an extra "," character at the end of the array

    • '{"a": .99}' is not a valid JSON because float value lacks a leading 0

Chompjs library was designed to bypass this limitation, and it allows to scrape such JavaScript objects into proper Python dictionaries:

import chompjs

chompjs.parse_js_object("{'a': 'b'}")  # Output: {u'a': u'b'}
chompjs.parse_js_object('{a: "b"}')  # Output: {u'a': u'b'}
chompjs.parse_js_object('{"a": [1, 2, 3,]}')  # Output: {u'a': [1, 2, 3]

chompjs works by parsing the JavaScript object and converting it into a valid Python dictionary. In addition to parsing simple objects, it can also handle objects containing embedded methods by storing their code in a string.

One of the benefits of using chompjs over json.loads is that it can handle a wider range of JavaScript objects. For example, chompjs can handle objects that use single quotes instead of double quotes for property names and values. It can also handle objects that have extra commas at the end of arrays or objects.

Dealing with Custom Python objects

Almost all programming languages support custom objects, which are created using object-oriented programming concepts. However, while the basic principles of object-oriented programing are the same across different programming languages, the syntax, features, and use cases of custom objects can vary depending on the language.
Custom Python objects are typically created using classes, which can encapsulate data and behavior.

One example of a custom Python object is the Car class:

class Car:
    def __init__(self, make, model, year, price):
        self.make = make
        self.model = model 
        self.year = year
        self.price = price

To create a new Car object, we can simply call the Car constructor with the appropriate arguments:

car = Car("Toyota", "Camry", 2022, 25000)

If we try to serialize the Car object as-is, we will get a TypeError:

car_json = json.dumps(car)

TypeError: Object of type 'Car' is not JSON serializable

This error occurs because json.dumps() doesn't know how to serialize our Car object. By default, the json module in Python can only serialize certain types of objects, like strings, numbers, and lists/dictionaries. To serialize our Car object to a JSON string, we need to create a custom encoding class.

Encoding

We can create a custom encoder by inheriting from json.JSONEncoder and overriding the default method. This allows us to convert python objects into JSON strings. The default method is called by the JSON encoder for objects that are not serializable by default.

import json

class CarEncoder(json.JSONEncoder):
    def default(self, obj):
        if isinstance(obj, Car):
            return {"make": obj.make, "model": obj.model, "year": obj.year, "price": obj.price}
        return super().default(obj)

Inside the default method, we check if the object being encoded is an instance of the Car class. If it is, we return a dictionary with the attributes. If it is not an instance of the Car class, we call the default method of the parent class to handle the encoding.

car = Car("Toyota", "Camry", 2022, 25000)
car_json = CarEncoder().encode(car)

print(car_json)  # Output: {"make": "Toyota", "model": "Camry", "year": 2022, "price": 25000}

By using a custom encoding class, we can customize how our objects are serialized to JSON and handle any special cases that may not be covered by the default encoding behavior.

Decoding

Just as we can use custom encoding classes to serialize custom objects to JSON, we can also use custom decoding classes to decode JSON strings back into our custom objects.

At the current state of our CarEncoder, we are not dealing with decoding the object back to its custom object. If we use the decode method, we will just receive a dictionary with the values, not the Car object.

car_json = '{"make": "Toyota", "model": "Camry", "year": 2022, "price": 25000}'
car_dict = json.loads(car_json)

print(car_dict)  # Output: {"make": "Toyota", "model": "Camry", "year": 2022, "price": 25000}

As you can see, the output is just a dictionary with the attributes of the Car object. If we want to turn this dictionary back into a Car object, we need to create a custom decoder class to be used on json.loads() method.

Adding metadata

One way of making the decoder able to know the object type that it should cast is by adding metadata bound to the object type when encoding it.

if isinstance(obj, Car):
    return {"make": obj.make, "model": obj.model, "year": obj.year, "price": obj.price}

Adding to our previous CarEncoder a type metadata

if isinstance(obj, Car):
    return {"__type__": "Car", "make": obj.make, "model": obj.model, "year": obj.year, "price": obj.price}

We can use this with a custom decoding class to determine which object to create.

car = Car("Toyota", "Camry", 2022, 25000)
car_json = json.dumps(car, cls=CarEncoder)

print(car_json)  # Output: {"__type__": "Car", "make": "Toyota", "model": "Camry", "year": 2022, "price": 25000}

Here is the CarDecoder class, which will allow us to pass data as JSON string and return the custom python object.

class CarDecoder(json.JSONDecoder):
    def __init__(self, *args, **kwargs):
        super().__init__(object_hook=self.object_hook, *args, **kwargs)

    def object_hook(self, dct):
        if '__type__' in dct and dct['__type__'] == 'Car':
            return Car(dct['make'], dct['model'], dct['year'], dct['price'])
        return dct

Then we can use CarDecoder in the json.loads() method as the cls parameter.

car_json = '{"__type__": "Car", "make": "Toyota", "model": "Camry", "year": 2022, "price": 25000}'
car = json.loads(car_json, cls=CarDecoder)

print(car.make)   # Output: "Toyota"
print(car.model)  # Output: "Camry" 
print(car.year)   # Output: 2022
print(car.price)  # Output: 25000

Conclusion

In this guide, we've covered the basics of reading and parsing JSON data with Python, as well as how to access and modify JSON data using Python's built-in json package. We've also discussed more advanced JSON parsing options, such as JMESPath and ChompJS, which are useful for web scraping data. With the knowledge gained from this guide, you should be able to efficiently work with JSON data in Python and integrate it into your developer workflow.

 

]]>
https://www.zyte.com/blog/json-parsing-with-python/feed/ 0
How to (safely) extract data from social media platforms and news sites https://www.zyte.com/blog/how-to-safely-extract-data-from-social-media-platforms-and-news-sites/ https://www.zyte.com/blog/how-to-safely-extract-data-from-social-media-platforms-and-news-sites/#respond Fri, 31 Mar 2023 05:25:30 +0000 https://www.zyte.com/?p=10571 Data extraction from news sites and social media platforms is becoming an increasingly common practice. Popular use cases range from ensuring more informed investment decisions to protecting brand reputation.

However, if your core business isn’t focused on news aggregation or analysis, it can be difficult to know how to scrape news articles and social posts effectively, and without breaking the law or unintentionally disrupting websites. While web scrapers can make it possible to manage anti-ban restrictions, this doesn’t remove the legal implications of being compliant. 

To help you overcome the common dilemmas faced when developing a data feed, our team here at Zyte hosted a webinar on how to build news and social media data schemas successfully (and what to avoid!). 

With guest speakers including Sanaea Daruwalla (Chief Legal Officer), and Konstantin Lopukhin (Head of Data Science) the webinar includes helpful advice on improving data coverage, the best data fields to scrape, and abiding by key regulatory considerations.

If you haven’t watched the webinar yet, here’s a breakdown of what to expect.

Disclaimer: The recommendations in this guide do not constitute legal advice. Our Chief Legal Officer is a lawyer, but she’s not your lawyer, so none of her opinions or recommendations in this guide constitute legal advice from her to you. The commentary and recommendations outlined below are based on Zyte’s experience helping our clients (startups to Fortune 100’s) maintain GDPR compliance whilst scraping 7 billion web pages per month. If you want assistance with your specific situation then you should consult a lawyer.

Which data fields should you extract for news schemas (aka article schemas) and why?

Extracting data from the right fields is crucial to ensure what you’ve collected is relevant and reliable. Zyte’s Head of Data Science (Konstantin Lopukhin) provides a detailed overview of which data fields you should include in your article schema. These include:

  • The URL - Great for identifying the source of the data and tracking changes over time.
  • Headline - Very beneficial for sentiment analysis, or to identify the topic of the article.
  • Author and publication date - Good for understanding when the article was published and avoiding old news.
  • Article body - This is the main and, arguably, the most important field in most use cases, but also the one with the strongest legal protections. Konstantin says: “This field provides you with the text of the article, and the quality of this field must be really high for you to trust the rest of the data feed.”

What are the legal implications of news data extraction?

Extracting data from news articles requires that businesses remain vigilant to ensure compliance. Zyte’s Chief Legal Officer, Sanaea Daruwalla, provided some expert insight on the legal factors to consider when developing a news data extraction schema.

One data field which many assume is legally compromising, but can be a bit more innocuous, is the author’s name. While the author’s name is personal data, many jurisdictions, such as the EU, have an exemption for journalistic use.  If you're not sure whether your use case falls under the exemption, you need to do a data processing impact analysis (DPIA). 

The article body, however, is a field that you need to be much more careful with, due to copyright law. Sanaea says: “The article body is always going to hold copyright protection, so your use of it has to fall under an exception to copyright law. Use cases are important. If you want to gain an exception, you can't be republishing. If using the content for investment decisions or sentiment analysis, you’re generally safe.” If you want to scrape the full article body, you should check with your lawyer to see if your use case qualifies as an exception under the applicable copyright laws.

Sanaea also explains, as a general rule, you should avoid scraping articles that require a subscription or login to access them. That’s because when you sign up for that particular service, you need to abide by their terms and conditions which usually involve a ban on web scraping. “Once you accept the terms and conditions, you're bound by them,” said Sanaea.

What are the legal implications of social media data extraction?

You might think ‘well social media is public, so it must be legal to scrape social content’, but this isn’t always the case. 

While social media pages owned by businesses are generally safe, you need to be far more cautious when scraping content from individuals. Sanaea says: “You need to be really cautious when you're taking personal data. Even if it's out there in the public, GDPR still applies.”

There are some exceptions, such as anonymizing personal data. For example, Zyte can help you anonymize the user tag and remove all personal indicators while retaining data such as the number of followers, likes, and reposts. Anonymized data is not considered personal data under the GDPR.

Similar to news scraping, businesses should never scrape social media content behind a login. Actions like this could be considered a breach of contract, and you would be putting yourself in extremely dangerous territory. 

How to design and develop a news schema that works

For those looking to design and develop their article schema, there are a lot of considerations to ensure efficacy and compliance.  Zyte has done extensive work in this field, and we’re happy to share our expertise to ensure you don’t fall into any common pitfalls. Here are a few of the key pointers mentioned in the webinar:

  • Consistency is everything. You need to be able to do your longitudinal analysis, especially when it comes to making investment decisions. The last thing you want to do is to change your schema in a few months' time.
  • Be cost-effective. Scraping news data can get expensive, so make sure you set expectations beforehand. 
  • Consider copyright and data protection implications. We’ve covered this already, but it’s absolutely imperative that you’re smart about navigating legal matters.
  • You need robust monitoring in place. If you're building your business or your products around this data, you need to know that your design is reliable. 
  • Ensure you have the right resources. Ask yourself whether you have the right team with the right skills in place to manage the constantly shifting environment. 

Extract news and social media data safely with Zyte

News data extraction is an intricate and legally complex practice, which shouldn’t be managed alone. Without the right resources or the right legal team in place, you could be vulnerable to collecting unreliable or unlawful data.

Zyte is here to help. As the world’s leading web scraping service, we are experts at finding, extracting, and formatting datasets so you don't have to.

We provide reliable and legally compliant web data, with competitive upfront pricing structures and ongoing support at every stage of the process. Try us for free today. 

To watch the full webinar, with all insights and advice on news and social media schemas, click here.

]]>
https://www.zyte.com/blog/how-to-safely-extract-data-from-social-media-platforms-and-news-sites/feed/ 0
Zyte API – a single solution for web data extraction https://www.zyte.com/blog/introducing-zyte-api/ https://www.zyte.com/blog/introducing-zyte-api/#respond Wed, 21 Dec 2022 23:07:39 +0000 https://www.zyte.com/?p=9715 We are excited to announce our latest web data extraction solution – Zyte API.

Our vision was to create a single API that takes care of all your web data extraction needs. So we built a web scraping infrastructure that can handle the most demanding use cases for the most sophisticated users.

This allows developers to focus on data – not on extraction – and forget about proxies, bans and maintenance.

Enter a URL, get data. It’s that simple.

What is Zyte API

All-in-one solution

Zyte API is a single automated solution for dependable web data extraction that uses the leanest setup to reliably return HTML from any website at the lowest cost. 

It provides the necessary tools to extract data from the most sophisticated sites applying state-of-the-art techniques using an automated “all-in-one” solution, getting rid of time-consuming configuration and anti-scraping workarounds. This replaces a previous set of disparate tools, along with the trial and error process (and related expense) of using these tools to find the right solution when extracting data.

zyte api web data
proxy management

"Zyte is committed to providing powerful tools that empower people and organizations, both large and small, to collect this valuable, publicly available data to unlock new solutions, build intelligence, and create new opportunities in the easiest, most reliable, cost-effective way possible."

Shane Evans, CEO at Zyte

Why use Zyte API

Lean user experience

Users experience a simple, seamless and predictable collection of data without the need to juggle between multiple tools and configurations to battle anti-data collection measures. As automated monitoring and a team of specialists build new intelligence into the platform to stay on top of the shifting web environment and anti-scraping measures.

Data collection engineers at large-scale professional environments can quickly build a scraping stack in a fraction of the time previously required. All while using the exact right features and resources needed for each domain.

zyte api request

Powerful API

Zyte API consolidates virtually every known web scraping technology and technique into a simple, yet powerful API to collect web data at scale. What's more, it will automatically adapt to any site changes, ensuring you’re never banned and always get your data even if your target site changes its code.

We built this to be the ultimate reliable web data collection for both large and small-scale operations.

"Data scientists and professionals do not get into data to create endless configs across multiple tools that will ultimately break and need constant, specialized attention and supervision.

Zyte API is a genuine breakthrough for these professionals as it counters virtually every anti-scraping method in current use, freeing these data engineers to focus on creating value from the data rather than herding algorithms and proxies. We have automated the tedious, repeatable tasks so that our users can focus on collecting the data and putting it to use."

Iain Lennon, Chief Product Officer at Zyte.

Built for purpose

Users can simply and effectively collect any publicly available data on the internet, avoiding virtually all anti-web scraping measures by automatically using the best, most cost-effective techniques at all scales.

At Zyte, our customers are at the core of everything we do. We want them to get data quickly and efficiently. Through a single API that automates all the repeatable and difficult tasks of web scraping, our customers can focus on driving insights and impact in their organizations.

zyte api pr interface

Zero downtime

Zyte offers per-site pricing, providing the most cost-effective solution for reliable web data collection. Cost per site is determined by the specific tools needed to solve each respective site’s anti-web scraping measures and bans.

Our scriptable headless browser more closely mimics human behavior, and increases success rates in returning data while offering you full control of its actions. You also get the most common browser actions for scraping, to save you even more time and money.

Always get the data you need – at a price that makes sense.

What's included

zyte api features

How do I start?

  1. Sign up here and get $5 in free credits.
  2. Use your web browser to send your first request to a website so you check what the cost is.
  3. Follow the Zyte API tutorial to learn how to get more data.
proxy management

Summary

Websites have become more difficult to scrape over recent years, and our newest innovation represents a big step forward in the sophistication of web scraping utility.

Zyte API automatically finds the right-size features and configures itself to retrieve data from any website building your scraping stack for you in a fraction of the time previously required, only using exactly the right features and resources required on a site by site basis.  

This all saves money by only using more complex or expensive features when absolutely required, and by reducing valuable developer hours on maintenance and tasks better suited to algorithms and specialist API developers.

"The collection of web data is used every day to solve real-world problems, including providing insights on everything from business challenges, economic indicators, the spread of diseases, and even combating human trafficking.

We are unequivocal believers in the immense value that Internet data has for creating value, enriching society, and unlocking social and economic benefit."

Shane Evans, CEO at Zyte

We look forward to seeing how our customers change the world through data using Zyte API.

Check our Zyte API tutorial and get started in minutes.

proxy management
]]>
https://www.zyte.com/blog/introducing-zyte-api/feed/ 0
Black Friday 2022 – an analysis of web scraping patterns https://www.zyte.com/blog/blackfriday-web-data-extraction/ https://www.zyte.com/blog/blackfriday-web-data-extraction/#respond Fri, 16 Dec 2022 18:32:48 +0000 https://www.zyte.com/?p=9660 We're just coming off an intense Black Friday season, and being such a significant date for ecommerce web data we have some great news to share!  

Our team used Zyte products for an in-depth analysis to compare market trends with data demand requests received during this period – and the results were pretty impressive.

There was a clear correlation for web data requests vs market trends, our performance was off the charts… and we can say that the “Black Friday Creep” is real.

Read on to see what we uncovered during this Black Friday season. 

Backstory

Black Friday and Cyber Monday may seem like 2 days of shopping mania following Thanksgiving…but there is more to that. 

Although Thanksgiving day and Black Friday were historically the top selling days for the ecommerce industry, the biggest day now for ecommerce business is Cyber Monday.

According to Adobe Analytics, Cyber Monday 2022 had US$11.3 billion in total spending online.

web data extraction
Source: Adobe Analytics, November 2022

A new phenomenon called “Black Friday Creep” has turned the Black Friday markets upside down. Forcing retailers to offer more deals earlier in the week, and for a longer period of time. 

And it’s precisely what sparked our curiosity to take a deep dive into our web data requests.

Increased need for ecommerce web data

Let me show you what I mean on the patterns found during Black Friday season.

We tracked and analyzed the requests received by Zyte Smart Proxy Manager for a popular ecommerce site – starting on Tuesday (11/22) and throughout the Black Friday (11/25) and Cyber Monday (11/28) period. 

The chart below represents a per hour aggregation of the traffic handled by our Smart Proxy Manager infrastructure when users are crawling this particular website where each request typically allows a customer to extract data from a single webpage.

web scraping patterns black friday
 Smart Proxy Manager infrastructure: traffic handled (per hour aggregation)

Web scraping patterns in ecommerce web data

  • Cyber Monday surpassed Black Friday in volume of requests – with 61 million requests processed on this site only – compared to 46.7 million requests for Black Friday. 
  • We noticed a high increase of data extraction demand on the days leading up to Thanksgiving, which is when early deals get released.
  • This was followed by a small 2-day valley before landing on Black Friday 
  • Which was again followed by a small dip – to later finish off with a high for Cyber Monday the strongest day in 2022.
  • When looking at traffic numbers registered by our Smart Proxy Manager we saw a 7.88% increase when comparing just Black Friday. 
  • Comparing Tuesday to Monday (end of Cyber Monday) we saw 6.34% increase  Year-Over-Year. 

Performance

We are pleased to report over 99% extraction performance across the entire week. We delivered an increase of 3.34% on Black Friday and 3.95% higher performance across the entire week when compared to 2021.

ecommerce web data extraction performance

Conclusion

Our analysis showed a clear correlation between volume of data requests vs sales reports for the Black Friday season.

  • Cyber Monday surpassed Black Friday in Zyte’s data demand requests. 
  • 99% performance average across the entire week.
  • Black Friday Creep is real with high data demand requests across the entire week leading up to Black Friday and Cyber Monday.

Interest in ecommerce web data is greater than ever – and the results prove we can deliver during the busiest of seasons. Even with high volume requests, we still provided excellent performance to our customers. 

This drives us to constantly seek new ways to improve our data extraction process and provide data of the highest quality possible. 

Zyte Smart Proxy Manager will not only help you get ecommerce product data, but it's a powerful tool to handle Cyber Monday & Black Friday ecommerce web data. 

proxy management

Zyte is the market leader in product data extraction – with services that provide data on 3 billion products per month – so you can extract web data from any ecommerce site. 

We understand the importance of web data for ecommerce and always ensure that results are accurate and up-to-date. 

Contact us with your needs for ecommerce web data extraction and our team of experts will take care of it.

Definitions

Below are some of the definitions of the terms used within this article. 

  • Black Friday week period: From Tuesday (during Black Friday week) until the end of Monday (Cyber Monday). 
  • Web data extraction performance: It represents the rate at which Smart Proxy Manager effectively extracts content from the target websites. As this is a rate, it is measured as a percentage which is the total number of successful responses divided by total responses during a given period of time.
]]>
https://www.zyte.com/blog/blackfriday-web-data-extraction/feed/ 0
How web scraping can be used for digital transformation https://www.zyte.com/blog/web-scraping-digital-transformation/ https://www.zyte.com/blog/web-scraping-digital-transformation/#respond Wed, 30 Nov 2022 20:58:48 +0000 https://www.zyte.com/?p=9582 Digital transformation has become an increasingly popular term these days. Regardless of the industry you work in, you have probably already heard of it. 

Digital transformation (DX) is the adoption of digital technologies to enhance an organization's products, services and operations. The successful deployment of a digital transformation strategy can help improve overall business efficiency. 

Similarly, web scraping has also seen a surge in popularity lately within the business world.

How does web scraping work and what is it used for? 

Web scraping is defined as the automatic process of collecting web data for any public website in a structured format.  Scraping tools and data extraction software help collect information quickly and accurately from various sources to make the best business decisions.

Furthermore, both the digital transformation and web scraping industries show no signs of slowing down.

A study by IDC shows global digital transformation investments will reach US$3.4 trillion in 2026, and a five year CAGR of 16.3%.

Every major organization is taking the leap to incorporate digital transformation in their operations.

And one key component at the core of a successful digital transformation – data. 

This article shows how web scraping and data extraction can help successfully navigate your digital transformation journey. 

Automated web data scraping & digital business transformation

DX vendors that automate data extraction and web scraping gain a better understanding of their customer’s industries and can tailor better solutions to empower their clients.

A solid digital transformation strategy is crucial for companies to leverage digital technologies (and data), and turn them into a key part of their business. 

This is precisely where web scraping and digital transformation intertwine.

Web scraping is a powerful tool to support digital transformation as it helps organizations collect and use data more efficiently. Providing valuable data, market insights, automates processes, and improves customer experience.

It helps address key pain points and improve digital transformation efforts. But it's not that simple.

Digital transformation challenges

The global economic landscape is constantly changing, and companies need to be prepared for change. 

Enterprises must identify and assess the risks associated with global economic changes, such as increases in interest rates in the different countries they operate. 

They should also plan for potential legislation changes or geopolitical events and ensure their operations are aligned with local standards. 

The challenges:

  • Understand the continuous evolution of customer needs
  • Supply Chain Management 
  • How to achieve economies of scale
  • Being quick and efficient to scale outsourcing demands

As you can see there are many set-backs and they are not easy to overcome. 

It’s imperative to understand the impact of data and how to be a data-driven organization to address digital transformation pain-points and improve operational efficiency. From there, data extraction and web scraping are a starting point to implement a sustainable digital transformation strategy.

Let's look into how web data scraping can help.

How does web scraping work in digital transformation?

Through the use of different tools and web scraping software, organizations can gather data and scale up their web scraping efforts. This helps better understand customer behavior, assess market trends, and track competitor activity, among other uses. 

Digital transformation providers play a major role in helping a myriad of organizations integrate and simplify their daily operations to improve performance through advanced digital technologies. These range from cloud services, big data, data analysis, predictive analysis, quantum computing, Machine Learning (ML), and Artificial Intelligence (AI), among other technologies. 

By working with valuable and reliable data from public websites, you get new insights that enhance your service offerings, uncover new business opportunities, solve internal pain points, and improve your overall decision making.

Enterprise data extraction plays a key role in driving the success of a digital business transformation, as it helps organizations better understand how to utilize data, automate internal business processes, and provides a competitive edge. 

Digital transformation business intelligence

When looking at digital transformation companies, many times we are talking about major enterprises with thousands of employees spread across the globe. 

Company's that make use of business intelligence and aim to be data-driven organizations – will most likely focus on one key aspect.

Data.

Lots of it...and from a variety of sources.  

how does web scraping work

For example, when it comes to effectively managing large global organizations – in order to reduce employee turnover, improve internal operations, and enhance supply chain management – businesses need to anticipate demand in specific regions if they want to remain relevant.

Key points to consider in digital transformation business intelligence:

  • Supply chain management 
  • Internal data management
  • Consumer price index and global price changes 

If you are still asking yourself – what is web scraping used for in a digital transformation strategy – the following will help provide more clarity.

Supply Chain Management

Hardware, laptops and work equipment must be taken into consideration, as many times these are provided by an internal supply chain management department and distributed to employees in multiple countries – each with their own unique pricing and regulations. 

One of the key advantages of web scraping in supply chain management is that it can help companies reduce costs and improve delivery times. 

By monitoring online prices and scraping data from websites, companies can get real-time pricing information and make informed decisions about where to source products. 

Consumer price index (CPI)

Web scraping can help companies keep track of consumer prices indices in order to better understand global market trends. 

Understanding how the consumer price index (CPI) and inflation rates impact business can completely change the way you view your business.  Budget considerations based on fluctuating foreign exchange rates are crucial companies that work with teams distributed across the globe. 

Each country has its own type of employee benefits, retirement and healthcare systems, which can affect the cost of hiring employees as well as a variation in salaries.

With a proper understanding of the Consumer Price Index (CPI) and inflation rates in different parts of the world, companies can ensure that their employees receive adequate salaries and benefits. 

Global pricing and exchange rates

Having updated and correct data related to local salary standards is critical when dealing with salaries or bonuses in foreign currency. This helps companies approach changes in exchange rates more efficiently. 

Taking all these factors into consideration – leveraging web scraping tools and data extraction software – helps companies make sure they're making the best decisions for their employees, and as a result, sustain a healthy (global) organization.

Summary

Why the sudden acceleration in demand for digital transformation?

As a response to COVID-19, consumers and businesses completely changed the way they behaved overnight.  A majority of organizations were forced to immediately realign their business models. Many launched digital initiatives with limited time, resources, and had unprepared infrastructures to deal with the effects of the pandemic. 

This sudden change made way for a new economy where a digital transformation strategy was necessary if organizations were to remain relevant – and is now at the center of operational efficiency. 

And how does web scraping work within the world of digital transformation?

Web scraping can be a powerful tool to help DX vendors, as well as organizations of almost any industry, improve their digital transformation strategy – through the use of web data. It enables businesses to have quick access to data and insights needed in order to make better decisions and remain competitive. 

Digital Transformation accelerated from being just a buzzword to business reality that has reshaped entire industries. It's an invaluable asset to any modern enterprise – and together with web scraping – organizations can leverage data extraction to their advantage.

The business value of adopting web scraping in a digital transformation strategy

  1. Automate processes

Web scraping software automates repetitive tasks. This saves time, resources and improves efficiency.

Automate data extraction at scale and instantly access web data to get quality data back in a structured format.

  1. Enhance customer experience, generate leads and boost sales

Improve your customer experience by providing real-time information and personalized content. Monitor behavior and preferences so you target your marketing efforts more efficiently.

Extracted data provides insight to improve your products, services, and marketing efforts. Understand what customers want and need, target new markets, and develop more effective sales strategies.

  1. Improve decision making

Gather data from multiple sources for more informed decision making. You can use web scraping to monitor trends, track competitors, and identify new business opportunities. 

data strategy

Web scraping and data extraction give your business a competitive edge, and enables you to stand out from the competition. 

With over 12 years of experience in data extraction projects we know what data suits your business needs. 

Zyte delivers with speed, accuracy, and reliability, no matter what format you need it in. 

Talk to our experts today to see how web data scraping can help with your digital transformation strategy.

]]>
https://www.zyte.com/blog/web-scraping-digital-transformation/feed/ 0
Zyte vs import.io: Which is the best alternative? https://www.zyte.com/blog/importio-alternative/ https://www.zyte.com/blog/importio-alternative/#respond Mon, 28 Nov 2022 14:41:59 +0000 https://www.zyte.com/?p=9559 Do you need to get ecommerce web data for your e-commerce site or project, and are looking for an import.io alternative?

Or maybe you’re unsure of how an ecommerce web scraper crawls websites to get web data.  

Most professionals in the e-commerce industry probably already heard of web scraping and the benefits of extracting e-commerce web data. However, many don’t know where to start.

If you’re reading this, you must be curious on what Zyte and import.io have to offer when it comes to e-commerce web data extraction – or you are just looking for an import.io web scraping alternative.

Zyte and import.io are two of the most popular e-commerce data extraction providers in the market. 

Both are different platforms with features and capabilities that help you get web data, each in their own unique way. Whether you want an import.io alternative or not – the best option for you depends on many factors, such as the size of your business and the type and amount of data you need to extract.

Don’t worry, this article will walk you through all the details. 

We’ll discuss individual features, pros & cons – and how e-commerce web data is obtained – so you can make your own decision on your search for an import.io alternative.

Read on to find out which one is right for you. 

Why e-commerce web data is important

Running an e-commerce business can be very time-consuming and challenging, we get that. And that's probably while you are still reading this and figuring out which is the best import.io alternative.

You may want to start by asking yourself "how".

Web scraping is important to simplify your overly complex workflow or business operations.

Leveraging web scraping tools and web data extraction allows you to gather data of the highest quality from all sorts of websites. Knowing how to scrape website data allows you to match your specific needs, manage your business more efficiently and ultimately helps you know what to look for when analyzing data. 

E-commerce web data can be used, for example, to optimize pricing, better understand your competition, and position your company as the primary source for the products you sell. They help you make better decisions about which products to sell, at what price, increase client satisfaction, and how to be more efficient.

Beware though, it’s not as simple as it seems. 

There are limitations and problems that arise when web scraping at scale, which is why companies with vetted industry experts, tend to be the go-to-option when scaling e-commerce web data extraction.

E-commerce web data use cases

The amount of product sales is often referred to as the common success metric to measure against for e-commerce sites. 

Hence, the demand for e-commerce data which helps businesses with process improvement ideas.

Examples include

  • Price intelligence
  • Competitor intelligence
  • Market analysis
  • Vendor management
  • Compliance
  • Improve seller experience 
  • Remove internal data barriers. 

While you're trying to decide on the best import.io alternative – you know that it’s very important for e-commerce sites and marketplaces to extract e-commerce product data and learn how to become more efficient.

Zyte and e-commerce web data extraction

Zyte is a proven industry leader in building products that facilitate how web data is obtained with established full service solutions and patented technology. The company offers customized solutions to meet individual business needs using intuitive API’s that accelerate data acquisition while minimizing time and budget impacts for its customers.

For example, you must have landed here after browsing sites, searching for an import.io alternative or other similar options.

Well, customized e-commerce product data extraction and scraping e-commerce websites, are among many of the company’s specialties. Zyte can also help you through the use of web scraping tools, automatic data extraction API’s, and proxy management, and enterprise solutions

import.io alternative zyte

Zyte has built the best web data extraction offering to serve over 2,000 companies and 1 million developers worldwide with reliable data that create valuable insights so they make smarter business decisions.

Zyte e-commerce web data extraction services

You can get e-commerce product data from any website with Zyte’s end-to-end extraction services.

Zyte has collected e-commerce data for countless marketplace websites and retailers, resulting from 12 years of industry experience, Currently providing data on 3 billion products per month, including over 30 product data fields such as product category, search keywords, URLs, location and more. 

The standard offering already covers most use cases but also provides custom solutions to fit your exact e-commerce data extraction needs. 

Zyte facilitates the process of scraping large e-commerce websites and leveraging product data at scale. All of which are necessary for an e-commerce business to gain a competitive advantage.

Web scraping and automated data extraction play a crucial role for e-commerce businesses, and Zyte is the ideal solution for those who need these types of solutions. It’s easy to use, accurate and provides reliable data. 

Additionally, its user-friendly interface is easy for just about anyone to understand.

For those of you familiar with import.io, you may already be noticing some similarity, and even potential improvement of Zyte services over import.io alternative offerings.

Let's keep going to look into what import.io has to offer when it comes to e-commerce web data extraction.

import.io and e-commerce web data extraction

Import.io is a web data provider that delivers e-commerce data at scale through the conversion of semi-structured information in web pages into structured data. 

Having invested years of development into creating technology and capabilities, Import.io provides web data to help businesses gain insights to understand their customers.

These solutions help build dynamic pricing to ensure  e-commerce retailers receive correct product pricing. 

Furthermore, the company is capable of delivering data at enterprise scale. 

import.io e-commerce web data extraction services

The e-commerce data extraction offerings by import.io help build insights for content integrity and brand protection, price tracking, customer sentiment, product ratings analysis, and retailer stock availability, among other features. 

This ensures the products of an e-commerce business are visible to the right customers, ensure retailers present them in the right way, as well as monitor customer product reviews. 

The enterprise offering is capable of collecting QA and delivering data so organizations improve business decisions through e-commerce data. This is achieved by offering real-time data retrieval, streaming APIs, and integration with many common programming languages and data analysis tools. 

As a result, web page content is identified and transformed into a structured format in a very efficient way. This being a strong reason for which it's customers may have never considered an import.io alternative before.

Comparative analysis

Both Zyte and import.io offer e-commerce web data extraction services – so how to decide?  

To help with your final decision for an import.io alternative, we've put together a comparative analysis of both platforms.

Overall, Zyte offers a more comprehensive suite of features, including web scraping, data mining, automation capabilities, as well as e-commerce data extraction

Zyte's web data extraction helps e-commerce businesses work more efficiently with the technical capability to extract any website data. A vast array of products deliver end-to-end solutions, which are crucial to boost web content extraction and are considered the best way to obtain reliable data. 

Although import.io is known for specializing in e-commerce data extraction, it lacks the full service solutions offered by Zyte. 

Import.io's price point and expertise make it a good option for small businesses looking for e-commerce data extraction. However, its feature set is more limited than Zyte's, which not only provides e-commerce data extraction but also more advanced features. 

The following charts should give you a better understanding for your import.io alternative research. These are based on information from each website, as well as Zyte and import.io reviews online.

import.io alternative

Zyte Pricing

  • Starting price: $25 per month
  • Free trial: Available
  • Splash Headless browser: starting at $25/month
  • Smart Proxy Manager: starting at $29/month
  • Data Extraction Services: starting at $450/month
  • Smart Browser: starting at $100/month
  • Automatic Extraction: starting at $60 per month
  • Residential proxies: starting at $300/per month

import.io Pricing

  • Starting price: $99 per month
  • Free trial: Available
  • Essential Services: $99/month or $1999/year. 
  • Premium & Enterprise: Available upon request.

Clearly both platforms have their pros and cons – making it difficult to decide which one is best to suit your needs – and if whether Zyte is the right import.io alternative.

Still feeling uncertain about which one you should choose? 

Summary

So which one is better than the other... or maybe best to just shed light on the best import.io alternative.

If you need intuitive e-commerce web data extraction services capable of obtaining product data from any site at scale – along with plenty of additional features – then Zyte is the one for you. And that's regardless of whether you're reading this looking for an import.io alternative or not.

On the flipside – import.io is also a good choice for e-commerce web data – but its features may not necessarily fit the individual needs of businesses, organizations and developers. 

If you are on a tight budget with regards to e-commerce web data – and do not need as robust capabilities – import.io may be the choice for you. It provides added value through app integrations, open API functionality, and unlimited data storage on their plans. 

However, looking at this from a broader perspective – if you need reliable and easy-to-use data extraction software for your e-commerce business – Zyte is the perfect fit. 

Zyte can easily extract data from any website or online store, and the results are always accurate and up-to-date. 

Why consider Zyte as your import.io alternative

Zyte is the market leader in product data extraction services so you can extract any e-commerce website. A key factor when deciding which import.io alternative to choose.

We provide data on 3 billion products per month. 

You can also choose from a vast array of solutions such as an Automatic Data Extraction API and other tools suitable for businesses of all sizes – including Enterprise Solutions

Lean & intuitive setup

import.io alternative zyte product data
import.io alternative: extract product data with Zyte

Powerful solutions for ecommerce web data extraction

Check this video to see how automatic data extraction helps you get e-commerce web data. Yet another reason to consider Zyte as a solid import.io alternative.

Contact us here with your needs for e-commerce web data extraction and our team of experts will take care of it.

]]>
https://www.zyte.com/blog/importio-alternative/feed/ 0
Web scraping e-commerce: 5 ways to help you succeed https://www.zyte.com/blog/web-scraping-e-commerce/ https://www.zyte.com/blog/web-scraping-e-commerce/#respond Wed, 23 Nov 2022 21:49:44 +0000 https://www.zyte.com/?p=9518 If you ask people if they browse e-commerce sites and marketplaces every single day, most likely the majority will answer yes to that question. 

However, few know that e-commerce websites fail 90% of the time…did you? A pretty shocking figure considering the popularity of these sites. 

If you’re an e-commerce owner and want to avoid being just another failure figure – then it is crucial for you to understand how to leverage open web data by web scraping e-commerce sites – and extracting e-commerce web data

An estimated 2.14 billion people made online purchases in 2021, and e-commerce sales surpassed $5.2 trillion worldwide. With such staggering figures, it's no wonder that e-commerce and online retail is a highly competitive and contested landscape. If you don't pay close attention to your business and marketing strategy, product content and pricing, your ship can sink quickly.

In this article, we’ll look at why it’s important to scrape e-commerce sites and use open web data to your advantage. 

Most importantly, we'll show 4 ways data extraction and e-commerce web scraping can help you gain key insights to improve your products and services and move your business forward.

What is e-commerce web scraping?

Web scraping e-commerce sites is the automated process of collecting and extracting structured web data from e-commerce web pages. This is performed with the help of tools and applications known as web scrapers and data extraction software.  

It essentially helps you get the right data and information that leads to better informed decisions. As a result, web scraping improves sell more products to your potential customers, email address lists of that can be obtained from the email search tool. This way you will be able to maintain communication and interest through an email outreach strategy.

Marketplaces and retailers use e-commerce scraping to gather data on pricing, competitors, and market research from specific web pages when they scrape an e-commerce website.

E-commerce data examples: 

  • Product details (such as name, price, description, features, stock availability, etc.) 
  • Product reviews (ratings, date posted, reviewer, etc.) 
  • Product lists (category on a search page with an overview of name, price, delivery information, etc.) 
  • Product seller details 

web scraping e-commerce

E-commerce web scraping use cases

When you have insights from an accurate, reliable, and up-to-date e-commerce data feed, you’ll have better visibility into pricing intelligence and competitor intelligence. 

You may still be a bit confused and wonder what is an example of e-commerce web scraping and data extraction? 

Let’s look into three main examples. 

Use case examples in e-commerce web scraping

Web scraping e-commerce websites and extracting data allows you to answer questions such as what’s trending on the market, customer preferences, customer sentiment around your brand and products, and so on. 

Here are some common use cases of e-commerce data extraction:

Pricing Intelligence

Studies show that price is a major factor influencing a buyer’s purchasing decision. It is therefore crucial to have accurate, up-to-date data to analyze and compare prices from major e-commerce sites – so you can price your own products competitively, boost sales, and ensure you’re not selling at a loss. 

Competitor Intelligence and Brand Monitoring

An integral part of business is knowing how your competitors are performing. For example, data on how competitors are pricing their products will be useful for you to price your own. 

A perspective that web scraping e-commerce sites can provide through brand monitoring and price intelligence with web data extraction.

Similarly, when web scraping e-commerce sites you can also monitor competitor inventory levels, or keep an eye on any promotions or deals they might be offering. 

Market Research and Analysis

To market and sell your products effectively means to understand what your customers want. Once again, it is imperative to understand how the process of web scraping e-commerce works together with market research analysis.

Outsmart the competition

We’ve established that extracting e-commerce data and web scraping e-commerce websites can be beneficial for businesses in a variety of ways. 

Let’s dig deeper into ways that help you get ahead. 

  1. Supercharge your marketing strategies

Customers today want personalization - a whopping 54% of consumers allegedly consider ending their loyalty relationships if the company doesn’t provide tailor-made, relevant content and offers. 

As such, your marketing campaigns will need to be on point – not only to establish your brand and capture prospective customers, but also to maintain the loyalty of your existing customer base. 

When you scrape e-commerce data, both historical and current, can be used to form targeted marketing campaigns to match what your audience prefers, determine which channels work best for your campaigns, and uncover sales opportunities.

  1.  Increase brand visibility and conversions 

According to a survey by SerpWatch, 67% of all clicks on search engine result pages go to the top five results. 

You can use web scraping to organize data, glean insights, and formulate a dynamic SEO strategy, using keywords that can help you boost your rankings in search engines. This is imperative to stay ahead of competitors and increase your brand’s visibility.

Additionally, by gathering data on product features and reviews ratings, you can optimize and enrich your listings for SEO, make them more informative, and lead potential customers towards the conversion journey. 

  1. Improve your products, services and pricing

Aside from product data, web scraping e-commerce customer reviews from third party sites can provide an extensive look into how your customer perceives your product, how it stands compared to the competition, and so on. 

Retailers and manufacturers can use this data to fine-tune their products, push preferred products on the market, enhance the customer experience, and more. 

  1. Manage operations efficiently 

There are several ways that web scraping e-commerce sites can help companies manage their operations smoothly. 

For example, it can be used to gather data on your supply chain, including information on supplier prices, lead times, and product availability. You can use this data to ensure you always have products your customers need in stock. 

Additionally, web scraping e-commerce sites can help you to track competitor pricing and identify potential sources of supply chain disruptions, so you can avoid costly delays or disruptions to your business. 

Another way is through product enrichment. 

  1. Automate e-commerce data extraction

Traditionally, companies would hire an employee to extract product details from a manufacturer’s site manually. 

Nowadays, this process is done via automated processes, minimizing human error and freeing up your resources for more important tasks. 

Here is an example of how it looks like if you use Zyte Automatic Data Extraction API

e-commerce web data
Zyte Automatic Extraction API

Get a reliable data feed

Now that you don’t need anyone to explain what web scraping is and you understand how web scraping e-commerce data can benefit your business, you’ll want to implement it. 

Depending on your needs, the use of data extraction software to get web data for e-commerce projects can be pretty straight forward.

However, in many scenarios getting a reliable data feed is challenging, especially since many sites deliberately make it difficult to scrape data – whether it’s by employing sophisticated anti-bot or banning measures. 

Yes that’s right, you can get banned even if you’re web scraping legitimately. And there’s also volume to contend with when web scraping e-commerce.

Overcoming challenges to obtain e-commerce data

E-commerce sites often deal with massive volumes of data, and as you scale, it would be impossible to manually scrape 20,000 products, every hour, every day. 

A task not easy to accomplish with many factors to consider – from managing proxies to avoiding bans and more. 

To overcome these challenges, you’ll need flawless web data extraction expertise. 

web scraping e-commerce challenges

A straightforward way to fully leverage the power of this data, is through automatic data extraction and a robust infrastructure in place when web scraping e-commerce websites. 

Here is a brief video on how Zyte’s automatic data extraction helps you get e-commerce data. 

With that said, it’s crucial to decide whether it’s worth it for you to do it in-house in the long run, or seek to outsource to a web scraping provider

Conclusion

Web scraping e-commerce and leveraging e-commerce data is not just a ‘nice-to-have’ for retailers and businesses.

It’s necessary to give your business a competitive edge, and enables you to stand out from the competition. Especially if you look to automate data extraction.

If you don’t have one in-house, then your best bet is to outsource to a data extraction expert, such as Zyte to take care of your needs when it comes to web scraping e-commerce.

With over 12 years of experience in data extraction projects in various fields, including e-commerce and retail, we know what data is best for you to reach your business goals.

We deliver with speed, accuracy, and reliability, no matter what format you need it in. 

Talk to our experts today to see how we can help. 

]]>
https://www.zyte.com/blog/web-scraping-e-commerce/feed/ 0
The Scraper’s System Part 2: Explorer’s Compass to analyze websites  https://www.zyte.com/blog/scrapers-system-compass/ https://www.zyte.com/blog/scrapers-system-compass/#respond Mon, 21 Nov 2022 20:32:03 +0000 https://www.zyte.com/?p=9476 Welcome to part two of "The Scraper’s System" series. 

If you haven’t read the introductory part yet, you can do so here

In the first part, we discussed a template to define the clear purpose of your web scraping system that can help you design your crawlers better and prepare you for the uncertainty involved in a large scale web scraping project. 

Step 1 clarifies the three W’s:

Why, What, Where – of a large scale Web Scraping project – which will be the guiding North-Star throughout the development process. 

Step 2 of the framework helps you answer: “How do we extract the data?” 

I also like to call this phase – The Explorer’s Compass –  you must understand this critical navigational tool and some best practices before you set out to sail through the target websites. 

At Zyte, developers spend days analyzing target websites using four parameters that will help design the high-level crawl logic and choose the best suitable technology stack for your project. 

  1. APIs Availability.
  2. Dynamic Content.  
  3. Antibot Mechanisms.
  4. Pages and Pagination. 

API availability

Web Scraping vs API – Which is the better option?

My answer to this question… It's always subjective and depends primarily on your business goals.

If you need to collect data from the same website all the time, API is a suitable choice. 

A good idea is to always check for API availability and note all those data fields that can be extracted from the website APIs, rather than jumping straight away to scraping them. 

Benefits of checking API availability:

  1. Respect the websites by not burdening them. 
  2. Save a lot of development time. 
  3. Avoid blocks/bans.
  4. Data may be available to developers for free.

Exceptions:

Keep in mind, that this may not hold true in certain scenarios which involve:

  1. Collecting real-time data.
  2. Websites with heavy Anti-ban protection.
  3. Data nested in Javascript which requires browser emulation. 

For example, if the target websites are e-commerce aggregators, then API would make sense. If the target websites are flight aggregators, then maybe not. 

When deciding whether to choose API over web scraping or vice-versa, it’s also important to check the dynamicity of the target websites and look at the website's history. In addition to trends in that specific industry to determine the likelihood of disruptive website changes, which as a result could break the crawlers. 

Once again, clarify the business goal and make a list of all data-fields that can be extracted using the APIs provided by the target websites. 

Dynamic Content

Let me give you two examples for a quick introduction of Interactive elements on the websites. 

Once you see this, you will notice little red markers lit up in your head every time you interact with any favorite web or mobile applications.

If you feel like eating your favorite cake… open Google Maps and type in “Bakery near me”. 

Did you notice those little red markers appear?

Open your favorite e-commerce website, check the exact availability of any product you want to buy –  select the delivery location and enter the pin code. 

Did you notice this entire interaction happened without loading the entire page? 

After this, you cannot unsee such interactions across many applications over the web. 

The use of JavaScript can vary from simple form events to Single Page Applications (SPA) and the data is displayed to the user on request. As a result, for many web pages the content that is displayed in our web browser is not available in the original HTML. 

Therefore, the regular approach of scraping data will fail when it comes to scraping dynamic websites. 

Two alternative approaches:

  • Reverse engineering JavaScript - we can reverse engineer websites behavior and replicate it in our code!
  • Rendering JavaScript using browser automation/simulation tools like Headless Browser Libraries - Puppeteer, Playwright, Selenium or you can use Zyte API - check `actions.`

To Do – make a list of all data-fields that require Reverse-Engineering Javascript or browser simulation tools.

Antibot Mechanisms

This part of the analysis is boolean, either your requests go through or you start experiencing "429's". 

It’s hard to pinpoint what goes on behind the scenes to block your requests. 

Read this blog post – where Akshay Philar covers the most common measures used by websites and we’ll show you how to overcome them with Zyte Data API Smart Browser

To summarize, these are some of the defensive measures that get you blocked

  • IP rate Limitation- A lot of crawling happens from datacenter IP addresses. If the website owner recognizes that there are a lot of non-human requests coming from this set of IPs, they can just block all the requests coming from that specific datacenter so the scrapers will not be able to access the site. To overcome this, you need to use other datacenter proxies or residential proxies. Or just use a service that handles proxy management.
  • Detailed Browser fingerprinting-a combination of browser properties/attributes derived from Javascript API and used in concert with each other to detect inconsistencies. It contains information about OS, devices, accelerometer, WebGL, canvas, etc…
  • Captchas and other ‘humanity’ tests- Back in the day, captchas used HIP (Human Interactive Proof) with the premise that humans are better at solving visual puzzles than machines. Machine learning algorithms weren’t developed enough to solve captchas like this:However, as machine learning technologies evolved, nowadays a machine can solve this type of captcha easily. Then, more sophisticated image-based tests were introduced, which gave a bigger challenge for machines.
  • TCP/IP fingerprinting, geofencing, and IP blocking.
  • Behavioral patterns : Human observation methods, such as analysis of mouse movements and detailed event logging, to differentiate automated crawlers from the behavior patterns of real-life visitors. 
  • Requests Patterns:  the amount and frequency of requests you make. The more frequent your requests (from the same IP) are, the more chance your scraper will be recognized.

In this process, try to figure out the level of protection used by the target website. This helps answer whether a rotation proxy solution will be enough or it needs an advanced anti-ban solution like Zyte API that takes care of bans of all types. 

Pages and Pagination

Lastly, the number of steps required to extract the data - in some cases all the target data isn’t available on a single page, instead, it requires the crawler to make multiple requests to obtain the data. 

In these cases, determine the number of requests that will need to be made which will determine the amount of infrastructure the project will require.

The complexity of iterating through records – certain sites have more complex pagination (infinite scrolling pages, etc.) or formatting structures that can require a headless browser or complex crawl logic. This helps answer the type of pagination and crawl logic required to access all the available records. 

Conclusion

To conclude, try the following exercise first. 

Answer the questions below to ensure you fully understood "The Explorer’s Compass" and are ready to move forward with The Scraper's System.

scraper's system
]]>
https://www.zyte.com/blog/scrapers-system-compass/feed/ 0