A Practical Guide to JSON Parsing with Python
JSON (JavaScript Object Notation) is a text-based data format used for exchanging and storing data between web applications. It simplifies the data transmission process between different programming languages and platforms.
The JSON standard has become increasingly popular in recent years. It’s a simple and flexible way of representing data that can be easily understood and parsed by both humans and machines. JSON consists of key-value pairs enclosed in curly braces, separated by a colon.
Python provides various tools, libraries and methods for parsing and manipulating JSON data, making it a popular choice for data analysts, web developers, and data scientists.
In this guide, we’ll explore the syntax and data types of JSON, as well as the Python libraries and methods used for parsing JSON data, including more advanced options like JMESPath and ChompJS, which are very useful for web scraping data.
Reading JSON
One of the most common tasks when working with JSON data is to read its contents. Python provides several built-in libraries for reading JSON from files, APIs, and web applications. To read JSON data, you can use the built-in json module (JSON Encoder and Decoder) in Python.
The json module provides two methods, loads and load, that allow you to parse JSON strings and JSON files, respectively, to convert JSON into Python objects such as lists and dictionaries. Next is an example on how to convert JSON string to a Python object with the loads method.
import json json_input = '{ "make": "Tesla", "model": "Model 3", "year": 2022, "color": "Red" }' json_data = json.loads(json_input) print(json_data) # Output: {'make': 'Tesla', 'model': 'Model 3', 'year': 2022, 'color': 'Red'}
Following, we display an example using the load method. Given a JSON file:
{ "make": "Tesla", "model": "Model 3", "year": 2022, "color": "Red" }
We load the data using the with open() context manager and json.load() to load the contents of the JSON file into a Python dictionary.
import json with open('data.json') as f: json_data = json.load(f) print(json_data)Â # Output: {'make': 'Tesla', 'model': 'Model 3', 'year': 2022, 'color': 'Red'}
Parse JSON data
After loading JSON data into Python, you can access specific data elements using the keys provided in the JSON structure. In JSON, data is typically stored in either an array or an object. To access data within a JSON array, you can use array indexing, while to access data within an object, you can use key-value pairs.
import json json_string ='{"numbers": [1, 2, 3], "car": {"model": "Model X", "year": 2022}}' json_data = json.loads(json_string) # Accessing JSON array elements using array indexing print(json_data['numbers'][0])Â # Output: 1 # Accessing JSON elements using keys print(json_data['car']['model'])Â # Output: Model X
In the example above, there is an object 'car' inside the JSON structure that contains two mappings ('model' and 'year'). This is an example of a nested JSON structure where an object is contained within another object. Accessing elements within nested JSON structures requires using multiple keys or indices to traverse through the structure.
JSON and Python objects Interchangeability
JSON is a string format used for data interchange that shares similar syntax with Python dictionary object literal syntax. However, it is essential to remember that JSON is not the same as a Python dictionary. When loading JSON data into Python, it is converted into a Python object, typically a dictionary or list, and can be manipulated using the standard methods of Python objects. When ready to save the data, you will need to convert it back to JSON format using the json dumps function. Remembering this difference between the two formats is essential.
Modifying JSON data
Working with JSON in Python also involves modifying the data by adding, updating or deleting elements. In this post we will focus on the basics, so we will be using the json built-in package, as it provides all basic functions we require to accomplish these tasks.
Adding an element
To add an element, you can modify the corresponding mapping in the JSON object using standard dictionary syntax. For example:
import json json_string = '{"model": "Model X", "year": 2022}' json_data = json.loads(json_string) json_data['color'] = 'red' print(json_data)Â # Output: {'model': 'Model X', 'year': 2022, 'color': 'red'}
Updating an element
Updating an element follows the same logic as the previous snippet, but instead of creating a new key, it will be replacing the value of an existing key.
import json json_string = '{"model": "Model X", "year": 2022}' json_data = json.loads(json_string) json_data['year'] = 2023 print(json_data)Â # Output: {'model': 'Model X', 'year': 2023}
Another approach to either adding and/or updating values into a python dictionary is using the update() method. It will add or update elements in the dictionary using the values from another dictionary, or with an iterable containing key-value pairs.
import json json_string =Â '{"model": "Model X", "year": 2022}' json_data = json.loads(json_string) more_json_string = '{"model": "Model S", "color": "Red"}' more_json_data = json.loads(more_json_string) json_data.update(more_json_data) print(json_data)Â # Output: {'model': 'Model S', 'year': 2022, 'color': 'Red'}
Deleting an element
To remove an element from a JSON object, you can use the del keyword to delete the corresponding value.
import json json_string = '{"model": "Model X", "year": 2022}' json_data = json.loads(json_string) del json_data['year']
Another approach to removing an element from a dictionary with JSON data is to use the pop method, which allows you to retrieve the value and use it at the same time it is removed.
import json json_string = '{"model": "Model X", "year": 2022}' json_data = json.loads(json_string) year = json_data.pop('year') print(year)Â # Output: 2022 print(json_data)Â # Output: {'model': 'Model X'}
Beware, trying to remove an element using del when the element is not present will raise a KeyError exception. The pop method, on the other hand, will return None if it doesn't find the key. Ways to use del when you are not sure if the key is present is by either checking if the key exists.
import json json_string = '{"model": "Model X", "year": 2022}' json_data = json.loads(json_string) if 'year' in json_data: del json_data['year'] else: print('Key not found') # or wrapping the del operation with a try/catch json_string = '{"model": "Model X", "year": 2022}' json_data = json.loads(json_string) try: del json_data['year'] except KeyError: print('Key not found')
Python Error Handling: Check or Ask?
When it comes to error handling in Python, there are two methods: "check before you leap" and "ask for forgiveness." The former involves checking the program state before executing each operation, while the latter tries an operation and catches any exceptions if it fails.
The "ask for forgiveness" approach is more commonly used in Python and assumes that errors are a regular part of program flow. This approach provides a graceful way of handling errors, making the code easier to read and write. Although it can be less efficient than the "check before you leap" approach, Python's exception handling is optimized for it, and the performance difference is usually insignificant.
Saving JSON
After tweaking with a previous JSON file or JSON string, you may want to save your modified data back to a JSON file or export it as a JSON string to store data. The json.dump() method allows you to save a JSON object to a file, while json.dumps() returns a JSON string representation of an object.
Saving JSON to a file using json.dump() and with open() context manager with write mode setting (writing mode "w"):
import json data = '{"model": "Model X", "year": 2022}' # Saves the dictionary named data as a JSON object to the file data.json with open("data.json", "w") as f: json.dump(data, f)
Converting a Python object to a JSON string using json.dumps():
import json data = {"model": "Model X", "year": 2022} # Converts the data dictionary to a JSON string representation json_string = json.dumps(data) print(json_string)Â # Output: {"model": "Model X", "year": 2022}
Advanced JSON Parsing Techniques
When traversing JSON data in Python, depending on the complexity of the object, there are more advanced libraries to help you get to the data with less code.
JMESPath
JMESPath is a query language designed to work with JSON data. It allows you to extract specific parts of a JSON structure based on a search query. JMESPath is well-suited for advanced JSON parsing tasks because it can handle complex, nested JSON structures with ease. At the same time, it is easy to use at beginner level, making it an accessible tool for anyone working with JSON data.
Here's an example using the jmespath library in Python to extract data:
import json import jmespath json_string = '{"numbers": [1, 2, 3], "car": {"model": "Model X", "year": 2022}}' json_data = json.loads(json_string) # Accessing nested JSON name = jmespath.search('car.model', json_data)Â # Result: Model X # Taking the first number from numbers first_number = jmespath.search('numbers[0]', json_data)Â # Result: 1
Those examples only display the basics of what JMESPath can do. JMESPath queries can also filter and transform JSON data. For example, you can use JMESPath to filter a list of objects based on a specific value or to extract specific parts of an object and transform them into a new structure.
Let's say we have a JSON array of car objects, each containing information such as the car's make, model, year and price:
cars = [ {"make": "Toyota", "model": "Corolla", "year": 2018, "price": 15000}, {"make": "Honda", "model": "Civic", "year": 2020, "price": 20000}, {"make": "Ford", "model": "Mustang", "year": 2015, "price": 25000}, {"make": "Tesla", "model": "Model S", "year": 2021, "price": 50000} ]
We can use JMESPath to filter this list and return only the cars that are within a certain price range, and transform the result into a new structure that only contains the make, model, and year of the car:
import jmespath result = jmespath.search(""" [?price <= `25000`].{ Make: make, Model: model, Year: year } """, cars)
The output of this code will be:
[ {'Make': 'Toyota', 'Model': 'Corolla', 'Year': 2018}, {'Make': 'Honda', 'Model': 'Civic', 'Year': 2020}, {'Make': 'Ford', 'Model': 'Mustang', 'Year': 2015} ]
Mastering JMESPath is a sure way to never have a headache when dealing with JSON parsing with python. Even complex JSON structures, like those often encountered in web scraping when dealing with a JSON document found on websites, can be easily handled with JMESPath's extensive features.
JMESPath is not only available for Python, but also for many other programming languages, such as Java and Ruby. To learn more about JMESPath and its features, check out the official website.
ChompJS
Web scraping involves collecting data from websites, which may be embedded in JavaScript objects that initialize the page. While the standard library function json.loads() extracts data from JSON objects, it is limited to valid JSON objects. The problem is that not all valid JavaScript objects are also valid JSONs. For example all those strings are valid JavaScript objects but not valid JSONs:
-
- "{'a': 'b'}" is not a valid JSON because it uses ' character to quote
-
- '{a: "b"}' is not a valid JSON because property name is not quoted at all
-
- '{"a": [1, 2, 3,]}' is not a valid JSON because there is an extra "," character at the end of the array
-
- '{"a": .99}' is not a valid JSON because float value lacks a leading 0
Chompjs library was designed to bypass this limitation, and it allows to scrape such JavaScript objects into proper Python dictionaries:
import chompjs chompjs.parse_js_object("{'a': 'b'}")Â # Output: {u'a': u'b'} chompjs.parse_js_object('{a: "b"}')Â # Output: {u'a': u'b'} chompjs.parse_js_object('{"a": [1, 2, 3,]}')Â # Output: {u'a': [1, 2, 3]
chompjs works by parsing the JavaScript object and converting it into a valid Python dictionary. In addition to parsing simple objects, it can also handle objects containing embedded methods by storing their code in a string.
One of the benefits of using chompjs over json.loads is that it can handle a wider range of JavaScript objects. For example, chompjs can handle objects that use single quotes instead of double quotes for property names and values. It can also handle objects that have extra commas at the end of arrays or objects.
Dealing with Custom Python objects
Almost all programming languages support custom objects, which are created using object-oriented programming concepts. However, while the basic principles of object-oriented programing are the same across different programming languages, the syntax, features, and use cases of custom objects can vary depending on the language.
Custom Python objects are typically created using classes, which can encapsulate data and behavior.
One example of a custom Python object is the Car class:
class Car: def __init__(self, make, model, year, price): self.make = make self.model = model self.year = year self.price = price
To create a new Car object, we can simply call the Car constructor with the appropriate arguments:
car = Car("Toyota", "Camry", 2022, 25000)
If we try to serialize the Car object as-is, we will get a TypeError:
car_json = json.dumps(car) TypeError: Object of type 'Car' is not JSON serializable
This error occurs because json.dumps() doesn't know how to serialize our Car object. By default, the json module in Python can only serialize certain types of objects, like strings, numbers, and lists/dictionaries. To serialize our Car object to a JSON string, we need to create a custom encoding class.
Encoding
We can create a custom encoder by inheriting from json.JSONEncoder and overriding the default method. This allows us to convert python objects into JSON strings. The default method is called by the JSON encoder for objects that are not serializable by default.
import json class CarEncoder(json.JSONEncoder): def default(self, obj): if isinstance(obj, Car): return {"make": obj.make, "model": obj.model, "year": obj.year, "price": obj.price} Â Â Â Â Â Â Â return super().default(obj)
Inside the default method, we check if the object being encoded is an instance of the Car class. If it is, we return a dictionary with the attributes. If it is not an instance of the Car class, we call the default method of the parent class to handle the encoding.
car = Car("Toyota", "Camry", 2022, 25000) car_json = CarEncoder().encode(car) print(car_json)Â # Output: {"make": "Toyota", "model": "Camry", "year": 2022, "price": 25000}
By using a custom encoding class, we can customize how our objects are serialized to JSON and handle any special cases that may not be covered by the default encoding behavior.
Decoding
Just as we can use custom encoding classes to serialize custom objects to JSON, we can also use custom decoding classes to decode JSON strings back into our custom objects.
At the current state of our CarEncoder, we are not dealing with decoding the object back to its custom object. If we use the decode method, we will just receive a dictionary with the values, not the Car object.
car_json = '{"make": "Toyota", "model": "Camry", "year": 2022, "price": 25000}' car_dict = json.loads(car_json) print(car_dict)Â # Output: {"make": "Toyota", "model": "Camry", "year": 2022, "price": 25000}
As you can see, the output is just a dictionary with the attributes of the Car object. If we want to turn this dictionary back into a Car object, we need to create a custom decoder class to be used on json.loads() method.
Adding metadata
Metadata here refers to additional information about the data. This can include information about the structure, format, or other descriptive details that help to understand or process the data.
One way of making the decoder able to know the object type that it should cast is by adding metadata bound to the object type when encoding it.
if isinstance(obj, Car): return {"make": obj.make, "model": obj.model, "year": obj.year, "price": obj.price}
Adding to our previous CarEncoder a type metadata
if isinstance(obj, Car): return {"__type__": "Car", "make": obj.make, "model": obj.model, "year": obj.year, "price": obj.price}
We can use this with a custom decoding class to determine which object to create.
car = Car("Toyota", "Camry", 2022, 25000) car_json = json.dumps(car, cls=CarEncoder) print(car_json)Â # Output: {"__type__": "Car", "make": "Toyota", "model": "Camry", "year": 2022, "price": 25000}
Here is the CarDecoder class, which will allow us to pass data as JSON string and return the custom python object.
class CarDecoder(json.JSONDecoder): def __init__(self, *args, **kwargs): super().__init__(object_hook=self.object_hook, *args, **kwargs) def object_hook(self, dct): if '__type__' in dct and dct['__type__'] == 'Car': Â Â Â Â Â Â Â return Car(dct['make'], dct['model'], dct['year'], dct['price']) return dct
Then we can use CarDecoder in the json.loads() method as the cls parameter.
car_json = '{"__type__": "Car", "make": "Toyota", "model": "Camry", "year": 2022, "price": 25000}' car = json.loads(car_json, cls=CarDecoder) print(car.make) # Output: "Toyota" print(car.model) # Output: "Camry" print(car.year) # Output: 2022 print(car.price) # Output: 25000
Conclusion
In this guide, we've covered the basics of reading and parsing JSON data with Python, as well as how to access and modify JSON data using Python's built-in json package. We've also discussed more advanced JSON parsing options, such as JMESPath and ChompJS, which are useful for web scraping data . With the knowledge gained from this guide, you should be able to efficiently work with JSON data in Python and integrate it into your developer workflow.
Â