JSON is ubiquitous, particularly when working with APIs and logs. Its unstructured nature makes it highly flexible for handling anything from a simple array to a complex nested structure. However, this can also make it challenging for data analysis. When parsing JSON, it's crucial to understand its structure so you can flatten it and convert it into a tabular format for analysis. Once the structure is identified, you can use pandas or PySpark to explode or normalize it into the desired shape. In this article, I will explain the method I use. While this approach is applicable to any notebook, there is a specific trick to make it work in a Fabric notebook.
I will use the following example. Just looking at it, we can see that it's about employees, their contact details and skills.
If you use
.json_normalize() method in pandas to read this json, you will get a dataframe with just one row and column.
To unnest this, we will need to identify the structure and pass extra arguments to
.json_normalize() . I like using panel library for that. With panel, you can identify the structure easily and also interact with it to find the nested objects. You just need one trick to make it work in Fabric notebook.
First, install panel and instantiate the extension
!pip install panel --q import panel as pn pn.extension()
Next, use the
.pane.JSON()method to load the json
pn.pane.JSON(json_obj, name='JSON', height = 300, width = 300)
After executing the code above, you would expect to see the JSON structure, but it doesn't appear. This is the trick. Panel uses Bokeh renderer and hence you need to import Bokeh to output the visual. I wrote about it in one of my blogs.
Fabric runtime already has Bokeh installed so you just need to import it.
import bokeh bokeh.io.output_notebook() pn.pane.JSON(json_obj, name='JSON', height = 300, width = 300)
With this, you should now see the panel visual:
This is fully interactive and you can see all levels of the json and the its nested structure.
From this we can see two things, we have employee records under the key "employees". Each employee record is a dictionary with keys for 'ids', 'name', 'role' and nested dictionaries for 'contact' and a list for 'skills'. To flatten this to a tabular format , we need to flatten the list under 'skills' and extract attributes under the 'contact' dictionary. Lists can be flattened by
record_path argument and nested dictionary can be flattened by
meta . This will transform the nested and hierarchical json into a table.
df = (pd.json_normalize(json_obj['employees'], record_path='skills', meta=['id', 'name', 'role', ['contact', 'email'], ['contact', 'phone']], record_prefix='skill_') ) df
This was rather a simple example, but you can use a similar approach to plot the structure, interact with it and then flatten it as needed. To recap, the trick to make it work in Fabric notebook was to import bokeh and including
bokeh.io.output_notebook() in the code block.
If you are using the Synapse VSCode extension to work with Fabric notebooks, try the JSON crack extension. It plots the json into a fully interactive tree.