Querying nested JSON can be frustrating and finding easy to use examples can be even more challenging. It is one of the most common sources of questions that I see on Stack Overflow and Snowflake Community forums. The syntax is not at all intuitive, and it will unhelpfully return NULLs instead of throwing errors when you get it wrong.
Even after years of practice, I know that it’s theoretically a matter of traversing the JSON and maybe adding a LATERAL FLATTEN clause, but I still always have to look up the syntax and tweak it until I get it just right.
To make the process a little easier, here is a sample JSON blob and SQL query to retrieve some data from nested JSON. Note the color-coding at each level of nesting, and how those attribute names are used in the query both for flattening and producing a final output.
Once it’s all working, most of the final query is boiler plate syntax and aliases. I hope this example makes it a little easier to get through the syntax and get on with querying your data.
Read on past the example for more advice and troubleshooting tips.
A sample JSON blob with a couple layers of nested arrays:
CREATE OR REPLACE TEMPORARY TABLE PEOPLE AS
SELECT
1 AS ID,
PARSE_JSON('
{
"name": {
"first": "John",
"last": "Smith"
},
"projects": [
{
"project_name":"foo",
"tasks": [
{ "type":"story", "name":"more bar", "estimate":5 },
{ "type":"bug", "name":"less foo", "estimate":8 }
]
},
{
"project_name":"Hello",
"tasks": [
{ "type":"spike", "name":"World", "estimate":1 }
]
}
]
}') AS JSON_DATA;
A query to dig through the JSON for the highlighted values
SELECT ID,
JSON_DATA:name.first AS FIRST_NAME,
ARRAY1.VALUE:project_name AS PROJECT,
ARRAY2.VALUE:estimate AS ESTIMATE
FROM PEOPLE P,
LATERAL FLATTEN (INPUT => P.JSON_DATA:projects) ARRAY1,
LATERAL FLATTEN (INPUT => ARRAY1.VALUE:tasks) ARRAY2;
This will give the result:
ID | FIRST_NAME | PROJECT | ESTIMATE |
1 | John | foo | 5 |
1 | John | foo | 8 |
1 | John | Hello | 1 |
From there, you can wrap this query in a CTE (Common Table Expression) and use those results to aggregate Estimates per projects, for example.
General Advice
If you’re still having trouble, here’s my attempt at distilling a few years of Snowflake JSON headaches into bullet point format.
- Simple nested attributes are queried differently from FLATTENing an array
- The FIRST_NAME column above requires querying nested data, but there is no array around the attribute, so this can be queried directly from the JSON_DATA column. No FLATTEN necessary. There are a few notation options though. Dot Notation, as used above, is my personal preference.
- The LATERAL FLATTEN syntax is needed any time you want a JSON array to translate to multiple rows in your query output
- You can avoid Flattening if you know that you only need the first (or second, or third) element from an array using this bracket notation
- If you need a value in an array that is in nested JSON that is inside another array, the query starts to get uglier, but at its core it’s a matter of combining these two notations, which leads to the next piece of advice…
- Build the query one step at a time
- Don’t try to build out your whole SELECT clause in one go. Keep adding output for every step (extracting an array, inspecting the value from the flattened array, extracting the attributes within, etc. etc. etc.)
- Making a small mistake in a big query can lead to NULLs in the output. Snowflake will happily FLATTEN a NULL array, retrieve an attribute from a NULL value, or an index from an empty array. If this happens, you’ll get NULL in your output with no error to indicate that something is not working as intended
- Work backwards, if needed, to a point where you have something non-null to spot where things are going wrong
- Clean up your output AFTER your query is working. Snowflake’s Exclude and Rename syntax can be helpful for building up the query in a CTE and then manipulating the output.
- JSON attributes are case sensitive
- All the lower-case text in the example query above is required to match the casing in the JSON blob
- If you ever have something returning NULL, double check your casing
- Make sure your JSON has consistent casing of attribute names if possible. GET_IGNORE_CASE provides alternate syntax for parsing JSON if needed
- Use helpful aliases at every step
- In the example above I used array1, and array2 to keep track of the nesting level while I was building the query, but meaningful names like “projects” and “tasks” might make it easier to follow in a real example
- I always go back and clean up my aliases on a second pass for code clarity
- Use a JSON formatter to make it easy to visualize the levels of nesting in your JSON blob. This is especially helpful if your JSON is large and complex. I use the “JSON formatter” extension in VSCode
- Extracting JSON attributes into table columns
- Many queries and views can perform fine while processing and returning values from JSON, but as a table gets bigger it can become a performance hit
- Any attribute that’s regularly used in a where clause, join clause, order by, group by, data mask, or cluster, should generally be extracted into its own column in a transformation layer
- JSON does not have native data types outside of numbers, text, and booleans. If you have date fields, they will be more efficient, both in space and query performance, in a table column
This blog is written by David Garrison.
0 Comments