Parsing Unstructured JSON

"Loosing" control the right way

·

11 min read

Introduction

JSON, or JavaScript Object Notation, is a lightweight data-interchange format that is easy for both humans and machines to read and write. It is based on a subset of the JavaScript programming language, and is quickly becoming the de facto standard for data interchange.

On the other hand, Java is known for its robustness, security, and portability. The other side of the coin is that these come at a cost of having a stringent type system. In Java, every variable has a specific type, and this type cannot be changed at runtime. This helps to prevent errors and makes code more reliable. But at the same time, shows that the language itself acts like a control freak.

So, can a JSON representation be represented in Java successfully? The answer is yes, but there are a few things to keep in mind. First, the representation must be in a format that Java can understand. This means that it must be well-formed JSON. Second, the representation must be compatible with Java's type system. This means that the types of the data in the representation must match the types of the variables in Java.

For example,

{
    "name": "ahrooran",
    "dob": 10151,
    "pronouns": [
        "he",
        "him"
    ],
    "isBachelor": true,
    "address": {
        "country": "Sri Lanka",
        "city": "Colombo",
        "street": null
    }
}

The above json can be easily mapped to the following classes:

record Person(String name, int dob, String[] pronouns, boolean isBachelor, Address address) {}

record Address(String country, String city, String street) {}

But what about following json snippet:

{
    "chicken": false,
    "chapter": [
        "plain",
        true,
        {
            "pleasure": false,
            "sail": true,
            "quickly": "drink",
            "rhythm": true,
            "process": -1701991737.8426595,
            "shape": 985992238.642838
        },
        1059826615
    ],
    "shop": 1874937219.5652575,
    "different": "height"
}

As you can see, it is impossible to parse an unstructured json to a java class because a class cannot be modified at runtime (let's omit the black magic stuff for now). Since JSON is being used by all the developers, across languages, it is paramount that Java must adapt to parse it. However, herein lies the issue.

There are a number of JSON parsing libraries that can be used to parse unstructured JSON data. These libraries offer a variety of features, such as automatic type inference, support for nested objects, and performance optimizations. In this article, we will look at some popular JSON parsing libraries and how they handle unstructured JSON data. We will also evaluate the user friendliness and performance of these libraries.

To demonstrate the usage of each library, I will use a piece of code from a production component. The code parses a large JSON file (around 16M) and does some repetitive operations on it. I will rewrite the same logic for different libraries and evaluate their usefulness and performance.

I write the code like a general purpose programmer would write. No fancy streaming approaches are used. The code is also written to be used in JMH performance benchmarks. There will be some JMH usage in this article. For those who are unfamiliar with the JMH, you can refer here.

Gson

Gson is a straightforward and easy-to-use JSON parser. However, it is now on maintenance mode and the lead developers have moved to Moshi. This means there is no active development of new features for Gson, but they are still fixing bugs and security vulnerabilities.

Gson has built-in JsonObject and JsonArray classes for handling object and array types. To parse JSON, simply call gson.fromJson(content, JsonArray.class) where content is a String. For performance-critical needs, use the gson streaming api.

Now lets look at the code:

@SneakyThrows
public static void gson(String content, Blackhole blackhole, Gson gson) {
    JsonArray array = gson.fromJson(content, JsonArray.class);
    array.forEach(catObj -> {
        JsonObject category = catObj.getAsJsonObject();
        String[] categoryField = category.get("cat").getAsString().split("~");
        String countryCode = categoryField[0];
        String gicsL2 = categoryField[1];

        JsonArray statList = category.getAsJsonArray("statistics");
        JsonObject ratioStatsMap = new JsonObject();
        List<String> ratioList = new ArrayList<>(statList.size());
        statList.forEach(ratObj -> {
            JsonObject ratioStat = (JsonObject) ratObj;
            ratioStatsMap.add(ratioStat.get("type").getAsString(), ratioStat);
            ratioList.add(ratioStat.get("type").getAsString());
        });

        JsonObject symbolRatios = category.getAsJsonObject("ratios");
        symbolRatios.keySet().forEach(symbolKey -> {
            List<String> allRatioList = new LinkedList<>(ratioList);
            String[] field = symbolKey.split("~");
            String exchange = field[0];
            String symbol = field[1];
            JsonObject ratios = symbolRatios.getAsJsonObject(symbolKey);

            ratios.keySet().forEach(key -> {
                String segment = getSegment(key);
                blackhole.consume(segment);
                allRatioList.remove(key);
            });

            allRatioList.forEach(key -> {
                String segment = getSegment(key);
                blackhole.consume(segment);
            });

            blackhole.consume(exchange);
            blackhole.consume(symbol);
            blackhole.consume(countryCode);
            blackhole.consume(gicsL2);
            blackhole.consume(ratioStatsMap);
        });
    });
}

The fact Gson has these intermediate types, makes it attractive and easy to work with over some of the other libraries (we will look at them below). Note that each element in a JsonArray is treated as a JsonElement. It could be either JsonObject, JsonArray, JsonPrimitive or JsonNull. User must choose the underlying subtype and then again convert to our own such as Integer, or String etc. (category.get("cat").getAsString()).

One thing I do not like about Gson is the fact that there is no direct parsing of byte[] since it is GC friendly. Most other modern parsers support it out of the box. This is especially problematic when we have to deal with big chunks of json snippets loading over network.

Jackson

Jackson is the most popular JSON parsing library for Java, especially since Spring exclusively uses it under the hood. Jackson's JsonNode is the basic JSON holder. It can be anything, including an object, an array, a string, number, or null. Jackson also offers specific nodes for each type of JSON data, such as ArrayNode, BooleanNode, NullNode, and so on. This makes it easy to work with JSON data, even if it is unstructured. One of the advantages of Jackson is that it does not require you to use subtypes. You can simply use JsonNode to parse any type of JSON data. This makes Jackson very flexible and easy to use. Jackson also offers its own streaming api which lets us write more performant parsers.

@SneakyThrows
public static void jackson(byte[] content, Blackhole blackhole, ObjectMapper objectMapper) {
    JsonNode array = objectMapper.readTree(content);
    array.forEach(category -> {
        String[] categoryField = category.get("cat").asText().split("~");
        String countryCode = categoryField[0];
        String gicsL2 = categoryField[1];

        JsonNode statList = category.get("statistics");
        Map<String, JsonNode> ratioStatsMap = new HashMap<>(statList.size());
        List<String> ratioList = new ArrayList<>(statList.size());
        statList.forEach(ratioStat -> {
            ratioStatsMap.put(ratioStat.get("type").asText(), ratioStat);
            ratioList.add(ratioStat.get("type").asText());
        });

        JsonNode symbolRatios = category.get("ratios");
        symbolRatios.fieldNames().forEachRemaining(symbolKey -> {
            List<String> allRatioList = new LinkedList<>(ratioList);
            String[] field = symbolKey.split("~");
            String exchange = field[0];
            String symbol = field[1];
            JsonNode ratios = symbolRatios.get(symbolKey);

            ratios.fieldNames().forEachRemaining(key -> {
                String segment = getSegment(key);
                blackhole.consume(segment);
                allRatioList.remove(key);
            });

            allRatioList.forEach(key -> {
                String segment = getSegment(key);
                blackhole.consume(segment);
            });

            blackhole.consume(exchange);
            blackhole.consume(symbol);
            blackhole.consume(countryCode);
            blackhole.consume(gicsL2);
            blackhole.consume(ratioStatsMap);
        });
    });
}

Invocation is simple. JsonNode array = objectMapper.readTree(content); Note the usage of JsonNode. It can act as an array, another object, and any other basic json types, all by itself. No need to have separate classes to represent arrays or objects in Jackson.

org.json

Org-json is a reference implementation of the JSON specification, and it is available under the Apache 2.0 license. It is a good choice for simple JSON parsing tasks. It is easy to use and it does not require any dependencies. However, it is not as powerful or versatile as some of the other JSON parsers available for Java. I have personally used this in many production Java components.

@SneakyThrows
public static void orgJson(String content, Blackhole blackhole) {
    JSONArray array = new JSONArray(content);
    array.forEach(catObj -> {
        JSONObject category = (JSONObject) catObj;
        String[] categoryField = category.getString("cat").split("~");
        String countryCode = categoryField[0];
        String gicsL2 = categoryField[1];

        JSONArray statList = category.getJSONArray("statistics");
        JSONObject ratioStatsMap = new JSONObject(statList.length());
        List<String> ratioList = new ArrayList<>(statList.length());
        statList.forEach(ratObj -> {
            JSONObject ratioStat = (JSONObject) ratObj;
            ratioStatsMap.put(ratioStat.getString("type"), ratioStat);
            ratioList.add(ratioStat.getString("type"));
        });

        JSONObject symbolRatios = category.getJSONObject("ratios");
        symbolRatios.keySet().forEach(symbolKey -> {
            List<String> allRatioList = new LinkedList<>(ratioList);
            String[] field = symbolKey.split("~");
            String exchange = field[0];
            String symbol = field[1];
            JSONObject ratios = symbolRatios.getJSONObject(symbolKey);

            ratios.keySet().forEach(key -> {
                String segment = getSegment(key);
                blackhole.consume(segment);
                allRatioList.remove(key);
            });

            allRatioList.forEach(key -> {
                String segment = getSegment(key);
                blackhole.consume(segment);
            });

            blackhole.consume(exchange);
            blackhole.consume(symbol);
            blackhole.consume(countryCode);
            blackhole.consume(gicsL2);
            blackhole.consume(ratioStatsMap);
        });
    });
}

We do not need fancy ways to initiate. Just calling JSONArray array = new JSONArray(content); or JSONObject object = new JSONObject(content); is enough. Also we can directly get underlying String or int or any other java type without calling a secondary method. For example, category.getString("cat"). For gson and jackson, we have to actually call methods in order to extract underlying type.

I do not like the way they handle the foreach in array streaming. JSONObject category = (JSONObject) catObj; I have to do a cast to get the JSONObject each time, and casting is one of the biggest problems when it comes to json parsing in terms of easiness.

Moshi

Moshi is a newer JSON parsing library that is written in Kotlin. It does not offer as many features as Jackson or Gson, but it is a good choice for simple JSON parsing tasks. The rumour is that lead developers of Gson quit google and moved over to square. Ever since, the new library moshi became the successor to gson. Moshi follows in the footsteps of gson but with some design changes. One of them being no intermediate json.

@SneakyThrows
public static void moshi(String content, Blackhole blackhole, JsonAdapter<List<Map<String, Object>>> moshi) {
    List<Map<String, Object>> list = moshi.fromJson(content);
    assert list != null;
    list.forEach(category -> {
        String[] categoryField = String.valueOf(category.get("cat")).split("~");
        String countryCode = categoryField[0];
        String gicsL2 = categoryField[1];

        List<?> statList = (List<?>) category.get("statistics");
        Map<String, Object> ratioStatsMap = new HashMap<>();
        List<String> ratioList = new ArrayList<>(statList.size());
        statList.forEach(ratObj -> {
            Map<String, String> ratioStat = (Map<String, String>) ratObj;
            ratioStatsMap.put(ratioStat.get("type"), ratioStat);
            ratioList.add(ratioStat.get("type"));
        });

        Map<String, Object> symbolRatios = (Map<String, Object>) category.get("ratios");
        symbolRatios.keySet().forEach(symbolKey -> {
            List<String> allRatioList = new LinkedList<>(ratioList);
            String[] field = symbolKey.split("~");
            String exchange = field[0];
            String symbol = field[1];
            Map<String, String> ratios = (Map<String, String>) symbolRatios.get(symbolKey);

            ratios.keySet().forEach(key -> {
                String segment = getSegment(key);
                blackhole.consume(segment);
                allRatioList.remove(key);
            });

            allRatioList.forEach(key -> {
                String segment = getSegment(key);
                blackhole.consume(segment);
            });

            blackhole.consume(exchange);
            blackhole.consume(symbol);
            blackhole.consume(countryCode);
            blackhole.consume(gicsL2);
            blackhole.consume(ratioStatsMap);
        });
    });
}

At a glance you can see invoking moshi is as similar to gson. However since moshi does not have intermediate holders, we have to work with List, Map, Object and explicit casting to desired objects. For example following code: List<Map<String, Object>> list = moshi.fromJson(content); is an equivalent to JsonArray array = gson.fromJson(content, JsonArray.class); in gson counterpart. You can see the explicit casting List<?> statList = (List<?>) category.get("statistics"); which is also makes the parsing looks ugly and less trivial.

Jsoniter

Jsoniter is designed to be faster and more efficient than other JSON parsers, while still being easy to use. Jsoniter is also more extensible and flexible than other JSON parsers, and it supports a wider range of features.

@SneakyThrows
public static void jsoniter(byte[] content, Blackhole blackhole) {
    Any array = JsonIterator.deserialize(content);
    array.forEach(category -> {
        String[] categoryField = category.toString("cat").split("~");
        String countryCode = categoryField[0];
        String gicsL2 = categoryField[1];

        Any statList = category.get("statistics");
        Map<String, Any> ratioStatsMap = new HashMap<>(statList.size());
        List<String> ratioList = new ArrayList<>(statList.size());
        statList.forEach(ratioStat -> {
            ratioStatsMap.put(ratioStat.toString("type"), ratioStat);
            ratioList.add(ratioStat.toString("type"));
        });

        Map<String, Any> symbolRatios = category.get("ratios").asMap();
        symbolRatios.keySet().forEach(symbolKey -> {
            List<String> allRatioList = new LinkedList<>(ratioList);
            String[] field = symbolKey.split("~");
            String exchange = field[0];
            String symbol = field[1];
            Map<String, Any> ratios = symbolRatios.get(symbolKey).asMap();

            ratios.keySet().forEach(key -> {
                String segment = getSegment(key);
                blackhole.consume(segment);
                allRatioList.remove(key);
            });

            allRatioList.forEach(key -> {
                String segment = getSegment(key);
                blackhole.consume(segment);
            });

            blackhole.consume(exchange);
            blackhole.consume(symbol);
            blackhole.consume(countryCode);
            blackhole.consume(gicsL2);
            blackhole.consume(ratioStatsMap);
        });
    });
}

Parsing is as simple as Any array = JsonIterator.deserialize(content);. The Any data structure is very similar to the jackson JsonNode.

DSL-JSON

DSL-JSON is a high-performance JVM (Java/Android/Scala/Kotlin) JSON library with advanced compile-time databinding support. It is designed for performance and is built for invasive software composition with DSL Platform compiler. It is the most performant library I have tested so far with respect to the unstructured json. However it lacks the intermediary holder which makes it cumbersome to write code (similar to moshi).

@SneakyThrows
public static void dsl(byte[] content, Blackhole blackhole, DslJson<Object> dslJson) {
    List<Map> list = dslJson.deserializeList(Map.class, content, content.length);
    assert list != null;
    list.forEach(category -> {
        String[] categoryField = String.valueOf(category.get("cat")).split("~");
        String countryCode = categoryField[0];
        String gicsL2 = categoryField[1];

        List<?> statList = (List<?>) category.get("statistics");
        Map<String, Object> ratioStatsMap = new HashMap<>();
        List<String> ratioList = new ArrayList<>(statList.size());
        statList.forEach(ratObj -> {
            Map<String, String> ratioStat = (Map<String, String>) ratObj;
            ratioStatsMap.put(ratioStat.get("type"), ratioStat);
            ratioList.add(ratioStat.get("type"));
        });

        Map<String, Object> symbolRatios = (Map<String, Object>) category.get("ratios");
        symbolRatios.keySet().forEach(symbolKey -> {
            List<String> allRatioList = new LinkedList<>(ratioList);
            String[] field = symbolKey.split("~");
            String exchange = field[0];
            String symbol = field[1];
            Map<String, String> ratios = (Map<String, String>) symbolRatios.get(symbolKey);

            ratios.keySet().forEach(key -> {
                String segment = getSegment(key);
                blackhole.consume(segment);
                allRatioList.remove(key);
            });

            allRatioList.forEach(key -> {
                String segment = getSegment(key);
                blackhole.consume(segment);
            });

            blackhole.consume(exchange);
            blackhole.consume(symbol);
            blackhole.consume(countryCode);
            blackhole.consume(gicsL2);
            blackhole.consume(ratioStatsMap);
        });
    });
}

Parsing is similar to moshi. List<Map> list = dslJson.deserializeList(Map.class, content, content.length);

Performance

The above logic is tested using JMH, with following properties:

# JMH version: 1.36
# VM version: JDK 19.0.2, OpenJDK 64-Bit Server VM, 19.0.2+7-FR
# VM invoker: D:\programs\java\jdk19.0.2_7\bin\java.exe
# VM options: --add-opens=java.base/java.lang=ALL-UNNAMED
# Blackhole mode: compiler (auto-detected, use -Djmh.blackhole.autoDetect=false to disable)
# Warmup: 5 iterations, 10 s each
# Measurement: 10 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Throughput, ops/time

And the performance results as follows:

Benchmark                Mode  Cnt   Score   Error  Units
BenchmarkJson.dsl       thrpt   60  14.618 ± 0.380  ops/s
BenchmarkJson.gson      thrpt   60  11.399 ± 0.083  ops/s
BenchmarkJson.jackson   thrpt   60   9.562 ± 0.057  ops/s
BenchmarkJson.jsoniter  thrpt   60   9.421 ± 0.183  ops/s
BenchmarkJson.moshi     thrpt   60   6.278 ± 0.036  ops/s
BenchmarkJson.orgJson   thrpt   60   2.589 ± 0.035  ops/s

Dsl-json is the most performant. Although I have not used it in production, I am more inclined to use it in the future, but not for unstructured json. The fact that this does not have an intermediary datatype to hold values is a big no for me.

14.618 ±(99.9%) 0.380 ops/s [Average]
(min, avg, max) = (12.173, 14.618, 15.388), stdev = 0.849
CI (99.9%): [14.238, 14.998] (assumes normal distribution)

Gson being not actively developed, still comes up 2nd place and unlike dsl-json it has built in intermediary types. This makes gson a prime candidate for unstructured json parsing.

11.399 ±(99.9%) 0.083 ops/s [Average]
(min, avg, max) = (10.511, 11.399, 11.676), stdev = 0.185
CI (99.9%): [11.316, 11.482] (assumes normal distribution)

What is interesting is the fact that moshi being next generation to gson, lacked very much lower in terms of throughput in comparison.

6.278 ±(99.9%) 0.036 ops/s [Average]
(min, avg, max) = (5.917, 6.278, 6.384), stdev = 0.080
CI (99.9%): [6.242, 6.313] (assumes normal distribution)