How to create a small json lib using antlr and shapeless

In this article i will show how antlr4 and shapeless can be used to create a small json library (not for production, of course ^_^)
with ability to decode arbitrary json strings into case classes and encode them back with some scala magic.

Project setup

Let’s begin with a project setup.

Generally speaking, it doesn’t really matter which IDE you will use, but i’ll use a Intellij Idea. Community edition is more than enough for it. Also, i recommend to instal antlr4 plugin for intellij — it’s not necessary, but it really helps to create and debug antlr grammar.

Now we are ready to create a new project. For project building i’ll use sbt, but it is also possible to use maven or gradle.
After project creation we need to add antlr4 compiler plugin — sbt-antlr4. Also, we will need a shapeless for codec derivation, so let’s add this library too. After all, plugins.sbt and build.sbt will look like this:

- antlr4Version` — antlr4 compiler version
- antlr4PackageName — package name of generated code
- antlr4TreatWarningsAsErrors — warnings can be dangerous, so let’s treat them as errors

antlr4GenListener and antlr4GenVisitor are the most interesting part. To be able to process parsed data in our code we can use different approaches: listener pattern or visitor pattern. For our small library i think that visitor is more native and easier, but it’s also possible to achieve the same results using listener. The most important difference between them is that using visitor you can control node visiting whereas using listener you should react on each call made by ANLTR mechanism.

By default, antlr4 grammars should be placed in antlr4 directory and as we didn’t change it, the final project structure should look like this:

root
| — — — — — — src
| |
| | — — main
| | |
| | | — — scala
| | | — — antlr4
| |
|build.sbt

ANTLR4 grammar

To be able to decode json string we need a parser. We can code it from scratch but we have an antlr4, so we will use this tool for parsing.

> This is not the best solution for json parsing, especially for load-heavy systems, and should be considered as an example only.

First of all, we need a grammar. I think, that the best place for exploring grammars and take some ideas is grammars-v4. There are a lot of grammar examples, but we need a special one — json.

Our initial grammar gently copied from grammars-v4 will look like this:

We will change it a little bit soon, but for now let’s look on what’s going on there. In general, grammar consists of parser and lexer rules.

Lexer rules specify token definitions. The syntax for lexer rules resembles the syntax for parser rules, but with some differences — lexer rules must begin with uppercase letter, for example. Also, it is possible to define a fragment. Fragments are not visible for parsers, but they can help in token recognition for lexer.

fragment INT is a rule for an integer: it can be zero OR be non-zero and begin with [1–9] + zero or more other digits:
- 0 — ok
- 123 — ok
- 01 — not ok

fragment EXP is a rule for exponent part. It is interesting example which shows us that we can also combine fragments. Correct examples of exponent part are e1, e+321, e-3.

As mentioned before, this rule is not visible for parser rules, but only for lexer rules. The next lexer rule uses this fragment. NUMBER is a rule for a number. Numbers can be negative (optional ‘-’), requires at least one INT fragment, can have optional floating point part (‘.’ [0–9] +)? and optional exponent part EXP?.

Now let’s move on to the parser rules.

Parser rules are the heart of our grammar. By defining parser rules we define how found (by lexer) tokens combine with each other.

Each combination has it’s own complexity, so, don’t overcomplicate it with many recursive rules :)

The first rule is a value` As we know, json consists of string literals, numbers, arrays of any values, objects with nested values, boolean values and null. In this example true, false and null defined as literals, but it could be done using explicit lexer rule for them. So, in this rule we state that value is a string OR a number OR etc.

As mentioned earlier, json can has arrays. Array can be empty [] or non empty [value, value, …]. If array is not empty then it should have at least one value and optionally more values.

Json object is a value which consists of a key-value pairs. Key always a string, but value can be any type. Object as an array can be empty {} or non empty {“f1”: “v1”}.

Finally, there is a rule for the whole json string:

It is just a single value.

Remember that we wanted to change grammar a little bit? That’s the perfect time for this. We’ll change rules by adding a labels to each of their alternative:

These labels are very useful which we will see soon.

To generate code write a sbt command: sbt antlr4Generate. This will generate a java code:

  • JSONBaseVisitor
  • JSONLexer
  • JSONParser
  • JSONVisitor

You can explore the sources, but for now we need to extend JSONBaseVisitor and override some methods.

First of all, let’s define our json ADT. I will not overcomplicate it with some number decoding magic — it should be done very carefully, but for now we define that all numbers is a BigDecimal.

Now we a ready to extend `JSONBaseVisitor`. The implementation will be surprisingly small:

Remember we have added a labels to parser rules? As you can see, JSONBaseVisitor has methods with names like labels. That’s very helpful. Also, there is additional method removeQuotes — i didn’t come up with a better solution to remove quotes, but this method is only for that. Without this method, string values in scala would be ”value”.

Usage of this simple decoder:

Shapeless magic

For now, we can only decode json string into Json ADT, but we want to decode it into a case class instance and encode case class instance back into json. Also, we don’t want to configure decoders and encoders manually, but automatically. All of this can be achieved using shapeless.

First of all, we will define a JsonReader[A] and JsonWriter[A] for some basic types. There is no any kind of magic yet, just a bunch of instances for some basic types:

Using this type classes we already can decode/encode simple data types:

So, consider this case class:

case class A(f1: Int, f2: String, f3: Option[Long])

This case class consists of three fields: integer, string and optional long. We can decode/encode each of them separately (except optional types, because we didn’t define reader and writer for them yet). Using JsonReader[A] and JsonWriter[A] we can process single data type but not a product of data types. Let’s add another layer of abstraction for it:

The traits look the same with the previous ones, but the purpose of them is a little bit different.
As a playground we will use scastie. As you can see, that’s not too many lines of code for auto codec derivation.

Let’s start from Encoder[A] because it simpler than Decoder[A]. In companion object we defined summoner method apply and some implicits for derivation:

For Unit we will just return an empty object, but you can change it and return JsonNull or whatever you want. Also we want to be able to use already defined `JsonReader` instances for basic types and to do so we define an implicit converter from JsonReader to Encoder. Finally, we should be able to encode optional data types, so we also define an implicit converter from Encoder[A] to Encoder[Option[A]].

Now let’s move on into the EncoderLowPriorityInstances. First of all, we use this trait because we don’t want to create ambiguities for compiler when it try to find an instance for some type. Because of it, we put it into a separate trait (probably, an overkill for this example, but a good practice). The magic is here:

This is a typical structure for auto derivation using shapeless. genericEncoder is used to derive instances for case classes. LabelledGeneric is used to allow to get field names.
Type A — type of case class. H — representation of a case class:

hnilEncoder is used to encode HNil instance. It will never be called, but is needed for derivation.
Finally, hlistEncoder is used to derive encoder for our representation.
witness: Witness.Aux[K] is needed to get a field name. hEncoder and tEncoder are encoders for head (H) of our case class and remaining tail (T <: HList). The implementation is basically encode both part then combine them into a single `JsonObject`. Return type of this method is not a Encoder[H :: T], but a Encoder[FieldType[K, H] :: T]:

If we combine Witness and FieldType, we get something very compelling — the ability to extract the field name from a tagged field. We use `FieldType` and `::` in the result type to declare the relationships between the three, and we use a Witness to access the runtime value of the type name.

And that’s it! The Decoder is the same with only difference in implementation details.

But that’s still not super cool implementation. What if we have a json without some field but our case class has a default value for it? Or we want to rename field names in json? All of this also could be done using shapeless.

Shapeless magic — improvements

For field renaming and default values we will use Decoder as an example (scastie playground).

For custom field names we will use annotation, so let’s define it:

Now we need an additional trait which should help us to handle not only current field value, but also custom field name and default value.

The instances in companion object remain the same, but in trait will be changes a little bit:

Important note is that for our decoder now we need not only A, H <: HList, but also HD <: HList, FH <: HList. In this case HD and FH are HList for default values and field names respectively. Also, there are some new instances in the method signature:

As name declares, their are needed for processing default values and annotations (FieldName in this case). The rest is pretty similar: we still need an instance of decoder for HNil and for HList:

witness and hDecoder should be already familiar. tDecoder was changed and also there is TD <: HList, FH <: Option[FieldName], FT <: HList. So, tDecoder now has a type DecoderWithMeta — it is needed because we decode our object step by step (of field by field) and we need to pass remaining json, default values and custom field names to the tail decoder and tail decoder should be able to accept and process them.
TD <: HList is a type for default values. FH <: Option[FieldName], FT <: HList — field names current head (FH) and tail (FT).
The return type is FieldType[K, H] :: T, Option[H] :: TD, FH :: FT. You can read it like this:

Decoder for a head of case class + its’ tail
with optional default value for its’ head and tail
with optional custom field name for its’ head and tail

Conclusion

At the end, we have a small library for parsing and decoding json. Consider it as a proof of concept — there are still many issues like:
- No check for custom field names duplicates
- Slow deserialization due to processing the whole json object again and again
- etc.

In the end the main purpose was no to create a library, but to show that antlr and shapeless could be combined giving in result an awesome outcome and i believe that it will help you to start exploring antlr4 and shapeless.

The source code of a whole project could be found here: https://github.com/nryanov/json-serde

Resources

- Antlr4 doc: https://github.com/antlr/antlr4/tree/master/doc
- The Type Astronaut’s Guide to Shapeless: https://underscore.io/books/shapeless-guide/

Scala developer

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store