Java type hinting in avro-maven-plugin

Recently, somebody shared with me the following problem: an Avro schema in Schema Registry has magically evolved into a slightly different version, albeit still backward compatible.

Schemas changing…

The first version of the schema looked like this:

{
    "type":"record",
    "name":"Key",
    "namespace":"my-namespace",
    "fields" [
        {
            "name":"UUID",
            "type":"string"
        }
    ],
    "connect.name":"my-topic.Key"
}

After it changed, the schema looked like this:

{
    "type":"record",
    "name":"Key",
    "namespace":"my-namespace",
    "fields" [
        {
            "name":"UUID",
            "type":{
                "type":"string",
                "avro.java.string":"String"
            }
        }
    ],
    "connect.name":"my-topic.Key"
}

To add up to the confusing, this is a topic published by Kafka Connect using MySQL Debezium plugin. Neither the database schema nor Connect or Debezium versions had changed anywhere close to when the schema evolved.

How could this have happened?

The mystery guest…

Although nothing had changed in the stack that was polling record changes from the MySQL database and sending them to Kafka… there was a new element to consider.

After some conversations, it was apparent that there was a new application publishing records to the same topic, for testing. This application was:

  1. Downloading the schema from Schema Registry.
  2. Doing code-generation using avro-maven-plugin against the downloaded .asvc files from Schema Registry.
  3. Producing some records using the newly created Java POJO classes.

Those seem like the right steps. However, looking into the options of avro-maven-plugin, once stood up:

  /**  The Java type to use for Avro strings.  May be one of CharSequence,
   * String or Utf8.  CharSequence by default.
   *
   * @parameter property="stringType"
   */
  protected String stringType = "CharSequence";

Could it be the culprit?

stringType does more than you expect

While the description of the property suggests something as naive as instructing the Avro code generator what class to use for Avro strings… it does more than just that.

Comparing the code for POJOs generated using maven-avro-plugin two things are different. Firstly, fields like the UUID in the schema above change their type from java.lang.CharSequence to java.lang.String; this is as expected.

However, it also changes the internal Avro schema that every Java POJO stores in:

public static final org.apache.avro.Schema SCHEMA$;

Upon changing stringType to String the resulting schema in SCHEMA$ contains the extended type definition that we saw at the beginning. The Java POJOs define this property because it is sent to Schema Registry when producing records (only once, from there one it uses the returned schema id).

Since there is no canonical representation of an Avro schema, Schema Registry chooses to take the schema as is, ignoring that both schemas are semantically identical and it should not create a new version for it.

A solution?

Can we not use stringType = String? Yes, but then all POJOs are generated using CharSequence. In my opinion, that is the best option for mixed environments. After all, this extra hint in the schema only makes sense for Java consumers.

However, if you control the topic end to end (e.g., both producers and consumers), you might as well use with stringType = String by default and guarantee that every client uses String instead of CharSequence.

In any case, both schemas are backward compatible between themselves. A correct Avro library should result in the same schema representation in whatever language you have chosen to use.

Incompatible AVRO schema in Schema Registry

My company uses Apache Kafka as the spine for its next-generation architecture. Kafka is a distributed append-only log that can be used as a pub-sub mechanism. We use Kafka to publish events once business processes have completed successfully, allowing a high degree of decoupling between producers and consumers.

These events are encoded using Avro schemas. Avro is a binary serialization format that enables a compact representation of data, much more than, for instance, JSON. Given the high volume of events we publish to kafka, using a compact format is critical.

In combination with Avro we use Confluent’s Schema Registry to manage our schemas. The registry provides a RESTful API to store and retrieve schemas.

Compatibility modes

The Schema Registry can control what schemas get registered, ensuring a certain level of compatibility between existing and new schemas. This compatibility can be set to one of the next four modes:

  • BACKWARD: a new schema is allowed if it can be used to read all data ever published into the corresponding topic.
  • FORWARD: a new schema is allowed if it can be used to write data that all previous schemas would be able to read.
  • FULL: a new schema that fullfils both registrations.
  • NONE: a schema is allowed as long as it is valid Avro.

By default, Schema Registry sets BACKWARD compatibility, which is most likely your preferred option in PROD environment, unless you want to have a hard time with your consumers not quite understanding events published with a newer, incompatible version of the schema.

Incompatible schemas

In development phase it is perfectly fine to replace schemas with others that are incompatible. Schema Registry will prevent updating the existing schema to an incompatible newer version unless we change its default setting.

Fortunately Schema Registry offers a complete API that allows to register and retrieve schemas, but also to change some of its configuration. More specifically, it offers a /config endpoint to PUT new values for its compatibility setting.

The following command would change the compatibility setting to NONE for all schemas in the Registry:

curl -X PUT http://your-schema-registry-address/config 
     -d '{"compatibility": "NONE"}'
     -H "Content-Type:application/json"

This way next registration would be allowed by the Registry as long as the newer schema were valid Avro. The configuration can be set for an specific schema too, simply appending the name (i.e., /config/subject-name).

Once the incompatible schema has been registered, the setting should be set back to a more cautious value.

Summary

The combination of Kafka, Avro and Schema Registry is a great way to store your events in the most compact way possible, while still retains the ability to evolve the corresponding schemas.

However some of the limitations that the Schema Registry imposes make less sense on a development environment. On some occassions, making incompatible changes in a simple way is necessary and recommendable.

The Schema Registry API allows changing the compatibility setting to accept schemas that, otherwise, would be rejected.