Completeness of "Parsing Canonical Form"

1.0 Introduction

One of the defining characteristics of Avro is that a reader is assumed to have the "same" schema used by the writer of the data the reader is reading. This assumption leads to a data format that's compact and amenable to many forms of schema evolution. However, there are nuances to defining exactly what it means for the reader to have "the same" schema used by the writer. We want to allow, for example, trivial transformations, such as the insertion of whitespace. But we can't allow transformations that change the real meaning of schemas, such as a reordering of fields in a record

To clearly define what it means for a reader to have "the same" schema as a writer, the Avro specification defines Parsing Canonical Form (PCF), a set of transformations on Avro schemas that strip away irrelevencies (e.g., "doc" attributes) and normalize the JSON text (e.g., dealing with whitespace). Two schemas are defined to be "the same" as far as a reader is concerned if and only if their PCFs are textually equal.

We believe that PCF is sound and complete. Soundness means that the PCF of a schema is logically equivalent to the original form, i.e., we can use the PCF in place of the original form without introducing bugs. Completeness is "maximal soundness:" if two schemas are logically equivalent, then their PFCs will be textually identical. The Avro specification claims that PCF is complete when it says: "[if two schemas have the same PCF, then] there is no serialized data that would allow a reader to distinguish data generated by a writer using one of the original schemas from data generated by a writing using the other original schema."

We believe that the transformations that define PCF are "self-evidently" sound to people familiar with Avro. For example, fixing the order of fields in a JSON object, or eliminating irrelevant attributes like doc, or using the simple int in place of {"type":"int"} clearly don't change the meaning of a schema.

Completeness, on the other hand, is much less obvious. How do we know that there aren't two logically equivalent schemas that happen to reduce to different canonical forms? All it takes is one such pair to foil our claim of completeness.

In general, completeness properties like this can be tricky to prove. It turns out that, while soundness is critical to us, completeness is not. If two schemas are operationally equivalent (i.e., a reader can't tell their output apart), but we accidentally treat them as if they are different, then typically all that happens is that we'll do more work. For example, we might generate a decoder object to decode some incoming data when it turns out that we had already cached a decoder object that could do the job. This is not likely to happen often, and thus incompleteness isn't a huge problem.

At the same time, if we knew that our canonical forms were complete, then we might take advantage of that fact in some circumstances (e.g., to serialize schemas). Also, the Schema.equals(Object) method provided in the Avro implementation makes many of the same assumptions made in the PCF definition. Thus, a completeness proof for our canonicalization would give us confidence in the correctness of this equality algorithm. So this issue is not entirely academic.

We haven't worked out a full, formal proof (we hope someone from the community will step up to that task!). However, we've been thinking about it quite a bit, and we thought we'd share our thoughts so far.

2.0 Completeness argument for Parsing Canonical Form

Our formalization of Avro schemas would be based on interpreting them as grammars. In this interpretation, Avro schemas are grammars that generate tagged data streams. Consider, for example, the following schema for a linked-list:

  {"type":"record", "name":"list", "fields":[
     {"name":"value", "type":"int"},
     {"name":"tail",  "type":["null", "list"]}
   ]}
Interpreted as a grammar, it can generate a tagged data-stream that looks like this:
  [record,"list"][field,"value"][int,10][field,"tail"][union,1]
    [record,"list"][field,"value"][int,22][field,"tail"][union,0]
(this is a two-record linked list whose first cell contains the value "10" and second cell the value "22"). Avro schemas can trivially be interpreted as grammars for such tagged data streams. Formal proofs involving Avro schemas can be carried out as proofs about languages and grammars.

So what does it mean for the canonical form of a schema to be "complete?" Let L(S) denote the language generated by the Avro schema S, and C(S) denote the canonical form of the schema. The canonicalization is complete if:

For all schemas S1 and S2,
    L(S1) = L(S2) ⇒ C(S1) = C(S2)
That is, for any two schemas that generate the same language, their canonicalizations are textually equivalent.

To prove this, we need to define some functions:

J is a variable name we often use to denote a JSON expression representing an Avro schema
C(J) is the Parsing Canonical Form of J as defined in the Avro specification
P(J) is the ASG for an Avro schema generated by parsing J (think of P(J) as a Schema Java object)
S is a variable name we often use to denote such ASGs
L(S) is the language generated by a schema ASG

With all these symbols defined, our completeness criteria is now rendered as:

J1, J2: L(P(J1)) = L(P(J2)) ⇒ C(J1) = C(J2)
We'll prove this by breaking it into two parts:
(1): ∀ S1, S2: L(S1) = L(S2) ⇒ S1 ≅ S2
(2): ∀ J1, J2: P(J1) ≅ P(J2) ⇒ C(J1) = C(J2)
In this two-step decomposition, we've introduced a new operator ≅, which compares the ASGs of two Avro schemas. The ASG of an Avro schema can be viewed as a rooted, labeled, directed graph. Because Avro schemas can be recursive, these graphs can be cyclic. The ≅ operator is "true" between two ASGs when the set of minimal labeled paths (no cycles, starting from the root) on the two ASGs are the same. (The Schema.equals(Object) method in the Avro implementation computes something close to this ≅ relation, except that ≅ ignores "irrelevant" attributes like doc and aliases.)

It turns out that, implicit in the Avro Specification, there are "canonicalization" rules that are important to our proof of completeness. In particular, the Avro Specification says that a name must be defined "before" it is used, and that a name cannot be defined more than once in a schema. Consider the following redefinition of the linked-list schema, for example:

  {"type":"record", "name":"list", "fields":[
    {"name":"value", "type":"int"},
    {"name":"tail",
      "type":["null", {"type":"record", "name":"list", "fields":[
                        {"name":"value", "type":"int"},
                        {"name":"tail", "type":["null", "list"]}]}]}
  ]}
In this redefinition, we've "unpacked" the recursion in the linked list by one level. In some sense, this is a perfectly fine definition of a linked list, and is operationally equivalent to the more compact version given earlier. So it makes sense that our claim of completeness is dependent upon this kind of "unpacking" not occuring in real schemas.

To deal with this issue in our proof, we pretend that the Avro specification does not require that named schemas be defined just once, and be defined "before" they are used. Rather, we treat this requirement as an additional transformation rule in the definition of Parsing Canonical Form:

(As in the Avro spec, "first use" is defined as the first occurrence in a depth-first, left-to-right traversal of the schema abstract-syntax graph (ASG).)

Getting back to the proof of (1) and (2) from above, we need to introduce more functions:

P(J)=PA(PJ(J)) - decompose parser into:
  PJ is the JSON parser
  PA is the Avro parser (takes JSON ASTs as input)
C(J)=CJ(CA(CM(J))) - decompose canonicalization into:
  CM(J) the MINIMIZE step
  CA(J) Avro normalizations
  CJ(J) JSON normalizations
M(S) is the "named-schema NFA minimzation" of S
"Named-schema NFA minimization" is similar to general NFA minimization, except that we only collapse nodes and edges related to named schema entities and not other nodes. For example, we would not collapse the nodes associated with int or union schemas.

Our proof of (1) looks like this (this proof refers to lemmas (3) and (4), which are defined later):

S1,S2:L(S1)=L(S2)
M(S1)=M(S2) by (3)
S1≅S2 by (4)
Here's the proof of (2) (this proof refers to lemmas (4)-(7), which are defined later):
J1,J2:P(J1)≅P(J2)
M(P(J1))=M(P(J2)) by (4)
P(CM(J1))=P(CM(J2)) by (5)
PA(PJ(CM(J1)))=PA(PJ(CM(J2))) by definition of P
PJ(CA(CM(J1)))=PJ(CA(CM(J2))) by (6)
CJ(CA(CM(J1)))=CJ(CA(CM(J2))) by (7)
C(J1)=C(J2) by definition of C
Here are the lemmas needed above:
(3): ∀ S1, S2: L(S1) = L(S2) ⇒ M(S1) = M(S2)
(4): ∀ S1, S2: M(S1) = M(S2) ⇔ S1 ≅ S2
(5): ∀ J: M(P(J)) = P(CM(J))
(6): ∀ J1, J2: PA(PJ(J1)) = PA(PJ(J2)) ⇒ PJ(CA(J1)) = PJ(CA(J2))
(7): ∀ J1, J2: PJ(J1) = PJ(J2) ⇒ CJ(J1) = CJ(J2)

Proving the lemmas:

  1. This says that the language-related part of our canonicalization is complete, i.e., M finds the equivalence-classes of L. I would imagine one could prove this by modifying a proof that the equality of LL(1) grammars is a decidable problem. I haven't gotten very far in showing this, however.
  2. The right-hand direction of this follows from the definition of minimization. The left-hand direction seems correct, but I'm not sure how to prove it (I think it also follows from the definition of minimization).
  3. This is showing that the MINIMIZE step (which is done on JSON expressions) is equivalent to doing an named-schema NFA minimization on the ASG representation. This should follow pretty directly from a detailed definition of M, if we provided one.
  4. This says that the Avro-related part of our canonicalization is complete, i.e., that CA finds equivalence-classes of PA.
  5. This says that the JSON-related part of our canonicalization is complete, i.e., that CJ finds equivalence-classes of PJ. Note that, implicitly, this lemma ranges over only JSON expressions that are legal Avro schemas with no doc strings or default values, and thus (for example) doesn't need to worry about normalization of floating-point literals.

3.0 Concluding remarks

Engineers have a history of running ahead of formal mathematical proofs, when things "seem correct" to them. In this case, it seems pretty obvious that Parsing Canonical Form is complete as well as sound, and we should go ahead and treat it as such. At the same time, formal proofs often turn up corner cases and exceptions that are valuable to document and account for. Thus, it'd nice if someone could provide a better completeness argument than we've been able to so far.