XML Schema: The W3C's Object-Oriented Descriptions for XML

Category: Programming
Author: Eric van der Vlist
3.0
This Month Stack Overflow 1

Comments

by anonymous   2019-01-13

Your issue has a resolution, but it will not be pretty. Here's why:

Violation of non-deterministic content models

You've touched on the very soul of W3C XML Schema's. What you are asking — variable order and variable unknown elements — violates the hardest, yet most basic principle of XSD's, the rule of Non-Ambiguity, or, more formally, the Unique Particle Attribution Constraint:

A content model must be formed such that during validation [..] each item in the sequence can be uniquely determined without examining the content or attributes of that item, and without any information about the items in the remainder of the sequence.

In normal English: when an XML is validated and the XSD processor encounters <SurName> it must be able to validate it without first checking whether it is followed by <GivenName>, i.e., no looking forward. In your scenario, this is not possible. This rule exists to allow implementations through Finite State Machines, which should make implementations rather trivial and fast.

This is one of the most-debated issues and is a heritage of SGML and DTD (content models must be deterministic) and XML, that defines, by default, that the order of elements is important (thus, trying the opposite, making the order unimportant, is hard).

As Marc_s already suggested, Relax_NG is an alternative that allows for non-deterministic content models. But what can you do if you're stuck with W3C XML Schema?

Non-working semi-valid solutions

You've already noticed that xs:all is very restrictive. The reason is simple: the same non-deterministic rule applies and that's why xs:any, min/maxOccurs larger then one and sequences are not allowed.

Also, you may have tried all sorts of combinations of choice, sequence and any. The error that the Microsoft XSD processor throws when encountering such invalid situation is:

Error: Multiple definition of element 'http://example.com/Chad:SurName' causes the content model to become ambiguous. A content model must be formed such that during validation of an element information item sequence, the particle contained directly, indirectly or implicitly therein with which to attempt to validate each item in the sequence in turn can be uniquely determined without examining the content or attributes of that item, and without any information about the items in the remainder of the sequence.

In O'Reilly's XML Schema (yes, the book has its flaws) this is excellently explained. Furtunately, parts of the book are available online. I highly recommend you read through section 7.4.1.3 about the Unique Particle Attribution Rule, their explanations and examples are much clearer than I can ever get them.

One working solution

In most cases it is possible to go from an undeterministic design to a deterministic design. This usually doesn't look pretty, but it's a solution if you have to stick with W3C XML Schema and/or if you absolutely must allow non-strict rules to your XML. The nightmare with your situation is that you want to enforce one thing (2 predefined elements) and at the same time want to have it very loose (order doesn't matter and anything can go between, before and after). If I don't try to give you good advice but just take you directly to a solution, it will look as follows:

<xs:element name="User">
    <xs:complexType>
        <xs:sequence>
            <xs:any minOccurs="0" processContents="lax" namespace="##other" />
            <xs:choice>
                <xs:sequence>                        
                    <xs:element name="GivenName" />
                    <xs:any minOccurs="0" processContents="lax" namespace="##other" />
                    <xs:element name="SurName" />
                </xs:sequence>
                <xs:sequence>
                    <xs:element name="SurName" />
                    <xs:any minOccurs="0" processContents="lax" namespace="##other" />
                    <xs:element name="GivenName" />
                </xs:sequence>
            </xs:choice>
            <xs:any minOccurs="0" processContents="lax" namespace="##any" />
        </xs:sequence>
        <xs:attribute name="ID" type="xs:unsignedByte" use="required" />
    </xs:complexType>
</xs:element>

The code above actually just works. But there are a few caveats. The first is xs:any with ##other as its namespace. You cannot use ##any, except for the last one, because that would allow elements like GivenName to be used in that stead and that means that the definition of User becomes ambiguous.

The second caveat is that if you want to use this trick with more than two or three, you'll have to write down all combinations. A maintenance nightmare. That's why I come up with the following:

A suggested solution, a variant of a Variable Content Container

Change your definition. This has the advantage of being clearer to your readers or users. It also has the advantage of becoming easier to maintain. A whole string of solutions are explained on XFront here, a less readable link you may have already seen from the post from Oleg. It's an excellent read, but most of it does not take into account that you have a minimum requirement of two elements inside the variable content container.

The current best-practice approach for your situation (which happens more often than you may imagine) is to split your data between the required and non-required fields. You can add an element <Required>, or do the opposite, add an element <ExtendedInfo> (or call it Properties, or OptionalData). This looks as follows:

<xs:element name="User2">
    <xs:complexType>
        <xs:sequence>
            <xs:element name="GivenName" />
            <xs:element name="SurName" />
            <xs:element name="ExtendedInfo" minOccurs="0">
                <xs:complexType>
                    <xs:sequence>
                        <xs:any minOccurs="0" maxOccurs="unbounded" processContents="lax" namespace="##any" />
                    </xs:sequence>
                </xs:complexType>
            </xs:element>
        </xs:sequence>
    </xs:complexType>
</xs:element>

This may seem less than ideal at the moment, but let it grow a bit. Having an ordered set of fixed elements isn't that big a deal. You're not the only one who'll be complaining about this apparent deficiency of W3C XML Schema, but as I said earlier, if you have to use it, you'll have to live with its limitations, or accept the burden of developing around these limitations at a higher cost of ownership.

Alternative solution

I'm sure you know this already, but the order of attributes is by default undetermined. If all your content is of simple types, you can alternatively choose to make a more abundant use of attributes.

A final word

Whatever approach you take, you will loose a lot of verifiability of your data. It's often better to allow content providers to add content types, but only when it can be verified. This you can do by switching from lax to strict processing and by making the types themselves stricter. But being too strict isn't good either, the right balance will depend on your ability to judge the use-cases that you're up against and weighing that in against the trade-offs of certain implementation strategies.