Anyone have examples of XML that can be mutated? My guess is that it wouldn't ta...

masklinn · on Dec 14, 2020

> I expect that a similar problem will be found in many other libraries, if the XML was publicized. XML namespaces made a critical... "mistake" is probably too strong, but "design choice that deviated too far from people's mental model" is about right... that has prevented them from being anywhere near as useful or safe as they could be. In an XML document using XML namespaces, "ns1:tagname" may not equal "ns1:tagname", and "ns1:tagname" can be equal to "ns2:tagname". This breaks people's mental models of how XML works, and correspondingly, breaks people's code that manipulates XML.

That right there is why I like Clark's notation (despite its unholy verbosity), which I learned of because that's how ElementTree manipulates namespaces: in Clark's notation, the document is conceptually

    <{https://sample.com/1}tag>
      <{https://blah.org/1}tag>
        <{https://blah.org/2}tag>
          <{https://anewsite.com/xmlns}tag />
        </{https://blah.org/2}tag>
      </{https://blah.org/1}tag>
    </{https://sample.com/1}tag>

Which is unambiguous. But as you note adds challenges for round-trip equality (in fact ElementTree doesn't maintain that, it simply discards the namespace prefixes on parsing, which I have seen outright break supposedly XML parsers which were really hard-coded for specific namespace prefixes).

lxml does round-trip prefixes (though it still doesn't round-trip documents) by including a namespace map on each element.

dasyatidprime · on Dec 14, 2020

Not round-tripping prefixes breaks other things, and this is around where I start more thoroughly agreeing that XML namespaces tried to do too much at once: mainly, there are XML-based definitions, I think including XPath and XSLT's use of it (or old XSLT's use of what might not have been XPath yet?), that use the ambient namespace prefix set as context for interpreting attribute values that contain XML names. If you might ever interchange some prefixes, you might even get similar “looks valid but means the wrong thing” problems.

And if you try to combine namespaces with DTDs (which is just an explosive mix to start with, and I think is just recommended to never do) you get other problems, because you're no longer allowed to add arbitrary namespace declarations in the middle, so anything that round-trips prefixes but might ever add redundant declarations of them won't reliably produce something that DTD-validates, and if you're transforming into a DTD from something that might have used other namespaces, you have to make sure to remove all the extra declarations, and…

Note that most of this is still “well-defined”, it's just awkwardly hairy. This is not to be taken as an excuse to implement the standard badly or incorrectly if you're going to handle it at all.

dasyatidprime · on Dec 14, 2020

In other words, the minds of a large number of human specifiers and implementors have their own roundtrip corruption flaws where when you pass the XML namespaces specification through them, you get something out that's incoherent and doesn't interoperate properly, creating representation mismatch problems down the line back in the digital world.

dasyatidprime · on Dec 14, 2020

To add to this with some (potentially out-of-date!) personal experience: sometime around 2008, now somewhere in my dusty directories, I started implementing my own Ruby XML DOM-like library (based on the Parsifal XML lexer in C) mainly to handle this properly, because the most “normal” REXML did something really horrible in its API for namespaced attributes that made them almost unusable by clients of the library. (I forget why I couldn't use the Ruby bindings to libxml2 at the time; I think maybe they didn't support several things I wanted to do, and I couldn't add them without patching and vendoring libxml2, and that would be its own disaster because of shared libraries, etc.) Specifically, attribute accesses didn't always work consistently with namespaces, the callers had to handle prefix management themselves in places, and looping over the attributes of an element required you to handle both “single” attributes and sub-collections of namespaced attributes if there were any qualified and unqualified attributes with the same short name, because the second was what the outer collection was indexed by…

(I originally wrote this comment with a more fully worked-out example, but after viewing it in context I realized it was way too long to be an only-partly-on-topic comment on this thread, so I'll probably move it to a post elsewhere and submit it later.)

lyxsus · on Dec 14, 2020

> This breaks people's mental models of how XML works, and correspondingly, breaks people's code that manipulates XML.

Because they usually have incorrect mental model. Blaming namespaces for name ambiguity would be the same as blaming the code "x = a + b" because "a" and "b" could be defined differently.

Namespace prefixes are absolutely irrelevant, they only exists for your convenience.

tptacek · on Dec 14, 2020

That doesn't seem accurate at all. It would be the case if there was some deterministic abbreviation from URL namespace qualifiers down to namespace prefixes, but there is not; instead, they are template variables, which can be shuffled throughout an XML document, requiring security software to constantly and reliably keep track of the value of the variable at multiple points. People sign URLs and JSON documents all the time with schemes that don't have this goofy property.

There's a similar problem with XML entity references, which have been happily breaking enterprise security for over a decade, because nobody has a good mental model of how entities in XML documents actually behave.

It seems fair at this point to blame the standard.

jerf · on Dec 15, 2020

In hindsight, it probably would have been better to define standard prefixes, let people just sort of register their own for non-standard ones in whatever manner is suitable for their particular top-level document type, and if somebody, somewhere out there did finally manage to stomp on each other, let that particular type of document where that happened deal with it.

While technically suboptimal compared to what currently exists, it would match people's expectations better, and in practice, I can't speak for everybody, but I just don't see a whole lot of documents with hundreds+ namespaces such that collisions are a realistic possibility. And when I do see documents with a lot of namespaces (XMPP, for instance, or XHTML+SVG+some other thing), there's still a top-level type that could keep its own registry just fine. A bit of guidance on naming extensions probably ("don't call it e:, work your name in somehow like with the initial of your company or something") would have 99.9% solved the problem.

Prior to seeing what happened I'd probably still have argued for the current namespaces spec. In principle it doesn't seem that complicated to me. But I'm obviously wrong in practice, because, like I said, I can hardly cite an example of them being used correctly at all.

(Likewise, in hindsight, entities shouldn't have been able to be recursive, and if we were spec'ing out the next generation of XML I'd straght-up remove them except for the ones necessary to XML itself, <, >, and & because UTF covers the major use case of entities now. I'd discard the "terrible, terrible templating language" use case entirely.)

pvg · on Dec 15, 2020

In principle it doesn't seem that complicated to me. But I'm obviously wrong in practice, because, like I said, I can hardly cite an example of them being used correctly at all.

A snarky-but-mostly-true oversimplification: the complexity was necessary because XML was supposed to become a machine-readable interchange format for everything but it ended up not becoming that due to the complexity.

progval · on Dec 14, 2020

> instead, they are template variables, which can be shuffled throughout an XML document, requiring security software to constantly and reliably keep track of the value of the variable at multiple points

Isn't the issue here that they are mixing this templating with the business logic? They should be fine if the XML parser (or some post-processing) expanded the namespaces and business logic didn't see them at all.

> People sign URLs and JSON documents all the time with schemes that don't have this goofy property.

Similarly, that might be a design issue. They should only sign documents they 100% built and serialized themselves, so the set of tags and namespaces.

lyxsus · on Dec 14, 2020

> That doesn't seem accurate at all. It would be the case if there was some deterministic abbreviation from URL namespace qualifiers down to namespace prefixes, but there is not;

I'm not sure what you mean by that, tbh. It seems to me that namespace expansion is absolutely straightforward and deterministic. There're scopes, yes, but they're too well-defined (if that's what you mean).

tptacek · on Dec 14, 2020

Yes, you are describing the same feature I am with slightly different words. It obviously causes problems. You could describe XML entity expansion in simple terms too, and it would remain one of the major causes of game-over vulnerabilities in enterprise software over the last decade.

lyxsus · on Dec 14, 2020

Well, yeah, true.

I believe it's mostly implementation and popularisation problems.

The w3c specs surrounding xml/xpath/xslt/rdf and etc are very well designed but it's possible to appreciate them only after you spend ridiculously unreasonable amount of time reading and putting them all together. Otherwise it looks like a stupid pile of complexity with no purpose.

And what upsets me the most is the lack of really good libraries, everything I worked with just sucks so much.

I still have a hope that maybe in 5-15 years things will change.

layoutIfNeeded · on Dec 14, 2020

>Namespace prefixes are absolutely irrelevant, they only exists for your convenience.

This is false. As soon as you need XML canonicalization you very much need those prefices exactly as they were present in the original document.

lyxsus · on Dec 14, 2020

It doesn't affect data model encoded in document even a tiny bit. Namespace prefixes are irrelevant. If changing these prefixes breaks the program, the program is incorrect.

layoutIfNeeded · on Dec 14, 2020

Again, this is false.

“The C14N-20000119 Canonical XML draft described a method for rewriting namespace prefixes such that two documents having logically equivalent namespace declarations would also have identical namespace prefixes. The goal was to eliminate dependence on the particular namespace prefixes in a document when testing for logical equivalence. However, there now exist a number of contexts in which namespace prefixes can impart information value in an XML document. For example, an XPath expression in an attribute value or element content can reference a namespace prefix. Thus, rewriting the namespace prefixes would damage such a document by changing its meaning (and it cannot be logically equivalent if its meaning has changed).”

https://www.w3.org/TR/xml-c14n/#NoNSPrefixRewriting

lyxsus · on Dec 14, 2020

> However, there now exist a number of contexts in which namespace prefixes can impart information value in an XML document.

Well, yeah. They've given up to a mass amount of half-ass implementations? So what? I think it's our moral duty to ignore it :)

oever · on Dec 14, 2020

DTD does not know about namespaces and checks against "prefix:local-name".

E.g. the xhtml dtd will not accept this:

  <h:html xmlns:h="http://www.w3.org/1999/xhtml"/>

If you want to change prefixes, use XML Schema or Relax NG.

Ygg2 · on Dec 15, 2020

DTD is the devil spawn. Devil here being massive security vulnerabilities.

lyxsus · on Dec 14, 2020

I would say use XML Schema at least. DTD looks alien to XML anyway.

jsmith45 · on Dec 15, 2020

It is not in general legal to change prefixes and reserialize an XML document. Some official XML formats including XML Schema allow attribute values to reference prefixes in xs:QName types. One needs to bind the schema to detect that.

But it gets worth with XSLT using the prefixes in XPath expressions in attributes. If the prefixes are changed those values also need to be updated to change the prefix too, which requires complete knowledge of the format. This is because one cannot programmatically detect something like attributes that use custom data types that reference the prefixes in scope, but XSLT's xpath expressions show that W3C considers it legal to create such custom formats.

GoblinSlayer · on Dec 15, 2020

You can see test cases: https://github.com/mattermost/xml-roundtrip-validator/blob/m...

Though they mention something called xml directive. I don't think such a thing exists.

politician · on Dec 14, 2020

Do you think there is a path forward for the Go team to release an XML library without namespace support that simply errors when they are encountered ("XML namespaces are considered harmful")?

jerf · on Dec 14, 2020

They release something not called "encoding/xml". They could do what they did to the syscall package. The syscall package, by its nature, can't conform to the 1.0 compatibility promise Go itself maintains, because it changes outside of the scope of the Go project. So they froze the syscall package at some point, and then offered one in the golang.org/x/ namespace at https://pkg.go.dev/golang.org/x/sys .

I would again emphasize that encoding/xml, to my knowledge, only has problems with this particular roundtripping use case. It can consume non-namespaced XML correctly, and handle namespaced XML as long as you don't plan on re-emitting XML.

What would probably end up happening is a new package appearing on github.com for this use case, forked off of encoding/xml, for this use case. (If you're looking for a project that might attain some use, this is a likely candidate.) Unlike something like Python where the core packages are often C-based and thus you can expect better performance from the built-in "set" than somebody's pure-Python "set" implementation from before the built-in, encoding/xml is just a pile of pure Go code whose only advantage is that it ships with the compiler. Anyone can replace it without incurring any other disadvantage whenever they like.

(I looked a few versions ago, FWIW; encoding/xml has deviated so much from what I forked that my fork is essentially dead and no longer releasable without basically starting over from scratch. Plus I built it with the idea that it should be a minimal modification (so I could port it forward, which turned out to not work, but it's still how it was built)... if I was truly forking I'd have done some more extensive changes to it to support namespaces in general, rather than for my particular case.)

Anyhow, upshot, the Go project as a whole is not stuck... it is specifically encoding/xml as the standard, built-in library that is stuck. It's not like Go is completely incapable of handling XML correctly from first principles for some reason or anything.

tptacek · on Dec 14, 2020

There is no good reason for the standard library to include a SAML-safe XML, which is its own huge project, and which is useful only for that one standard. SAML implementations should include their own purpose-built, defensively-written XMLs.

sleepydog · on Dec 14, 2020

XML namespaces are ubiquitous. The utility of such a library would be very questionable.

While they do have the problems described, XML namespaces are what allow for abstraction and composition of documents from disparate systems.

tptacek · on Dec 14, 2020

You can do that, but SAML is heavily namespaced.

dolmen · on Dec 15, 2020

No need to be something published by the Go team.