Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Anyone have examples of XML that can be mutated? My guess is that it wouldn't take much.

I expect that a similar problem will be found in many other libraries, if the XML was publicized. XML namespaces made a critical... "mistake" is probably too strong, but "design choice that deviated too far from people's mental model" is about right... that has prevented them from being anywhere near as useful or safe as they could be. In an XML document using XML namespaces, "ns1:tagname" may not equal "ns1:tagname", and "ns1:tagname" can be equal to "ns2:tagname". This breaks people's mental models of how XML works, and correspondingly, breaks people's code that manipulates XML.

(I actually used the Go XML library as an SVG validator in the ~1.8 timeframe and had to fork it to fix namespaces well enough to serve in that role. I didn't know about how to exploit it in a specific XML protocol but I've know about the issues for a while. "Why didn't you upstream it then?" Well, as this security bulletin implies, the data structures in encoding/xml are fundamentally wrong for namespaced XML to be round-tripped and there is no backwards-compatible solution to the problem, so it was obvious to me without even trying that it would be rejected. This has also been discussed on a number of tickets subsequently over the years, so that XML namespace handling is weak in the standard library is not news to the Go developers. Note also that it's "round-tripping" that is the problem; if you parse & consume you can write correct code, it's the sending it back out that can be problematic.)

Namespaces fundamentally rewrite the nature of XML tag and attribute names. No longer are they just strings; now they are tuples of the form (namespace URL, tag name)... and namespace URL is NOT the prefix that shows up before the colon! The prefix is an abbreviation of an earlier tag declaration. So in the XML

    <tag xmlns="https://sample.com/1" xmlns:example1="https://blah.org/1">
      <example1:tag xmlns:example2="https://blah.org/2">
        <example2:tag xmlns:example1="https://anewsite.com/xmlns">
          <example1:tag />
        </example2:tag>
      </example1:tag>
    </tag>
not a SINGLE ONE of those "tag"s is the same! They are, respectively, actually (https://sample.com/1, tag), (https://blah.org/1, tag), (https://blah.org/2, tag), and (https://anewsite.com/xmlns, tag). There's a ton of code, and indeed, even quite a few standards, that will get that wrong. (Note the redefinition of 'example1' in there; that is perfectly legal.) Even more excitingly,

    <tag xmlns="https://sample.com/1" xmlns:example1="https://sample.com/1">
      <example1:tag/>
      <example2:tag xmlns:example2="https://sample.com/1" />
    </tag>
ARE all the exact tag and should be treated as such, despite the different "tag names" appearing.

Reserializing these can be exciting, because A: Your XML library, in principle, ought to be presenting you the (XMLNS, tagname) tuple with the abbreviation stripped away, to discourage you from paying too much attention to the abbreviation but B: humans in general and a lot of code expect the namespace abbreviations to stay the same in a round trip, and may even standardize on what the abbreviations should be. There's a LOT of code out there in the world looking for "'p' or 'xhtml:p'" as the tag name and not ("http://www.w3.org/1999/xhtml", "p").

In general, to maintain roundtrip equality, you have to either A: maintain a table of the abbreviations you see, when they were introduced, and also which was used or B: just use the (XMLNS, tagname) and ensure that while outputing that the relevant namespaces have always been declared. Generally for me I go for option B as it's generally easier to get correct and I pair it with a table of the most common namespaces for what I'm working in, so that, for example, XHTML gets a hard-coded "xhtml:" prefix. It is very easy if you try to implement A to screw it up in a way that can corrupt the namespaces on some input.

(Option B has its own pathologies. Consider:

    <tag xmlns:sample="https://example.com/1">
      <sample:tag1 />
      <sample:tag2 />
    </tag>
It's really easy to write code that will drop the xmlns specification on all of the children of "tag", since it didn't use it there, and if your code throws away where the XMLNS was declared and just looks to whether the NS is currently declared, it'll see a new declaration of the "sample" namespace on every usage. Technically correct if the downstream code handles namespaces correctly (big if!), but visually unappealing.)

Not defending Go here, except inasmuch as it's such a common error to make that I have a hard time naming libraries and standards that get namespaces completely correct, for as simple as they are in principle. (I think SVG and XHTML have it right. XMPP is very, very close, but still has a few places where the "stream" tag is placed in different namespaces and you're just supposed to know to handle it the same in all the namespaces it appears it... which most people do only because it doesn't occur to them that technically these are separate tags, so it all kinda works out in the end.... libxml2 is correct but I've seen a lot of things that build on top of it and they almost all screw up namespaces.)



> I expect that a similar problem will be found in many other libraries, if the XML was publicized. XML namespaces made a critical... "mistake" is probably too strong, but "design choice that deviated too far from people's mental model" is about right... that has prevented them from being anywhere near as useful or safe as they could be. In an XML document using XML namespaces, "ns1:tagname" may not equal "ns1:tagname", and "ns1:tagname" can be equal to "ns2:tagname". This breaks people's mental models of how XML works, and correspondingly, breaks people's code that manipulates XML.

That right there is why I like Clark's notation (despite its unholy verbosity), which I learned of because that's how ElementTree manipulates namespaces: in Clark's notation, the document is conceptually

    <{https://sample.com/1}tag>
      <{https://blah.org/1}tag>
        <{https://blah.org/2}tag>
          <{https://anewsite.com/xmlns}tag />
        </{https://blah.org/2}tag>
      </{https://blah.org/1}tag>
    </{https://sample.com/1}tag>
Which is unambiguous. But as you note adds challenges for round-trip equality (in fact ElementTree doesn't maintain that, it simply discards the namespace prefixes on parsing, which I have seen outright break supposedly XML parsers which were really hard-coded for specific namespace prefixes).

lxml does round-trip prefixes (though it still doesn't round-trip documents) by including a namespace map on each element.


Not round-tripping prefixes breaks other things, and this is around where I start more thoroughly agreeing that XML namespaces tried to do too much at once: mainly, there are XML-based definitions, I think including XPath and XSLT's use of it (or old XSLT's use of what might not have been XPath yet?), that use the ambient namespace prefix set as context for interpreting attribute values that contain XML names. If you might ever interchange some prefixes, you might even get similar “looks valid but means the wrong thing” problems.

And if you try to combine namespaces with DTDs (which is just an explosive mix to start with, and I think is just recommended to never do) you get other problems, because you're no longer allowed to add arbitrary namespace declarations in the middle, so anything that round-trips prefixes but might ever add redundant declarations of them won't reliably produce something that DTD-validates, and if you're transforming into a DTD from something that might have used other namespaces, you have to make sure to remove all the extra declarations, and…

Note that most of this is still “well-defined”, it's just awkwardly hairy. This is not to be taken as an excuse to implement the standard badly or incorrectly if you're going to handle it at all.


In other words, the minds of a large number of human specifiers and implementors have their own roundtrip corruption flaws where when you pass the XML namespaces specification through them, you get something out that's incoherent and doesn't interoperate properly, creating representation mismatch problems down the line back in the digital world.


To add to this with some (potentially out-of-date!) personal experience: sometime around 2008, now somewhere in my dusty directories, I started implementing my own Ruby XML DOM-like library (based on the Parsifal XML lexer in C) mainly to handle this properly, because the most “normal” REXML did something really horrible in its API for namespaced attributes that made them almost unusable by clients of the library. (I forget why I couldn't use the Ruby bindings to libxml2 at the time; I think maybe they didn't support several things I wanted to do, and I couldn't add them without patching and vendoring libxml2, and that would be its own disaster because of shared libraries, etc.) Specifically, attribute accesses didn't always work consistently with namespaces, the callers had to handle prefix management themselves in places, and looping over the attributes of an element required you to handle both “single” attributes and sub-collections of namespaced attributes if there were any qualified and unqualified attributes with the same short name, because the second was what the outer collection was indexed by…

(I originally wrote this comment with a more fully worked-out example, but after viewing it in context I realized it was way too long to be an only-partly-on-topic comment on this thread, so I'll probably move it to a post elsewhere and submit it later.)


> This breaks people's mental models of how XML works, and correspondingly, breaks people's code that manipulates XML.

Because they usually have incorrect mental model. Blaming namespaces for name ambiguity would be the same as blaming the code "x = a + b" because "a" and "b" could be defined differently.

Namespace prefixes are absolutely irrelevant, they only exists for your convenience.


That doesn't seem accurate at all. It would be the case if there was some deterministic abbreviation from URL namespace qualifiers down to namespace prefixes, but there is not; instead, they are template variables, which can be shuffled throughout an XML document, requiring security software to constantly and reliably keep track of the value of the variable at multiple points. People sign URLs and JSON documents all the time with schemes that don't have this goofy property.

There's a similar problem with XML entity references, which have been happily breaking enterprise security for over a decade, because nobody has a good mental model of how entities in XML documents actually behave.

It seems fair at this point to blame the standard.


In hindsight, it probably would have been better to define standard prefixes, let people just sort of register their own for non-standard ones in whatever manner is suitable for their particular top-level document type, and if somebody, somewhere out there did finally manage to stomp on each other, let that particular type of document where that happened deal with it.

While technically suboptimal compared to what currently exists, it would match people's expectations better, and in practice, I can't speak for everybody, but I just don't see a whole lot of documents with hundreds+ namespaces such that collisions are a realistic possibility. And when I do see documents with a lot of namespaces (XMPP, for instance, or XHTML+SVG+some other thing), there's still a top-level type that could keep its own registry just fine. A bit of guidance on naming extensions probably ("don't call it e:, work your name in somehow like with the initial of your company or something") would have 99.9% solved the problem.

Prior to seeing what happened I'd probably still have argued for the current namespaces spec. In principle it doesn't seem that complicated to me. But I'm obviously wrong in practice, because, like I said, I can hardly cite an example of them being used correctly at all.

(Likewise, in hindsight, entities shouldn't have been able to be recursive, and if we were spec'ing out the next generation of XML I'd straght-up remove them except for the ones necessary to XML itself, &lt;, &gt;, and &amp; because UTF covers the major use case of entities now. I'd discard the "terrible, terrible templating language" use case entirely.)


In principle it doesn't seem that complicated to me. But I'm obviously wrong in practice, because, like I said, I can hardly cite an example of them being used correctly at all.

A snarky-but-mostly-true oversimplification: the complexity was necessary because XML was supposed to become a machine-readable interchange format for everything but it ended up not becoming that due to the complexity.


> instead, they are template variables, which can be shuffled throughout an XML document, requiring security software to constantly and reliably keep track of the value of the variable at multiple points

Isn't the issue here that they are mixing this templating with the business logic? They should be fine if the XML parser (or some post-processing) expanded the namespaces and business logic didn't see them at all.

> People sign URLs and JSON documents all the time with schemes that don't have this goofy property.

Similarly, that might be a design issue. They should only sign documents they 100% built and serialized themselves, so the set of tags and namespaces.


> That doesn't seem accurate at all. It would be the case if there was some deterministic abbreviation from URL namespace qualifiers down to namespace prefixes, but there is not;

I'm not sure what you mean by that, tbh. It seems to me that namespace expansion is absolutely straightforward and deterministic. There're scopes, yes, but they're too well-defined (if that's what you mean).


Yes, you are describing the same feature I am with slightly different words. It obviously causes problems. You could describe XML entity expansion in simple terms too, and it would remain one of the major causes of game-over vulnerabilities in enterprise software over the last decade.


Well, yeah, true.

I believe it's mostly implementation and popularisation problems.

The w3c specs surrounding xml/xpath/xslt/rdf and etc are very well designed but it's possible to appreciate them only after you spend ridiculously unreasonable amount of time reading and putting them all together. Otherwise it looks like a stupid pile of complexity with no purpose.

And what upsets me the most is the lack of really good libraries, everything I worked with just sucks so much.

I still have a hope that maybe in 5-15 years things will change.


>Namespace prefixes are absolutely irrelevant, they only exists for your convenience.

This is false. As soon as you need XML canonicalization you very much need those prefices exactly as they were present in the original document.


It doesn't affect data model encoded in document even a tiny bit. Namespace prefixes are irrelevant. If changing these prefixes breaks the program, the program is incorrect.


Again, this is false.

“The C14N-20000119 Canonical XML draft described a method for rewriting namespace prefixes such that two documents having logically equivalent namespace declarations would also have identical namespace prefixes. The goal was to eliminate dependence on the particular namespace prefixes in a document when testing for logical equivalence. However, there now exist a number of contexts in which namespace prefixes can impart information value in an XML document. For example, an XPath expression in an attribute value or element content can reference a namespace prefix. Thus, rewriting the namespace prefixes would damage such a document by changing its meaning (and it cannot be logically equivalent if its meaning has changed).”

https://www.w3.org/TR/xml-c14n/#NoNSPrefixRewriting


> However, there now exist a number of contexts in which namespace prefixes can impart information value in an XML document.

Well, yeah. They've given up to a mass amount of half-ass implementations? So what? I think it's our moral duty to ignore it :)


DTD does not know about namespaces and checks against "prefix:local-name".

E.g. the xhtml dtd will not accept this:

  <h:html xmlns:h="http://www.w3.org/1999/xhtml"/>
If you want to change prefixes, use XML Schema or Relax NG.


DTD is the devil spawn. Devil here being massive security vulnerabilities.


I would say use XML Schema at least. DTD looks alien to XML anyway.


It is not in general legal to change prefixes and reserialize an XML document. Some official XML formats including XML Schema allow attribute values to reference prefixes in xs:QName types. One needs to bind the schema to detect that.

But it gets worth with XSLT using the prefixes in XPath expressions in attributes. If the prefixes are changed those values also need to be updated to change the prefix too, which requires complete knowledge of the format. This is because one cannot programmatically detect something like attributes that use custom data types that reference the prefixes in scope, but XSLT's xpath expressions show that W3C considers it legal to create such custom formats.


You can see test cases: https://github.com/mattermost/xml-roundtrip-validator/blob/m...

Though they mention something called xml directive. I don't think such a thing exists.


Do you think there is a path forward for the Go team to release an XML library without namespace support that simply errors when they are encountered ("XML namespaces are considered harmful")?


They release something not called "encoding/xml". They could do what they did to the syscall package. The syscall package, by its nature, can't conform to the 1.0 compatibility promise Go itself maintains, because it changes outside of the scope of the Go project. So they froze the syscall package at some point, and then offered one in the golang.org/x/ namespace at https://pkg.go.dev/golang.org/x/sys .

I would again emphasize that encoding/xml, to my knowledge, only has problems with this particular roundtripping use case. It can consume non-namespaced XML correctly, and handle namespaced XML as long as you don't plan on re-emitting XML.

What would probably end up happening is a new package appearing on github.com for this use case, forked off of encoding/xml, for this use case. (If you're looking for a project that might attain some use, this is a likely candidate.) Unlike something like Python where the core packages are often C-based and thus you can expect better performance from the built-in "set" than somebody's pure-Python "set" implementation from before the built-in, encoding/xml is just a pile of pure Go code whose only advantage is that it ships with the compiler. Anyone can replace it without incurring any other disadvantage whenever they like.

(I looked a few versions ago, FWIW; encoding/xml has deviated so much from what I forked that my fork is essentially dead and no longer releasable without basically starting over from scratch. Plus I built it with the idea that it should be a minimal modification (so I could port it forward, which turned out to not work, but it's still how it was built)... if I was truly forking I'd have done some more extensive changes to it to support namespaces in general, rather than for my particular case.)

Anyhow, upshot, the Go project as a whole is not stuck... it is specifically encoding/xml as the standard, built-in library that is stuck. It's not like Go is completely incapable of handling XML correctly from first principles for some reason or anything.


There is no good reason for the standard library to include a SAML-safe XML, which is its own huge project, and which is useful only for that one standard. SAML implementations should include their own purpose-built, defensively-written XMLs.


XML namespaces are ubiquitous. The utility of such a library would be very questionable.

While they do have the problems described, XML namespaces are what allow for abstraction and composition of documents from disparate systems.


You can do that, but SAML is heavily namespaced.


No need to be something published by the Go team.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: