Embedded Languages

I don’t like them.

I’ve ranted before about how the Web is a festering polyglot made horrific by Postel’s Law. Many, including Tim Bray, advocate more knowledge at the client end, when an error occurs in parsing the steaming pile of HTML that forms today’s Web pages. I almost fell in line with this reasoning, because more information is better, right? I thought a draconian policy would so irritate customers that businesses would be quick to fix it, and expend much effort on prevention. So, all the Web becomes well-formed.

Oh how wrong I was!
Jeff Atwood recounts an interesting tale at http://diveintomark.org/archives/2004/01/14/thought_experiment:

Imagine that you posted a long rant about how this is the way the world should work, that clients should be the gatekeepers of wellformedness, and strictly reject any invalid XML that comes their way. You click ‘Publish’, you double-check that your page validates, and you merrily close your laptop and get on with your life.

A few hours later, you start getting email from your readers that your site is broken. Some of them are nice enough to include a URL, others simply scream at you incoherently and tell you that you suck. (This part of the thought experiment should not be terribly difficult to imagine either, for anyone who has ever dealt with end-user bug reports.) You test the page, and lo and behold, they are correct: the page that you so happily and validly authored is now not well-formed, and it not showing up at all in any browser. You try validating the page with a third-party validator service, only to discover that it gives you an error message you’ve never seen before and that you don’t understand.

You pore through the raw source code of the page and find what you think is the problem, but it’s not in your content. In fact, it’s in an auto-generated part of the page that you have no control over. What happened was, someone linked to you, and when they linked to you they sent a trackback with some illegal characters (illegal for you, not for them, since they declare a different character set than you do). But your publishing tool had a bug, and it automatically inserted their illegal characters into your carefully and validly authored page, and now all hell has broken loose.

You desperately jump to your administration page to delete the offending trackback, but oh no! The administration page itself tries to display the trackbacks you’ve received, and you get an XML processing error. The same bug that was preventing your readers from reading your published page is now preventing you from fixing it! You’re caught in a catch-22. … All the while, your page is completely inaccessible and visibly broken, and readers are emailing you telling you this over and over again.

Here’s the thing: that wasn’t a thought experiment; it all really happened. It’s a funny story, actually, because it happened to Nick Bradbury, on the very page where he was explaining why it was so important for clients to reject non-wellformed XML. His original post was valid XHTML, and his surrounding page was valid XHTML, but a trackback came in with a character that wasn’t in his character set, and Typepad didn’t catch it, and suddenly his page became non-wellformed XML.

The moral of the story is actually not about well-formedness and draconian client validation, but one of security. It should not be possible for somebody else to break your system. The mechanism by which we include foreign content into our pages in fundamentally broken. HTML systems usually function as templated string processing, a practice which results in the above problems. It’s an issue of content injection and a lack of sandboxing, that’s only masquerading itself as one of well-formedness and validation. Embedded languages shall never escape this quagmire.