Newsgroups: comp.lang.lisp Subject: Re: XML and lisp From: Erik Naggum <e...@naggum.net> Message-ID: <3207626455633924@naggum.net> Organization: Naggum Software, Oslo, Norway User-Agent: Gnus/5.0808 (Gnus v5.8.8) Emacs/20.7 Date: Fri, 24 Aug 2001 07:21:01 GMT NNTP-Posting-Date: Fri, 24 Aug 2001 09:21:01 MET DST * Tim Bradshaw <t...@tfeb.org> > ((:reply :title "Lisp is not just a programming language") > (:body > (:p "It is also a text-markup language, > and many other things, as you can see here" > "For instance with a suitable (small) macro, this is quite legal > Lisp syntax, which is compiled to *ML. I have written significantly-sized > documents in this notation.")) > (:signature "--tim")) As long as we think aloud in alternative syntaxes, I actually prefer to break the _incredibly_ stupid syntactic-only separation of elements and attribute values. SGML and its descendants have made a crucial mistake: For every level of container (there are about 7 of them), there is a new syntax for _two_ properties of the container: (1) the contents is wrapped in one syntax, but (2) the "writing on the box" is in quite another. This means that information and meta-information are massively different concepts, and this artificial separation runs through the whole SGML design. Each level offers a new way to write the two differently. This is what makes it so goddamn hard to reason about SGML documents and to do reasonably intelligent transformations on them without working your butt off specifying all sorts of irrelevant stuff that does _nothing_ but get in your way. I have come to _loathe_ the half-assed hybrid that some XML-in-Lisp tools use and produce, because it makes XML just as evil in Lisp as it was in XML to begin with, and we have gained absolutely nothing in either power of processing or in abstraction, which is so very un-Lisp-like. <foo bar="zot">quux</foo> should be read as (foo (bar "zot") "quux") and most definitely _NOT_ as ((:foo :bar "zot") "quux"), which turns this fairly reasonable structure into a morass of complexity worse than it was to begin with. And it does _NOT_ help to represent empty elements only with a keyword. Using three different levels of nesting to represent a single concept is Just Plain Wrong. Also, using keywords is not a good idea because there needs to be a lot of related information associated with elements and attributes, in different contexts, not to mention all the things they do with their funny "namespaces" these days. Whether something is an attribute or element is _completely_ arbitrary. It is based on some arbitrary choices in the design process that reveal absolutely no inherent qualities. For purely pragmatic reasons, SGML folks will use attributes for some things and elements for others because their tools can deal with some things in attributes and some things in elements. The faulty idea that attributes say something "about" the element and sub-elements somehow constitute be their contents is the same premature structuring that premature optimization of code suffers from. The whole language is incredibly misdesigned in making that distinction. As for writing SGML/XML/HTML/whatever, I have a simple way to get rid of the annoying verbosity of these stupid languages while _retaining_ that mistake between attribute values and elements, because it is quite hard to make simple regular expression-based conversions retain enough data about an element to decide what should be attribute and element. An element has the form <name [attributes] | [contents]>. Attribute have the form <name | value>. Internal whitespace is only for readability. XML Enamel (NML) CL <foo/> <foo> (foo) <foo bar="zot"/> <foo <bar|zot>> (foo (bar "zot")) <foo>zot</foo> <foo|zot> (foo "zot") <foo bar="zot">quux</foo> <foo <bar|zot> |quux> (foo (bar "zot") "quux") <foo>Hey, &quux;!</foo> <foo|Hey, [quux]!> (foo "Hey, " quux "!") <foo>AT&T you will</foo> <foo|AT&T you will> (foo "AT&T you will") <foo><bar>zot</bar></foo> <foo|<bar|zot>> (foo (bar "zot")) So I have almost none of the annoying and arbitrary quote/escape mania in attribute values or contents alike, either. Entities I write as [name], and they end up in the Lisp version as symbols if not the character they represent purely for syntactic reasons. Writing "code" in this language is actually amazingly painless compared to the produced noise. Besides, with a few simple modify-syntax-entry calls in Emacs, I get < and > to match and blink and I can move up and down the structure very easily. For processing this stuff in Common Lisp, it is _sometimes_ neat to convert the single | attribute/content marker into the zero-length symbol, ||, so pathological cases like <foo bar="zot"><bar>"zot"</bar></foo> which could have been written like this to show how arbitrary the syntactic disctinction in SGML/XML is <foo <bar|zot>|<bar|zot>> come out as (foo (bar "zot") || (bar "zot")) The really interesting thing is that writing in Enamel and producing XML is so easy that a simple Perl or Lisp function that takes an Enamel string as argument and produces XML is quite simple and straight- forward. This makes for some interesting-looking "scripting" that blows the mind of the miserable little wrecks that think they have to type the endtag, the quotes and all the other user-inimical features of SGML/XML. In my personal view, Lisp "markup" has the disadvantage of needing lots of quotes, while Enamel has the strong advantage that in <xxx|yyy>, xxx is always symbolic and yyy is always a string of characters subject to interpretation by whatever the symbolic part instructs in context. Since the key feature of markup languages is the separation of text from markup, the simple idea in Enamel should carry enough force to make this a fully realizable goal without making an artificial syntactic separation between information and meta-information at any level. If the syntax is good enough for the information, it should be good enough for the meta- information, and I think Enamel is. Fortunately, I do not have to create a whole new international following and engage in godawful politics to use a better syntax for XML and the like, since XML and the like are only used as interchange syntaxes these days. Nobody in their right mind actuslly writes anything by hand in such stupid languages that require so much attention to incredibly insignificant details and incomprehensibly irrelevant redundancy, anyway, do they? :) Finally, note that in Enamel, a complete element is enclosed in <...> and that means it can be subject to a nice little Common Lisp reader macro, and it can be taught to recognize other stuff, as well, such as the neat concept of interpolating expression values where {expression} occurs. Still at "internal use" stage, I plan to publish some stuff about Enamel not too far into the future. ///