Digital Media

Moshell - Spring 2000

Lecture 5a: XML

This lecture is based on information from the book The XML Handbook, Second Edition. C. F. Goldfarb, Paul Prescod. NOTE: This lecture replaces LECTURE 19 in the 1999 series, which was full of wrong information! We're calling it Lecture 5a because it comes after 5 and before 6, in the sequence of the Spring 2000 course.
 

Presentation, Rendition, Markup, Abstraction

A presentation of a document is precisely the pixels you see on the screen, with fonts, spacing, pictures etc. Clearly a given document can have many different presentations: a flat text file, a nice typeset version, and even someone reading it out loud.

A rendition of that document is a file which contains information sufficient to produce a presentation of it. Common formats for renditions include Postscript, PDF and HTML.

Postscript is a language for communicating typography and page layout to printers. It is possible to hand-write or edit Postscript, which is in an ASCII form, but most of the time Postscript is emitted by programs such as Pagemaker or Quark Express. We examined some of its features back in Lecture 10 when we discussed Ghostscript - a tool that can be used to emit Postscript graphical output in many different forms such as .gif files. If your printer doesn't handle Postscript directly (and most don't), it can sometimes be a hassle to print Postscript files.

PDF, also known as Adobe Acrobat format, is a compressed page layout format that was designed for distributing finished documents via CD or over the web. Usually a PDF file is smaller than the corresponding postscript file. Acrobat Reader is a free utility which is used to display and print PDF files. Acrobat Reader seems to work flawlessly with almost everybody's printer.

A markup of a document is a version of the document which contains special symbols called tags. There are several possible purposes for tags, one of which is to produce a rendition. But there is some deeper logic going on here. Why do we have typographic conventions (such as italics and bold face) in the first place? To convey specific meanings. What if we could directly tag the categories of meaning, and then let some automatic process transform that into a markup?

For instance, a simple memo might be represented as

<datewritten>1/24/2000</datewritten>
<to> Charles Goldfarb </to>
<from> Paul Prescod </from>
<re>John Smith's Will </re>
<p>John Smith wants to update his will. Another wife left him on 11/24/99. <p>

The advantages would be enormous. You could automate the construction of presentations, by using a style sheet which said how your company likes memos to look. But much more importantly, the document could also be automatically searched for information, in an intelligent fashion.

Searching plain text is an error-filled process. If you wanted to search a bunch of text files containing some memos to find the dates they were written, you would stumble across all kinds of dates and would not be able to distinguish them from the document's creation date. With the tagging scheme shown above, you could easily do so.

This is the basic idea of XML - mark the meaningful categories of information in a document. We call these categories, abstractions.

Abstractions are not all that abstract. For instance, the category of information that I called "datewritten" in my example above, is an abstraction. It's a category of information. Now it might be that your company likes memos to look like this:

-------

DATE:1/24/2000
TO:Charles Goldfarb
FROM:Paul Prescod
RE:John Smith's Will



John Smith wants to update his will. Another wife left him on 11/24/99.

-------

But you couldn't reliably search such records for the date written, because there is no rule forbidding you to use the uppercase boldface word DATE in the body of the document. You just can't rely on presentation information to manage abstraction information.

Three Principles Behind XML


1. Programs should support a common document standard.

2. Each industry (e. g. law) needs a specific standard for their special terms and activities.

3. The markup should be based on abstractions, not directly on presentation features.

HTML sorta addressed Principle 1, but it was a "one size fits all" solution. A number of attempts were made to develop common style sheets for industries, to conform to Principle 2, but it just didn't work. Too many folks within a given industry had too many ways of making things look different.

It was only with the realization of Princple 3, that the light began to dawn. When you're working at the right level of abstraction, (2) becomes easier. It then becomes possible to develop formal Document Type Definitions which can be used to check if all the pieces are there. For instance, lawyers might develop a DTD for wills, which would specify all the details required in a legal will. Then a validity checker program can be used to see if a given document conforms to the DTD it allegedly obeys.

Historically in fact, this was done in the right order. An abstraction based language called SGML - Standard Generalized Markup Language - was developed in the early 1970's. It's a very heavy duty tool, with lots of structure to it. But it had some real simplicity at its core. Tags were marked with <> brackets, and all files were straight text so you could work on them with any text editor. The SGML system was used by IBM and Boeing and others with huge documentation problems to solve, and it matured and worked well.

HTML was developed in the spirit of SGML, perhaps in a relationship something like "18 wheeler Mac truck compared to a tricycle." That is, HTML is MUCH simpler and doesn't attempt to maintain much in the way of abstractions. It's a language for expressing renditions, but it is at least SLIGHTLY abstract. It declares, for instance, that there must be header types H1 through H6, but then lets your browser determine how they will look.

XML is born. There was a big gap between SGML's super-duper power, and HTML's low level friendly webstuff. Ultimately XML was developed as a kind of a "Chrysler mini-van." Not as complex as the Mac truck, but with much more abstraction power than the HTML tricycle.
 

Databases and Documents

We used to think of databases and documents as different kinds of information. Databases had data stored in fields which were formally named, indexed, etc. whereas documents had running text in them, with some rendition information. But with the concept of SGML/XML, documents ARE databases. They can be systematically searched!
 

XML Structure

Well, I'm outa time to write it up, but other folks have done so. Look at

http://www.projectcool.com/developer/xmlz/index.html

I'll add to this page when I get a chance.

JMM