Presenting conservation information on the internet

Conservation is a trade that uses engineering principles in an environment that is dominated by historians and archaeologists. Most conservators have an education in the arts and humanities. Few conservators enjoy an easy relationship, meaning shared coffee breaks, with scientists. It is not a natural activity to keep up with the technical literature by browsing in the nearest university library from time to time.

The internet should be a great help to conservators, a small profession spread out around the world. Writing to the net is not as trivial as writing an email, so there are few conservator driven web sites outside the larger institutes. A good example of an institutional web site is that of the American Institute for Conservation, which has published the full articles from its Journal from 1977 to 2000 (at the time of writing). The AIC also publishes a very detailed account of how it has taken early numbers, only available in paper form, and turned them into XML (more about this acronym in a moment) for web publication.

The AIC project was a very expensive operation. It used the talents of the technical publication industry. Technical books are increasingly being written in XML, which allows extensive indexing, cross referencing and publication in several formats, all from the same original code. This requires collaboration between an information technologist and the writer. At present it is not easy to find out how to use XML for a modest project, such as a personal website about one's own speciality.

Web writing by an individual

I started writing the articles in this compendium in 1995, when HTML was simple, some would say primitive, and consequently easy to write in a text editor. Each article was uploaded to a file system on the web server and an index page was written to bind everything together with hyperlinks. The burden of maintaining the navigation links was not too great, because there weren't many pages to control. The complicated bit was the web server, which at that time was almost always run by a large institution. The difference between writing for the web and writing for a paper publisher was only one of style: the web requires a more compact and direct style, because reading from the screen is more tiresome.

Nowadays the situation is quite different. Anyone can buy space on a web server for about ten euro per month. On the other hand the pressure to complicate the code is severe. The prevailing fashion ordains that pages shall be embellished with pictures of little relevance, 'eye candy' in IT jargon. They should also have fancy navigation menus that jump out as the mouse wanders over the screen. Many programs have been written to facilitate writing these complicated pages. They all suffer from the same fatal fault, from the academic author's viewpoint: they are very short lived and the code they produce cannot easily, or at all, be taken in by a newer program which may be used to continue development of the site. The developers of the AIC web journal found that most of the original texts for the articles were in unrecoverable digital formats, in obsolete programs or in obsolete storage formats. Many had to be retyped, maybe after a first run of the paper version through an optical character recognition process. The incompatibility and obscurity of early word processor formats is matched by the incomprehensible complexity of modern, machine generated, html pages.

Many eager writers have been seduced by the promises of these programs. The results are of varying visual quality but are always difficult to maintain, seen in the perspective of years. Furthermore the intended appearance cannot be ensured over all viewers on all browsers, because of the complexity of the browsers' rendering engine (the program code that generates the visual display from the text markup) and the multiplicity of browsers, all interpreting differently an ambigously worded standard.

Taming the complexity of digital data and ensuring durability through standards

The problem of non-durable data formats has been tackled by introducing XML - extensible markup language. The data format of XML is plain text with embedded tags within angle brackets, for example <sect1> for the beginning of a main section within a chapter. The tags provide more detailed information than the HTML language, which can be regarded as a simplified XML. A tag would define, for example, that a short fragment of text is the name of a journal in a literature reference. HTML cannot do this - it only has tags to force the journal name to be shown in italics. The same italic tag can be used to emphasise any word in the body of the text.

The use of the text format and the universality of the markup syntax immediately gives good durability to articles written in XML but the high precision and corresponding complexity of XML make this text very tedious to write freehand in a text editor. Beside the main body of text, XML requires accompanying files explaining the meaning of the tags and programs to validate the logic of the document, so that a chapter heading is not immediately followed by a fifth level of subdivision, for example. Other programs use a separate style sheet to convert the content oriented XML markup into the appearance oriented HTML, though browsers are beginning to be able to read XML directly.

XML is very useful for scientific data which must be in a very tightly defined format to be any use at all: infra red spectra for example. XML is developing into a language which allows programs from different manufacturers, running on different operating systems, to exchange data. XML is widely used in technical publishing, where the same document must be printable on demand, visible on a screen and maybe even readable on a radio connected personal digital assistant. Such diversity of media would be useful for an aircraft maintenance manual, for example.

Precision brings complexity

But is this system a practical option for the content of a student compendium like this one? There is immense pressure to make all information as versatile as possible. Throughout my career I have worked beside people inventing ingenious database systems with separate fields for all conceivable fragments of information from an investigation. These all failed because the task is so enormous: indexing every aspect of all materials and artifacts used by mankind over millenia. The database design and thesaurus building took longer to make than the database program lasted as a commercial product. Conservators were unwilling to spend hours typing reports into the immense number of separate cells on the screen, without any guarantee that the result was more durable than the span of one person's interest in maintaining the system. The relatively new XML standard defines a database structure which supersedes all these earlier attempts. How much effort should we put into learning skills that become obsolete in a few years?

It is difficult to predict which way development will go. Even getting familiar enough with the concepts to make a reasonable judgement is a time consuming effort.

Writing in an age of rapidly changing technology

If I were starting to write this collection of articles now, I would use the docBook system, which uses XML. The XML tagged document looks rather like HTML but cannot be read directly by a web browser. The XML file must first be translated to HTML. The same XML file can also be translated into PDF, or any other format. This translation can be done at the time the page is requested from the web browser, so the viewer can click on 'page 2' or 'printable version' or 'pdf' and the web server will generate the appropriate page on the spot. All this requires a considerable time to set up, and it requires considerable control over the web server. I am using a commercial web hosting company, which does not allow this degree of control, unless one pays several times the cost of the basic service.

I have settled for a compromise. There is a dialect of XML called XHTML, which can be rendered directly by a browser. I have edited the old pages to make them valid XHTML pages. XHTML enjoys the advantages of XML: it can be translated into many different final products. Most of the labour of translation from HTML was actually removing complications in the original pages. The layout is now controlled much more simply, by a style defining section in a single file.

HTML code does not print well, so many pages, eventually all, are also available in pdf (portable document format - a format for exchange of page-formatted documents owned by the Adobe corporation but writable by several open source programs). A great many scientific articles are now put on the web as pdf documents but this is not so obvious an advantage as it once seemed. Pdf documents are big, because they contain formatting information for the printer. The reader may not wish to print out the document. There is a good case for keeping HTML, or XHTML, as the primary format, using pdf only for documents that need exact layout. Conversion to PDF can be automated to some extent, but since the printed page has much more subtlety of layout a good deal of refinement by hand is necessary. I have done this in the typesetting program Latex, which is much used by physicists and mathematicians because it handles maths very well.

Writing an interactive web site

Answering back to the website was a facility built into the first html standard. A form is displayed where the reader writes a message which is sent back to the server. At the server, this information is given to a program which is named by a hidden instruction which was embedded in the html code sent out to render the form on the viewer's computer.

At this point, the problem of durability appears again. The communication between the viewer writing in the form and the web server receiving this information is standardised in a protocol called CGI (common gateway interface) but once the information arrives at the server it is handed to a program which anyone could have written, in whatever language the server's operating system can manage.

In the early days, these programs were written in 'c', which has for many years been the fundamental language of computing. Lately, many alternatives have been developed which are more convenient to write and are usable by amateur programmers. A powerful accelerator of interactive web site design has been the introduction of embedded scripting languages. These are bits of program code, in plain text, embedded in ordinary html text pages. When the server encounters one of these code fragments while preparing to send a html page, it stops sending and instead obeys the instructions in the fragment. This code will, for example, fetch information from a database, format it in html and send the result to the server program, which will then insert this slice of text at the position where the script fragment was embedded. The script is not sent to the viewer - only the html text which it generates.

The first, and now the most popular of these scripting languages was PHP (Php Hypertext Processor), closely followed by Microsoft's ASP (Active Server Pages). There are, however, many other contenders, such as Java, Python and Ruby, all of which are more elegantly designed than PHP. The durability of PHP is by no means certain, it is a collection of functions rather than a planned language.

I chose nevertheless to use PHP as the language for making the interactive pages, because there is a vast amount of example scripts, tutorials and de-bugging help available on the internet. PHP is a fine example of that revolution in IT known as 'Open Source' , which has proved astonishingly successful, precisely because it is not secret and commercial. However, open source does not guarantee durability, though it has a good record in this regard. I have therefore used a good deal of effort to isolate the PHP code so that it can be changed in the future, without the need to edit the entire collection of html texts to find the fragments of program code.

In principle, readers' comments can be intercepted by a PHP script in the page, stored in an ordinary text file as html code and retrieved by a PHP script, which inserts them in the relevant page for viewing at the end of the regular text. This is the most durable method, keeping all the content as text files and using only PHP as the relatively ephemeral, but necessary programming tool.

However, a rich collection of readers' comments will require considerable programming effort to sort and connect to the right page. For this sort of job, a database is more convenient.

Adding a database to the enterprise adds yet another durability hazard. Fortunately, all modern databases can export their content as a series of SQL (Structured Query Language) statements, in plain text, or in binary form for pictures, which can be reformed quite simply to the original picture. SQL is a well established standard. A change of database is less traumatic than a change of scripting language, so one can be relatively relaxed about introducing a third short lived element into the search for a durable format for academic discourse.

I chose MySQL (the origin of the name is obscure) as the database. It is open source, with a huge body of users and therefore a correspondingly large advice and trouble-shooting literature on the web. It is now the leading web database for all but the most demanding commercial and air traffic control tasks. The turnover in the top database position is very rapid, so one must certainly allow for rapid obsolescence in this department.

Mysql and PHP are closely linked: PHP has special functions to connect to a mysql database ( and to many others), so the coding is fairly easy.

The choice of web server is the next source of risk in the search for durability and transportability. In the old days, servers just called independent programs using the CGI protocol. In principle one could use any server interchangeably and simply upload one's own data management program to a directory called, by universal convention, /cgi-bin. Nowadays the PHP script interpreter is embedded in the server as a module, merged with the server code at the moment of starting the program. So changing server program is not so straightforward.

My choice was Apache. This is not a controversial choice: Apache runs most of the world's web services. There are versions of Apache for many operating systems but the most usual operating system is Linux, another open source product of legendary stability, which is not to be confused with durability: there are significant rivals to Linux. This combination of Linux, Apache, MySQL and PHP is so popular that a merged acronym has been coined: LAMP.

These four programs have to be linked together. This involves study of their setup files with a view to establishing a way of working that is as standard as possible, so that transferring the book to another site will not stop it working because of reliance on strange settings of the inner workings of the programs. Making a LAMP structure that is largely independent of the exact setup of the computer requires a time consuming refinement of the code to avoid any site specific setup details. Another detail, which nevertheless takes a vast amount of time, is hardening the code against malicious attack. In recent times, the struggle against malevolent interference has become a major time consumer for system programmers. It came as a shock to a me, a programmer with forty year's experience of writing entirely unprotected scientific programs, to have to learn all the tricks of protecting an internet linked site.

The conservator wishing to publish on the web may well be daunted by these complexities. I was certainly surprised by the time it took to get it all working. However, the core HTML code is easy to write, so if the conservator discards the interactive aspects of the web and uses direct HTML methods for navigation and linking, publishing is very easy.

 

Creative Commons License
This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 License.