Archive by Author

ENRICH

Until December 2009 I worked on the ENRICH project, and as it has now finished, I thought that I should reflect on some of what the project has done and the aspects we’ve been involved with here in Oxford. For the most part the project has been attempting to both aggregate manuscript descriptions into the manuscriptorium framework and standardise these manuscript descriptions to a single, common, agreed format. For the background to the ENRICH project, see the website, and especially this article on the ENRICH Project and TEI P5. A list of deliverables is also available.

Standardisation of Specification

The workpackage we were most involved with, partly because we were leading it, was workpackage 3 whose object was:

To ensure interoperability of the metadata used to describe all the shared resources by analysing the various standards used by different partners and ensuring their mapping to a single common format, which will be expressed in a way conformant with current standards.

As one might expect, in practice, this common format was a more tightly constrained subset of the TEI recommendations on Manuscript Description. The difficulty in any such endeavour is getting coherent agreement between a large number of representatives on a wide variety of customisations. As part of this process we undertook a comparison of MASTER, TEI P5, and Manuscriptorium formats. A number of revisions were made to the ENRICH schema through the course of the project. Deliverable D3.1 was a “Revised TEI-Conformant specification” available in a number of schema languages. The ENRICH Schema is publicly and freely available as as DTD, RELAX NG, and W3C Schema, but we recommend the RELAX NG format:

Documentation

The next deliverable, D3.2, was “Documentation and training materials for use with the ENRICH Specification”. Because the TEI ODD had been written with documentation in it, the same TEI ODD which generated the schemas above could also be used to generate project-specific documentation. This meant that in addition to the documentation written specifically for the ENRICH project, it had access to all the internationalised reference material available in the TEI Guidelines as a whole. This meant that we could produce versions of the documentation which while still primarily in English, contained glosses of the elements in another language. So for example:

<msIdentifier> (manuscript identifier) contains the information required to identify the manuscript being described.

in the English documentation for the ENRICH Specification became, in the French:

<msIdentifier> (identifiant du manuscrit) Contient les informations requises pour identifier le manuscrit en cours de description.

While this is admittedly of limited benefit, since the bulk of the documentation remains in English, it can aid comprehension to those reading in a foreign language to have the element descriptions in their own language. The ENRICH Specification documentation is available in the following languages and formats:

(HTML needs odd.css and tei.css)

Training Materials

Training materials were also created as part of D3.2 and took the form of slide sets as PDF, HTML, and TEI XML that project partners were free to take, modify, and use in teaching the ENRICH schema:

Migration Tools

While the primary migration tools from other formats to the ENRICH Specification were undertaken by the lead technical partner, we were tasked with undertaking a case study based analysis of the construction of migration tools and the make recommendations to the project based on these. The Migration case studies focussed on MASTER records that we had accumulated as a testbed and EAD records given to us by the Bodleian Library. The Case Studies on Migration to the ENRICH Specification and all their materials are freely available online. The case studies examined methods for transformation of MASTER and EAD records to TEI P5, mainly using XSLT-based conversions. The report on the Development and Validation of Migration Tools is available online.

ENRICH Garage Engine

Originally D3.4 of the ENRICH Project was a “Report on METS/TEI interoperability, best practice with respect to handling of Unicode and non-Unicode data in Manuscriptorium and P5 conversion techniques”. However, after much investigation it was determined that the use of METS was unnecessary for our extension to the Manuscriptorium platform. (This is not to say that it would not have been suitable for this or other uses.)

Part 1 of D3.4 and some of the work on it was replaced through the development of the ENRICH Garage Engine (EGE) and a report on the Documentation and Use of the ENRICH Garage Engine. This is a primarily web-service based format conversion engine developed by PSNC which enables document conversion through a number of formats. The engine itself consists of a web service and website frontend and underneath consists of a recognizer, a validator, and a converter. As the EGE website explains:

  • Recognizer – this plug-in is responsible for the recognition of the Internet Media Type (MIME type) of the given input data. For example, it will receive the input data and state that the input data has text/xml MIME type. The recognized data may then be further validated to check the format of the data.
  • Validator – this plug-in is responsible for validation of the input data. For example it may be used to validate the ENRICH TEI P5 data stored in a MIME type (e.g. text/xml) either received from end user or created by one of the converters. The following notation is assumed: ENRICH TEI P5 (text/xml) – it means that validator is able to validate ENRICH TEI P5 format encoded in text/xml.
  • Converter – this plug-in is responsible for converting the input data. It may be, for example, conversion from XML to Word, conversion from Word to PDF, conversion of the XML from one form to another (e.g. MASTER -> ENRICH TEI P5) or even cleaning the input data (e.g. removing redundant information).

You can try the EGE at its website:

ENRICH gBank and Non-Unicode Characters

One problem encountered in the migration of legacy documents to the ENRICH Specification might be that these records use characters which are not currently present in Unicode. The Medieval Unicode Font Initiative (MUFI) campaigns for inclusion of some of these specialized characters into the Unicode Specification. The second half of the D3.4 deliverable we produced was a report on Best practice in handling non-unicode characters. This included the description of a software tool, the ENRICH gBank produced to assist in normalization and documentation of non-Unicode characters. This contains a list of all of MUFI non-Unicode characters in the Private Use Area (PUA), images of them, and a representation of them using a TEI <char> element. For the most part these were automatically generated from the MUFI Spec. Conversion of this involved exporting the Adobe InDesign file as RTF, converting this to a basic presentation TEI XML, running a transformation script on this to extract just the data we needed for our own tables. In addition, the PUA references were used, in conjunction with the Andron Scriptor Web font, to produce first SVG files (using Apache Batik) and then specific-sized PNG files from this. This allowed us to have character images for each of the characters in the PUA.

You can see the ENRICH gBank on the ENRICH beta website at:

ENRICH Templates

As part of the ENRICH teaching materials we also created some ENRICH templates, to assist those who wanted a guide as to the kind of material that should be present in an ENRICH manuscript description.

A number of projects have taken these templates as starting points to further customise in their own use of the the ENRICH Specification or TEI P5 msDesc.

Conclusions

Working for any large and dispersed EU project always has its benefits and drawbacks. In the case of ENRICH we were able to draw on a wide range of experience, technologies and data because of the diverse nature of the project. One of the major drawbacks stems from being partnered with commercial organisations. While all the work they did in their development and support of the Manuscriptorium platform was top notch, they naturally have commercial interests of their business model at the forefront of their activities. This meant, for example, that while the ENRICH Specification and all the software, documentation, training materials and tools that we (OUCS) produced were licensed under an open licence, the same was not true of the main commercial company behind Manuscriptorium. The platform itself is not open source, at no point were we able to see the workings of the platform, nor contribute patches or bug fixes to it. This meant any of our development took place in an isolated manner and at arm’s reach.

Fair enough, the EU (via its eContent+ programme) funded this project with the understanding, presumably, that this would be the case. However, I feel that it is wrong for the EU to fund projects with commercial partners where those partners are not required to release the products of the funded work under an open licence of some sort. I’m not in any way against these commercial companies, but there are plenty of workable business models which enable them still to profit from materials they have developed and released under an open licence.

The ENRICH project has produced a lot that is good and interesting, and one of its major achievements is the network of individuals, projects, and institutions which are all approaching medieval manuscript description in the same manner. Although ENRICH (as a schema or project) is certainly not the last word in large-scale projects for the aggregation and standardization of medieval manuscript descriptions, it is a good development and milestone along that road.

List of Deliverable Reports

Thunderbird + Lightning Nexus Calendar Export to Google Calendar

There are plenty of ways to sync one’s work (nexus, Oxford’s version of Exchange) calendar with google if you are using Windows and Outlook. However, I’m using Ubuntu Linux. The solution I’ve chosen for getting mail and shared calendaring is Thunderbird + Lightning + Davmail. This works, but had idiosyncrises such as not allowing you to share calendars (but use calendars you have shared through another method such as Outlook2007 or OWA-Messageware).

Let’s be clear here, I do not need full synchronisation. What I want to do is:

  • when looking at my google apps calendars (which I intentionally separate from my work ones) I want to be able to have at least read-only view of my work calendars. Basically I want to just see them so I know that work activities are not overlapping with personal ones.
  • make my calendars available read-only to specific other people who either are not inside *.ox.ac.uk or whose departments do not use calendaring aspects of nexus

The solution I’ve come up with is an ad-hoc one involving a mozilla thunderbird extension called automatic export. Once installed and the icon is added to toolbar you can select from a dropdown menu on this icon a cyclical export. I have this set to export my calendar every 10 minutes. As long as you export this to a web accessible location then google calendar can subscribe to this. In addition, I store mine on a remote server, so have a shell script that scp’s it to the correct location every 10 minutes…so at very worst it is 20 minutes out of date. On google you just subscribe to the remote .ics file… though it sometimes takes awhile for google to finally realise it is there.

Drawbacks

  • The export only works when you have a copy of thunderbird that is set to do this is currently running. So, for example, TB on my laptop is not set to do this, or if I add an appointment with OWA-lite it doesn’t end up in my google calendar until I load up TB at work on Monday.
  • It is fairly insecure. The entire calendar is exported as an .ics file that is world readable. While it is in a place that is fairly obscure, security by obscurity isn’t really security.
  • I tried having it on a passworded WebDAV storage, but even giving google the username/password in the url, it had problems finding it.
  • Private events are shared with those with whom you share the calendar… so they basically see anything you see.
  • You need to have a constantly web-accessible location in which to put the calendar, exporting it to your desktop machine isn’t sufficient since google will think it has disappeared when the machine is off. (And we all hibernate our desktops and use OUCS’s Wake-On-Lan service to wake them up when needed… don’t we?)

I don’t know if this will be useful to anyone else… but that is how I export my Thunderbird+lightning+davmail Nexus Calendar to my Google Apps Calendar.

-James

TEI-Comparator

I have just finished my poster for DRHA 2009 which is about the TEI-Comparator that RTS worked on for the Holinshed Project. My poster is available online in PDF and PNG formats. (Though for the record it was created in Inkscape as an SVG file).

The poster discusses the creation of the tool for the Holinshed Project at the University of Oxford. Holinshed’s Chronicles of England, Scotland, and Ireland was the crowning achievement of Tudor historiography and an important historical source for contemporary playwrights and poets. Holinshed’s Chronicles was first printed in 1577 and a second revised and expanded edition followed in 1587. EEBO-TCP had already encoded a version of the 1587 edition, and the Holinshed Project specially commissioned them to create a 1577 edition using the same methodology. The resulting texts were converted to valid TEI P5 XML and used as a base to construct a comparison engine, known as the TEI-Comparator, to assist the editors in understanding the textual differences between the two editions.

Using the TEI-Comparator has several stages. The first was to decide what elements in the two TEI XML files should be compared. In this case the appropriate granularity was at the paragraph (and paragraph-like) level. The project was primarily interested in how portions of text were re-used, replaced, expanded, deleted, and modified from one edition to another. This first stage ran a short preparatory script which added unique namespaced IDs to each relevant element in both the TEI files. It is the proper linking of these two IDs which the TEI-Comparator hoped to facilitate.

The second stage was to prepare a database of initial comparisons between the two texts using a bespoke fuzzy text-comparison n-gram algorithm designed by Arno Mittelbach (the technical lead for the TEI-Comparator). This algorithm, called Shingle Cloud, transforms both input texts (needle and haystack) into sets of n-grams. It matches the haystack’s n-grams against the needle’s and constructs a huge binary string where they match. This binary string is then interpreted by the algorithm to determine whether the needle can be found in the haystack and if so where. The algorithm runs in linear time and, given the language of the originals, was found to work better if the strings of text were regularized (including removal of vowels).
The third stage in using the comparator was for the research assistant on the project to confirm, remove, annotate, or create new links between one edition and the other using a custom interface to the TEI-Comparator constructed in Java using the Google Web Toolkit API. The final stage was to produce output from the work put in by the RA through generating two standalone HTML versions of the texts which were linked together based on the now-confirmed IDs.

Shortly the TEI-Comparator will be publicly available on Sourceforge with documentation and examples to make it easy for others to re-purpose this software for other similar uses, and submit bugs and requests for future development.

Although known as the ‘TEI-Comparator’, the program does not require TEI input, it works with XML files of any vocabulary as long as the elements being compared have sufficient unique text in them.

For more information about the TEI-Comparator e-mail: tei@oucs.ox.ac.uk

addingIDs

Rehdon asked me about giving @xml:id attributes to things, so I whipped up this quick XSLT stylesheet. Some people prefer to use generate-id() to get a truly random and unique ID without semantic baggage. In many cases, where IDs are exposed to the public, I prefer to use some which make sense and are human readable.

Warning: there is a distinct flaw in the lack of testing I’ve done before applying the @xml:id. If something other than a <p> element already has xml:id=”p5″ then it will still add ‘p5′ as an @xml:id to the fifth paragraph. This means that it will produce an xml document that is not well-formed since one of the requirements of @xml:id is that it is unique in the document. Also it would number paragraphs in other namespaces as well. (This may be a bug or a feature depending on your outlook.) It numbers from tei:text so if you don’t have that in your document you should change that variable.

The XSLT stylesheet takes a parameter ‘e’ which you can pass the local-name of the element in question. It assumes ‘p’ otherwise, but you could use it number div, head, w, or really any element just by passing it e=w (or whatever).

Update: Rehdon asked about a configurable optional prefix to the ID and a 4-digit zero-padded number for it. So I changed the script to do that.

   <?xml version="1.0" encoding="UTF-8"?>
    <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:tei="http://www.tei-c.org/ns/1.0"
    xmlns="http://www.tei-c.org/ns/1.0"
    exclude-result-prefixes="tei"
    version="1.0">
    <!-- Parameter to pass to the stylesheet, assumes 'p' if nothing given -->
    <xsl:param name="e" select="'p'"/>
    <!-- If it exists, a prefix string: include a separator, like 'text1_' to get 'text1_p0005' -->
    <xsl:param name="pre"/>

    <!-- typical copy-all template -->
    <xsl:template match="@*|node()|comment()|processing-instruction()" priority="-1">
    <xsl:copy><xsl:apply-templates select="@*|node()|comment()|processing-instruction()"/></xsl:copy>
    </xsl:template>

    <!-- higher priority one to match elements -->
    <xsl:template match="*" >
    <xsl:copy>
    <!-- If the local-name is the element we've passed it, and there is not an @xml:id attribute  -->
    <xsl:if test="local-name() = $e and not(@xml:id)">
    <!-- make a variable numbering current nodes at any level from tei:text -->
    <xsl:variable name="num"><xsl:number level="any" from="tei:text" format="1111"/></xsl:variable>
    <!-- Then create an @xml:id attribute with the name and the number concatenated -->
    <xsl:attribute name="xml:id"><xsl:value-of select="concat($pre, local-name(), $num)"/></xsl:attribute>
    </xsl:if>
    <!-- apply any other templates (i.e. copy other stuff) -->
    <xsl:apply-templates select="@*|node()|comment()|processing-instruction()"/></xsl:copy>
    </xsl:template>
    </xsl:stylesheet>

Hope that is useful. I’ll try to remember to add it to the TEI wiki as well.

adding word-level markup

Rehdon and snail and others occasionally have asked me recently about marking up words inside another element where there may be markup (sometimes containing more than one word) inside this so I thought I’d write it up.

So for example we might have an XML file that looked like:

<?xml version="1.0" encoding="UTF-8"?>
<root>
<line>This is a test</line>
<line>Only a <seg type="foo">test</seg> ok?</line>
<line>And <seg>so; is</seg> this as well.</line>
</root>

Let’s say we want to mark up each of the whitespace-separated words, and for some reason the randomly added semi-colons, as words with a element. What we can use is and a regex. For example:

<xsl:template match="line//text()">
 <xsl:analyze-string regex="(\w+|;+)" select=".">
 <xsl:matching-substring><w><xsl:value-of select="."/></w></xsl:matching-substring>
 <xsl:non-matching-substring><xsl:value-of select="."/></xsl:non-matching-substring>
 </xsl:analyze-string>
</xsl:template>

In this example we’re matching any text() inside an element anywhere and if it matches the \w regex (or is a semicolon) it will get wrapped in a element. If it doesn’t match, then the text that was there gets output. Because this is l//text() (as opposed to l/text()) it will recurse down into grandchildren elements and further.

So assuming we have a copy-all template something like:

<xsl:template match="@*|node()" priority="-1">
  <xsl:copy><xsl:apply-templates select="@*|node()"/></xsl:copy>
</xsl:template>

(where we basically copy any nodes and attributes unless something else matches them) then we should get the result:

<root>
  <line><w>This</w> <w>is</w> <w>a</w> <w>test</w><w>;</w></line>
  <line><w>Only</w> <w>a</w> <seg type="foo"><w>test</w></seg> <w>ok</w>?</line>
  <line><w>And</w> <seg><w>so</w><w>;</w> <w>is</w></seg> <w>this</w> <w>as</w> <w>well</w>.</line>
</root>

Of course that is only the beginning, as your documents will probably have weird special cases and punctuation that you want to handle differently. And also it would, of course, be useful to create an @xml:id attribute for each word element.

-James

Evaluate a string as an XPath

Looking at ways to process a suggested change in TEI P5, I wanted to test that there is a straightforward way to evaluate a string that exists in a document as if it was an XPath you had included in your document.

So say I have a made-up document where I store some xpaths relating to that very document in the document itself as bits of text.

Input

<?xml version="1.0" encoding="UTF-8"?>
<foo>
    <paths>
        <path>/foo/blort/wibble[1]</path>
        <path>/foo/blort/wibble[2]</path>
        <path>//*[@xml:id='wibNum2']/splat/@att</path>
    </paths>
    <blort>
        <wibble>test text 1</wibble>
        <wibble>Another wibble </wibble>
        <wibble xml:id="wibNum2">This is <splat att="value1">a
            test</splat></wibble>
    </blort>
</foo>

To grab these and evaluate them as XPaths, you need to use an extension in saxon, unfortunately, saxon:evaluate(). For example in this stylesheet:

XSLT

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    version="2.0" xmlns:saxon="http://saxon.sf.net/"
    exclude-result-prefixes="#all">
    <xsl:output indent="yes"/>

    </xsl><xsl:template match="/foo">
        <foo>
            <xsl:for-each select="paths/path">
                <out>
                    <xsl:value-of select="saxon:evaluate(.)"/>
                </out>
            </xsl>
        </foo>
    </xsl>

This should produce the output:

Output

< ?xml version="1.0" encoding="UTF-8"?>
<foo>
  <out>test text 1</out>
  <out>Another wibble </out>
  <out>value1</out>
</foo>

This does use the saxon:evaluate(.) extension. There are similar extensions in a variety of other implementations for XSLT1 as well.

-James

XSLT2 collection() with dynamic collections from directory listings

Something I didn’t know about XSLT2’s collection() function. I had previously used it in the form:

<xsl:variable name="files" select="collection(docs.xml)"/>

where docs.xml has a structure of:

<?xml version="1.0"?>
<collection>
    <doc href="blort1.xml"/>
    <doc href="blort2.xml"/>
</collection>

You can then address, via the variable, the structure of those files blort1 and blort2 and iterate over them etc. e.g. you can do something like:

<xsl:for-each select="$files/tei:TEI/tei:text/tei:div">
  <xsl:apply-templates mode="TOC" select="tei:head"/>
</xsl:for-each>

Ok… I already knew how to do that and have used it to run XSLT on a whole raft of files. To get the docs.xml file I used to run “xmlstarlet ls” and then I have a dir2collection.xsl that transforms its output to the correct format.

However, what I didn’t know is that I didn’t need to bother creating the collection file at all. Saxon can generate the collection file from a parameter on the URI that you hand collection(). That is you can do something like:

<xsl:variable name="files" select="collection('../foo/?select=blor*.xml')"/>

And $files is then addressable in the same way as if you had made a collection document of all the files matching blor*.xml in the directory ../foo/ (and of course you can just do *.xml)

But wait, that’s not all. You can get a bit more complicated about it, pass the path as a parameter, and supply the collection() function extra parameters. So something like:

    <xsl:param name="path2collection">../foo/</xsl:param>
    <xsl:variable name="path">
        <xsl:value-of
            select="concat('../',$path2collection,'?select=*.xml;recurse=yes;on-error=warning')"
        />
    </xsl:variable>
    <xsl:variable name="docs" select="collection($path)"/>

And thus forth $docs contains a recursive collection of anything in the path2collection parameter you give it.

Isn’t that fun? Ok, maybe only me.

XIncluding portions of TEI Documents

‘Leoba’ another time asked me what to do when multiple files want to refer to the same textDesc, msDesc, listPerson or similar elements in their teiHeader.

To me, this is the canonical example use-case for W3C XInclude. You can store the individual bits anywhere you want on the web, and point (for example) into an element with a @xml:id element on it. There are ways to do more complicated xpointer fragment identifiers, but these aren’t processed automatically in oXygen, my preferred XML editor. oXygen, by default processes XIncludes in this format and so virtually includes the referenced element before validating the file.

So, in your file1.xml where you are encoding an electronic text, you might replace a listPerson element with the following:

<xi:include href="people.xml" xpointer="listPerson1" parse="xml">
    <xi:fallback>
        <listperson>
            <head>People not available</head>
            <person/>
        </listperson>
    </xi:fallback>
</xi:include>

This will include the element which has an @xml:id attribute on it, one assumes a listPerson, stored in the (full TEI file) ‘people.xml’ at that point in file1.xml. Here an optional fallback is provided to provide an empty listPerson with a message inside a head element. One of the benefits of this is that many texts can refer to the same listPerson, listPlace, textDesc, msDesc, or what have you, so you share resources across multiple documents, projects, and hopefully institutions. When projects use such a system, in addition to their editions, their standalone listPerson, listPlace, etc. files should also be made transparently available so that other people can point to the same people, places, etc.