A Neo-Latin workshop at the ACDH (Austrian Center for Digital Humanities)

Vienna, ACDH / Sonnenfelsgasse 19, 12-13 May

Croatian / CroALa participant: Neven Jovanović, Department of Classical Philology, Faculty of Humanities and Social Sciences, University of Zagreb

With shabby equipment always deteriorating

(T. S. Eliot, East Coker)

Required for the workshop: bring your own small corpus of Latin letters, tokenized, grammatically annotated, reviewed.

How we produce the corpus -- documentation

The steps:

  1. Select 20 documents with letters ("the corpus") from the CroALa collection
  2. Select only Latin words from the texts (discard notes and apparatus, as well as passages in foreign languages
  3. Tokenize these words, i. e. wrap them in a TEI <w>... </w>
  4. Create a word list from the w elements in the corpus
  5. Lemmatise the word list using the Perseus Morphological Analysis Service
  6. Separate the unidentified words
  7. Separate the unambiguously identified words from the rest
  8. Analyse the rest (ambiguous lemma or inflection identification)
  9. Add morphological data to the TEI XML markup
  10. Analyse the lemmatised part of the corpus

Select a certain number of documents with a certain type of text

From the CroALa collection in a BaseX XML database (for some background information, cf. From doc to DB) we extract the first 20 documents containing letters and copy them to a directory. We use the following XQuery:

for $c in 1 to 20
let $ep := //*:TEI[descendant::*:textClass/*:keywords/*:term[matches(.,'prosa oratio - epist')]]
let $name := "/home/neven/rad/vienna/" || db:path($ep[$c])
return file:write($name, $ep[$c], map { "method": "xml"})

For further manipulation, make a database acdhcroala from these files:

db:create("acdhcroala", "/home/neven/rad/vienna/" , (), map { 'ftindex': true(), 'intparse': true(), 'stripns': true() })

Select only Latin words from the main text

We want to disregard the TEI header, as well as any notes, abbreviations etc.

First we have to see what tags are there in the text section of the TEI documents. Here is the XQuery.


The result, with notes on what we want to omit and analyse further:

The XQuery which does what we want, and returns 1531 text nodes:

for $t in //*:text//text()[not(ancestor::*:note)][not(parent::*:abbr)][not(parent::*:foreign)][not(parent::*:corr[parent::*:choice])][not(parent::*:orig[parent::*:choice])]
return $t

Tokenize the words and wrap them in TEI w element

This is easiest to accomplish with an XSLT stylesheet. The solution with xsl:analyze-string came from Michael Kay on Stack Overflow. David J. Birnbaum also helped.

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:tei="http://www.tei-c.org/ns/1.0"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0"
    <xsl:output method = "xml" indent="yes" omit-xml-declaration="no" /> 
    <!-- 16croala-lemmata: copy everything, add w elements to words, keep punctuation -->
    <xsl:include href="copy.xsl"/>

    <xsl:template match="//*:text//text()">
    <xsl:analyze-string select="." regex="\w+">
        <xsl:matching-substring><xsl:element name="w" namespace="http://www.tei-c.org/ns/1.0"><xsl:value-of select="."/></xsl:element></xsl:matching-substring>
            <xsl:value-of select="."/>


This XSL stylesheet is applied to our 20 documents, collected as an oXygen project. XML files in the resulting directory tokenized are then transformed into an XML database with an XQuery.

(: create acdhcroala db from a directory with xml files :)
db:create("acdhcroala", "/home/neven/rad/vienna/tokenized/" , (), map { 'ftindex': true(), 'intparse': true(), 'stripns': true(), 'chop': false() })

The corpus database acdhcroala now contains 21,006 w elements.

Create a word list from the w elements in the corpus

Use this XQuery / XPath (ordering types, that is word-forms, alphabetically for readability):

for $i in distinct-values(//*:w)
order by $i
return $i

There are 9,089 types (different tokens) in the database. Bear in mind that some of them are children of abbr (that is, abbreviations) and the like; some are numbers; some are unmarked abbreviations. So the number of word-forms would be closer to our original 8,700.

Lemmatise the word list, separate the unparsed and ambiguously parsed forms

We expect three outcomes: a) the word-form is not recognized by the parser; b) the word-form is recognized, but ambiguously (several identifications are possible); c) the word-form is recognized unambiguously (and correctly).

Using the Perseus Morphological Service, we query it with this XQuery:

(: extract distinct forms from acdhcroala db :)
(: send them to Perseus Morpheus :)
(: write to an XML file :)

let $morph := ('http://services.perseids.org/bsp/morphologyservice/analysis/word?lang=lat&amp;word=REPLACE_WORD&amp;engine=morpheuslat')
let $wordlist := element list {
for $i in distinct-values(collection("acdhcroala")//*:w)
let $q := (doc(replace($morph, 'REPLACE_WORD',$i)))
return element word { element w { $i } , $q//*:rest }

Then we create a database from the file acdh-lemmata.xml and see what is there. It turns out that there are 1,570 (18%) unidentified word-forms in the corpus (found with //*:word[not(*:rest)]), and 1,984 word-forms with ambiguous lemma identification (found with //*:word[*:rest[2]]). Of the other 5,146 word-forms, however, only 2,784 (32% of the corpus) can be unambiguously parsed (//*:word[not(*:rest[2])]//*:entry/*:infl[2]) -- the rest can represent several inflections of the same lemma. But the good news is that, since we want just to lemmatise words, the parser can do so unambiguously for 59% of our corpus.

Add morphological information to markup

Following the rules for TEI XML w element, we use morphological information to enrich our word-forms with markup as follows:

<w lemma="opto" lemmaRef="http://data.perseus.org/collections/urn:cite:perseus:latlexent.lex38019.1">optet</w>

We go back to the original database of 20 documents with letters, and add the enriched w markup (with @lemma and @lemmaRef) to all unambiguously lemmatised words. This is done with an updating XQuery expression, using the file acdh-lemmataidentified.xml as the index of (unambiguously) lemmatised words.

let $lemmata := doc("/home/neven/rad/acdh-lemmataidentified.xml")
for $l in $lemmata//*:w
for $w in //*:w
where $w[.=data($l)]
return replace node $w with element w { $l/@lemma , $l/@lemmaRef , data($w) }

This makes it possible to also produce an index of lemmata and corresponding forms. The XQuery formats the index as a HTML table (with tr and td elements), ordering lemmata by frequency and then alphabetically, and listing all node ids in the database.

(: create an index of lemmatised words :)
(: order by frequency descending, then alphabetically :)
for $w in //*:w[@lemma]
let $idx := $w/@lemma
group by $idx
order by count($w) descending , $idx
return element tr { 
element td { $idx } , element td { count($w) } , element td { db:node-id($w) } }

The resulting HTML page (static) can be seen at our solr server: acdh-lemmata.html. Note that the index entries are linked to nodes in the acdhcroala database.

Now we're all set for an analysis of lemmatised part of the corpus. For starters, we can say how much of the corpus is lemmatised. But we can think about the frequencies of lemmata too, as well as about co-occurrences (bigrams and trigrams of lemmata etc).


The Bitbucket repository acdhcroala contains texts (original), tokenized versions with added w elements, and XQuery and XSL scripts to reproduce everything described here.

Archive of attempts

A not sufficiently successful XQuery attempt to tokenize text strings in XML.

The XQuery modifies our database, adding all tokens as w elements (original text nodes are preserved! -- looks messy, but works).

(: tokenize certain text nodes under text :)
(: mark tokens with w element :)

for $t in
for $tokens in ft:tokenize($t, map { 'case': 'sensitive' })
return insert node <w>{$tokens}</w> into $t/parent::*