A Neo-Latin workshop at the ACDH (Austrian Center for Digital Humanities)
Vienna, ACDH / Sonnenfelsgasse 19, 12-13 May
Croatian / CroALa participant: Neven Jovanović, Department of Classical Philology, Faculty of Humanities and Social Sciences, University of Zagreb
With shabby equipment always deteriorating
(T. S. Eliot, East Coker)
Required for the workshop: bring your own small corpus of Latin letters, tokenized, grammatically annotated, reviewed.
How we produce the corpus -- documentation
The steps:
- Select 20 documents with letters ("the corpus") from the CroALa collection
- Select only Latin words from the texts (discard notes and apparatus, as well as passages in foreign languages
- Tokenize these words, i. e. wrap them in a TEI
<w>... </w>
- Create a word list from the w elements in the corpus
- Lemmatise the word list using the Perseus Morphological Analysis Service
- Separate the unidentified words
- Separate the unambiguously identified words from the rest
- Analyse the rest (ambiguous lemma or inflection identification)
- Add morphological data to the TEI XML markup
- Analyse the lemmatised part of the corpus
Select a certain number of documents with a certain type of text
From the CroALa collection in a BaseX XML database (for some background information, cf. From doc to DB) we extract the first 20 documents containing letters and copy them to a directory. We use the following XQuery:
for $c in 1 to 20
let $ep := //*:TEI[descendant::*:textClass/*:keywords/*:term[matches(.,'prosa oratio - epist')]]
let $name := "/home/neven/rad/vienna/" || db:path($ep[$c])
return file:write($name, $ep[$c], map { "method": "xml"})
For further manipulation, make a database acdhcroala
from these files:
db:create("acdhcroala", "/home/neven/rad/vienna/" , (), map { 'ftindex': true(), 'intparse': true(), 'stripns': true() })
Select only Latin words from the main text
We want to disregard the TEI header, as well as any notes, abbreviations etc.
First we have to see what tags are there in the text
section of the TEI documents. Here is the XQuery.
distinct-values(//*:text//*/name())
The result, with notes on what we want to omit and analyse further:
- abbr (omit)
- add
- addrLine
- address
- bibl
- body
- choice (select the
corr
child) - closer
- corr
- date
- dateline
- div
- docAuthor
- docDate
- docImprint
- docTitle
- emph
- expan
- foreign (omit)
- front
- gap
- head
- hi
- item
- l
- list
- milestone
- name
- note (omit with all children)
- num
- opener
- orgName
- orig (omit if it has reg as sibling)
- p
- pb
- persName
- placeName
- postscript
- pubPlace
- publisher
- q
- quote
- ref
- salute
- sic (omit if it has corr as sibling =
//*:sic[preceding-sibling::*:corr]
or//*:sic[following-sibling::*:corr]
) - signed
- titlePage
- titlePart
The XQuery which does what we want, and returns 1531 text nodes:
for $t in //*:text//text()[not(ancestor::*:note)][not(parent::*:abbr)][not(parent::*:foreign)][not(parent::*:corr[parent::*:choice])][not(parent::*:orig[parent::*:choice])]
return $t
Tokenize the words and wrap them in TEI w element
This is easiest to accomplish with an XSLT stylesheet. The solution with xsl:analyze-string
came from Michael Kay on Stack Overflow. David J. Birnbaum also helped.
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:tei="http://www.tei-c.org/ns/1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0"
exclude-result-prefixes="tei">
<xsl:output method = "xml" indent="yes" omit-xml-declaration="no" />
<!-- 16croala-lemmata: copy everything, add w elements to words, keep punctuation -->
<xsl:include href="copy.xsl"/>
<xsl:template match="//*:text//text()">
<xsl:analyze-string select="." regex="\w+">
<xsl:matching-substring><xsl:element name="w" namespace="http://www.tei-c.org/ns/1.0"><xsl:value-of select="."/></xsl:element></xsl:matching-substring>
<xsl:non-matching-substring>
<xsl:value-of select="."/>
</xsl:non-matching-substring>
</xsl:analyze-string>
</xsl:template>
</xsl:stylesheet>
This XSL stylesheet is applied to our 20 documents, collected as an oXygen project. XML files in the resulting directory tokenized
are then transformed into an XML database with an XQuery.
(: create acdhcroala db from a directory with xml files :)
db:create("acdhcroala", "/home/neven/rad/vienna/tokenized/" , (), map { 'ftindex': true(), 'intparse': true(), 'stripns': true(), 'chop': false() })
The corpus database acdhcroala
now contains 21,006 w elements.
Create a word list from the w elements in the corpus
Use this XQuery / XPath (ordering types, that is word-forms, alphabetically for readability):
for $i in distinct-values(//*:w)
order by $i
return $i
There are 9,089 types (different tokens) in the database. Bear in mind that some of them are children of abbr
(that is, abbreviations) and the like; some are numbers; some are unmarked abbreviations. So the number of word-forms would be closer to our original 8,700.
Lemmatise the word list, separate the unparsed and ambiguously parsed forms
We expect three outcomes: a) the word-form is not recognized by the parser; b) the word-form is recognized, but ambiguously (several identifications are possible); c) the word-form is recognized unambiguously (and correctly).
Using the Perseus Morphological Service, we query it with this XQuery:
(: extract distinct forms from acdhcroala db :)
(: send them to Perseus Morpheus :)
(: write to an XML file :)
let $morph := ('http://services.perseids.org/bsp/morphologyservice/analysis/word?lang=lat&word=REPLACE_WORD&engine=morpheuslat')
let $wordlist := element list {
for $i in distinct-values(collection("acdhcroala")//*:w)
let $q := (doc(replace($morph, 'REPLACE_WORD',$i)))
return element word { element w { $i } , $q//*:rest }
}
return
file:write("/home/neven/rad/acdh-lemmata.xml",$wordlist)
Then we create a database from the file acdh-lemmata.xml
and see what is there. It turns out that there are 1,570 (18%) unidentified word-forms in the corpus (found with //*:word[not(*:rest)]
), and 1,984 word-forms with ambiguous lemma identification (found with //*:word[*:rest[2]]
). Of the other 5,146 word-forms, however, only 2,784 (32% of the corpus) can be unambiguously parsed (//*:word[not(*:rest[2])]//*:entry/*:infl[2]
) -- the rest can represent several inflections of the same lemma. But the good news is that, since we want just to lemmatise words, the parser can do so unambiguously for 59% of our corpus.
Add morphological information to markup
Following the rules for TEI XML w
element, we use morphological information to enrich our word-forms with markup as follows:
<w lemma="opto" lemmaRef="http://data.perseus.org/collections/urn:cite:perseus:latlexent.lex38019.1">optet</w>
We go back to the original database of 20 documents with letters, and add the enriched w
markup (with @lemma
and @lemmaRef
) to all unambiguously lemmatised words. This is done with an updating XQuery expression, using the file acdh-lemmataidentified.xml
as the index of (unambiguously) lemmatised words.
let $lemmata := doc("/home/neven/rad/acdh-lemmataidentified.xml")
for $l in $lemmata//*:w
for $w in //*:w
where $w[.=data($l)]
return replace node $w with element w { $l/@lemma , $l/@lemmaRef , data($w) }
This makes it possible to also produce an index of lemmata and corresponding forms. The XQuery formats the index as a HTML table (with tr
and td
elements), ordering lemmata by frequency and then alphabetically, and listing all node ids in the database.
(: create an index of lemmatised words :)
(: order by frequency descending, then alphabetically :)
for $w in //*:w[@lemma]
let $idx := $w/@lemma
group by $idx
order by count($w) descending , $idx
return element tr {
element td { $idx } , element td { count($w) } , element td { db:node-id($w) } }
The resulting HTML page (static) can be seen at our solr server: acdh-lemmata.html. Note that the index entries are linked to nodes in the acdhcroala
database.
Now we're all set for an analysis of lemmatised part of the corpus. For starters, we can say how much of the corpus is lemmatised. But we can think about the frequencies of lemmata too, as well as about co-occurrences (bigrams and trigrams of lemmata etc).
Material
The Bitbucket repository acdhcroala contains texts (original), tokenized versions with added w
elements, and XQuery and XSL scripts to reproduce everything described here.
Archive of attempts
A not sufficiently successful XQuery attempt to tokenize text strings in XML.
The XQuery modifies our database, adding all tokens as w elements (original text nodes are preserved! -- looks messy, but works).
(: tokenize certain text nodes under text :)
(: mark tokens with w element :)
for $t in
collection("acdhcroala")//*:text//text()
[not(ancestor::*:note)]
[not(parent::*:abbr)]
[not(parent::*:foreign)]
[not(parent::*:corr[parent::*:choice])]
[not(parent::*:orig[parent::*:choice])]
for $tokens in ft:tokenize($t, map { 'case': 'sensitive' })
return insert node <w>{$tokens}</w> into $t/parent::*