Quo modo LaUrDal cum LiLa lemmatibus conectatur / Connecting LaUrDal to LiLa lemmata, a demonstration¶

Praeparavit Neven Jovanović, ORCID 0000-0002-9119-399, Universitatis Zagrabiensis Facultas philosophica.

Hancce paginam invenies in interrete apud: croala.ffzg.unizg.hr/eklogai/laurdal-lila/.

Notarum PDF invenies in: github.com/nevenjovanovic/laudationes-urbium-dalmaticarum/blob/main/laurdal-lila/latex-handout/23-05-26-jovanovic-lila-laurdal.pdf.

Exaratum ope et auxilio: Architectural Culture of the Early Modern Eastern Adriatic (AdriArchCult).

Quid sit LaUrDal¶

Laudationes urbium Dalmaticarum bibliotheca est digitalis operum (et excerptorum) Latinorum ubi urbes Dalmaticae memorentur sive laudentur. Formam veterem invenies huc (Philologic).

Cf.¶

Jovanović, Neven. 2012. ‘Dubrovnik in the Corpus of Eastern Adriatic Humanist Laudationes Urbium’, Dubrovnik Annals, 16: 23–36
Jovanović, Neven. 2011. ‘Marulić i laudationes urbium’, Colloquia Maruliana XX, (Književni krug Split - Marulianum, centar za proučavanje Marka Marulića i njegova humanističkoga kruga): 141–63 (with a summary in English)

Opera in hac probatione¶

Leonardi Montagnae epigramma in laudem nobilitatis Aspalatensis (1452) (Split); Markdown in Github
Michaelis Marulli De laudibus Rhacusae (ante 1489) (Dubrovnik); Markdown in Github
Epistola Danielis Clarii ad Iulianum archiepiscopum Ragusinum (1505) (Dubrovnik); Markdown in Github
P. Nardini Celinei De situ Iadrae (1508) (Zadar); Markdown in Github

Indices verborum alphabetice¶

In Montagnae epigramma; etiam in HTML hic; in Github
In Marulli ode; etiam in HTML hic; in Github
In Clarii epistulam; etiam in HTML hic; in Github
In Nardini Celinei carmen; etiam in HTML hic; in Github

Indices verborum per partes orationis¶

In Montagnae epigramma; etiam in HTML hic; in Github
In Marulli ode; etiam in HTML hic; in Github
In Clarii epistulam; etiam in HTML hic; in Github
In Nardini Celinei carmen; etiam in HTML hic; in Github

Why should we lemmatize editions?¶

A scholarly edition is the result of all knowledge and understanding an editor has gathered about the edited text. In the world of print, a lot of knowledge remained tacit, or implicit (what form is this word in? what does this word mean in this place?); this made discussions and criticisms difficult, or general. Ideally, in a scholarly edition everything I know about a text will be stated explicitly, just as everything I state about a text will be referable.

Making knowledge explicit also opens new possibilities for research: if we annotate all lemmata and all forms, it is easy to search for a lemma, or to reorder words in the text (all words) as a frequency index.

But... this has to matter to a certain number of scholars.

Some thoughts on current state of Latin lemmatization¶

Lemmatizer designers tend to solve very general questions (a lemmatizer for all historical languages, a lemmatizer using different language models for a language) and to give little thought to uniquely identifying a lemma. Ideally, to avoid ambiguity, this identification should be a unique number, just as people or organizations have Personal Identification Numbers.

If you think that statistically it does not matter... there are more homographs and homonyms in Latin than you would expect (cf. Latin homographs and homonyms on Dickinson College Wiki).

In Linked Open Data (LOD): a permanent URI serves as a peg or a hinge to hang different information on.

AGLDT / Morpheus (homonyms distinguished by index numbers: praedico1 praedico2); cf a word list of a Neo-Latin text
LiLA / LemLat (homonyms distinguished by URIs: praedico: http://lila-erc.eu/data/id/lemma/118832 ,praedico: http://lila-erc.eu/data/id/lemma/118833 )
There are other lemmatizers: Collatinus (web and desktop), Whitaker's Words, CLTK... and a number of smaller, less documented projects (Deucalion / Pie, Lamon...); in all of them, the concept of Linked Open Data is not clear, or considered unnecessary for lemmatization
Morpheus, LemLat, Whitaker's Words can be used on a local machine and queried programmatically (by a script: Bash, Python, XQuery...)
It is not easy to use unique IDs in Morpheus, it is not clear what they mean; Collatinus has internal unique IDs, but no public reference endpoint – it was not designed for LOD reference
LiLa offers the Text Linker http://lila-erc.eu:8080/LiLaTextLinker/ where LOD results can be exported as ttl RDF representation (LiLa URIs serve as links for Linked Data)

Annotating lemmata, a workflow in LaUrDal¶

Tokenize sentences, words¶

What do we do with verse lines? (Answers: do not tokenize sentences, remove line breaks, replace them with a milestone empty node to avoid overlapping annotations)
What do we do with enclitics? (Answers: ignore, separate enclitic from the word, introduce new annotation layer)
Can be done programmatically (XSLT; cf. LaUrDal XSLT scripts)... has to be checked

Normalize spellings¶

Keep spellings from the original edition, use the @norm attribute to add canonical / standard spellings
This has to be done manually... and checked

Use normalized spellings for lemmatization¶

Use LiLa's Text Linker to add lemma URIs
Convert LiLa RDF ttl to XML representation (using Python script rdfconvert)
From XML, create a BaseX database, query the database to annotate the documents
An XQuery script from LaUrDal to add URIs to words: add-lemmauri-2.xq
Check everything...

Add lemmata labels and URIs to files¶

use TEI XML attributes @lemma and @lemmaRef (TEI Guidelines 17.1 Linguistic Segment Categories)

What can we get when we have URIs of lemmata?¶

Produce indices of words (indices verborum), by frequency or alphabetically, with contexts of occurrences
Produce indices of parts of speech (indices partium orationis)
Produce stylistic comparisons: parts of speech in documents, in sentences...
Produce specialistic terminology indices (proper names, architecture, or affects, or politics)

Apotheca in Github¶

Apotheca: Laudationes urbium Dalmaticarum (LaUrDal)

Supported by¶

This repository is part of a project that has received funding from the European Union's Horizon 2020 Research and Innovation Programme (GA n. 865863 ERC-AdriArchCult).