Skip to content

Quo modo LaUrDal cum LiLa lemmatibus conectatur / Connecting LaUrDal to LiLa lemmata, a demonstration

Praeparavit Neven Jovanović, ORCID 0000-0002-9119-399, Universitatis Zagrabiensis Facultas philosophica.

Hancce paginam invenies in interrete apud: croala.ffzg.unizg.hr/eklogai/laurdal-lila/.

Notarum PDF invenies in: github.com/nevenjovanovic/laudationes-urbium-dalmaticarum/blob/main/laurdal-lila/latex-handout/23-05-26-jovanovic-lila-laurdal.pdf.

Exaratum ope et auxilio: Architectural Culture of the Early Modern Eastern Adriatic (AdriArchCult).

Quid sit LaUrDal

Laudationes urbium Dalmaticarum bibliotheca est digitalis operum (et excerptorum) Latinorum ubi urbes Dalmaticae memorentur sive laudentur. Formam veterem invenies huc (Philologic).

Cf.

Opera in hac probatione

  1. Leonardi Montagnae epigramma in laudem nobilitatis Aspalatensis (1452) (Split); Markdown in Github
  2. Michaelis Marulli De laudibus Rhacusae (ante 1489) (Dubrovnik); Markdown in Github
  3. Epistola Danielis Clarii ad Iulianum archiepiscopum Ragusinum (1505) (Dubrovnik); Markdown in Github
  4. P. Nardini Celinei De situ Iadrae (1508) (Zadar); Markdown in Github

Indices verborum alphabetice

  1. In Montagnae epigramma; etiam in HTML hic; in Github
  2. In Marulli ode; etiam in HTML hic; in Github
  3. In Clarii epistulam; etiam in HTML hic; in Github
  4. In Nardini Celinei carmen; etiam in HTML hic; in Github

Indices verborum per partes orationis

  1. In Montagnae epigramma; etiam in HTML hic; in Github
  2. In Marulli ode; etiam in HTML hic; in Github
  3. In Clarii epistulam; etiam in HTML hic; in Github
  4. In Nardini Celinei carmen; etiam in HTML hic; in Github

Why should we lemmatize editions?

A scholarly edition is the result of all knowledge and understanding an editor has gathered about the edited text. In the world of print, a lot of knowledge remained tacit, or implicit (what form is this word in? what does this word mean in this place?); this made discussions and criticisms difficult, or general. Ideally, in a scholarly edition everything I know about a text will be stated explicitly, just as everything I state about a text will be referable.

Making knowledge explicit also opens new possibilities for research: if we annotate all lemmata and all forms, it is easy to search for a lemma, or to reorder words in the text (all words) as a frequency index.

But... this has to matter to a certain number of scholars.

Some thoughts on current state of Latin lemmatization

Lemmatizer designers tend to solve very general questions (a lemmatizer for all historical languages, a lemmatizer using different language models for a language) and to give little thought to uniquely identifying a lemma. Ideally, to avoid ambiguity, this identification should be a unique number, just as people or organizations have Personal Identification Numbers.

If you think that statistically it does not matter... there are more homographs and homonyms in Latin than you would expect (cf. Latin homographs and homonyms on Dickinson College Wiki).

In Linked Open Data (LOD): a permanent URI serves as a peg or a hinge to hang different information on.

  • AGLDT / Morpheus (homonyms distinguished by index numbers: praedico1 praedico2); cf a word list of a Neo-Latin text
  • LiLA / LemLat (homonyms distinguished by URIs: praedico: http://lila-erc.eu/data/id/lemma/118832 ,praedico: http://lila-erc.eu/data/id/lemma/118833 )
  • There are other lemmatizers: Collatinus (web and desktop), Whitaker's Words, CLTK... and a number of smaller, less documented projects (Deucalion / Pie, Lamon...); in all of them, the concept of Linked Open Data is not clear, or considered unnecessary for lemmatization
  • Morpheus, LemLat, Whitaker's Words can be used on a local machine and queried programmatically (by a script: Bash, Python, XQuery...)
  • It is not easy to use unique IDs in Morpheus, it is not clear what they mean; Collatinus has internal unique IDs, but no public reference endpoint – it was not designed for LOD reference
  • LiLa offers the Text Linker http://lila-erc.eu:8080/LiLaTextLinker/ where LOD results can be exported as ttl RDF representation (LiLa URIs serve as links for Linked Data)

Annotating lemmata, a workflow in LaUrDal

Tokenize sentences, words

  • What do we do with verse lines? (Answers: do not tokenize sentences, remove line breaks, replace them with a milestone empty node to avoid overlapping annotations)
  • What do we do with enclitics? (Answers: ignore, separate enclitic from the word, introduce new annotation layer)
  • Can be done programmatically (XSLT; cf. LaUrDal XSLT scripts)... has to be checked

Normalize spellings

  • Keep spellings from the original edition, use the @norm attribute to add canonical / standard spellings
  • This has to be done manually... and checked

Use normalized spellings for lemmatization

  • Use LiLa's Text Linker to add lemma URIs
  • Convert LiLa RDF ttl to XML representation (using Python script rdfconvert)
  • From XML, create a BaseX database, query the database to annotate the documents
  • An XQuery script from LaUrDal to add URIs to words: add-lemmauri-2.xq
  • Check everything...

Add lemmata labels and URIs to files

What can we get when we have URIs of lemmata?

  • Produce indices of words (indices verborum), by frequency or alphabetically, with contexts of occurrences
  • Produce indices of parts of speech (indices partium orationis)
  • Produce stylistic comparisons: parts of speech in documents, in sentences...
  • Produce specialistic terminology indices (proper names, architecture, or affects, or politics)

Apotheca in Github

Apotheca: Laudationes urbium Dalmaticarum (LaUrDal)

Supported by

This repository is part of a project that has received funding from the European Union's Horizon 2020 Research and Innovation Programme (GA n. 865863 ERC-AdriArchCult).