Quo modo LaUrDal cum LiLa lemmatibus conectatur / Connecting LaUrDal to LiLa lemmata, a demonstration¶
Praeparavit Neven Jovanović, ORCID 0000-0002-9119-399, Universitatis Zagrabiensis Facultas philosophica.
Hancce paginam invenies in interrete apud: croala.ffzg.unizg.hr/eklogai/laurdal-lila/.
Notarum PDF invenies in: github.com/nevenjovanovic/laudationes-urbium-dalmaticarum/blob/main/laurdal-lila/latex-handout/23-05-26-jovanovic-lila-laurdal.pdf.
Exaratum ope et auxilio: Architectural Culture of the Early Modern Eastern Adriatic (AdriArchCult).
Quid sit LaUrDal¶
Laudationes urbium Dalmaticarum bibliotheca est digitalis operum (et excerptorum) Latinorum ubi urbes Dalmaticae memorentur sive laudentur. Formam veterem invenies huc (Philologic).
Cf.¶
- Jovanović, Neven. 2012. ‘Dubrovnik in the Corpus of Eastern Adriatic Humanist Laudationes Urbium’, Dubrovnik Annals, 16: 23–36
- Jovanović, Neven. 2011. ‘Marulić i laudationes urbium’, Colloquia Maruliana XX, (Književni krug Split - Marulianum, centar za proučavanje Marka Marulića i njegova humanističkoga kruga): 141–63 (with a summary in English)
Opera in hac probatione¶
- Leonardi Montagnae epigramma in laudem nobilitatis Aspalatensis (1452) (Split); Markdown in Github
- Michaelis Marulli De laudibus Rhacusae (ante 1489) (Dubrovnik); Markdown in Github
- Epistola Danielis Clarii ad Iulianum archiepiscopum Ragusinum (1505) (Dubrovnik); Markdown in Github
- P. Nardini Celinei De situ Iadrae (1508) (Zadar); Markdown in Github
Indices verborum alphabetice¶
- In Montagnae epigramma; etiam in HTML hic; in Github
- In Marulli ode; etiam in HTML hic; in Github
- In Clarii epistulam; etiam in HTML hic; in Github
- In Nardini Celinei carmen; etiam in HTML hic; in Github
Indices verborum per partes orationis¶
- In Montagnae epigramma; etiam in HTML hic; in Github
- In Marulli ode; etiam in HTML hic; in Github
- In Clarii epistulam; etiam in HTML hic; in Github
- In Nardini Celinei carmen; etiam in HTML hic; in Github
Why should we lemmatize editions?¶
A scholarly edition is the result of all knowledge and understanding an editor has gathered about the edited text. In the world of print, a lot of knowledge remained tacit, or implicit (what form is this word in? what does this word mean in this place?); this made discussions and criticisms difficult, or general. Ideally, in a scholarly edition everything I know about a text will be stated explicitly, just as everything I state about a text will be referable.
Making knowledge explicit also opens new possibilities for research: if we annotate all lemmata and all forms, it is easy to search for a lemma, or to reorder words in the text (all words) as a frequency index.
But... this has to matter to a certain number of scholars.
Some thoughts on current state of Latin lemmatization¶
Lemmatizer designers tend to solve very general questions (a lemmatizer for all historical languages, a lemmatizer using different language models for a language) and to give little thought to uniquely identifying a lemma. Ideally, to avoid ambiguity, this identification should be a unique number, just as people or organizations have Personal Identification Numbers.
If you think that statistically it does not matter... there are more homographs and homonyms in Latin than you would expect (cf. Latin homographs and homonyms on Dickinson College Wiki).
In Linked Open Data (LOD): a permanent URI serves as a peg or a hinge to hang different information on.
- AGLDT / Morpheus (homonyms distinguished by index numbers:
praedico1 praedico2
); cf a word list of a Neo-Latin text - LiLA / LemLat (homonyms distinguished by URIs:
praedico
: http://lila-erc.eu/data/id/lemma/118832 ,praedico
: http://lila-erc.eu/data/id/lemma/118833 ) - There are other lemmatizers: Collatinus (web and desktop), Whitaker's Words, CLTK... and a number of smaller, less documented projects (Deucalion / Pie, Lamon...); in all of them, the concept of Linked Open Data is not clear, or considered unnecessary for lemmatization
- Morpheus, LemLat, Whitaker's Words can be used on a local machine and queried programmatically (by a script: Bash, Python, XQuery...)
- It is not easy to use unique IDs in Morpheus, it is not clear what they mean; Collatinus has internal unique IDs, but no public reference endpoint – it was not designed for LOD reference
- LiLa offers the Text Linker http://lila-erc.eu:8080/LiLaTextLinker/ where LOD results can be exported as
ttl
RDF representation (LiLa URIs serve as links for Linked Data)
Annotating lemmata, a workflow in LaUrDal¶
Tokenize sentences, words¶
- What do we do with verse lines? (Answers: do not tokenize sentences, remove line breaks, replace them with a
milestone
empty node to avoid overlapping annotations) - What do we do with enclitics? (Answers: ignore, separate enclitic from the word, introduce new annotation layer)
- Can be done programmatically (XSLT; cf. LaUrDal XSLT scripts)... has to be checked
Normalize spellings¶
- Keep spellings from the original edition, use the
@norm
attribute to add canonical / standard spellings - This has to be done manually... and checked
Use normalized spellings for lemmatization¶
- Use LiLa's Text Linker to add lemma URIs
- Convert LiLa RDF
ttl
toXML
representation (using Python script rdfconvert) - From XML, create a BaseX database, query the database to annotate the documents
- An XQuery script from LaUrDal to add URIs to words: add-lemmauri-2.xq
- Check everything...
Add lemmata labels and URIs to files¶
- use TEI XML attributes
@lemma
and@lemmaRef
(TEI Guidelines 17.1 Linguistic Segment Categories)
What can we get when we have URIs of lemmata?¶
- Produce indices of words (indices verborum), by frequency or alphabetically, with contexts of occurrences
- Produce indices of parts of speech (indices partium orationis)
- Produce stylistic comparisons: parts of speech in documents, in sentences...
- Produce specialistic terminology indices (proper names, architecture, or affects, or politics)
Apotheca in Github¶
Apotheca: Laudationes urbium Dalmaticarum (LaUrDal)
Supported by¶
This repository is part of a project that has received funding from the European Union's Horizon 2020 Research and Innovation Programme (GA n. 865863 ERC-AdriArchCult).