From annotation to learners’ corpora

Neven Jovanović (neven.jovanovic@ffzg.hr)
Leipzig, July 6-7, 2017

This page: croala.ffzg.unizg.hr/annotation-learners-corpora/
Repository: bitbucket.org/nevenjovanovic/discipulus-leipzig-2017

The Plan

On learning a (historical) language

On using annotations (treebanks) for exercises

Conclusion: from research to learning... and back

10,000

Hours needed to achieve mastery in a field, according to Malcolm Gladwell, Outliers (2008)

300

Hours of language-related classes per academic year (est.)

جائع

Feedback?

Computer-assisted exercises

in the style of Memrise etc.

Cover as much of the difference (10,000 − 300)
as you like (or have to)

Teacher's perspective

The exercises are a pain to create

How to control the students?
(regarding both contents and achievements)

Learner's perspective

Feedback?

Accessibility?

Ingredients?

A (simple) corpus

An annotated corpus

Lemma, meaning, morphology, syntax...

A frequency list

Greek and Latin: The Dickinson College Core Vocabulary

A tool for transforming data

XQuery (in my case)

A learning environment (with exercise formats)

Moodle (in my case)

Vocabulary exercises

A frequency list

From Timo Korkiakangas's Late Latin Charter Treebank (LLCT)


unknown 26469
et 10251
 9227
in 7057
ego 5133
qui 3791
sum 3642
ipse 3299
meus 2835
tu 2769
de 2306
is 2203
ad 2176
omnis 2082
per 2036
hic 1979
vel 1931
ecclesia 1918
manus 1911
filius 1897
rogo 1889
  

A frequency list

From Timo Korkiakangas's Late Latin Charter Treebank (LLCT)


for $f in //*:LM
let $lemma := $f/@lemma
group by $lemma
order by count($f) descending
return element w { replace($lemma, '[0-9]$', '') , count($f) }
  

A frequency list

From Timo Korkiakangas's Late Latin Charter Treebank (LLCT)


unknown 26469
ET
 9227
IN
EGO
QVI
SVM
IPSE
MEVS
TV
DE
IS
AD
OMNIS
PER
HIC
HIC
VEL
ecclesia 1918
MANVS
filius 1897
ROGO
AB
SANCTVS
NOS
RES
chartula 1645
suprascriptus 1553
DEVS
SIGNVM
casa 1488
TESTIS
NOSTER
presbyter 1429
CVM
DO
VT
comma 1168
ANNVS
VNVS
clericus 1049
  

A list of phrases

From Timo Korkiakangas's Late Latin Charter Treebank (LLCT)


in integrum 514
inter nos 379
in omnibus 309
per annos singulos 290
In nomine Dei 266
a me 222
post traditam 203
a Deo 137
ad successoribus tuis 135
in Italia 132
post tradita 123
In nomine et Patris Filii Spiritus sancti 120
in nomine Dei 118
cum heredes meis 116
per cartulam hanc 106
per cartula 90
per nos 89
ad casa ipsa 88
per cartulam 87
a nos 86
in prefinito 85
  

An algorithm for producing exercises

Multiple choice - fill in the gaps

Add the word missing from the phrase

  1. Find the phrase
  2. Mark the word to be added
  3. Find distractors (different from the word)
  4. Fill the XML questions template
  5. Import questions into Moodle
  6. Select questions for an exercise

Produce a lot of exercises quickly!

Conclusions

Erkenntnis des Erkannten

Annotations embody our knowledge about the text.

Exercises bring this knowledge closer to the learners.

Not all people have the time or the patience to produce many, many (many...) exercises by hand.

Many people have the skill to select exercises from the set of all exercises.

Share your treebank with your class!

(Whether they know it or not.)