From doc to db
A set of recipes to turn your (Microsoft Word or OpenOffice) document into an XML database.
From doc to XML
There are many possible routes, your mileage may vary. You may also prefer to use OxGarage.
Inspiration for this recipe comes from XSL stylesheets for TEI XML.
What do we need:
- a text in Microsoft Word, OpenOffice, or LibreOffice format
- oXygen XML editor
What do we want:
- transform a transcription of a text into a basic TEI XML document, which can be enriched with more encoding and metadata later
Steps:
- Download and install oXygen XML Editor (if you don't have it already)
- From oXygen, open the .docx or .odt document you want to convert to TEI
- In the directory tree on the left, select any document with .xml extension (Content_Types in the screenshot). Double click on it to open it
- Click on the red triangle in a white circle on the right which says "Apply Transformation Scenario(s)"
- In the OOXML section, select the DOCX TEI P5 stylesheet. Then click on "Apply Associated"
- Various information flashes in the lower pane, eventually ending in "Build successful" message
7. If you want to make the TEI more readable, place the cursor behind the
From XML to an XML database
What do we need:
- one or more (valid) XML documents (e. g. those produced by transformations above)
- The BaseX XML database, freely available at the basex.org site
What do we want:
- create an XML database from our TEI XML documents, so that we can use XQuery queries and transformations to analyse the XML
Steps:
Installing BaseX
- Download BaseX database (preferably as .zip archive) from the BaseX site
- Unpack the .zip archive anywhere (I like to put it immediately in my Home directory)
- Click on BaseX.jar, or open it with Java 7 ("Oracle Java 7 Runtime")
- If all is well, you will see the message that BaseX is launching
- And eventually, the database interface ("GUI") will open
Creating a database from XML
- In BaseX GUI, select Database / New (or Ctrl-N)
- Find the file or files you want to add, name the database, play with parameters in other tabs (Parsing, Indexes, Full-Text...); don't worry, you won't break anything
- Select a file to be imported into the database; if you want to import several files, put them all in a directory and then select that directory; finally, press OK
- The database is being created...
- And voilà!
Querying an XML database with BaseX and XQuery
Now the things get interesting. BaseX, like the other XML databases (such as eXist DB), uses a query language called XQuery to get data out of collections of XML documents. XQuery uses XPaths to access different parts ("nodes") of XML documents. There are also some other database routines. Here a bit of reading is required.
Useful resources:
- BaseX wiki - you may start at Tutorials; some notes were written by me, even
- The XQuery Wikibook is worth studying; I certainly did so
- The XPath Tutorial at Zvon.org is helpful for understanding how to access different locations in the XML document
- A battered copy of XSLT quickly by Bob DuCharme (Manning) taught me the key concepts of XML / XSLT / XPath
- Traces of my learning are in my links and bibliography on XML, XQuery, XPath, TEI XML (BibSonomy)
Using other people's XQueries
The most promising aspect of XQuery, however, is the possibility to write down queries (searches) and exchange them - use my queries on your documents, etc. A large, sparsely documented repository of queries used for CroALa and related projects is the croalatransform on Bitbucket.