From doc to db

A set of recipes to turn your (Microsoft Word or OpenOffice) document into an XML database.

From doc to XML

There are many possible routes, your mileage may vary. You may also prefer to use OxGarage.

Inspiration for this recipe comes from XSL stylesheets for TEI XML.

What do we need:

What do we want:

Steps:

  1. Download and install oXygen XML Editor (if you don't have it already)
  2. From oXygen, open the .docx or .odt document you want to convert to TEI Open docx from oXygen
  3. In the directory tree on the left, select any document with .xml extension (Content_Types in the screenshot). Double click on it to open it In oXygen select Content Types
  4. Click on the red triangle in a white circle on the right which says "Apply Transformation Scenario(s)" Apply Tranformation
  5. In the OOXML section, select the DOCX TEI P5 stylesheet. Then click on "Apply Associated" Apply Associated
  6. Various information flashes in the lower pane, eventually ending in "Build successful" message Transformation in progress

Build successful 7. If you want to make the TEI more readable, place the cursor behind the tag (as shown) and from the Document menu select Source / Format and Indent (or Ctrl + Shift + P) Format and indent 8. You now have a basic (and valid, i. e. conformant to the TEI scheme) TEI XML document to which you should add metadata (title, author, editor...) and change the textual encoding as necessary 9. If you want to prove that the document is valid, select either the sheet with a checkmark left of the "Apply Transformation" triangle, or go to Document / Validate / Validate menu (or press Ctrl + Shift + V)

From XML to an XML database

What do we need:

What do we want:

Steps:

Installing BaseX

  1. Download BaseX database (preferably as .zip archive) from the BaseX site
  2. Unpack the .zip archive anywhere (I like to put it immediately in my Home directory)
  3. Click on BaseX.jar, or open it with Java 7 ("Oracle Java 7 Runtime")
  4. If all is well, you will see the message that BaseX is launching Launching Basex
  5. And eventually, the database interface ("GUI") will open BaseX GUI

Creating a database from XML

  1. In BaseX GUI, select Database / New (or Ctrl-N)
  2. Find the file or files you want to add, name the database, play with parameters in other tabs (Parsing, Indexes, Full-Text...); don't worry, you won't break anything Create Database
  3. Select a file to be imported into the database; if you want to import several files, put them all in a directory and then select that directory; finally, press OK Input Files
  4. The database is being created... Creating DB
  5. And voilà! BaseX DB Ready

Querying an XML database with BaseX and XQuery

Now the things get interesting. BaseX, like the other XML databases (such as eXist DB), uses a query language called XQuery to get data out of collections of XML documents. XQuery uses XPaths to access different parts ("nodes") of XML documents. There are also some other database routines. Here a bit of reading is required.

Useful resources:

Using other people's XQueries

The most promising aspect of XQuery, however, is the possibility to write down queries (searches) and exchange them - use my queries on your documents, etc. A large, sparsely documented repository of queries used for CroALa and related projects is the croalatransform on Bitbucket.