Individual queries with the revised software and data have been sped up by between 10 and 100 times, with an overall run time improvement of the CLAROS query test suite of about 20 times. I think this is a useful result. I think some more performance gains may be possible, but I also suspect we are near reaching a point of diminishing returns.
More detailed information about the performance improvements can be found at http://code.google.com/p/milarq/wiki/ClarosServerPerformanceNotes.
Essentially, 4 techniques have been used:
- reordering of queries so that more selective selective elements are evaluated earlier (this can also be performed automatically by the ARQ query processor in Jena).
- "materialization" of property paths and UNIONS in queries - adding "short cut" properties to the triple store, and use these properties in queries.
- customized indexes for finding earliest- and latest- occurrences of a given object type, and also for providing consistent ordering in other keyword-based object access queries. These new indexes are not Lucene-based, as originally intended, as Lucene handing of result sorting is less scalable than had been anticipated. Instead, a simple arrangement of flat files named by keywords, with contents sorted by the ordering key is used.
- pre-calculation of object counts by various categories, so that counting queries can run without having to access every matching object.
The revised query framework makes it easy to incorporate and configure new, specialized indexes to be accessed by SPARQL queries. This gives us a useful path to future inclusion of, e.g., spatial indexes so we might pose questions like "Find all amphora found within 50Km of Athens". Actually deploying such an index would be an additional development, and will not be implemented by the MILARQ project.
Some remaining problems have been exposed or rediscovered:
- some of the object counting queries over Arachne data are not accurate - some objects get double-counted - this is due to a problem in the counting queries, not the Arachne data
- some data problems have been noted, with inconsistent date formats and many instances of zero years (which are not valid for xsd:gYear)
- some of the date-ordered keyword indexes are quite cumbersome, and could usefully be slimmed down
- lack of numeric (sortable) date information in data from the Arachne database.
- use of strings that are not Unicode NFC - strictly, these not valid RDF, though the software still works with them
- address the problems noted above
- optionally: update data export to use new, stable URIs adopted for the Erlangen CIDOC-CRM implementation.
- generate and incorporate updated LIMC data, per previous CLAROS project discussions
- load new database and deploy query server on public host
- update CLAROS explorer to use new queries, test and ascertain query mix in actual workload.
I am hoping to have a final meeting with Epimorphics soon, to discuss final improvements to and tidying up of the code provided, bringing the project to a hopefully satisfactory conclusion.

No comments:
Post a Comment