Tuesday, 2 November 2010

MILARQ project final progress report

An initial draft of the the project's final progress report has been posted at http://code.google.com/p/milarq/wiki/MILARQ_final_progress_post.

Details may be updated as needed over the coming few days, but the broad picture presented represents the project's state at the end of its funded activity.

Thursday, 30 September 2010

MILARQ technical review and planning meeting

This meeting, held on 30 September, was arranged to provide a possibly final review of progress and to identify remaining steps to wrapping up the project. Progress over the preceding two months was very satisfactory, and remaining technical issues are mainly matters of detail.

The intent from our previous meeting was that we would have made substantial progress on deploying MILARQ and other performance improvements in a query server over the CLAROS data.  This has substantially been achieved.

More detailed notes from the meeting are at; http://code.google.com/p/milarq/wiki/20100930_Meeting_Project_Progress

Tuesday, 28 September 2010

MILARQ progress update

I have now installed a copy of the MILARQ revised software and used this to construct a new query server over the CLAROS data.  This required scripting some additional preprocessing of the data while loading the triple store and new indexes. The query language used has been updated to SPARQL 1.1.

Individual queries with the revised software and data have been sped up by between 10 and 100 times, with an overall run time improvement of the CLAROS query test suite of about 20 times.  I think this is a useful result.  I think some more performance gains may be possible, but I also suspect we are near reaching a point of diminishing returns.

More detailed information about the performance improvements can be found at http://code.google.com/p/milarq/wiki/ClarosServerPerformanceNotes.

Essentially, 4 techniques have been used:
  1. reordering of queries so that more selective selective elements are evaluated earlier (this can also be performed automatically by the ARQ query processor in Jena).
  2. "materialization" of property paths and UNIONS in queries - adding "short cut" properties to the triple store, and use these properties in queries.
  3. customized indexes for finding earliest- and latest- occurrences of a given object type, and also for providing consistent ordering in other keyword-based object access queries.  These new indexes are not Lucene-based, as originally intended, as Lucene handing of result sorting is less scalable than had been anticipated.  Instead, a simple arrangement of flat files named by keywords, with contents sorted by the ordering key is used.
  4. pre-calculation of object counts by various categories, so that counting queries can run without having to access every matching object.
While the creation of additional indexes has been an important role to play in improving performance, I have been slightly surprised at the extent of gains that have been realized using techniques based on existing triple-store capabilities.  Of the 4 techniques discussed above, only number (3) depends on the new MILARQ indexes.

The revised query framework makes it easy to incorporate and configure new, specialized indexes to be accessed by SPARQL queries.  This gives us a useful path to future inclusion of, e.g., spatial indexes so we might pose questions like "Find all amphora found within 50Km of Athens". Actually deploying such an index would be an additional development, and will not be implemented by the MILARQ project.

Some remaining problems have been exposed or rediscovered:
  • some of the object counting queries over Arachne data are not accurate - some objects get double-counted - this is due to a problem in the counting queries, not the Arachne data
  • some data problems have been noted, with inconsistent date formats and many instances of zero years (which are not valid for xsd:gYear)
  • some of the date-ordered keyword indexes are quite cumbersome, and could usefully be slimmed down
  • lack of numeric (sortable) date information in data from the Arachne database.
  • use of strings that are not Unicode NFC - strictly, these not valid RDF, though the software still works with them
Remaining work for deployment in CLAROS:
  • address the problems noted above
  • optionally: update data export to use new, stable URIs adopted for the Erlangen CIDOC-CRM implementation.
  • generate and incorporate updated LIMC data, per previous CLAROS project discussions
  • load new database and deploy query server on public host
  • update CLAROS explorer to use new queries, test and ascertain query mix in actual workload.
Further notes about outstanding issues are at http://code.google.com/p/milarq/wiki/OutstandingIssues, but not all of these are in scope for being addressed by the MILARQ project.

I am hoping to have a final meeting with Epimorphics soon, to discuss final improvements to and tidying up of the code provided, bringing the project to a hopefully satisfactory conclusion.

Tuesday, 17 August 2010

MILARQ technical review and planning meeting

This meeting, held on 30 July just before GK left for 2 weeks vacation, was arranged primarily to provide technical coordination between Epimorphics and Oxford. Progress over the preceding period had been distracted by a number of issues, and synchronization of activities and goals had become somewhat de-focused.

The original intent from our previous meeting was that by the of this meeting, we would have software and supporting structures in place to allow GK to start work on trialling the accelerated query service within the CLAROS query service and test suite. In the event, though much of the technical work underpinning accelerated queries had been completed, the goal of having components ready for deployment in the CLAROS query service had not been achieved.

The main outcomes were:
  • The revised query framework was reviewed. Accelerated queries are provided by using the keyword to access a sorted file of corresponding subject identifiers and sort key values.
  • A number of technical problems were uncovered.
  • Steps to be completed by the time of GK's return from vacation were agreed, with the aim of allowing work to start on a CLAROS service deployment using the new structures.

Saturday, 12 June 2010

At last, test cases pass with new Jena deployment

After what seems like weeks of struggle (a few days, actually), I have managed to replicate a copy the CLAROS demo server that passes all test cases using newer Jena libraries.

My problems, it seems, were down to a failure of dataset configuration management.  I really should have known better, but sometimes it seems the fundamental lessons need reinforcing.  Ho hum.

Anyway, now I can look to loading up the latest data from LIMC, which promises to add some interesting capabilities.  And I will maintain (compressed) copies of the test dataset in subversion!

Friday, 11 June 2010

JISC project working with outside contractors

I've been reflecting on the ramifications for the MILARQ project of working with outside contractors.

While the initial motivation and justification for working with Epimorphics was for access to their technical expertise concerning Jena, I've noticed that working with an experienced external team is providing valuable views and insights into how to run this kind of project, which I'd like to think can be propagated to other JISC projects over time.

MILARQ technical review and planning meeting

This meeting was arranged at relatively short notice (i.e. unplanned) as we were facing some technical questions which we thought would benefit from face-to-face contact. But we also took the opportunity to treat the meeting as a mini sprint review.

The technical issues concerned (a) some identified optimizations which, while effective on a limited test query, we felt might not be effective across the range of queries CLAROS needed to perform, and (b) difficulties in replicating the updated server environment at Oxford.

Important outcomes are:
  • The hypothesis that specialized indexes can resolve query performance problems is supported by some concrete evidence
  • Substantial performance gains, to the extent that sub-second query performance may be achievable on current data, can be realized by materializing query results and appropriate arrangement of the queries used, but will probably not scale well to even larger data volumes (probably linear). Isolating values from about 66,000 object records takes over 0.5s
  • More care is needed to ensure consistency of source data used for development and testing purposes

Tuesday, 8 June 2010

Sprint 3 plan

I prepared this a couple of weeks ago, but forgot to announce it: http://code.google.com/p/milarq/wiki/SprintPlan_3.

Following the query performance scoping experiments, the general plan for sprint 3 is to analyze existing software and plan ways to include additional indexing information, and for Oxford to replicate Epimorphics' running version of the Claros query service using a more recent version of Jena.

Thursday, 27 May 2010

Progress review with Epimorphics

Our first progress review meeting was held a little later than intended, due to a combination of technical problems with the development and test environment inhibiting progress, and scheduling conflicts.
Notes from the meeting are at http://code.google.com/p/milarq/wiki/20100526_Meeting_Project_Progress.

The scoping experiments generally confirmed the premise on which the project is based, namely that it is the use of ordering in complex queries that gives rise the the most serious query performance problems. Additionally, some unanticipated (though, in hindsight, unsurprising) results were also noted:
  • Even without the cost of sorting, some of the original queries do not meet the sub-second query execution criterion. Engineering solutions have been tested for these cases.
  • Queries involving "joins" impose significant increased cost. That is queries that involve discovering chains of triples, rather than combinations of triples with a common subject, impose a significant query performance penalty.
The plan for the next month will focus on determining how to provide additional indexing to accelerate queries that depend on ordering information:
  • as a priority, examine ARQ processing with a view to understanding how index ordering info can be used
  • look at current external index mechanisms used by ARQ (currently Lucene plus property function hooks).
From this, we hope for a well-understood and easily implemented plan for software enhancements to deliver the required query performance.

Friday, 23 April 2010

Sprint 2 plan

Following the Kick-off meeting with Epimorphics, I've updated the first sprint plan (http://code.google.com/p/milarq/wiki/SprintPlan_1) and prepared the second one (http://code.google.com/p/milarq/wiki/SprintPlan_2).

There was no formal review of the first sprint, but there wasn't really much to review. And the Epimorphics meeting really was the planning session for the second sprint. The next such meeting will probably combine sprint review and planning.

The main work of sprint 2 will be to perform some scoping experiments for SPARQL queries against the CLAROS database, to get some early indications of what kinds of performance improvement might be expected through different query handling strategies.

Monday, 19 April 2010

Kick-off meeting with Epimorphics

A two-day project kick-off meeting with Epimorphics was help last week (15-16 April), which I think went very well.

All intended management and technical goals for the meeting were achieved, with goals confirmed, roles clarified, a technical plan in place and a framework for evaluation in place.

The main project success criterion is that we can improve query performance sufficiently to take CLAROS public.  Roughly, this means achieving sub-second response times for all queries.

During the project, Epimorphics will undertake exploration and modification of the query mechanisms, while I will suppiort the test and evaluation elements by applying any agreed query redesigns to the test suite, and evaluate revised query handling mechanisms in the full CLAROS system.

In the first month, Epimorhics will perform a series of scoping experiments, to explore the unknown elements in the project plan, and "de-risk" the project.  After this,we will hold a second meeting to decide concrete steps for further development, by which time the actual steps that need to be performed should be much clearer.

During the two days, we installed the CLAROS query software and test suite from its various subversion repositories and loaded up a copy of the data currently running on the CLAROS demonstration server at OeRC.  The resulting system passed all test cases in the test suite, which represent instances of all the queries used by the CLAROS Explorer VRE application.

All-in-all, I felt this was a positive and successful meeting.

Tuesday, 23 March 2010

Project plan, infrastructure setup and first sprint

The intial project planning and setup for MILARQ is now done, and I'm starting to look at technical issues (testing, evaluation, etc.) in preparation for a kick-off meeting with Epimorphics next month.

The formal project plan for the JISC is posted here: http://milarq.googlecode.com/hg/docs/MILARQ_Projectplan_VRERI_JISC.pdf.

The working project plan is posted here: http://code.google.com/p/milarq/wiki/ProjectPlanOutline_201003_201010.

The first sprint plan is posted here: http://code.google.com/p/milarq/wiki/SprintPlan_1.

Monday, 15 March 2010

MILARQ project startup

This blog is used for annoucements relating to the MILARQ project.

More information about the project can be found at http://code.google.com/p/milarq/.