Portal Design

Design document for the portal to be used in the fishnet project.

Issues

The main issue to be addressed is that of performance. The existing DiGIR portal applications are slow to respond to user queries, and this severely limits the type of web applications that can be constructed. Poor performance in a DiGIR portal has two main causes:

  1. Query response time of each data provider
  2. Time to retrieve and collate the matching data

Bandwidth demand is not particularly high, though can be significant with large record sets. For example, a full Darwin Core record is typically about 3KB, so 10000 records is about 30MB which can be a significant network load, especially where multiple simultaneous queries are being serviced.

Another issue is the difficulty of integrating other data with the Darwin Core records, which results in a fairly insular system that does not interact well with other services such as mash-ups or data feed back services (quality control, annotations, georeferencing) since there is no effective mechanism to easily reference a single record (there is, but the URLs are very cumbersome and response is slow).

Existing Portal Model

The existing infrastructure looks something like this:

Proposed Portal Model

A redesign of the information flow is required to address these concerns. A typical solution is to create a copy of the data in a database optimized for the task and located on a high bandwidth network, however this would entail all contributors to upload their data to the warehouse, which is unlikely to be satisfactory.

Instead a hybrid solution is being implemented for the Fishnet portal, and instead of being simply a portal for accessing the data providers, it becomes an important component of the overall infrastructure for sharing these data. The high level schematic of such a system is indicated below:

The main features of this system are the data store (4) which is a simple file-system like service that is optimized to contain a very large number of small data objects, each of which is identifiable by a unique URI; the index (5) which is an Apache Lucene based index that typically provides sub-seconds query response times; and (6) the portal, which provides the web interface to the infrastructure and supports both a programmatic (REST + JSON) and user interface.

The role of the data store is to provide a cache of Darwin Core (DwC) records that can be rapidly retrieved by the portal and indexer. It contains potentially hundreds of millions of DwC XML documents, each of which is still owned by the data source, and can be deleted, updated or hidden form the portal (and so other users). Existing DiGIR and TAPIR providers (1) are accessed using a Harvester that retrieves all the records of relevance to the portal (e.g. of a particular taxa or geographic region). The Harvester operates on a regular basis to maintain synchronization between the data available on the provider and that available in the data store. If an existing data provider would prefer not to have their content harvested, then an adapter (1a.) can be used to index the data and make it accessible to the portal. In this case, the original records are not directly available to the client, but can be retrieved by connecting directly with the data provider instead of retrieving from the data store.

An alternative to operating a data provider is indicated by (2), where instead of a data provider, a "Record Builder" is used to generate DwC representations of the collection database, and directly uploads these records to the Data Store. In this case, the task of the Record Builder is to generate the DwC records and maintain their representation in the Data Store (create, update, delete as necessary). In this scenario, the data source does not need to operate a data provider, web server, and associated hardware. Instead, the Record Builder script is run periodically, and pushes the content to the data store.

This infrastructure also supports the ability to index non-specimen based records as indicated in (3). In this case, a portal may want to provide access to a library of documents where DwC records (or portions thereof) can provide effective search terms for locating information of relevance to the role of the portal (e.g. the documents may be reports on invasive species, and the portal may want to provide the ability to locate specimens and documents related to a specific search term).

The portal (6) interacts primarily with the index (5) for supporting searches, and only retrieves content from the data store when original records are desired. Since all content is referenced by URIs, the index can also support search across multiple data stores (4a), and so a particular index may be built to support very specific types of searches across multiple networks (biodiversity in Mississippi) or may also support very broad searches across multiple networks (as in the case of systems such as GBIF and EoL).

Since the portal supports a REST API for programmatic access to the data, it is also feasible for sophisticated desktop applications to be built, or existing applications to be adapted to work with the search environment. Poor performance of the existing DiGIR networks placed a significant practical limitation on this capability.

Since individual records are easily referenced with a simple URL and can be retrieved very quickly, integration with other services (e.g. annotation, data quality evaluation, georeferencing services) is much improved. Since the data store supports the ability to write data back, approved services may also modify records held in the data store, and those modified records could be returned to the source collections as part of the synchronization process.

Implementation

The new portal design is implemented using the Python programming language in the Django web framework operating under the Apache web server. The indexer is a modified implementation of the Apache SOLR indexer, which in turn is based on the Lucene search engine.

The portal API is documented at wiki:Draft/DataStoreAPI

Attachments