Specification of DwCStore Protocol Used by Fishweb

Status
DRAFT - actively edited
Protocol URI
http://purl.oclc.org/NET/WASABI/itemstore/

The DwCStore protocol (DWCS) is used by the fishweb data store to enable remote management of a collection of records, with a primary focus on Darwin Core records expressed in XML. Although the initial focus is on Darwin Core records, there should be no specific properties of the service that limit it to this type of data.

DWCS uses HTTP for transport.

DWCS response messages are JSON encoded objects.

The primary purpose of DWCS is to provide a REST interface to a collection of objects, and metadata describing those objects.

An DWCS server may support subsetting operations by querying an index of the content. Subsetting applies to the List and Retrieve operations.

The collection may contain any content, and basic metadata is recorded about the content as it is inserted or modified in the collection.

Object Metadata

Each object stored in the DWCS has the following associated metadata.

guid Unique identifier
uid User identifier
gid Group identifier
permissions Access flag
otype Object type (URI)
mimetype Mime type, the representation of the object
mjdcreated MJD indicating when object was inserted into cache
mjdmodified Last time the object was modified
mjdindexed When the object was indexed (for collections that support indexing)
chash A hash of the object for change detection
origin Origin of the object, typically a URI
bytesize Size in bytes of the object
nhits Number of times object has been retrieved
content The actual object

Operations

Operations are accessed by URLs. Operations and their respective URL patterns:

Operation URL Method Response Type About
Authentication
login /slogin/ POST JSON Authenticate and receive token
logout /slogout/ GET JSON Invalidates an authentication token
Introspection
collections /collections/ GET JSON List available collections
types /<collection>/types/ GET JSON List types present in a collection (otype values)
indices /indices/ GET JSON List available indices
fields /<index>/fields/ GET JSON List fields in an index
field /<index>/fields/<field>/ GET JSON Provider detail about a field of an index
Collection Interaction
create /<collection>/ POST JSON Create a new item
get /<collection>/<id>/ GET object mime-type Retrieve an item
info /<collection>/<id>/meta/ GET JSON Retrieve metadata about an item
update /<collection>/<id>/ PUT JSON Modify an existing item
delete /<collection>/<id>/ DELETE JSON Remove an item from a collection
list /<collection>/[?i=<index>&q=<query>&...] GET JSON List items in a collection
updated /<collection>/updated/<start_mjd>/<end_mjd>/ GET JSON Retrieve all items from a collection updated within time range
created /<collection>/created/<start_mjd>/<end_mjd>/ GET XML retrieve all items from a collection created within a time span
retrieve /<collection>/retrieve/?q=<query>& GET JSON Retrieve all items from a collection encapsulated within an XML document
Administrative
reindex /<index>/_reindex/ GET JSON Rebuilds the specified index. Requires administrative privileges

All JSON responses support the following parameters:

Key Value
varname For JSON responses, indicates that the response will include a variable assignment to that name.

The list and retrieve operations support several parameters:

Key Value
i Name of index to use. The default index is used if not specified.
q Query to pass on to the index.
f Facet to pass on to the index to restrict application of the query.
start Starting page number of paged results.
limit Number of entries per page.

The retrieve operation also supports these additional parameters:

Key Value
doc The root element of the resulting XML document. Default = 'doc'
ns The namespace (1) of the root element. Default = None

(1) When specifying a namespace, prefix the root element with the namespace prefix. For example:

&doc=ns1:doc&ns=http%3A%2F%2Fwww.example.com%2Fsome%2Fnamespace

would specify a document like:

<ns1:doc xmlns:ns1="http://www.example.com/some/namespace">
  <record>...</record>
  ...
</ns1:doc>

Times are represented by Modified Julian Date (MJD), which is a double precision floating point value which indicates the number of days since 1858-11-17 00:00:00.00. MJD values are always represented in UTC.

A time period is specified by an upper and lower bound, with at least one boundary specified. In the updated and created operations, and underscore in place of an MJD value indicates no time boundary. So for example, the URL:

http://some.service/items/_created/_/54726.709/

Would retrieve a list of all objects in the items collection that were created before MJD=54726.709 (about 2008-09-17 10:00AM PDT).

Responses are either raw or JSON encoded structures. Raw responses are used to return actual content of items in the collection. JSON encoded responses are used for all other actions. The general form of a JSON response is a dictionary with two required keys, data and errors:

response = {'data': {},
            'errors': [error_info],
           }

Additional dictionary entries may be present which may be processed or ignored by the client.

The data entry is always a dictionary containing information specific to the response.

The errors entry contains information about the failure of the operation, which may be fatal errors where no data is returned, warnings, or any other information about problems encountered generating the response. The structure of error_info entries in the errors list should be a tuple of [error_code, error_text] where error_code is a value from the list or collection operation errors, and error_text provides additional human readable information.

Login

Validates credentials and returns a token used in the HTTP Authorization header for other operations.

Method
POST
Target
http://<server>/slogin/
login_response <- login(user, password)
user
User identification
password
Password verifying identification
login_response
JSON encoded authorization token
response = {'data': {'app': '<application key>',
                     'sid': '<authorization token>',
                     'header': '<header text>',
                    },
            'errors': [error_info, ], 
           };

Errors:

  • HTTP 401: Invalid credentials

Logout

Invalidates a token.

Method
GET
Target
http://<server>/slogin/
response = {'data': {'status':'True | False', },
            'errors': [error_info, ], 
           };

If responsedata?status? is not True, then the token was not invalidated, most likely because the token itself was not valid.

Errors:

Collections

Provides a list of collections on the server.

Method
GET
Target
http://<server>/collections/
response = {'data': {'collections': [<collection_name>, ], },
            'errors': [error_info, ], 
           };

Errors:

Create

Creates a new item in the collection. If the connected user does not have write permission on the collection, then a HTTP 403 error is returned with a JSON encoded body.

The default permissions of the new object are specified by dumpster/src/itemstore/store.DEFAULT_PERMISSIONS (0322 # rwr-r-)

Method
POST
Target
http://<server>/<collection>/
create_response <- create(content, guid, type, origin=None, doindex=1)

POST parameters:

content (required)
The payload to be inserted to the collection. This is treated as an opaque object by both the client and server, though additional operations may be performed by either client or server (pre/post processing), such operations are not defined in this specification.
guid (required)
The globally unique identifier for the object. An error is returned if the specified GUID already exists within the collection.
type (required)
The object type of the supplied item. This is used to label items within a collection as being instances of a particular type of data. It is recommended that URIs are used to identify object types.
origin (optional)
A URL pointing to information about the origin of the item. If not specified, then the URL of the collection will be used.
doindex (optional)
Collections may have an indexer attached, this parameter provides a hint that the server may use to delay indexing of the item (can be useful when a large volume of content is being uploaded). The server can choose to ignore the parameter. A server that does not support indexing must ignore this parameter.
create_response
A JSON encoded instance of a response object.
response = {'data': {'<guid>':'<Full URL pointing to object>', },
            'errors': [error_info, ], 
           };

Both 'data' and 'errors' will be present. On failure, data will be an empty dictionary, with additional information in the error array.

Errors:

  • HTTP 401: User is not authenticated and so should do so before submitting
  • HTTP 403: User is authenticated but not allowed to write to collection

Get

Retrieves a single item from the collection. The response contains the actual content (bytes) of the item. Store specific metadata about the item is retrieved using the info() operation. If the specified item does not exist in the collection then a HTTP error 404 is returned. If the connected user does not have read permission on the item, then a HTTP 403 error is returned.

Method
GET
Target
http://<server>/<collection>/<guid>/
get_response <- get(guid)
guid
The GUID of the item to retrieve
get_response
The item bytes.

Errors:

  • HTTP 401: Authentication is required to view object. The object exists but does not allow anonymous read.

  • HTTP 403: Insufficient privileges to view object. The object exists, and the user is identified but is not allowed to view the object- not in group.

  • HTTP 404: Object does not exist.

Info

Retrieves collection specific metadata about the item identified by GUID. The metadata provides attributes similar to those of a file system store and are described in getmeta_response. If the specified item does not exist in the store, then a HTTP 404 error is returned. If the connected user does not have read permission on the item, then a HTTP 403 error is returned.

Method
GET
Target
http://<server>/<collection>/<guid>/meta/
getmeta_response <- getmeta(guid, varname=None)
guid
The GUID of the item for which metadata is to be retrieved.
varname
If set, then the JSON response will be set to that variable name, otherwise just the structure is returned.
getmeta_response
Item metadata. Response types of RDF and JSON are supported and selected by the Accept: header sent by the client.

Errors:

  • HTTP 401: Authentication is required to view object. The object exists but does not allow anonymous read.

  • HTTP 403: Insufficient privileges to view object. The object exists, and the user is identified but is not allowed to view the object- not in group.

  • HTTP 404: Object does not exist.

JSON encoded response:

response = {'data': 
             {'<guid>': 
               {'type':'<DC type (otype) value of item>',
                'source':'<DC origin of item>',
                'creator': '<DC creator (uid) of item - use uri pointing to member collection>',
                'created':<DC UTC time item was first inserted into store>,
                'modified':<DC UTC time item was last updated>,
                'permissions':<integer permissions flag>,
                'bytesize':<integer size in bytes of content>,
                'hash': '<md5 hash of content>'
               },
             }
            'errors': [error_info],
           }

RDF encoded response:

<?xml version="1.0" encoding="utf-8"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
         xmlns:dc="http://purl.org/dc/terms/"
         xmlns:rcp="http://fishnet2.net/vocab/">
  <rdf:Description rdf:about="http://localhost:8000/items/.KU.KU%20Fish.1004/">
    <dc:identifier>.KU.KU%20Fish.1004/</dc:identifier>
    <dc:type rdf:resource="http://purl.oclc.org/NET/WASABI/darwincore/ /">
    <dc:creator rdf:resource="http://localhost:8000/members/1/" />
    <dc:created>2008-06-01T10:42:11.0Z</dc:created>
    <dc:modified>2008-06-01T10:42:11.0Z</dc:modified>
    <rcp:bytesize>1234</rcp:bytesize>
    <rcp:permissions>210</rcp:permissions>
    <rcp:hash>33ee40b2e431dda458247005355bea14</rcp:hash>
  </rdf:Description>
</rdf:RDF>

(Note: investigate overhead requirements for adding a read count for each item).

Example:

$ curl "http://localhost:8000/items/.KU.KU%20Fish.1004/meta/?varname=test" \
 -H "Accept: application/json"

test={"origin": "http://digir.nhm.ku.edu:80/digir/DiGIR.php", 
      "bytesize": 2761, 
      "modified": 54643.9320763, 
      "chash": "33ee40b2e431dda458247005355bea14", 
      "otype": "DwC", 
      "created": 54643.9144452, 
      "guid": ".KU.KU Fish.1004", 
      "permissions": 210}

Update

Updates an existing item with new content. The otype or origin of an item can not be changed without deleting the existing instance and creating a new one. An error (HTTP 404) is returned if the item with the specified GUID does not exist in the collection. If the connected user does not have write permission on the item, then a HTTP 403 error is returned.

Method
PUT
Target
http://<server>/<collection>/<guid>/
update_response <- update(guid, content)

guid
The GUID of the item for which content is to be updated.
content
The content of the item that will replace the existing content.
update_response
A JSON encoded response object.
Errors:

  • HTTP 401: Authentication is required to modify the object. The object exists but does not allow anonymous modification.

  • HTTP 403: Insufficient privileges to modify object. The object exists, and the user is identified but is not allowed to modify the object- not in group.

  • HTTP 404: Object does not exist.

response = {'data':
              {'<guid>': '<Full URL pointing to object>',
               'hash': '<md5 hash of object (used for change detection)>', 
              }
            'errors': [error_info, ], 
           };

Delete

Removes an item and its associated metadata from the collection. If the item does not exist, then a HTTP 404 error is returned. If the connected user does not have write permission on the item, then a HTTP 403 error is returned.

Method
DELETE
Target
http://<server>/<collection>/<guid>/
delete_response <- delete(guid)

guid
The GUID of the item to be deleted.

delete_response
A JSON encoded response object.

Errors:

  • HTTP 401: Authentication is required to delete the object. The object exists but does not allow anonymous deletion.

  • HTTP 403: Insufficient privileges to delete object. The object exists, and the user is identified but is not allowed to delete the object- not in group.

  • HTTP 404: Object does not exist.

response = {'data':
              {'<guid>': 'True | False',
              }
            'errors': [error_info, ], 
           }

List

Retrieves a list of metadata entries for items in the collection. The entire collection operates in a manner similar to an indexed array, however there is no guarantee that the array indexes will reference the same objects between calls (as insert / delete operations may have occurred). If no writes are made to the collection, then paging through the list operation will retrieve all objects readable by the logged in credentials.

If the user is not logged in, then only anonymously readable objects are listed.

If the user is logged in, then only objects readable by that user and group or anonymous users are listed.

By default, the set of items being accessed by the list is all items that are readable with the credentials. This set may be further restricted on collections that support an index by specifying a query with the q parameter.

The default return type is the standard JSON encoded list. A CSV encoded response, with one item per row can be retrieved if the HTTP Accept header mime type of text/plain or application/csv is sent by the client.

Method
GET
Target
http://<server>/<collection>/[<index>/]
list_response <- list(start=0, pagesize=1000, q=None, )

start
Index of first value to retrieve
pagesize
The number of items to retrieve
q
Query that defines a subset of the collection. This parameter is only used on collections that have an index. The syntax of the query is determined by the type of index as indicated in the #Indexes operation.
list_response
A JSON encoded list of getmeta_response structures.

response = [{'identifier':'<DC identifier (GUID) of item>',
              'type':'<DC type (otype) value of item>',
              'source':'<DC origin of item>',
              'creator': '<DC creator (uid) of item - use uri pointing to member collection>',
              'created':<DC UTC time item was first inserted into store>,
              'modified':<DC UTC time item was last updated>,
              'permissions':<integer permissions flag>,
              'bytesize':<integer size in bytes of content>,
              'hash': '<md5 hash of content>'
             },
             {}, 
             {}, 
             ... ]

CSV response:

"identifier","type","source","creator","created","modified","permissions","bytesize","hash"
...

Types

Retrieves a list of distinct otype values along with the number of occurrences of each.

Method
GET
Target
http://<server>/<collection>/types/

type_list_response <- types(start=0, limit=None)
start
Index of first value to retrieve
limit
The number of items to retrieve
response = {'results': [[<otype>, <count>],
                         ...],
           }

Retrieve

Like List except the result is a set of items rather than metadata.

Method
GET
Target
http://<server>/<collection>/
retrieve_response <- list(start=0, pagesize=1000, q=None)
q
Query that defines a subset of the collection. The syntax of the query is that of the Apache Lucene indexer.
start
Index of first value to retrieve
pagesize
The number of items to retrieve
retrieve_response
A JSON encoded list of items.

TODO: Define JSON encoding rules. XML as a hierarchical dictionary, binary objects URLs.

Indices

Retrieve a list of indices supported by the collection. A particular index supports a set of search terms. Multiple indexes may be necessary to improve efficiency of search.

One index will be tagged as default, and this indicates that when no index is specified in requests, the default index will be assumed.

If only one index is defined, then it is the default regardless of the value of the "default" flag.

Method
GET
Target
http://<server>/<collection>/indices/
  indices_response <- indices()
  response = {"indices": [{"name": "<name of index (label used in URL)>",
                           "URI": "<URI for index definition>",
                           "syntax": "<URI for syntax description>",
                           "label": "<human readable label for index>",
                           }, 
                          ... 
                          ],
              "default": "<name of the default index>", 
             }

Errors:

Fields

Retrieves a list of fields from the index.

Fields are defined as search points.

TODO: There needs to be a mapping between fields and their definition (URI + description). This mapping should be retrievable from the service. Perhaps the response should be something like fieldname + type + URI + range

TODO: Support multiple indexes on the collection? Example - dublin core + dwc

Method
GET
Target
http://<server>/<collection>[/index]/fields/ (1)

(1) The default index is used if not specified.

fields_response <- fields()
  response = {"<field_name>": {"URI": "<URI for field definition>",
                               "type": "<Data type of field>",
                               "label": "<Human readable label for field>",
                              }, 
              ... 
             }

Field

Retrieves a list of distinct values for the specified field.

Method
GET
Target
http://<server>/<collection>/fields/<field name>/

Item Representation in JSON

XML

XML items are converted to a hierarchy of dictionaries.

xml:

<doc xmlns='http://default.name.space/' 
     xmlns:a='http://a.name.space/'>
  <a:item>
  some text
  </a:item>
</doc>

JSON:

Error: Failed to load processor javascript
No macro or processor named 'javascript' found