Specification of DwCStore Protocol Used by Fishweb
- Status
- DRAFT - actively edited
- Protocol URI
- http://purl.oclc.org/NET/WASABI/itemstore/
The DwCStore protocol (DWCS) is used by the fishweb data store to enable remote management of a collection of records, with a primary focus on Darwin Core records expressed in XML. Although the initial focus is on Darwin Core records, there should be no specific properties of the service that limit it to this type of data.
DWCS uses HTTP for transport.
DWCS response messages are JSON encoded objects.
The primary purpose of DWCS is to provide a REST interface to a collection of objects, and metadata describing those objects.
An DWCS server may support subsetting operations by querying an index of the content. Subsetting applies to the List and Retrieve operations.
The collection may contain any content, and basic metadata is recorded about the content as it is inserted or modified in the collection.
Object Metadata
Each object stored in the DWCS has the following associated metadata.
| guid | Unique identifier |
| uid | User identifier |
| gid | Group identifier |
| permissions | Access flag |
| otype | Object type (URI) |
| mimetype | Mime type, the representation of the object |
| mjdcreated | MJD indicating when object was inserted into cache |
| mjdmodified | Last time the object was modified |
| mjdindexed | When the object was indexed (for collections that support indexing) |
| chash | A hash of the object for change detection |
| origin | Origin of the object, typically a URI |
| bytesize | Size in bytes of the object |
| nhits | Number of times object has been retrieved |
| content | The actual object |
Operations
Operations are accessed by URLs. Operations and their respective URL patterns:
| Operation | URL | Method | Response Type | About |
| Authentication | ||||
| login | /slogin/ | POST | JSON | Authenticate and receive token |
| logout | /slogout/ | GET | JSON | Invalidates an authentication token |
| Introspection | ||||
| collections | /collections/ | GET | JSON | List available collections |
| types | /<collection>/types/ | GET | JSON | List types present in a collection (otype values) |
| indices | /indices/ | GET | JSON | List available indices |
| fields | /<index>/fields/ | GET | JSON | List fields in an index |
| field | /<index>/fields/<field>/ | GET | JSON | Provider detail about a field of an index |
| Collection Interaction | ||||
| create | /<collection>/ | POST | JSON | Create a new item |
| get | /<collection>/<id>/ | GET | object mime-type | Retrieve an item |
| info | /<collection>/<id>/meta/ | GET | JSON | Retrieve metadata about an item |
| update | /<collection>/<id>/ | PUT | JSON | Modify an existing item |
| delete | /<collection>/<id>/ | DELETE | JSON | Remove an item from a collection |
| list | /<collection>/[?i=<index>&q=<query>&...] | GET | JSON | List items in a collection |
| updated | /<collection>/updated/<start_mjd>/<end_mjd>/ | GET | JSON | Retrieve all items from a collection updated within time range |
| created | /<collection>/created/<start_mjd>/<end_mjd>/ | GET | XML | retrieve all items from a collection created within a time span |
| retrieve | /<collection>/retrieve/?q=<query>& | GET | JSON | Retrieve all items from a collection encapsulated within an XML document |
| Administrative | ||||
| reindex | /<index>/_reindex/ | GET | JSON | Rebuilds the specified index. Requires administrative privileges |
All JSON responses support the following parameters:
| Key | Value |
| varname | For JSON responses, indicates that the response will include a variable assignment to that name. |
The list and retrieve operations support several parameters:
| Key | Value |
| i | Name of index to use. The default index is used if not specified. |
| q | Query to pass on to the index. |
| f | Facet to pass on to the index to restrict application of the query. |
| start | Starting page number of paged results. |
| limit | Number of entries per page. |
The retrieve operation also supports these additional parameters:
| Key | Value |
| doc | The root element of the resulting XML document. Default = 'doc' |
| ns | The namespace (1) of the root element. Default = None |
(1) When specifying a namespace, prefix the root element with the namespace prefix. For example:
&doc=ns1:doc&ns=http%3A%2F%2Fwww.example.com%2Fsome%2Fnamespace
would specify a document like:
<ns1:doc xmlns:ns1="http://www.example.com/some/namespace"> <record>...</record> ... </ns1:doc>
Times are represented by Modified Julian Date (MJD), which is a double precision floating point value which indicates the number of days since 1858-11-17 00:00:00.00. MJD values are always represented in UTC.
A time period is specified by an upper and lower bound, with at least one boundary specified. In the updated and created operations, and underscore in place of an MJD value indicates no time boundary. So for example, the URL:
http://some.service/items/_created/_/54726.709/
Would retrieve a list of all objects in the items collection that were created before MJD=54726.709 (about 2008-09-17 10:00AM PDT).
Responses are either raw or JSON encoded structures. Raw responses are used to return actual content of items in the collection. JSON encoded responses are used for all other actions. The general form of a JSON response is a dictionary with two required keys, data and errors:
response = {'data': {},
'errors': [error_info],
}
Additional dictionary entries may be present which may be processed or ignored by the client.
The data entry is always a dictionary containing information specific to the response.
The errors entry contains information about the failure of the operation, which may be fatal errors where no data is returned, warnings, or any other information about problems encountered generating the response. The structure of error_info entries in the errors list should be a tuple of [error_code, error_text] where error_code is a value from the list or collection operation errors, and error_text provides additional human readable information.
Login
Validates credentials and returns a token used in the HTTP Authorization header for other operations.
- Method
- POST
- Target
- http://<server>/slogin/
login_response <- login(user, password)
- user
- User identification
- password
- Password verifying identification
- login_response
-
JSON encoded authorization token
response = {'data': {'app': '<application key>', 'sid': '<authorization token>', 'header': '<header text>', }, 'errors': [error_info, ], };
Errors:
- HTTP 401: Invalid credentials
Logout
Invalidates a token.
- Method
- GET
- Target
- http://<server>/slogin/
response = {'data': {'status':'True | False', },
'errors': [error_info, ],
};
If responsedata?status? is not True, then the token was not invalidated, most likely because the token itself was not valid.
Errors:
Collections
Provides a list of collections on the server.
- Method
- GET
- Target
-
http://<server>/collections/
response = {'data': {'collections': [<collection_name>, ], }, 'errors': [error_info, ], };
Errors:
Create
Creates a new item in the collection. If the connected user does not have write permission on the collection, then a HTTP 403 error is returned with a JSON encoded body.
The default permissions of the new object are specified by dumpster/src/itemstore/store.DEFAULT_PERMISSIONS (0322 # rwr-r-)
- Method
- POST
- Target
- http://<server>/<collection>/
create_response <- create(content, guid, type, origin=None, doindex=1)
POST parameters:
- content (required)
- The payload to be inserted to the collection. This is treated as an opaque object by both the client and server, though additional operations may be performed by either client or server (pre/post processing), such operations are not defined in this specification.
- guid (required)
- The globally unique identifier for the object. An error is returned if the specified GUID already exists within the collection.
- type (required)
- The object type of the supplied item. This is used to label items within a collection as being instances of a particular type of data. It is recommended that URIs are used to identify object types.
- origin (optional)
- A URL pointing to information about the origin of the item. If not specified, then the URL of the collection will be used.
- doindex (optional)
- Collections may have an indexer attached, this parameter provides a hint that the server may use to delay indexing of the item (can be useful when a large volume of content is being uploaded). The server can choose to ignore the parameter. A server that does not support indexing must ignore this parameter.
- create_response
-
A JSON encoded instance of a response object.
response = {'data': {'<guid>':'<Full URL pointing to object>', }, 'errors': [error_info, ], };
Both 'data' and 'errors' will be present. On failure, data will be an empty dictionary, with additional information in the error array.
Errors:
- HTTP 401: User is not authenticated and so should do so before submitting
- HTTP 403: User is authenticated but not allowed to write to collection
Get
Retrieves a single item from the collection. The response contains the actual content (bytes) of the item. Store specific metadata about the item is retrieved using the info() operation. If the specified item does not exist in the collection then a HTTP error 404 is returned. If the connected user does not have read permission on the item, then a HTTP 403 error is returned.
- Method
- GET
- Target
- http://<server>/<collection>/<guid>/
get_response <- get(guid)
- guid
- The GUID of the item to retrieve
- get_response
- The item bytes.
Errors:
- HTTP 401: Authentication is required to view object. The object exists but does not allow anonymous read.
- HTTP 403: Insufficient privileges to view object. The object exists, and the user is identified but is not allowed to view the object- not in group.
- HTTP 404: Object does not exist.
Info
Retrieves collection specific metadata about the item identified by GUID. The metadata provides attributes similar to those of a file system store and are described in getmeta_response. If the specified item does not exist in the store, then a HTTP 404 error is returned. If the connected user does not have read permission on the item, then a HTTP 403 error is returned.
- Method
- GET
- Target
- http://<server>/<collection>/<guid>/meta/
getmeta_response <- getmeta(guid, varname=None)
- guid
- The GUID of the item for which metadata is to be retrieved.
- varname
- If set, then the JSON response will be set to that variable name, otherwise just the structure is returned.
- getmeta_response
- Item metadata. Response types of RDF and JSON are supported and selected by the Accept: header sent by the client.
Errors:
- HTTP 401: Authentication is required to view object. The object exists but does not allow anonymous read.
- HTTP 403: Insufficient privileges to view object. The object exists, and the user is identified but is not allowed to view the object- not in group.
- HTTP 404: Object does not exist.
JSON encoded response:
response = {'data':
{'<guid>':
{'type':'<DC type (otype) value of item>',
'source':'<DC origin of item>',
'creator': '<DC creator (uid) of item - use uri pointing to member collection>',
'created':<DC UTC time item was first inserted into store>,
'modified':<DC UTC time item was last updated>,
'permissions':<integer permissions flag>,
'bytesize':<integer size in bytes of content>,
'hash': '<md5 hash of content>'
},
}
'errors': [error_info],
}
RDF encoded response:
<?xml version="1.0" encoding="utf-8"?> <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/terms/" xmlns:rcp="http://fishnet2.net/vocab/"> <rdf:Description rdf:about="http://localhost:8000/items/.KU.KU%20Fish.1004/"> <dc:identifier>.KU.KU%20Fish.1004/</dc:identifier> <dc:type rdf:resource="http://purl.oclc.org/NET/WASABI/darwincore/ /"> <dc:creator rdf:resource="http://localhost:8000/members/1/" /> <dc:created>2008-06-01T10:42:11.0Z</dc:created> <dc:modified>2008-06-01T10:42:11.0Z</dc:modified> <rcp:bytesize>1234</rcp:bytesize> <rcp:permissions>210</rcp:permissions> <rcp:hash>33ee40b2e431dda458247005355bea14</rcp:hash> </rdf:Description> </rdf:RDF>
(Note: investigate overhead requirements for adding a read count for each item).
Example:
$ curl "http://localhost:8000/items/.KU.KU%20Fish.1004/meta/?varname=test" \
-H "Accept: application/json"
test={"origin": "http://digir.nhm.ku.edu:80/digir/DiGIR.php",
"bytesize": 2761,
"modified": 54643.9320763,
"chash": "33ee40b2e431dda458247005355bea14",
"otype": "DwC",
"created": 54643.9144452,
"guid": ".KU.KU Fish.1004",
"permissions": 210}
Update
Updates an existing item with new content. The otype or origin of an item can not be changed without deleting the existing instance and creating a new one. An error (HTTP 404) is returned if the item with the specified GUID does not exist in the collection. If the connected user does not have write permission on the item, then a HTTP 403 error is returned.
- Method
- PUT
- Target
- http://<server>/<collection>/<guid>/
update_response <- update(guid, content)
- guid
- The GUID of the item for which content is to be updated.
- content
- The content of the item that will replace the existing content.
- update_response
- A JSON encoded response object.
- HTTP 401: Authentication is required to modify the object. The object exists but does not allow anonymous modification.
- HTTP 403: Insufficient privileges to modify object. The object exists, and the user is identified but is not allowed to modify the object- not in group.
- HTTP 404: Object does not exist.
response = {'data':
{'<guid>': '<Full URL pointing to object>',
'hash': '<md5 hash of object (used for change detection)>',
}
'errors': [error_info, ],
};
Delete
Removes an item and its associated metadata from the collection. If the item does not exist, then a HTTP 404 error is returned. If the connected user does not have write permission on the item, then a HTTP 403 error is returned.
- Method
- DELETE
- Target
- http://<server>/<collection>/<guid>/
delete_response <- delete(guid)
- guid
- The GUID of the item to be deleted.
- delete_response
- A JSON encoded response object.
Errors:
- HTTP 401: Authentication is required to delete the object. The object exists but does not allow anonymous deletion.
- HTTP 403: Insufficient privileges to delete object. The object exists, and the user is identified but is not allowed to delete the object- not in group.
- HTTP 404: Object does not exist.
response = {'data':
{'<guid>': 'True | False',
}
'errors': [error_info, ],
}
List
Retrieves a list of metadata entries for items in the collection. The entire collection operates in a manner similar to an indexed array, however there is no guarantee that the array indexes will reference the same objects between calls (as insert / delete operations may have occurred). If no writes are made to the collection, then paging through the list operation will retrieve all objects readable by the logged in credentials.
If the user is not logged in, then only anonymously readable objects are listed.
If the user is logged in, then only objects readable by that user and group or anonymous users are listed.
By default, the set of items being accessed by the list is all items that are readable with the credentials. This set may be further restricted on collections that support an index by specifying a query with the q parameter.
The default return type is the standard JSON encoded list. A CSV encoded response, with one item per row can be retrieved if the HTTP Accept header mime type of text/plain or application/csv is sent by the client.
- Method
- GET
- Target
- http://<server>/<collection>/[<index>/]
list_response <- list(start=0, pagesize=1000, q=None, )
- start
- Index of first value to retrieve
- pagesize
- The number of items to retrieve
- q
- Query that defines a subset of the collection. This parameter is only used on collections that have an index. The syntax of the query is determined by the type of index as indicated in the #Indexes operation.
- list_response
- A JSON encoded list of getmeta_response structures.
response = [{'identifier':'<DC identifier (GUID) of item>',
'type':'<DC type (otype) value of item>',
'source':'<DC origin of item>',
'creator': '<DC creator (uid) of item - use uri pointing to member collection>',
'created':<DC UTC time item was first inserted into store>,
'modified':<DC UTC time item was last updated>,
'permissions':<integer permissions flag>,
'bytesize':<integer size in bytes of content>,
'hash': '<md5 hash of content>'
},
{},
{},
... ]
CSV response:
"identifier","type","source","creator","created","modified","permissions","bytesize","hash" ...
Types
Retrieves a list of distinct otype values along with the number of occurrences of each.
- Method
- GET
- Target
-
http://<server>/<collection>/types/
type_list_response <- types(start=0, limit=None)
- start
- Index of first value to retrieve
- limit
- The number of items to retrieve
response = {'results': [[<otype>, <count>],
...],
}
Retrieve
Like List except the result is a set of items rather than metadata.
- Method
- GET
- Target
- http://<server>/<collection>/
retrieve_response <- list(start=0, pagesize=1000, q=None)
- q
- Query that defines a subset of the collection. The syntax of the query is that of the Apache Lucene indexer.
- start
- Index of first value to retrieve
- pagesize
- The number of items to retrieve
- retrieve_response
- A JSON encoded list of items.
TODO: Define JSON encoding rules. XML as a hierarchical dictionary, binary objects URLs.
Indices
Retrieve a list of indices supported by the collection. A particular index supports a set of search terms. Multiple indexes may be necessary to improve efficiency of search.
One index will be tagged as default, and this indicates that when no index is specified in requests, the default index will be assumed.
If only one index is defined, then it is the default regardless of the value of the "default" flag.
- Method
- GET
- Target
-
http://<server>/<collection>/indices/
indices_response <- indices()
response = {"indices": [{"name": "<name of index (label used in URL)>",
"URI": "<URI for index definition>",
"syntax": "<URI for syntax description>",
"label": "<human readable label for index>",
},
...
],
"default": "<name of the default index>",
}
Errors:
Fields
Retrieves a list of fields from the index.
Fields are defined as search points.
TODO: There needs to be a mapping between fields and their definition (URI + description). This mapping should be retrievable from the service. Perhaps the response should be something like fieldname + type + URI + range
TODO: Support multiple indexes on the collection? Example - dublin core + dwc
- Method
- GET
- Target
- http://<server>/<collection>[/index]/fields/ (1)
(1) The default index is used if not specified.
fields_response <- fields()
response = {"<field_name>": {"URI": "<URI for field definition>",
"type": "<Data type of field>",
"label": "<Human readable label for field>",
},
...
}
Field
Retrieves a list of distinct values for the specified field.
- Method
- GET
- Target
- http://<server>/<collection>/fields/<field name>/
Item Representation in JSON
XML
XML items are converted to a hierarchy of dictionaries.
xml:
<doc xmlns='http://default.name.space/' xmlns:a='http://a.name.space/'> <a:item> some text </a:item> </doc>
JSON:
