WhitePaper on OAI "result set filtering" issue
Martin Vesely - Tibor
Simko - Thomas Baron
Last revision: 26 October 2001
1. Purpose
The objective of result set filtering specification is to enhance the functionality
of the OAI-PMH (OAI Protocol for Metadata Harvesting) [1]
in order to obtain a more detailed selective harvesting. This specification
is aimed to enable wider range of the protocol usability. There are currently
two mechanisms allowing the selective harvesting: the selection based on
the time range related to the last record modification (datestamp) and
metadata sets that represent the internal repository structure as defined
by the data provider. These mechanisms ensure the posibility of periodical
and incremental harvesting of entire metadata sets, however no further
refinement on these sets is allowed. For certain applications the selection
based on these mechanisms does not offer a sufficient framework. The result
set filtering should allow a further refinement of large sets on the side
of data provider. The introduction of a result set filtering in the OAI-PMH
is the subject of this document.
2. Selection modes
With the current OAI-PMH v1.1, the following two selection modes are already
possible:
1) selection by datestamp (mandatory), represented by keywords 'from'
and 'until'
This selection mode is used in periodical incremental harvesting mode,
when DP repositories are fully mirrored at the SP.
2) selection by sets (optional, refinement by the Data Provider - DP),
represented by keyword 'set'
This selection mode is used when there is a need for partial harvesting
of the repository metadata. The repositories are not fully mirrored. In
principle this selection is "pushed" by data provides as they define what
will be selected. In cross-archive and even more in cross-disciplinary
application this would require a uniform definition of sets in all repositories
based on some accepted standard. Furthermore, the needs for set contents
of harvested metadata may change or may be different for each SP. Therefore
the selection by sets refinement is quite rigid in the domain of cross-archive
and cross-disciplinary application.
In order to add more refinement capabilities, we propose a third selection
mode:
3) selection by matching query (optional, refinement by the Service
Provider - SP)
The feature of result set filtering using matching query is useful
in applications where there is a requirement for low network traffic or
the requested metadata set is relatively small compared to the entire harvestable
metadata set of large repositories. For cases of small and numerous repositories
(e.g. as in Kepler project [3]) the result set filtering
should remain optional. In particular, the applications with hierarchical
harvesting and metadata brokering [4] would profit of such extended functionality.
The result set filtering would also allow service providers to model distributed-archive
paradigm within the OAI framework.
Note: These three selection modes could be combined within the same OAI
request
3. From hierarchical harvesting to metadata brokering
These selection modes play different roles in the relation between data
providers and service providers.
-
Hierarchical harvesting (see [4]):
With the current OAI-PMH, the hierarchical harvesting can be schematically
described as:
DP--\
DP---SP/DP---SP
DP--/
|
-
Metadata brokering:
Metadata brokering model uses a mediator (broker) between data providers
and service providers. The broker has a possibility to filter and
refine harvested metadata sets before mediating them further onto the service
providers. This is to facilitate metadata exchange between numerous
DPs on one side and numerous SPs on the other side.
DP--\ /--SP
DP---B---SP
DP--/ \--SP
|
The selection modes 1) and 2) are expected mostly in the DP-B
relationships, while selection mode 3) mostly in the relation between B-SP.
4. Distributed-archive paradigm
Using the matching query approach, the protocol will allow remote metadata
access in DP repositories without any metadata mirroring, which would enable
its application in cross-search-like applications (document searches, personal
alerts and personal baskets, to name a few examples). According to this
paradigm, the metadata would not have to be harvested periodically, but
it would be rather accessed at their original location on demand.
However this is not the primary aim of the result set filtering extension
as some further specifications will have to be accepted in order to receive
a full distributed protocol (e.g. SOAP extension [5]).
5. Protocol extension proposal
The Dublin Core Metadata Element Set [2] is required
as being an OAI mandatory metadata format. So it is a natural candidate
for expressing the queries. But ideally, the matching query should not
depend on specific metadata formats. The query should be expressed in a
simple way so it can be easily parsed by the repository server. The nested
boolean logic with basic operations (AND,OR,NOT) should be included. The
use of this new feature should be optional
5.1 Proposal 1
Matching query will be contained in the HTTP request by introducing a new
OAI_keyword 'Matching':
The matching query is encoded in the HTTP query string as follows:
?Verb=OAIVerb&Matching=OAIQuery
|
where OAIQuery is composed as boolean expression of queried fields:
Matching={ElementID=Value[;|,|~]}
|
where boolean operations between fields is expressed by a glimpse-like
syntax ([6]):
';' semicolon
conjuction
',' comma
disjunction
'~' tilde
negation
The nested queries could be expressed using parentheses.
Selection can be based on any combination of the three selection keywords
(From-Until, Set, Matching) that are not exclusive.
Matching query example: A query requesting new records on weekly
basis registered in a particular repository created by Smith and published
in 2000:
?Verb=ListIdentifiers&Matching=creator=Smith;date=2000 |
-
The response format of the used verb will not be changed by the addition
of this new query word.
-
If the archive does not support matching altogether or matching of the
requested field, a standard exception with no metadata container should
be returned.
5.2 Other issues
Since this new feature should be optional, some questions remain.:
-
How can an harvester know whether an archive supports the Matching keyword
or not?
-
If the archive supports it, which elements can be used through this matching
(all of standard DC, part of it, other metadata formats)?
Here are two possible solutions to answer these questions:
5.2.1 Added fields on "Identify" verb
-
An optional repeatable tag could be added to the answer of the Identify
verb, indicating which elements are available for matching.
-
Example implementation using the "description" container
<Identify
xmlns="http://www.openarchives.org/OAI/1.1/OAI_Identify"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.openarchives.org/OAI/1.1/OAI_Identify
http://www.openarchives.org/OAI/1.1/OAI_Identify.xsd">
[...]
<description>
<oai-matching
xmlns="..."
xmlns:xsi="..."
xsi:schemaLocation="...">
<matching>
<format>dc</format>
<elementID>Title</elementID>
</matching>
<matching>
<format>oai_marc</format>
<elementID>24513_a</elementID>
<elementID>100___a</elementID>
</matching>
</oai-matching>
</description>
[...]
</Identify>
|
-
If an archive does not support matching at all, none of these fields are
sent back to the Identify query.
5.2.2 Create new verb
-
A "ListMatching" verb is added
An example reply to this request could be:
<?xml version="1.0" encoding="UTF-8"?>
<ListMatching
xmlns="http://www.openarchives.org/OAI/1.1/OAI_ListMatching"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.openarchives.org/OAI/1.1/OAI_ListMatching
http://www.openarchives.org/OAI/1.1/OAI_ListMatching.xsd">
<responseDate>2001-06-01T19:20:30-04:00</responseDate>
<requestURL>http://an.oa.org/OAI-script?verb=ListMatching</requestURL>
<matching>
<format>dc</format>
<elementID>Title</elementID>
</matching>
<matching>
<format>oai_marc</format>
<elementID>24513_a</elementID>
<elementID>100___a</elementID>
</matching>
</ListMatching>
|
6. Recommendations
The OAI-MH protocol does not intend to replace other more complex protocols
such as Z39.50. This is mainly true due to the youth of the OAI-MH protocol
and the will its inventors have to keep it simple so that the largest possible
amount of users adopt it. Once this protocol becomes a widely used standard
in metadata exchange (which is likely to happen), will data providers
want to implement other protocols just to have access to missing functionalities?
Result set filtering should be planned at a more or less short term,
and its implementation should be made optional to keep the simplicity
of implementation targetted by OAI.
7. References
[1] OAI Protocol for Metadata Harvesting: http://www.openarchives.org/OAI_protocol/openarchivesprotocol.html
[2] Dublin core metadata element set: http://www.dublincore.org/documents/dces/
[3] http://kepler.cs.odu.edu
[4] Liu X, Maly K, Zubair M: Arc - An OAI Service
Provider for Digital Library Federation, D-Lib Magazine 7/2001
[5] Simple Object Access Protocol (SOAP) 1.1: http://www.w3.org/TR/SOAP/
[6] Glimpse man page: http://webglimpse.org/glimpsehelp.html