WhitePaper on OAI "result set filtering" issue

Martin Vesely - Tibor Simko - Thomas Baron
Last revision: 26 October 2001

1. Purpose

The objective of result set filtering specification is to enhance the functionality of the OAI-PMH (OAI Protocol for Metadata Harvesting) [1] in order to obtain a more detailed selective harvesting. This specification is aimed to enable wider range of the protocol usability. There are currently two mechanisms allowing the selective harvesting: the selection based on the time range related to the last record modification (datestamp) and metadata sets that represent the internal repository structure as defined by the data provider. These mechanisms ensure the posibility of periodical and incremental harvesting of entire metadata sets, however no further refinement on these sets is allowed. For certain applications the selection based on these mechanisms does not offer a sufficient framework. The result set filtering should allow a further refinement of large sets on the side of data provider. The introduction of a result set filtering in the OAI-PMH is the subject of this document.
 

2. Selection modes

With the current OAI-PMH v1.1, the following two selection modes are already possible:
    1) selection by datestamp (mandatory), represented by keywords 'from' and 'until'
      This selection mode is used in periodical incremental harvesting mode, when DP repositories are fully mirrored at the SP.
       
       
    2) selection by sets (optional, refinement by the Data Provider - DP), represented by keyword 'set'
      This selection mode is used when there is a need for partial harvesting of the repository metadata. The repositories are not fully mirrored. In principle this selection is "pushed" by data provides as they define what will be selected. In cross-archive and even more in cross-disciplinary application this would require a uniform definition of sets in all repositories based on some accepted standard. Furthermore, the needs for set contents of harvested metadata may change or may be different for each SP. Therefore the selection by sets refinement is quite rigid in the domain of cross-archive and cross-disciplinary application.
In order to add more refinement capabilities, we propose a third selection mode:
    3) selection by matching query (optional, refinement by the Service Provider - SP)
      The feature of result set filtering using matching query is useful in applications where there is a requirement for low network traffic or the requested metadata set is relatively small compared to the entire harvestable metadata set of large repositories. For cases of small and numerous repositories (e.g. as in Kepler project [3]) the result set filtering should remain optional. In particular, the applications with hierarchical harvesting and metadata brokering [4] would profit of such extended functionality. The result set filtering would also allow service providers to model distributed-archive paradigm within the OAI framework.
       
Note: These three selection modes could be combined within the same OAI request

3. From hierarchical harvesting to metadata brokering

These selection modes play different roles in the relation between data providers and service providers.
DP--\          
DP---SP/DP---SP
DP--/          
DP --> SP/DP --> SP

 
DP--\ /--SP
DP---B---SP
DP--/ \--SP
DP --> Broker --> SP
The selection modes 1) and 2) are expected mostly in the DP-B relationships, while selection mode 3) mostly in the relation between B-SP.

4. Distributed-archive paradigm

Using the matching query approach, the protocol will allow remote metadata access in DP repositories without any metadata mirroring, which would enable its application in cross-search-like applications (document searches, personal alerts and personal baskets, to name a few examples). According to this paradigm, the metadata would not have to be harvested periodically, but it would be rather accessed at their original location on demand.

However this is not the primary aim of the result set filtering extension as some further specifications will have to be accepted in order to receive a full distributed protocol (e.g. SOAP extension [5]).
 
 

5. Protocol extension proposal

The Dublin Core Metadata Element Set [2] is required as being an OAI mandatory metadata format.  So it is a natural candidate for expressing the queries. But ideally, the matching query should not depend on specific metadata formats. The query should be expressed in a simple way so it can be easily parsed by the repository server. The nested boolean logic with basic operations (AND,OR,NOT) should be included. The use of this new feature should be optional
 

5.1 Proposal 1

Matching query will be contained in the HTTP request by introducing a new OAI_keyword 'Matching':
The matching query is encoded in the HTTP query string as follows:
 
?Verb=OAIVerb&Matching=OAIQuery

where OAIQuery is composed as boolean expression of queried fields:
 
 
Matching={ElementID=Value[;|,|~]}

where boolean operations between fields is expressed by a glimpse-like syntax ([6]):

';'     semicolon       conjuction
','     comma            disjunction
'~'    tilde                  negation
The nested queries could be expressed using parentheses.

Selection can be based on any combination of the three selection keywords (From-Until, Set, Matching) that are not exclusive.

Matching query example: A query requesting new records on weekly basis registered in a particular repository created by Smith and published in 2000:

5.2 Other issues

Since this new feature should be optional,  some questions remain.: Here are two possible solutions to answer these questions:

5.2.1 Added fields on "Identify" verb

5.2.2 Create new verb

6. Recommendations

The OAI-MH protocol does not intend to replace other more complex protocols such as Z39.50. This is mainly true due to the youth of the OAI-MH protocol and the will its inventors have to keep it simple so that the largest possible amount of users adopt it. Once this protocol becomes a widely used standard in  metadata exchange (which is likely to happen), will data providers want to implement other protocols just to have access to missing functionalities?
Result set filtering should be planned at a more or less short term, and its  implementation should be made optional to keep the simplicity of implementation targetted by OAI.
 

7. References

[1] OAI Protocol for Metadata Harvesting: http://www.openarchives.org/OAI_protocol/openarchivesprotocol.html

[2] Dublin core metadata element set: http://www.dublincore.org/documents/dces/

[3] http://kepler.cs.odu.edu

[4] Liu X, Maly K, Zubair M: Arc - An OAI Service Provider for Digital Library Federation, D-Lib Magazine 7/2001

[5] Simple Object Access Protocol (SOAP) 1.1: http://www.w3.org/TR/SOAP/

[6] Glimpse man page: http://webglimpse.org/glimpsehelp.html