1 Rank Xerox Research Centre, Grenoble Laboratory
6, chemin de Maupertuis. F-38240 Meylan, France
{borghoff, pareschi, willamow}@grenoble.rxrc.xerox.com
2 Department of Physics, Carl von Ossietzky University Oldenburg
D-26111 Oldenburg, Germany
{hilf, severien, stamer}@merlin.physik.uni-oldenburg.de
Although our motivation is the physics community, it goes without saying that the ideas and issues of this project apply in general.
The coming of the electronic age in document handling has changed this practice, especially in physics because of competitive pressure, high mobility of researchers, and large and distributed groups of collaborators, many of whom are also skillful programmers.
However, as Figure 1 points out, for a full integration there are still some missing links. Some tasks, such as document management and information retrieval are already tightly linked together. Other links have still to be established or improved. For example, workflow management systems are not fully integrated with the document management and retrieval process. The same holds true for publishing/viewing and for electronic report distribution/output systems, such as a printing-on-demand system.
Figure 1: Islands of Integration (courtesy of PERSEO consulting). This paper addresses the integration of
Document Management and Report Distribution, or more generally the integration of Document Management
/ Information Retrieval and Report Distribution / Output.
This paper addresses the improvement and integration of two of these "islands", namely the document management/information retrieval system "island" and the report distribution/output systems "island". An application to document handling in physics illustrates our approach.
Two groups of servers can be distinguished: a) servers dedicated to high quality documents only; and b) servers created and maintained locally where all kinds of other not refereed ("gray") documents can be posted and found.
Authors post about 40000 scientific publications per year to the globally interconnected distributed preprint hosts belonging to the first group. Such hosts run at the physics laboratories LANL , home of their creator P. Ginsparg, SLAC, Brown University, as well as at the European mirror sites SISSA and Augsburg University . They make the publications along with complete bibliographic data available through well-adapted retrieval engines. They offer addon services such as links to documents cited by the author, and links to documents citing the document itself. They also give the authors a priority assuring entrance date, and provide world-wide access to the documents, as well as readability and archiving.
Other documents which their authors deem not suitable for distribution via preprint servers, such as laboratory annual reports, project status reports, or Ph.D. theses are often still accessible via local department Web-servers. Since the documents are not formally registered, each author individually decides on the format and the content of the information s/he makes available. Apart from the documents themselves, authors often also provide their publication list, references to older published papers, and links to full scanned documents.
Thus, electronic posting of documents is now a generally accepted practice in physics, but integrated automatic searching, retrieval and printing/delivery of the many repositories can be improved greatly. The search engines for preprint servers, publishers' servers and local department Web-servers are not connected to each other. The necessary bibliographic metainformation is, especially in department servers, not sufficiently marked-up. Search engines use proprietary retrieval languages and heterogeneous semantics/logics. There is no agreed standard for presenting the retrieved information. In some cases, just raw data is provided, which is difficult to interpret.
As a consequence, a user searching for documents on a certain topic must
Among the specific services provided (Hilf et al. 1996a-c) there are PhysDoc and PhysDiss, supporting, respectively, the search of locally stored scientific documents and Ph.D. theses. But, due to the heterogeneity of the information provided by the authors, these services often provide rather cryptic output. Hence, the problem still remains of extracting the relevant data, and making them readable in some familiar user-friendly format. Under discussion is fixing a common set of metadata to be provided by the authors; such a uniform standard should help yield more homogeneous output from the search services.
In order to integrate the EPS services with other services, such as the preprint system cited above, the Constraint-Based Knowledge Broker (CBKB) system , currently under develop- ment at the Rank Xerox Research Centre in Grenoble, France, was chosen. The CBKB system supports efficient document retrieval, and allows uniform access to the different existing search engines. The first prototype was presented at the 1st PAAM Conference (Borghoff et al. 1996), as well as at the joint IuK-Conference, in Munich, Germany (IuK 1996).
The CBKB system acts as a front-end to a printing-on-demand system, originally called Physicists Network Publishing System (PNPS), which addresses the problems of document printing and delivery. It was developed during 1995 in a joint effort between the physicists at the University of Oldenburg, Germany and Rank Xerox Business Services, Germany. The initial design and an early prototype were presented at the DPG-Conference 1995, in Berlin, Germany.
The CBKB system forwards the retrieved documents to the printing-on-demand system. It thus bridges the gap between the information retrieval/document management system "island", represented by the Web-servers with their search interfaces, and the report distribution/output systems "island", represented by the printing-on-demand system.
The Wide-Area Information Servers (WAIS) (Kahle and Medlar 1991), for instance, provide a Z39.50-based search and retrieval interface. Another famous approach is Archie (Emtage and Deutsch 1992) which periodically contacts a set of registered ftp-servers to index the file names.
At the University of Karlsruhe, A.-C. Achilles supports a well-known application on top of Glimpse (Manber and Wu 1994), namely the sophisticated search facility for a large collection of computer science bibliographies.
Based on Glimpse, Hardy et al. (1994-1996) developed Harvest, in which they separated a gatherer and a broker. Using special indexing-techniques, Harvest reduces network load.
Although unspecific undedicated multi-source index/search subsystems have already been built for Gopher, with Veronica, and for the WWW, e.g. with Alta Vista, Lycos, Yahoo, retrieval engines or retrieval support systems for heterogeneous information are an active research field (Barbara 1993).
The current search tools can be roughly divided into four groups. The first group of unspecific undedicated brokers tries to surpass each other in their coverage of html-pages (e.g., Alta Vista). They index "everything" their spiders find. They have a very efficient, but rather simple keyword-based interface allowing keyword grouping with and and or. However, the user is then left to extract the desired specific answer out of a possibly huge number of hits, say the few scientific publications out of the thousands of retrieved user group annotations. Even if the advanced query interface of Alta Vista for example provides means to rank the retrieved documents, this is generally not enough to relieve the user from filtering out the interesting documents by himself.
The second group, the dedicated ("homogenous") brokers, search on a defined coherent set of hopefully well marked up documents with a correspondingly well adapted search engine. Examples include brokers for books in a library, or the abstract data bases of commercial hosts. The user starts by choosing a data repository (e.g. Library of Congress or some university's digital library), gets a corresponding form showing what she/he can ask for there, and, finally, performs the search at that repository gaining a well marked up set of answers.
The third group of dedicated heterogeneous brokers normally just take a simple keyword query and pushes it to several distributed brokers and search engines, independently and in parallel. The user receives a list of answers with comments where the hits were found and which tool was successful in finding them. Recent examples in physics are a broker by M. Bischoff, searching in about 50 distributed libraries in the Darmstadt area in Germany, and the One Stop In Physics, searching in the preprint databases of major high-energy research facilities and in the LANL repository. However, these systems fall short of the ideal requirements due to restricted functionality, possibly unspecific and excessive querying of the addressed brokers, and to poor precision. Also, over all they still provide only poor recall.
The final group includes offline Web agents designed for asynchronous browsing-support, such as
To answer this need we have designed and implemented a heterogeneous broker, which logically combines heterogeneous information retrieved from other brokers, on-line repositories, and databases (Borghoff and Schlichter 1996), in reply to a physicist's search query.
Intelligent agents (CACM 37:7 1994, Wooldridge and Jennings 1995) or knowledge brokers (Barbara and Clifton 1992) also address this problem. A related thrust in knowledge brokerage has been defined by the consortium of the KRAFT project.
One basic law governing the future society of brokers negotiating and querying each other automatically is that the querying broker be at least as "intelligent" (as far as query semantics and result parsing are concerned) as the queried ones. Otherwise, the user should be passed directly to the more "intelligent" broker.
This is why we specify that referred specific, and heterogeneous brokers be at least as "intelligent" as the queried ones. The brokers provide added-on value by intelligent combining and evaluating the heterogeneous data. We define as task: creating a search service on the Web where the user can formulate a complex, well-specified request, including interdependencies among the attribute-based entries of different, heterogeneous data sources. This service should offer on-line as well as off-line help for putting together the request, and should exploit the full search semantics of the search engines involved. In addition to that, the tool could be enhanced by local constraint solving techniques to extend the search capabilities in order to allow local filtering and sifting of information. We will give examples in the later paper. Finally, we envision the pure search service integrated into the networked publishing system, as outlined by Borghoff et al. (1996b).
The Constraint-Based Knowledge Broker system (CBKB) fullfills these specifications. Constraints have been introduced to flexibly manage the search space of broker agents, as well as to flexibly adapt user requests and answers from information providers. Andreoli et al. (1995, 1996) present the theoretical background of the CBKB system. Protocol issues within the CBKB system are addressed by Arcelli et al. (1995) and by Borghoff et al. (1997).
Fikes et al. (1995) also use logic-based models to capture the domain of expertise of information brokers. Instead of using constraints, their modeling language is based on a predicate logic with contexts. The Tsimmis project (Chawathe et al. 1994) takes a different approach using a self-describing object model for the internal representation of information and queries.
The proposed service for integrated document delivery in physics in Europe combines the strengths of the search capabilities of the Constraint-Based Knowledge Broker model with the advantages of the Physicists Network Publishing System. At the backend, the service will connect, among others, to the Harvest servers within the European Physical Society, and the global physics preprint network. At the frontend, the user interacts with the service via her/his preferred Java-enabled Web-browser.
The objectives of the Oldenburg server are different from those of the Augsburg preprint server. The Oldenburg server aims to provide uniform access to as large as possible a set of documents available at European physics departments. It provides only basic information about the heterogeneously marked-up and distributed documents. The Augsburg server gives access to a large number of formally registered scientific documents with complete bibliographical data. It provides detailed information about each registered document. Thus the possibilities to formulate a query and the format and complexity of the results obtained from these servers differs greatly. Also, until the CBKB integrated these two services, the Web query-forms to these services consisted only of a textfield for query specification (see Figure 2).
With the Oldenburg search services, the original Web search form allowed formulating queries either using a large number of possible search fields, such as document-type, author, or title, or alternatively via fulltext search. However, the possibility to retrieve a document via specific search fields depends on the mark-up/meta-information provided by the author. For example, if the author field is not explicitly marked up, it cannot be indexed by the Harvest engine and can therefore not be used as search field to retrieve the document. Due to the non-uniformity of the mark-up and the provided bibliographic meta-information, the results returned by the Oldenburg server only contain the URL of the matched document, the corresponding host and path, a one-line-description, and the lines of the document containing a match to the query.
In order to homogenize indexed documents and query results information, a common set of meta-information fields for all the connected department servers is currently being defined.
Figure 2: The search form for the PhysDep service only provides a textfield to enter the search term.
With the Augsburg search service, a query always triggers a fulltext search. The search keywords are matched against a WAIS index covering the author, the title and the abstract of the registered documents. Thus through the service's web search form, one cannot, for example, explicitly search for articles containing specific search keywords in the title: all articles containing the keyword in the abstract are then also listed. On the other hand, the filtering capabilities of the CBKB-system can filter hits. In contrary to the Oldenburg server, the results returned from the Augsburg server contain the abstract and complete bibliographical data as well as the URL of the matching documents. These result fields can also be constrained in order to further filter out the most relevant documents.
The goal of the Constraint-Based Knowledge Broker system (CBKB) is to provide a uniform search interface to a number of the existing backends, no matter which search engine or indexing tool they use. The CBKB interface exploits all the capabilities of the underlying backends but provides a uniform formalism for formulating queries and to present the results obtained from the different servers. Several relevant fields, such as author, title etc., are proposed to the user for specifying queries (Borghoff et al. 1996). The final output can be displayed in any of a number of formats selectable by the user: readable on screen as bibliographical data (authors, title, status, link, abstract); in summary form (ranking or sorting the results alphabetically by title or author for example); or displaying the complete information found about each matched document.
Figure 3 shows CBKB in action: the system is managing several queries concurrently (in the foreground), and displays query results (in the background).
Figure 3: The graphical user interface of the Constraint-Based Knowledge Broker system presenting results to
a query about documents containing laser in the title.
In the example, Query 1 requests documents about lasers. Twelve hits were found in the global preprint service (Augsburg mirror) and the Oldenburg server altogether. The user can dynamically customize this presentation: choose format, and specificity (e.g. display only the information responding to the main query or display information corresponding to possible subqueries of a complex query). The user can then either directly inspect the retrieved documents (by clicking on the corresponding URL) or decide to select relevant hits for saving or sending to the printing-on-demand service (see Section 4).
Figure 4: The graphical user interface of the Constraint-Based Knowledge Broker during complex query
formulation. In this example the user's query is composed of two subqueries; the mainquery is displayed in this
screenshot. The user is searching for documents written after may 1995, by the same author as the documents
retrieved with subquery 1. Subquery 1 could e.g. be the query for an article for which the user knows the title
but not the author.
For the moment, the user can only enter simple flat queries, although the user interface and the brokers are able to manage complex ones, i.e. queries that must be decomposed into interdependent subqueries. For example, the results of a first subquery could be used as search pattern in a second subquery (see Figure 4). More concretely, one could search for articles written by co-authors of a specific author. This capability cannot be exploited at the present time because the results obtained from the different connected backends are still too heterogeneous and some result fields would require further natural language analysis. For instance, if the contents of the result-field author is to be used in a subsequent query, its content should be split into several fields if it contains several names. Furthermore other backends have to be connected in order to obtain sufficient coverage to formulate and respond to complex queries.
Connecting an external server to the system is done by analyzing its search interface, then writing a wrapper for it. This wrapper receives the description of the constraints corresponding to the query, translates them into the query-string required by the search script, verifies that the indicated fields are accepted by the server and provides default values for required fields not specified by the user. It then queries the server and receives the results in html-format. Finally it parses the results and translates them into the constraint format accepted by the CBKB system. Additional filtering and sifting of results according to user-constrained field entries that are not properly handled due to restrictions in the backend search capabilities is managed by the local constraint solvers of the CBKB system. Problems such as homogenizing the presentation of results from different backends, e.g. presenting information always in the same order, are solved by the CBKB user interface.
The work required to integrate a new server depends solely on the complexity of the query interface the server provides. While writing of the corresponding code can be done relatively quickly, it could be accelerated if the backend servers automatically provided the format specification for the query, enumerating the query fields they accept, and the format in which they send back results, listing the possible result-fields. Currently result parsing is done using the html-tags used to separate the items; thus, whenever the tag is changed on the server side, the interface program must be modified accordingly.
One field always provided as result by the connected backends is the URL of the matched document. This allows to retrieve the document for printing. The URL together with other relevant information out of the result set can then be forwarded to the document delivery subsystem, described in the following.
The global physics document server solves this by storing the original LaTeX documents, tested as compiling to a postscript file. The attached graphics elements are not guaranteed however, as they are archived at the local server of the author. In addition, these attachments are often very large. It is frustrating for the reader to download these attachments over small bandwidth connections and then to have problems printing them out. Printing-on-demand services can remove these burdens from the user.
Writing a script to funnel a given URL to a local printer is easy. However, for a professional (and thus commercial) printing-on-demand service, the transfer is more complex. A prototype has been written which comprises an initial Web-form for the order, customer identification, and includes the URLs of the documents in question. The prototype phase of the transfer operation passes the printing requests to Oldenburg, then downloads the document from the archive where it resides, checks for printability, and processes the document. In addition, a job ticket transfers the necessary information to the provider of the service. Finally, the document is printed, bound (when requested), and sent by surface mail to the customer. Although this process was time consuming and took several days for delivery, the customer got - with seemingly no effort - a perfectly bound book with professional color print-out on his desk. The prototype system was successfully tested in late 1995 and received a positive response from many countries all over the world. The launch of the respective commercial service by Rank Xerox is still pending. Different business models are currently being evaluated.
The printing-on-demand service can be seamlessly integrated as one backend option of the broker. An other printout service is offered by some large University Libraries as well as the British National Library. Here the document has to be available at the respective library as a paper document. If a document is ordered by e-mail, fax or letter, the library scans in the respective document picked from the shelf, and delivers it by fax (outside Germany), or, in some cases, by e-mail to the reader. While this is cumbersome for large documents it certainly fills a gap.
The real benefit to a reader, however, is to have a single user-friendly search interface, and to receive the document readable on his desk, with no effort on his/her side. Thus, all the different parts of such a service should be smoothly embedded into the future envisioned Physicists Network Publishing System.
At the present time, each Web-server to be searched must be connected manually to a centralized system. However, a centralized solution for registration and adaptation of the many different data formats used within the search/index subsystems will eventually fail. Therefore, we plan to provide a semi-automized wrapper-builder which allows rapid connection of new servers/repositories to the existing service. Once, a critical mass has been reached and the service is well-established, the open design of the integrated document retrieval and delivery system will enable remote sites to register their repositories in a do-it-yourself style. Changes to the remote data format or modifications in the search interface could then be notified by the data owners themselves. This will greatly improve the consistency and coherence of the service.
While it is straightforward to coherently search through ordinary Web-servers, a society of collaborating heterogeneous brokers has yet to emerge in the near future. The CBKB framework addresses the problem of extracting a maximum amount of information from other distributed brokers, repositories and document servers, by logically combining the heterogeneous information these provide. Thus, it reduces the load on the user, and, in some cases, on the net as well. Care has to be taken to transfer the full search semantics an individual broker offers to the CBKB system. If an underlying broker is searched less specifically than necessary, too much information must be downloaded and the underlying broker will be used more intensively than necessary. The overall reputation of this ,meta-broker" might suffer. As a first step, a CBKB operator has to negotiate the access modes among the individual brokers.
Andreoli, J.-M., Borghoff, U. M., Pareschi, R. (1996). The Constraint-Based Knowledge Broker Model: Semantics, Implementation and Analysis. J. Symbolic Computation 21:4, 635-667.
Arcelli, F., Borghoff, U. M., Formato, F., Pareschi, R. (1995). Tuning Constraint-Based Communication in Distributed Problem Solving. In Proc. 1st Int. Workshop on Concurrent Constraint Programming (CCP '95), Venice, Italy.
Barbara, D., Clifton, C. (1992). Information Brokers: Sharing Knowledge in a Heterogeneous Distributed System. Technical Report MITL-TR-31-92, Matsushita Information Technology Lab., Princeton, NJ.
Barbara, D. (1993). Extending the Scope of Database Systems. Technical Report MITL-TR-44-93, Matsushita Information Technology Lab., Princeton, NJ.
Borghoff, U. M., Pareschi, R., Arcelli, F., Formato, F. (1997). Constraint-Based Protocols for Distributed Problem Solving. Science of Computer Programming, to appear in Oct. 1997.
Borghoff, U. M., Pareschi, R., Karch, H., N”hmeier, M., Schlichter, J. H. (1996). Constraint- Based Information Gathering for a Network Publication System. In PAAM `96, London, pp. 45-59.
CACM 37:7 (1994). Special Issue on Intelligent Agents. Communications of the ACM.
Chawathe, S., Garcia-Molina, H., Hammer, J., Ireland, K., Papakonstantinou, Y., Ullman, J., Widom, J. (1994). The Tsimmis Project: Integration of Heterogeneous Information Sources. In Proc. IPSJ Conf., Tokyo, Japan. Los Alamitos, CA: IEEE Comp. Soc. Press.
Emtage, A., Deutsch, P. (1992). Archie: An Electronic Directory Service for the Internet. In Proc. Usenix Conf. Winter '92, pp. 93-110, Sunset Beach, CA. Berkeley, CA: Usenix Association.
Fikes, R., Engelmore, R., Farquhar, A., Pratt, W. (1995). Network-Based Information Brokers. In Proc. AAAI Spring Symp. Series on Information Gathering from Distributed Heterogeneous Environments, Stanford, CA.
Hardy, D. R., Schwartz, M. F., Wessels, D. (1994-1996). Harvest User's Manual.
Kahle, B., Medlar, A. (1991). An Information System for Corporate Users: Wide Area Information Servers. Connexions: The Interoperability Report 5:11, 2-9.
Manber, U., Wu, S. (1994). Glimpse: A Tool to Search Trough Entire File Systems. In Proc. Usenix Conf. Winter '94, pp. 23-32, San Francisco, CA. Berkeley, CA: Usenix Association.
Obraczka, K., Danzig, P. B., Li, S.-H. (1993). Internet Resource Discovery Services. IEEE Computer 26:9, 8-22.
Schwartz, M. F., Emtage, A., Kahle, B., Neuman, B. C. (1992). A Comparison of Internet Resource Discovery Approaches. Computing Systems 5:4, 461-493.
Wooldridge, M., Jennings, N. R. (1995). Intelligent Agents: Theory and Practice. Knowledge Engineering Review 10:2, 115-152.