«Bibliomining for automated collection development in a digital library setting: Using data mining to discover web-based scholarly research works. ...»
Scott Nicholson - Bibliomining for Automated Collection Development in...etting: Using Data Mining to Discover Web-Based Scholarly Research Work
Nicholson, S. (2003). Bibliomining for automated collection development in a digital library setting:
Using data mining to discover web-based scholarly research works. Journal of the American Society
for Information Science and Technology 54(12). 1081-1090.
Bibliomining for Automated Collection Development in a Digital Library Setting: Using Data Mining to Discover Web-Based Scholarly Research Works Scott Nicholson Syracuse University School of Information Studies 4-127 Center for Science and Technology Syracuse, NY 13244 Phone: 315-443-1640 Fax: 315-443-5806 http://www.scottnicholson.com http://www.bibliomining.org firstname.lastname@example.org This is a preprint of an article accepted for publication in Journal of the American Society for Information Science and Technology ©2003 John Wiley & Sons.
0. ABSTRACT This research creates an intelligent agent for automated collection development in a digital library setting. It uses a predictive model based on facets of each Web page to select scholarly works. The criteria came from file:///C|/WEBPAGES/bibliomining/nicholson/asisdiss.html (1 of 26) [8/29/2003 9:26:59 AM] Scott Nicholson - Bibliomining for Automated Collection Development in...etting: Using Data Mining to Discover Web-Based Scholarly Research Work the academic library selection literature, and a Delphi study was used to refine the list to 41 criteria. A Perl program was designed to analyze a Web page for each criterion and applied to a large collection of scholarly and non-scholarly Web pages. Bibliomining, or data mining for libraries, was then used to create different classification models. Four techniques were used: logistic regression, non-parametric discriminant analysis, classification trees, and neural networks. Accuracy and return were used to judge the effectiveness of each model on test datasets. In addition, a set of problematic pages that were difficult to classify because of their similarity to scholarly research was gathered and classified using the models.
The resulting models could be used in the selection process to automatically create a digital library of Webbased scholarly research works. In addition, the technique can be extended to create a digital library of any type of structured electronic information.
Keywords Digital Libraries, Collection Development, World Wide Web, Search Engines, Bibliomining, Data Mining, Intelligent Agents
1. INTRODUCTION Web sites contain information that ranges from the highly significant through to the trivial and obscene, and because there are no quality controls or any guide to quality, it is difficult for searchers to take information retrieved from the Internet at face value. The Internet will not become a serious tool for professional searchers until the quality issues are resolved
The Quality of Electronic Information Products and Services, IMO
One purpose of the academic library is to provide access to scholarly research. Librarians select material appropriate for academia by applying a set of explicit and tacit selection criteria. This manual task has been manageable for the world of print. However, in order to aid selectors with the rapid proliferation and frequent updating of Web documents, an automated solution must be found to help searchers find scholarly research works published on the Web. Bibliomining, a.k.a. data mining for libraries, provides a set of tools that can be used to discover patterns in large amounts of raw data, and can provide the patterns needed to create a model for an automated collection development aid (Nicholson and Stanton, in press and Nicholson, 2002).
One of the difficulties in creating this solution is determining the criteria and specifications for the underlying file:///C|/WEBPAGES/bibliomining/nicholson/asisdiss.html (2 of 26) [8/29/2003 9:26:59 AM] Scott Nicholson - Bibliomining for Automated Collection Development in...etting: Using Data Mining to Discover Web-Based Scholarly Research Work decision-making model. A librarian makes this decision by examining facets of the document and determining from those facets if the work is a research work. The librarian is able to do this because he/she has seen many examples of research works and papers that are not research works, and recognizes patterns of facets that appear in research works.
Therefore, to create this model, many samples of Web-based scholarly research papers are collected along with samples of other Web-based material. For each sample, a program in Perl ( a pattern-matching computer language) analyzes the page and determines the value for each criterion. Different bibliomining techniques are then applied to the data in order to determine the best set of criteria to discriminate between scholarly research and other works. The best model produced by each technique is tested with a different set of Web pages. The models are then judged using measures based the traditional evaluation techniques of precision and recall called accuracy and return. Finally, the performance of each model is examined with a set of pages that are difficult to classify.
1.1 Problem Statement
Researchers need a digital library consisting of Web-based scholarly works due to the rapidly growing amount of academic research published on the Web. The general search tools overwhelm the researcher with nonscholarly documents, and the subject-specific academic search tools may not meet the needs of those in other disciplines. An automated collection development agent is one way to quickly discover online academic research works.
In order to create a tool for identifying Web-based scholarly research, a decision-making model for selecting scholarly research must first be designed. Therefore, the goal of the present study is to develop a decisionmaking model that can be used by a Web search tool to automatically select Web pages that contain scholarly research works, regardless of discipline. This tool could then be used as a filter for the pages collected by a traditional Web page spider, which could aid in the collection development task for a scholarly digital library.
1.2 Definitions 1.2.1 Scholarly Research Works file:///C|/WEBPAGES/bibliomining/nicholson/asisdiss.html (3 of 26) [8/29/2003 9:26:59 AM] Scott Nicholson - Bibliomining for Automated Collection Development in...etting: Using Data Mining to Discover Web-Based Scholarly Research Work To specify the types of resources that this predictive model will identify, the term “scholarly research works” must be defined. For this study, scholarly research is limited to research written by students or faculty of an academic institution, works produced by a non-profit research institution, or works published in an scholarly peer-reviewed journal. Research, as defined by Dickinson in Science and Scientific Reasoning, is a “systematic investigation towards increasing the sum of knowledge” (1984, pg. 33). This investigation, therefore, may be a literature review, a qualitative or quantitative study, a thinkpiece, or another type of scholarly exploration. A research work is defined as a Web page (a single HTML or text file) that contains the full text of a research report. As the Web page has become the standard unit for indexing and reference by search tools and style manuals, the Web page is used here as the information container.
1.2.2 Accuracy / Precision and Return / Recall
The models are judged using measures named accuracy and return; these are based off the traditional IR measures of precision and recall. Accuracy (precision) and return(recall) are both defined in their classical information retrieval sense, as first defined by Cleverdon (1962). Accuracy is measured by dividing the number of pages that are correctly identified as scholarly research by the total number of pages identified as scholarly research by the model. Return is determined by dividing the number of pages correctly identified as scholarly research by the total number of pages in the test set that are scholarly research. When applied to the Web as a whole, return can not be easily defined. However, a higher return in the test environment may indicate which tool will be able to discover more scholarly research published on the Web.
1.2.3 Problematic Pages Problematic pages are Web pages that might appear to this agent to be scholarly research works (as defined above in 1.2.1), but are not. Categories of problematic pages are author biographies, syllabi, vitae, abstracts, corporate research, research that is in languages other than English, and pages containing only part of a research work. Future researchers will want to incorporate some of these categories into digital library tools and this level of failure analysis will assist those researchers in adjusting the models presented in this research.
file:///C|/WEBPAGES/bibliomining/nicholson/asisdiss.html (4 of 26) [8/29/2003 9:26:59 AM] Scott Nicholson - Bibliomining for Automated Collection Development in...etting: Using Data Mining to Discover Web-Based Scholarly Research Work
1.3 Research Overview First, a set of criteria used in academic libraries for print selection is collected from the literature, and a Delphi study was done with a panel of librarians to refine the list. The criteria are then translated into terms appropriate for Web documents, and a Perl program was written that collects aspects of the a Web that
correspond to the criteria.
This data collection tool is used to gather information on 5,000 pages with scholarly research works and 5,000 pages without these works. This data set is split, with the majority of the pages used to train the models and the rest used to test the models. The training set is used to create different models using logistic regression, memory-based reasoning (through non-parametric n-nearest neighbor discriminant analysis), decision trees, and neural networks.
Another set of data is used to tweak the models and make them less dependent on the training set. Each model is then applied to the testing set. Accuracy and return is determined for each model, and the best models are identified.
1.4 Literature Review This section explores closely related literature and the placement of this research in the areas of the selection of quality materials, data mining and similar projects.
1.4.1 Selection of Quality Materials Should the librarian be a filter for quality? S.D. Neill argues for it in his 1989 piece. He suggests librarians, along with other information professionals, become information analysts. In this article, he suggests that these information analysts sift through scientific articles and remove those that are not internally valid. By looking for those pieces that are “poorly executed, deliberately (or accidentally) cooked, fudged, or falsified”(Neill, 1989, pg. 6), information analysts can help in filtering for quality of print information.
Piontek and Garlock also discuss the role of librarians in selecting Web resources. They argue that collection development librarians are ideal in this role because of “their experience in the areas of collection, file:///C|/WEBPAGES/bibliomining/nicholson/asisdiss.html (5 of 26) [8/29/2003 9:26:59 AM] Scott Nicholson - Bibliomining for Automated Collection Development in...etting: Using Data Mining to Discover Web-Based Scholarly Research Work organization, evaluation, and presentation” (1996, pg. 20). Academic librarians have been accepted as quality filters for decades. Therefore, the literature from library and information science will be examined for appropriate examples from print selection and Internet resource selection of criteria for quality.
126.96.36.199 Selection of Print Materials
The basic tenet in selection of materials for a library is to follow the library’s policy, which in an academic library is based upon supporting the school’s curriculum (Evans, 2000). Because of this, there are not many published sets of generalized selection criteria for academic libraries.
One of the most well-known researchers in this area is S. R. Ranganathan. His five laws of librarianship (as cited in Evans, 2000) are a classical base for many library studies. There are two points he makes in this work that may be applicable here. First, if something is already known about an author and the author is writing the same area, then the same selection decision can be made with some confidence. Second, selection can be made based upon the past selection of works from the same publishing house. The name behind the book may imply quality or a lack thereof, and this can make it easier to make a selection decision.
Library Acquisition Policies and Procedures (Futas, 1995) is a collection of selection policies from across the country. By examining these policies from academic institutions, one can find the following criteria for
quality works that might be applicable in the Web environment:
• Reference materials like encyclopedias, handbooks, dictionaries, statistical compendia, standards, style manuals, and bibliographies.
188.8.131.52 Selection of Online and Internet Resources file:///C|/WEBPAGES/bibliomining/nicholson/asisdiss.html (6 of 26) [8/29/2003 9:26:59 AM] Scott Nicholson - Bibliomining for Automated Collection Development in...etting: Using Data Mining to Discover Web-Based Scholarly Research Work Before the Internet was a popular medium for information, libraries were faced with electronic database selection. In 1989, a wish list was created for database quality by the Southern California Online Users Group (Basch, 1990). This list had 10 items, some of which were coverage, scope, accuracy, integration, documentation, and value-to-cost ratio.
This same users group discussed quality on the Internet in 1995 (as cited in Hofman and Worsfold, 1999).
They noted that Internet resources were different from the databases because those creating the databases were doing so to create a product that would produce direct fiscal gain, while those creating Internet resources, in general, were not looking for this same gain. Because of this fact, they felt that many Internet resource providers did not have the impetus to strive for a higher-quality product.