WWW.ABSTRACT.XLIBX.INFO
FREE ELECTRONIC LIBRARY - Abstract, dissertation, book
 
<< HOME
CONTACTS



Pages:   || 2 | 3 | 4 |

«Bibliomining for automated collection development in a digital library setting: Using data mining to discover web-based scholarly research works. ...»

-- [ Page 1 ] --

Scott Nicholson - Bibliomining for Automated Collection Development in...etting: Using Data Mining to Discover Web-Based Scholarly Research Work

Nicholson, S. (2003). Bibliomining for automated collection development in a digital library setting:

Using data mining to discover web-based scholarly research works. Journal of the American Society

for Information Science and Technology 54(12). 1081-1090.

Bibliomining for Automated Collection Development in a Digital Library Setting: Using Data Mining to Discover Web-Based Scholarly Research Works Scott Nicholson Syracuse University School of Information Studies 4-127 Center for Science and Technology Syracuse, NY 13244 Phone: 315-443-1640 Fax: 315-443-5806 http://www.scottnicholson.com http://www.bibliomining.org scott@scottnicholson.com This is a preprint of an article accepted for publication in Journal of the American Society for Information Science and Technology ©2003 John Wiley & Sons.

0. ABSTRACT This research creates an intelligent agent for automated collection development in a digital library setting. It uses a predictive model based on facets of each Web page to select scholarly works. The criteria came from file:///C|/WEBPAGES/bibliomining/nicholson/asisdiss.html (1 of 26) [8/29/2003 9:26:59 AM] Scott Nicholson - Bibliomining for Automated Collection Development in...etting: Using Data Mining to Discover Web-Based Scholarly Research Work the academic library selection literature, and a Delphi study was used to refine the list to 41 criteria. A Perl program was designed to analyze a Web page for each criterion and applied to a large collection of scholarly and non-scholarly Web pages. Bibliomining, or data mining for libraries, was then used to create different classification models. Four techniques were used: logistic regression, non-parametric discriminant analysis, classification trees, and neural networks. Accuracy and return were used to judge the effectiveness of each model on test datasets. In addition, a set of problematic pages that were difficult to classify because of their similarity to scholarly research was gathered and classified using the models.

The resulting models could be used in the selection process to automatically create a digital library of Webbased scholarly research works. In addition, the technique can be extended to create a digital library of any type of structured electronic information.

Keywords Digital Libraries, Collection Development, World Wide Web, Search Engines, Bibliomining, Data Mining, Intelligent Agents

1. INTRODUCTION Web sites contain information that ranges from the highly significant through to the trivial and obscene, and because there are no quality controls or any guide to quality, it is difficult for searchers to take information retrieved from the Internet at face value. The Internet will not become a serious tool for professional searchers until the quality issues are resolved

The Quality of Electronic Information Products and Services, IMO

One purpose of the academic library is to provide access to scholarly research. Librarians select material appropriate for academia by applying a set of explicit and tacit selection criteria. This manual task has been manageable for the world of print. However, in order to aid selectors with the rapid proliferation and frequent updating of Web documents, an automated solution must be found to help searchers find scholarly research works published on the Web. Bibliomining, a.k.a. data mining for libraries, provides a set of tools that can be used to discover patterns in large amounts of raw data, and can provide the patterns needed to create a model for an automated collection development aid (Nicholson and Stanton, in press and Nicholson, 2002).

One of the difficulties in creating this solution is determining the criteria and specifications for the underlying file:///C|/WEBPAGES/bibliomining/nicholson/asisdiss.html (2 of 26) [8/29/2003 9:26:59 AM] Scott Nicholson - Bibliomining for Automated Collection Development in...etting: Using Data Mining to Discover Web-Based Scholarly Research Work decision-making model. A librarian makes this decision by examining facets of the document and determining from those facets if the work is a research work. The librarian is able to do this because he/she has seen many examples of research works and papers that are not research works, and recognizes patterns of facets that appear in research works.

Therefore, to create this model, many samples of Web-based scholarly research papers are collected along with samples of other Web-based material. For each sample, a program in Perl ( a pattern-matching computer language) analyzes the page and determines the value for each criterion. Different bibliomining techniques are then applied to the data in order to determine the best set of criteria to discriminate between scholarly research and other works. The best model produced by each technique is tested with a different set of Web pages. The models are then judged using measures based the traditional evaluation techniques of precision and recall called accuracy and return. Finally, the performance of each model is examined with a set of pages that are difficult to classify.

1.1 Problem Statement

Researchers need a digital library consisting of Web-based scholarly works due to the rapidly growing amount of academic research published on the Web. The general search tools overwhelm the researcher with nonscholarly documents, and the subject-specific academic search tools may not meet the needs of those in other disciplines. An automated collection development agent is one way to quickly discover online academic research works.





In order to create a tool for identifying Web-based scholarly research, a decision-making model for selecting scholarly research must first be designed. Therefore, the goal of the present study is to develop a decisionmaking model that can be used by a Web search tool to automatically select Web pages that contain scholarly research works, regardless of discipline. This tool could then be used as a filter for the pages collected by a traditional Web page spider, which could aid in the collection development task for a scholarly digital library.

1.2 Definitions 1.2.1 Scholarly Research Works file:///C|/WEBPAGES/bibliomining/nicholson/asisdiss.html (3 of 26) [8/29/2003 9:26:59 AM] Scott Nicholson - Bibliomining for Automated Collection Development in...etting: Using Data Mining to Discover Web-Based Scholarly Research Work To specify the types of resources that this predictive model will identify, the term “scholarly research works” must be defined. For this study, scholarly research is limited to research written by students or faculty of an academic institution, works produced by a non-profit research institution, or works published in an scholarly peer-reviewed journal. Research, as defined by Dickinson in Science and Scientific Reasoning, is a “systematic investigation towards increasing the sum of knowledge” (1984, pg. 33). This investigation, therefore, may be a literature review, a qualitative or quantitative study, a thinkpiece, or another type of scholarly exploration. A research work is defined as a Web page (a single HTML or text file) that contains the full text of a research report. As the Web page has become the standard unit for indexing and reference by search tools and style manuals, the Web page is used here as the information container.

1.2.2 Accuracy / Precision and Return / Recall

The models are judged using measures named accuracy and return; these are based off the traditional IR measures of precision and recall. Accuracy (precision) and return(recall) are both defined in their classical information retrieval sense, as first defined by Cleverdon (1962). Accuracy is measured by dividing the number of pages that are correctly identified as scholarly research by the total number of pages identified as scholarly research by the model. Return is determined by dividing the number of pages correctly identified as scholarly research by the total number of pages in the test set that are scholarly research. When applied to the Web as a whole, return can not be easily defined. However, a higher return in the test environment may indicate which tool will be able to discover more scholarly research published on the Web.

1.2.3 Problematic Pages Problematic pages are Web pages that might appear to this agent to be scholarly research works (as defined above in 1.2.1), but are not. Categories of problematic pages are author biographies, syllabi, vitae, abstracts, corporate research, research that is in languages other than English, and pages containing only part of a research work. Future researchers will want to incorporate some of these categories into digital library tools and this level of failure analysis will assist those researchers in adjusting the models presented in this research.

file:///C|/WEBPAGES/bibliomining/nicholson/asisdiss.html (4 of 26) [8/29/2003 9:26:59 AM] Scott Nicholson - Bibliomining for Automated Collection Development in...etting: Using Data Mining to Discover Web-Based Scholarly Research Work

1.3 Research Overview First, a set of criteria used in academic libraries for print selection is collected from the literature, and a Delphi study was done with a panel of librarians to refine the list. The criteria are then translated into terms appropriate for Web documents, and a Perl program was written that collects aspects of the a Web that

correspond to the criteria.

This data collection tool is used to gather information on 5,000 pages with scholarly research works and 5,000 pages without these works. This data set is split, with the majority of the pages used to train the models and the rest used to test the models. The training set is used to create different models using logistic regression, memory-based reasoning (through non-parametric n-nearest neighbor discriminant analysis), decision trees, and neural networks.

Another set of data is used to tweak the models and make them less dependent on the training set. Each model is then applied to the testing set. Accuracy and return is determined for each model, and the best models are identified.

1.4 Literature Review This section explores closely related literature and the placement of this research in the areas of the selection of quality materials, data mining and similar projects.

1.4.1 Selection of Quality Materials Should the librarian be a filter for quality? S.D. Neill argues for it in his 1989 piece. He suggests librarians, along with other information professionals, become information analysts. In this article, he suggests that these information analysts sift through scientific articles and remove those that are not internally valid. By looking for those pieces that are “poorly executed, deliberately (or accidentally) cooked, fudged, or falsified”(Neill, 1989, pg. 6), information analysts can help in filtering for quality of print information.

Piontek and Garlock also discuss the role of librarians in selecting Web resources. They argue that collection development librarians are ideal in this role because of “their experience in the areas of collection, file:///C|/WEBPAGES/bibliomining/nicholson/asisdiss.html (5 of 26) [8/29/2003 9:26:59 AM] Scott Nicholson - Bibliomining for Automated Collection Development in...etting: Using Data Mining to Discover Web-Based Scholarly Research Work organization, evaluation, and presentation” (1996, pg. 20). Academic librarians have been accepted as quality filters for decades. Therefore, the literature from library and information science will be examined for appropriate examples from print selection and Internet resource selection of criteria for quality.

1.4.1.1 Selection of Print Materials

The basic tenet in selection of materials for a library is to follow the library’s policy, which in an academic library is based upon supporting the school’s curriculum (Evans, 2000). Because of this, there are not many published sets of generalized selection criteria for academic libraries.

One of the most well-known researchers in this area is S. R. Ranganathan. His five laws of librarianship (as cited in Evans, 2000) are a classical base for many library studies. There are two points he makes in this work that may be applicable here. First, if something is already known about an author and the author is writing the same area, then the same selection decision can be made with some confidence. Second, selection can be made based upon the past selection of works from the same publishing house. The name behind the book may imply quality or a lack thereof, and this can make it easier to make a selection decision.

Library Acquisition Policies and Procedures (Futas, 1995) is a collection of selection policies from across the country. By examining these policies from academic institutions, one can find the following criteria for

quality works that might be applicable in the Web environment:

–  –  –

• Reference materials like encyclopedias, handbooks, dictionaries, statistical compendia, standards, style manuals, and bibliographies.

1.4.1.2 Selection of Online and Internet Resources file:///C|/WEBPAGES/bibliomining/nicholson/asisdiss.html (6 of 26) [8/29/2003 9:26:59 AM] Scott Nicholson - Bibliomining for Automated Collection Development in...etting: Using Data Mining to Discover Web-Based Scholarly Research Work Before the Internet was a popular medium for information, libraries were faced with electronic database selection. In 1989, a wish list was created for database quality by the Southern California Online Users Group (Basch, 1990). This list had 10 items, some of which were coverage, scope, accuracy, integration, documentation, and value-to-cost ratio.

This same users group discussed quality on the Internet in 1995 (as cited in Hofman and Worsfold, 1999).

They noted that Internet resources were different from the databases because those creating the databases were doing so to create a product that would produce direct fiscal gain, while those creating Internet resources, in general, were not looking for this same gain. Because of this fact, they felt that many Internet resource providers did not have the impetus to strive for a higher-quality product.



Pages:   || 2 | 3 | 4 |


Similar works:

«E n d b eri ch t “[.] the landscape approach offers holistic assessment and planning tools to define and develop the interface between nature and culture. Hence, landscape, as the place of human interaction with nature appears to be at the heart of sustainability.” (Wascher 2000) 5 Indikatorenbildung 5.1 Konzepte Im Brundtland-Bericht der Weltkommission für Umwelt und Entwicklung aus dem Jahre 1987 sowie seit dem Erdgipfel in Rio, wird Nachhaltigkeit als eine „Entwicklung, die den...»

«ВЕСТНИК НП «АРФИ» НАУЧНО-ПРАКТИЧЕСКОЕ ЭЛЕКТРОННОЕ ИЗДАНИЕ ДЛЯ СПЕЦИАЛИСТОВ ПО СВЯЗЯМ С ИНВЕСТОРАМИ #7 Август / 2014 ВЕСТНИК НП «АРФИ», научно-практическое электронное издание для специалистов по связям с инвесторами, распространяется бесплатно. В электронной форме издание...»

«Zeitkritischer Dokumentarfilm im Spannungsfeld zwischen Fernsehjournalismus und Autorenfilm: Roman Brodmann Inaugural-Dissertation zur Erlangung der Doktorwürde des Fachbereichs Germanistik und Kunstwissenschaften der Philipps-Universität Marburg vorgelegt von Frauke Böhm Marburg, den 22.05.2000 Vom Fachbereich Germanistik und Kunstwissenschaften der Philipps-Universität Marburg als Dissertation angenommen am: 23.05.2000 Tag der Disputation: 13.09.2000 Erstgutachter: Prof. Dr. Heinz-Bernd...»

«STRANGE DAYS BY JAMES CAMERON AND JAY COCKS FROM A STORY BY JAMES CAMERON AUGUST 11, 1993 1:06 AM DEC 30, 1999 Blackness. We hear: VOICE Ready? SECOND VOICE (LENNY) Yeah. Boot it. A burst of bright white static exploding across the darkness. A high whine on the audio track gives way to street sounds and rapid breathing. AN IMAGE wavers and stabilizes: A nervous POV. We're in a car, sitting in the backseat, and we're nervous, the view swinging around, showing the street rolling by outside the...»

«Sprachtheorie und germanistische Linguistik, 22.2 (2012), 123-135 © Copyright 2012 by Nodus Publikationen (Münster), ISSN 1218-5736 Judit Bihari Grundlagen der Pragma-Dialektik* Eine Übersicht Teil 2 3.2 Regeln der PD für die erfolgreiche Auflösung eines Meinungsunterschieds Grundlegend bei der Formulierung der pragma-dialektischen Diskussionsregeln ist das Kooperationsprinzip von Grice. Nach Ansichten von Grice werden von dem vernünftigen Sprecher vier generelle Prinzipien beachtet und...»

«BALLARAT PLANNING SCHEME 21.05 BUILT FORM AND AMENITY 17/09/2015 C173 21.05-1 Character 17/09/2015 C173 It is important that future development within Ballarat and its townships makes a positive contribution towards the high quality presentation of the City by recognising the character of its setting and preserving valued heritage and natural landscape elements. This will maintain an important component of the City’s liveability. This is also integral to maintaining Ballarat’s reputation...»

«Washington State Gambling Commission Group 12 Amusement Games Updated April 29, 2016 (New Information in green text) New Information: April 29, 2016 1. Banilla Games Inc. and Grover Gaming Inc. submitted the required manufacturer applications for a license.2. Banilla Games Inc. submitted Group 12 Amusement Games Olympic Skill 1 & Olympic Skill 2 for compliance testing to rules passed by the Commission. Both games were tested by our Electronic Gambling Lab and are in compliance with wagering and...»

«AZƏRBAYCAN SSR ELMLƏR AKADEMİYASI AZƏRBAYCAN TARİXİ MUZEYİNİN 50 İLLİYİNƏ HƏSR OLUNUR К 50-ЛЕТИЮ МУЗЕЯ ИСТОРИИ АЗЕРБАЙДЖАНА АКАДЕМИИ НАУК АЗЕРБАЙДЖАНСКОЙ ССР Azərbaycan SSR Elmlər Akademiyası Azərbaycan tarixi muzeyi Академия наук Азербайджанской ССР Музей истории Азербайджана АЗЕРБАЙДЖАНСКАЯ НАЦИОНАЛЬНАЯ ОДЕЖДА Под...»

«Im Feuer Der Smaragde Roman You is epub of home countries how backlog capital. A comes how Im Feuer der Smaragde : Roman according a pdf Im Feuer der Smaragde : Roman would download them a sort if who may make the name in your market. Of also, have's locate the current anything that one the download to have who the effects have or it will reject than it should look the global debt. Before your preapproach or in the sends, your communication serves recognized you as your browse like cleaning...»

«Liability for travel implementation Gotogate is an agent for airline tickets, hotels, car rentals and events. We accept no liability for changes to timetables, cancelled flights, luggage or other incidents relating to the implementation of flights. Nor do we accept liability for problems relating to booked hotel accommodation or car rentals. The relevant provider bears liability for such incidents. Therefore, any claims relating to this must be submitted to the provider directly. We arrange...»

«Small Island Read 2007 Refugee Boy – Activity Pack Pack Contents Introduction Personal Stories Benjamin Zephaniah The British Refugee Boy: Summary Coming to Britain Refugee Boy: Questions for Discussion Maps Role Playing The Transatlantic Slave Trade Refugee Boy: Word-Search Quiz The Slave Ship We Refugees by Benjamin Zephaniah The Empire Windrush Ethiopia and Eritrea Who are We? Communities of Britain Word-Search Resources Asylum Seekers PF upil eedback Form Small Island Read 2007 is a...»

«THE STAGES OF THE PEPTIC HYDROLYSIS OF EGG ALBUMIN. By JENNIE McFARLANE, VIOLET E. DUNBAR, HENRY BORSOOK, Am) B.ARDOLPH WASTENEYS. (From the Department of Biochemistry, University of Toronto, Toronto, Canada.) (Accepted for publication, October 18, 1926.) Numerous investigators have studied the stages in peptic digestion. The conclusions at which they have arrived may be grouped under two main heads. On the one hand, Kiibne (1) and Neumeister (2) concluded that the products of digestion arise...»





 
<<  HOME   |    CONTACTS
2016 www.abstract.xlibx.info - Free e-library - Abstract, dissertation, book

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.