FREE ELECTRONIC LIBRARY - Abstract, dissertation, book

Pages:   || 2 | 3 |

«Abstract. The Linked Data initiative gained momentum inside as well as outside of the research community. Thus, it is already an accepted research ...»

-- [ Page 1 ] --

Statistical Analysis of Web of Data Usage

Markus Luczak-R¨sch and Markus Bischoff


Freie Universit¨t Berlin, Networked Information Systems WG, 14109 Berlin,



markus.luczak-roesch@fu-berlin.de, markus.wt@web.de

WWW home page: http://www.ag-nbi.de

Abstract. The Linked Data initiative gained momentum inside as well

as outside of the research community. Thus, it is already an accepted

research issue to investigate usage mining in the context of the Web of

Data from various perspectives. We are currently working on an approach that applies such usage mining methods and analysis to support ontology and dataset maintenance tasks. This paper presents one part of this work, namely a method to detect errors or weaknesses within ontologies used for Linked Data population based on statistics and network visualizations. We contribute a detailed description of a log file preprocessing algorithm for Web of Data endpoints, a set of statistical measures that help to visualize different usage aspects, and an examplary analysis of one of the most prominent Linked Data set – DBpedia – aimed to show the feasibility and the potential of our approach.

Keywords: linked data, web usage mining, ontology maintenance 1 Introduction The Linked Data initiative gained momentum inside as well as outside of the research community. At least the recent open government data approaches stress that assumption. That means that it is reasonable to expect that the real world usage of Linked Data, in the sense of querying and accessing it, will increase. It is already an accepted research issue to investigate usage mining in the context of the Web of Linked Data (or short: Web of Data). We are currently working on an approach that applies such usage mining methods and analysis to support dataset ontology maintenance. This paper presents one part of this work, namely a method to detect errors and weaknesses within ontologies used for Linked Data population based on statistical measures and their visualization by use of a network analysis tool.

1.1 Motivation, Terminology and Challenges It is not in all cases trivial to apply the methods from classical Web usage mining to this new discipline one could call Web of Data usage mining. A first problem is the terminology as it is familiar for people in the context of the Web of documents. To our best knowledge only one W3C effort exists which aimed to define a terminology that characterizes the structure and the content of the Web1. This terminology does not cover the entities properly which are of interest on the Web of Data: resources that represent individual “things” named by URIs (or IRIs respectively) and a collection of RDF statements about such resources served in one place – a dataset – maintained by a Web data publisher. So far this is only a need for an adapted set of terms. But, even though it is not a requirement of a Linked Data endpoint to offer a SPARQL endpoint, lots of dataset providers on the Web of Data do so. Hence, resources on the Web of Data are requested directly via their URIs and by use of SPARQL queries which raises at least one central problem: The Web server observes requests for only one single Web resource very often (the SPARQL endpoint URI) while potentially more than one resource has been accessed as part of the query patterns.

Analyzing server logs is an intuitive way to perform Web usage mining. However, another problem on the Web of Data in its current shape is that the meaning of HTTP status codes2 does not work out at all time. When accessing a URI which does not point to any resource on a Web server, the server responds the 404 code. The SPARQL protocol3 requires servers to respond the 200 HTTP status code and a serialization of the SPARQL results format that contains no bindings in the case that a SELECT query is performed correctly but yields an empty result set. The HTTP 1.1 status code definitions4 would recommend the use of the 204 status code in this case. This looks like a misuse of HTTP response codes at a first sight but also may be a desired feature for developers which deal with empty result sets application-dependent and detect this when the serialization of the result is processed. During our intensive work with logs from several Web of Data endpoints such as DBpedia5, the Semantic Web Dog Food server6, and Linked Geo Data7 we observed that queries must be re-ran to find out whether they returned any result or not.

Listing 1.1.

Anonymized excerpt of a DBpedia log file showing some of the different types of requests and the responded HTTP status codes.

xxx. xxx. xxx. xxx − − [ 2 1 / Sep / 2 0 0 9 : 0 0 : 0 0 : 0 0 −0600] ”GET / page / J e r o e n S i m a e y s HTTP/ 1. 1 ” 200 26777 ” ” ” msnbot / 2. 0 b (+ h t t p : / / s e a r c h. msn. com/ msnbot. htm ) ” xxx. xxx. xxx. xxx − − [ 2 1 / Sep / 2 0 0 9 : 0 0 : 0 0 : 0 0 −0600] ”GET / r e s o u r c e / Guano Apes HTTP/ 1. 1 ” 303 0 ” ” ” M o z i l l a / 5. 0 ( c o m p a t i b l e ; G o o g l e b o t / 2. 1 ; + h t t p : / /www. g o o g l e. com/ b o t. html ) ” xxx. xxx. xxx. xxx − − [ 2 1 / Sep / 2 0 0 9 : 0 0 : 0 0 : 0 1 −0600] ”GET / s p a r q l ? q u e r y=PREFIX+r d f s %3A+%3Chttp%3A%2F%2Fwww. w3. o r g... ” 200 1844 ” ” ”” http://www.w3.org/1999/05/WCA-terms http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html http://www.w3.org/TR/rdf-sparql-protocol/ http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html http://dbpedia.org http://data.semanticweb.org/ http://linkedgeodata.org/ The above mentioned problems show that it is an interesting issue to analyze usage on the Web of Data – especially requests against SPARQL endpoints.

This paper deals with the research question how usage analysis can support the

maintenance of linked datasets. Altogether we contribute three central things:

First, an innovative log file preprocessing algorithm for Web of Data endpoints.

Second, a set of statistical measures that help to visualize different usage aspects.

Third, a statistical analysis of the usage of the DBpedia dataset with the purpose to identify problems in the data or the underlying schema. The remainder of this paper is structured as follows: Firstly we present a survey of related work in the following subsection. Afterwards Section 2 will introduce our preprocessing algorithm for log files of Linked Data endpoints before Section 3 describes the set of statistics and visualizations we propose for the analysis of the usage data.

The Sections 4 and 5 complete this work with an evaluation of our approach by an examplary study and a discussion of the results as well as an outlook on future work.

1.2 Related Work

Classical Web usage mining has been placed within the Web mining hierarchy as a child of Web mining and a sibling to Web content mining [7]. Essential parts of Web usage mining are the characteristic metrics and patterns one has to identify, such as hits, page impressions, visits, time and navigation heuristics, unique visitors, clickthrough, viewtime, sessions, path analysis, association rules, sequential patterns, classification rules or clustering [13,14]. In this work we do not apply complex data mining methods to our data, such as sequential pattern mining or clustering, but remain on the statistical level.

We mentioned several differences between the classical Web and the Web of Data with reference to usage mining methods and techniques beforehand. Such a difference is also recognizable when we regard the use of the Web of Data in practice which has been described in works such as [6],[8] and [9]. Altogether, one can summarize that Linked Data typically is used (1) to provide unambiguous concept identifiers within Web applications, (2) to enhance the experience of Web users by aggregation and integration of corresponding content within CMS systems and Web applications, and (3) to be browsed and mashed up in a userspecific way. It becomes apparent that the classical browsing scenario plays a minor role and is outperformed by the access and use of Web resources through libraries or applications which are not or only indirectly connected with a human user’s interaction and the SPARQL8 query language plays an important role in these scenarios.

Already in 2002 and again in 2004 Berendt et al. [2,3] identified a new research area – the so called Semantic Web mining. The authors describe how the two disciplines, namely the Semantic Web and Web mining, may converge. They present three perspectives which reflect this: First, the perspective how Web mining can help to extract semantics from the Web. Second, the exploitation of http://www.w3.org/TR/rdf-sparql-query/ semantics for Web mining. And third, the perspective of mining of the Semantic Web. The latter perspective is the one which matches best to the focus of our work. It is subdivided into Semantic Web structure and content mining as well as Semantic Web usage mining. Again, the latter point is the one which is the most interesting one with reference to our work because it deals with the analysis of the usage of semantic data on the Web. Even though Berendt et al. mention one early approach that could result in log files which contain information about the usage of semantically rich content[10], it seems that since that date the research in that area and in the analysis of such log files was not very active.

Today this area gains a new momentum due to the broader success of the Linked Data ideas. To our best knowledge, in 2010 M¨ller et al.[12] published o the next notable piece of work in this area. As a motivation for Linked Data usage analysis the authors raise a set of challenges, namely reliability, peakload, performance, usefulness, and attacks. M¨ller et al. address these challenges o by analyzing raw logs in order to learn about user clients, requested content types, and the structure of SPARQL queries. Our work will rely on the above mentioned challenges but address them under a different scope. We preprocess the logs in order to analyze the usage data on the level of basic graph patterns and the ontology primitives used in them.

Also after a very recent workshop on usage analysis and the Web of Data9 [4,5] this perspective is still unique. Only two papers at the workshop were related to log file analysis and worked upon the USEWOD challenge dataset which is partially a subset of the data we are working on. Kirchberg et al.[11] present an approach that combines data about real world events and log files to retrieve a notion of time-windowed relevance of data. Using an analysis of the syntactical and structural use of SPARQL in real-world scenarios to provide recommendations for index and store designers was introduced by Arias et al. [1].

2 Log File Preprocessing To overcome the above mentioned issues with log files of Web of Data endpoints we propose an innovative preprocessing method. Our approach runs on server log files following the extended common log format10. These logs contain information about the access to RDF resources via their URIs and SPARQL queries. The first step of our preprocessing is to clean the log from all entries that contain 40x and 50x response codes. Afterwards we transform each single request for resources into a SPARQL DESCRIBE query to retrieve a normalized view to the usage of the dataset on the level of SPARQL queries. For all (1) basic graph patterns and (2) triple patterns of each single query, as well as the original query itself, we perform auto-generated queries that result in information about the success of individual graph patterns, triple patterns and the existence of resources and predicates in the dataset. The pseudocode of our algorithm is shown in Listing 1.2 and the resulting usage database in Figure 1.

http://data.semanticweb.org/usewod/2011/ http://www.w3.org/TR/WD-logfile.html Fig. 1. Schema of the resulting database of the log file preprocessing

–  –  –

3 Visualization of Web of Data Usage The visualization of the collected data is done with an extension of the software “SONIVIS:Tool”11 which enables network generation and analysis. We implemented network visualizations different perspectives on usage data, e.g. ontology, request hosts or time perspectives. Each perspective is supported by a set of widgets that represent detailed information about a selected entity of the network.

To visualize the usage data on the basis of a given ontology, a transformation see http://sonivis.org of the preprocessed data is necessary. Hence, a mapping between the resources used in queries and the classes which represent the corresponding types in an ontology which was used for data population in the respective dataset is established. In this section we introduce each of the implemented visualizations, the underlying metrics and interpretations of observations which are possible due to the visualizations. We do not present images of each visualization here due to limited space but we do so for a representative selection in Section4.

3.1 Ontology Heat Map The ontology heat map provides an overview of the associated ontology primitives12 of resources and predicates being used in queries. This is the global perspective on ontology usage. Its concept of a network visualization with weighted nodes and edges as a so called heat map is the basic concept of all further visualization as well.

Views: The central network view shows how often a specific primitive was used in queries. The more a certain primitive is used, the bigger the corresponding node in the graph view becomes and a specific color is applied to it. Zoom levels enable to focus parts of the network which are of a special interest. Two widgets contain lists that support (a) the examination of corresponding primitives of the resources that are present in the collected usage data and (b) statistical results for each primitive (count, absolute, relative).

Metrics: The view is based on metrics that sum the number of requests for each primitive that appears in triple patterns. “Count” is the absolute number of occurences used as a specific part of triple patterns. “Absolute” is the percentage of triple patterns using a chosen primitive out of all requested triple patterns.

“Relative” is the percentage of queries that had no variable in the part of the triple pattern and used the chosen primitive.

Pages:   || 2 | 3 |

Similar works:

«Crop Profile for Alfalfa in Tennessee Prepared: February, 2005 General Production Information Production Facts: Tennessee ranks 31st of 42 states producing alfalfa. ● Tennessee produces less than one percent of the total alfalfa produced within the United States. ● Producers harvested 30,000 acres of alfalfa during 2003, yielding approximately 3.1 tons per acre for the ● season. In 2004, acreage was valued at $117 per ton with an approximate state value of $10,881,000 for the season....»

«Landesverband Frühund Risikogeborene Kinder Rheinland-Pfalz e.V.5. Rheinland-Pfalz-Symposium Irgendwas ist anders. Frühgeborene und schulisches Lernen am 23. November 2013 in Nieder-Olm bei Mainz Bericht von Karin Jäkel, LV Frühund Risikogeborene Kinder Rheinland-Pfalz“, e.V. Irgendwas ist anders – Frühgeborene und schulisches Lernen 5. Rheinland-Pfalz-Symposium benennt Anforderungen an die inklusive Schule Unter dem Titel Irgendwas ist anders – Frühgeborene und schulisches Lernen...»

«EUROPEAN FREE TRADE ASSOCIATION ASSOCIATION EUROPEENNE DE LIBRE-ECHANGE C/00/R/006 7 March 2001 Brussels An Opinion from the EFTA Consultative Committee 2000 REVIEW OF THE INTERNAL MARKET STRATEGY AND THE FOLLOW-UP ON THE EFTA SIDE Rapporteur: Ms. Grete Gautvik C/00/R/006 -2At the Joint meeting between the EFTA Consultative Committee (CSC) and the EFTA Standing Committee on 30 March 2000 (C/50/M/002), the Standing Committee Chairman invited the CSC to submit an opinion on the follow up to the...»

«Weapons and Ammunition Management in the Federal Republic of Somalia About UNIDIR The United Nations Institute for Disarmament Research (UNIDIR)—an autonomous institute within the United Nations—conducts research on disarmament and security. UNIDIR is based in Geneva, Switzerland, the centre for bilateral and multilateral disarmament and non-proliferation negotiations, and home of the Conference on Disarmament. The Institute explores current issues pertaining to a variety of existing and...»

«Columbia College Online Campus P a g e |1 MATH 104 E Beginning Algebra Early Fall Session 15-51 August 17, 2015 to October 10, 2015 Course Description Introduction to the fundamental concepts of algebra. Review of fractions, decimals, and signed numbers. Methods for solving linear equations, linear inequalities, and systems of linear equations. Thorough treatment of graphing lines and linear inequalities in the plane. Introduction to rules of exponents. Real-world applications will be...»

«Resources 2014, 3, 575-598; doi:10.3390/resources3030575 OPEN ACCESS resources ISSN 2079-9276 www.mdpi.com/journal/resources Article Exploring the Potential of a German Living Lab Research Infrastructure for the Development of Low Resource Products and Services Justus von Geibler 1,*, Lorenz Erdmann 2, Christa Liedtke 1,3, Holger Rohn 4, Matthias Stabe 5, Simon Berner 2, Kristin Leismann 4, Kathrin Schnalzer 5 and Katharina Kennedy 1 Wuppertal Institute for Climate Environment and Energy,...»

«California High-Speed Rail Authority Request for Proposals for Insurance Broker and Administrative Services RFP No.: HSR14-05 May 7, 2014 California High-Speed Rail Authority RFP No.: HSR14-05 Table of Contents 1.0 OVERVIEW AND GENERAL INFORMATION 1.1 DEFINITIONS 1.2 ACRONYMS 1.3 AUTHORITY’S DESIGNATED POINT-OF-CONTACT 1.4 PROPOSAL SUBMITTAL INFORMATION 1.4.1 Addendums to Request for Proposals 1.4.2 Non-Commitment of Authority 1.4.3 Late Submittals 1.4.4 Modification or Withdrawal of...»

«Journal of Interactive Online Learning Volume 7, Number 2, Summer 2008 www.ncolr.org/jiol ISSN: 1541-4914 Students’ Perceptions of Online-learning Quality given Comfort, Motivation, Satisfaction, and Experience Michael C. Rodriguez University of Minnesota Ann Ooms Kingston University, UK Marcel Montañez New Mexico State University Abstract Understanding factors in successful online course experiences can provide suggestions for instructors and students to promote improved learning...»

«NA PA'ANI KEIKI MA HAWAI'I NEI: CHILDREN'S PLAYS, PASTIMES, AMUSEMENTS, RECREATIONS IN HAWAI'I A PLAN B PAPER SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF ARTS IN PACIFIC ISLANDS STUDIES MAY 2006 By Yasuko Chiba Committee: David Hanlon, Chairperson Annette Ku'uipolani Wong Terence Wesley-Smith Karen Peacock This is dedicated to my sweet grandparents, my dearest little brother, and my mentor, Uncle Tsuneo that rest in peace. ii TABLE OF CONTENTS Chapter...»

«Issue 7 | February 2015 Minister Bettison walking with The Indian Australian Association of South Australia, the Chair of the South Australian Multicultural and Ethnic Affairs Commission, the Hon Grace Portolesi, and Dr Rakesh Mohindra, President of IAASA and his wife Dr Veenu Mohindra. Celebrating our Multicultural Pride Minister for Multicultural Affairs, the Honourable Zoe Bettison, celebrated Australia day with South Australia‟s diverse multicultural communities at the annual Australia...»

«C. J. Sheeran Limited Annual Environmental Report 2009 ANNUAL ENVIRONMENTAL REPORT 2009 Licensee: C. J. Sheeran Limited Address: Shannon Street, Mountrath, Co. Laois Tel. No.: 057 87 56700 Fax No.: 057 87 56814 Company Reg. No.: 214509 Managing Director: Mark Sheeran Contact e-mail: mark@cjs.ie Administration Manager: Ashleigh Doyle Contact e-mail: ashleigh@cjs.ie IPPC Licence No.: P0337-01 EPA Inspector: Martina Kirwan C. J. Sheeran Limited Annual Environmental Report 2009 C. J. Sheeran...»

«Concept Analysis For Maniac Magee by Jerry Spinelli Little Brown and Company (New York: 1999) Plot Summary Jeffrey Magee lived in Bridgeport with his parents until the age of three when both of his parents were killed in a trolley crash. His Aunt Dot and Uncle Dan, who lived in Hollidaysburg Pennsylvania, took him in. His aunt and uncle hated each other, and although they lived in the same house never spoke. Eventually, at the age of eleven, Jeffrey could not take the silence anymore and he ran...»

<<  HOME   |    CONTACTS
2016 www.abstract.xlibx.info - Free e-library - Abstract, dissertation, book

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.