«Abstract. The Linked Data initiative gained momentum inside as well as outside of the research community. Thus, it is already an accepted research ...»
Statistical Analysis of Web of Data Usage
Markus Luczak-R¨sch and Markus Bischoﬀ
Freie Universit¨t Berlin, Networked Information Systems WG, 14109 Berlin,
WWW home page: http://www.ag-nbi.de
Abstract. The Linked Data initiative gained momentum inside as well
as outside of the research community. Thus, it is already an accepted
research issue to investigate usage mining in the context of the Web of
Data from various perspectives. We are currently working on an approach that applies such usage mining methods and analysis to support ontology and dataset maintenance tasks. This paper presents one part of this work, namely a method to detect errors or weaknesses within ontologies used for Linked Data population based on statistics and network visualizations. We contribute a detailed description of a log ﬁle preprocessing algorithm for Web of Data endpoints, a set of statistical measures that help to visualize diﬀerent usage aspects, and an examplary analysis of one of the most prominent Linked Data set – DBpedia – aimed to show the feasibility and the potential of our approach.
Keywords: linked data, web usage mining, ontology maintenance 1 Introduction The Linked Data initiative gained momentum inside as well as outside of the research community. At least the recent open government data approaches stress that assumption. That means that it is reasonable to expect that the real world usage of Linked Data, in the sense of querying and accessing it, will increase. It is already an accepted research issue to investigate usage mining in the context of the Web of Linked Data (or short: Web of Data). We are currently working on an approach that applies such usage mining methods and analysis to support dataset ontology maintenance. This paper presents one part of this work, namely a method to detect errors and weaknesses within ontologies used for Linked Data population based on statistical measures and their visualization by use of a network analysis tool.
1.1 Motivation, Terminology and Challenges It is not in all cases trivial to apply the methods from classical Web usage mining to this new discipline one could call Web of Data usage mining. A ﬁrst problem is the terminology as it is familiar for people in the context of the Web of documents. To our best knowledge only one W3C eﬀort exists which aimed to deﬁne a terminology that characterizes the structure and the content of the Web1. This terminology does not cover the entities properly which are of interest on the Web of Data: resources that represent individual “things” named by URIs (or IRIs respectively) and a collection of RDF statements about such resources served in one place – a dataset – maintained by a Web data publisher. So far this is only a need for an adapted set of terms. But, even though it is not a requirement of a Linked Data endpoint to oﬀer a SPARQL endpoint, lots of dataset providers on the Web of Data do so. Hence, resources on the Web of Data are requested directly via their URIs and by use of SPARQL queries which raises at least one central problem: The Web server observes requests for only one single Web resource very often (the SPARQL endpoint URI) while potentially more than one resource has been accessed as part of the query patterns.
Analyzing server logs is an intuitive way to perform Web usage mining. However, another problem on the Web of Data in its current shape is that the meaning of HTTP status codes2 does not work out at all time. When accessing a URI which does not point to any resource on a Web server, the server responds the 404 code. The SPARQL protocol3 requires servers to respond the 200 HTTP status code and a serialization of the SPARQL results format that contains no bindings in the case that a SELECT query is performed correctly but yields an empty result set. The HTTP 1.1 status code deﬁnitions4 would recommend the use of the 204 status code in this case. This looks like a misuse of HTTP response codes at a ﬁrst sight but also may be a desired feature for developers which deal with empty result sets application-dependent and detect this when the serialization of the result is processed. During our intensive work with logs from several Web of Data endpoints such as DBpedia5, the Semantic Web Dog Food server6, and Linked Geo Data7 we observed that queries must be re-ran to ﬁnd out whether they returned any result or not.
Anonymized excerpt of a DBpedia log ﬁle showing some of the diﬀerent types of requests and the responded HTTP status codes.
xxx. xxx. xxx. xxx − − [ 2 1 / Sep / 2 0 0 9 : 0 0 : 0 0 : 0 0 −0600] ”GET / page / J e r o e n S i m a e y s HTTP/ 1. 1 ” 200 26777 ” ” ” msnbot / 2. 0 b (+ h t t p : / / s e a r c h. msn. com/ msnbot. htm ) ” xxx. xxx. xxx. xxx − − [ 2 1 / Sep / 2 0 0 9 : 0 0 : 0 0 : 0 0 −0600] ”GET / r e s o u r c e / Guano Apes HTTP/ 1. 1 ” 303 0 ” ” ” M o z i l l a / 5. 0 ( c o m p a t i b l e ; G o o g l e b o t / 2. 1 ; + h t t p : / /www. g o o g l e. com/ b o t. html ) ” xxx. xxx. xxx. xxx − − [ 2 1 / Sep / 2 0 0 9 : 0 0 : 0 0 : 0 1 −0600] ”GET / s p a r q l ? q u e r y=PREFIX+r d f s %3A+%3Chttp%3A%2F%2Fwww. w3. o r g... ” 200 1844 ” ” ”” http://www.w3.org/1999/05/WCA-terms http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html http://www.w3.org/TR/rdf-sparql-protocol/ http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html http://dbpedia.org http://data.semanticweb.org/ http://linkedgeodata.org/ The above mentioned problems show that it is an interesting issue to analyze usage on the Web of Data – especially requests against SPARQL endpoints.
This paper deals with the research question how usage analysis can support the
maintenance of linked datasets. Altogether we contribute three central things:
First, an innovative log ﬁle preprocessing algorithm for Web of Data endpoints.
Second, a set of statistical measures that help to visualize diﬀerent usage aspects.
Third, a statistical analysis of the usage of the DBpedia dataset with the purpose to identify problems in the data or the underlying schema. The remainder of this paper is structured as follows: Firstly we present a survey of related work in the following subsection. Afterwards Section 2 will introduce our preprocessing algorithm for log ﬁles of Linked Data endpoints before Section 3 describes the set of statistics and visualizations we propose for the analysis of the usage data.
The Sections 4 and 5 complete this work with an evaluation of our approach by an examplary study and a discussion of the results as well as an outlook on future work.
1.2 Related Work
Classical Web usage mining has been placed within the Web mining hierarchy as a child of Web mining and a sibling to Web content mining . Essential parts of Web usage mining are the characteristic metrics and patterns one has to identify, such as hits, page impressions, visits, time and navigation heuristics, unique visitors, clickthrough, viewtime, sessions, path analysis, association rules, sequential patterns, classiﬁcation rules or clustering [13,14]. In this work we do not apply complex data mining methods to our data, such as sequential pattern mining or clustering, but remain on the statistical level.
We mentioned several diﬀerences between the classical Web and the Web of Data with reference to usage mining methods and techniques beforehand. Such a diﬀerence is also recognizable when we regard the use of the Web of Data in practice which has been described in works such as , and . Altogether, one can summarize that Linked Data typically is used (1) to provide unambiguous concept identiﬁers within Web applications, (2) to enhance the experience of Web users by aggregation and integration of corresponding content within CMS systems and Web applications, and (3) to be browsed and mashed up in a userspeciﬁc way. It becomes apparent that the classical browsing scenario plays a minor role and is outperformed by the access and use of Web resources through libraries or applications which are not or only indirectly connected with a human user’s interaction and the SPARQL8 query language plays an important role in these scenarios.
Already in 2002 and again in 2004 Berendt et al. [2,3] identiﬁed a new research area – the so called Semantic Web mining. The authors describe how the two disciplines, namely the Semantic Web and Web mining, may converge. They present three perspectives which reﬂect this: First, the perspective how Web mining can help to extract semantics from the Web. Second, the exploitation of http://www.w3.org/TR/rdf-sparql-query/ semantics for Web mining. And third, the perspective of mining of the Semantic Web. The latter perspective is the one which matches best to the focus of our work. It is subdivided into Semantic Web structure and content mining as well as Semantic Web usage mining. Again, the latter point is the one which is the most interesting one with reference to our work because it deals with the analysis of the usage of semantic data on the Web. Even though Berendt et al. mention one early approach that could result in log ﬁles which contain information about the usage of semantically rich content, it seems that since that date the research in that area and in the analysis of such log ﬁles was not very active.
Today this area gains a new momentum due to the broader success of the Linked Data ideas. To our best knowledge, in 2010 M¨ller et al. published o the next notable piece of work in this area. As a motivation for Linked Data usage analysis the authors raise a set of challenges, namely reliability, peakload, performance, usefulness, and attacks. M¨ller et al. address these challenges o by analyzing raw logs in order to learn about user clients, requested content types, and the structure of SPARQL queries. Our work will rely on the above mentioned challenges but address them under a diﬀerent scope. We preprocess the logs in order to analyze the usage data on the level of basic graph patterns and the ontology primitives used in them.
Also after a very recent workshop on usage analysis and the Web of Data9 [4,5] this perspective is still unique. Only two papers at the workshop were related to log ﬁle analysis and worked upon the USEWOD challenge dataset which is partially a subset of the data we are working on. Kirchberg et al. present an approach that combines data about real world events and log ﬁles to retrieve a notion of time-windowed relevance of data. Using an analysis of the syntactical and structural use of SPARQL in real-world scenarios to provide recommendations for index and store designers was introduced by Arias et al. .
2 Log File Preprocessing To overcome the above mentioned issues with log ﬁles of Web of Data endpoints we propose an innovative preprocessing method. Our approach runs on server log ﬁles following the extended common log format10. These logs contain information about the access to RDF resources via their URIs and SPARQL queries. The ﬁrst step of our preprocessing is to clean the log from all entries that contain 40x and 50x response codes. Afterwards we transform each single request for resources into a SPARQL DESCRIBE query to retrieve a normalized view to the usage of the dataset on the level of SPARQL queries. For all (1) basic graph patterns and (2) triple patterns of each single query, as well as the original query itself, we perform auto-generated queries that result in information about the success of individual graph patterns, triple patterns and the existence of resources and predicates in the dataset. The pseudocode of our algorithm is shown in Listing 1.2 and the resulting usage database in Figure 1.
http://data.semanticweb.org/usewod/2011/ http://www.w3.org/TR/WD-logfile.html Fig. 1. Schema of the resulting database of the log ﬁle preprocessing
3 Visualization of Web of Data Usage The visualization of the collected data is done with an extension of the software “SONIVIS:Tool”11 which enables network generation and analysis. We implemented network visualizations diﬀerent perspectives on usage data, e.g. ontology, request hosts or time perspectives. Each perspective is supported by a set of widgets that represent detailed information about a selected entity of the network.
To visualize the usage data on the basis of a given ontology, a transformation see http://sonivis.org of the preprocessed data is necessary. Hence, a mapping between the resources used in queries and the classes which represent the corresponding types in an ontology which was used for data population in the respective dataset is established. In this section we introduce each of the implemented visualizations, the underlying metrics and interpretations of observations which are possible due to the visualizations. We do not present images of each visualization here due to limited space but we do so for a representative selection in Section4.
3.1 Ontology Heat Map The ontology heat map provides an overview of the associated ontology primitives12 of resources and predicates being used in queries. This is the global perspective on ontology usage. Its concept of a network visualization with weighted nodes and edges as a so called heat map is the basic concept of all further visualization as well.
Views: The central network view shows how often a speciﬁc primitive was used in queries. The more a certain primitive is used, the bigger the corresponding node in the graph view becomes and a speciﬁc color is applied to it. Zoom levels enable to focus parts of the network which are of a special interest. Two widgets contain lists that support (a) the examination of corresponding primitives of the resources that are present in the collected usage data and (b) statistical results for each primitive (count, absolute, relative).
Metrics: The view is based on metrics that sum the number of requests for each primitive that appears in triple patterns. “Count” is the absolute number of occurences used as a speciﬁc part of triple patterns. “Absolute” is the percentage of triple patterns using a chosen primitive out of all requested triple patterns.
“Relative” is the percentage of queries that had no variable in the part of the triple pattern and used the chosen primitive.