FREE ELECTRONIC LIBRARY - Abstract, dissertation, book

Pages:   || 2 | 3 | 4 | 5 |   ...   | 7 |

«An integrated, generic approach to pattern mining: data mining template library Vineet Chaoji · Mohammad Al Hasan · Saeed Salem · Mohammed J. Zaki ...»

-- [ Page 1 ] --

Data Min Knowl Disc

DOI 10.1007/s10618-008-0098-x

An integrated, generic approach to pattern mining:

data mining template library

Vineet Chaoji · Mohammad Al Hasan ·

Saeed Salem · Mohammed J. Zaki

Received: 8 May 2007 / Accepted: 29 April 2008

Springer Science+Business Media LLC 2008


Frequent pattern mining (FPM) is an important data mining paradigm to

extract informative patterns like itemsets, sequences, trees, and graphs. However, no

practical framework for integrating the FPM tasks has been attempted. In this paper, we describe the design and implementation of the Data Mining Template Library (DMTL) for FPM. DMTL utilizes a generic data mining approach, where all aspects of mining are controlled via a set of properties. It uses a novel pattern property hierarchy to define and mine different pattern types. This property hierarchy can be thought of as a systematic characterization of the pattern space, i.e., a meta-pattern specification that allows the analyst to specify new pattern types, by extending this hierarchy. Furthermore, in DMTL all aspects of mining are controlled by a set of different mining properties. For example, the kind of mining approach to use, the kind of data types and formats to mine over, the kind of back-end storage manager to use, are all specified as a list of properties. This provides tremendous flexibility to customize the toolkit for various applications. Flexibility of the toolkit is exemplified by the ease with which support for a new pattern can be added. Experiments on synthetic and public dataset are conducted to demonstrate the scalability provided by the Responsible editor: Hannu Toivonen.

V. Chaoji · M. Al Hasan · S. Salem · M. J. Zaki (B) Computer Science Department, Rensselaer Polytechnic Institute, Troy, NY 12180, USA e-mail: zaki@cs.rpi.edu V. Chaoji e-mail: chaojv@cs.rpi.edu M. Al Hasan e-mail: alhasan@cs.rpi.edu S. Salem e-mail: salems@cs.rpi.edu V. Chaoji et al.

persistent back-end in the library. DMTL been publicly released as open-source software (http://dmtl.sourceforge.net/), and has been downloaded by numerous researchers from all over the world.

Keywords Frequent pattern mining · Itemset mining · Sequence mining · Tree mining · Graph mining · Generic programming 1 Introduction Frequent pattern mining (FPM) is an important data mining paradigm that helps to discover patterns that conceptually represent relations among discrete entities (or items). Depending on the complexity of these relations, different types of patterns arise.

The most common type of patterns are sets, where the relation is the co-occurrence of items. A well-known example of a set pattern is a market-basket, the set of items that are bought together by a customer, say at a supermarket. Next, there are sequence patterns, wherewe require an ordering (temporal or positional) between items. Examples include time-series data in financial markets, genome sequence data in bioinformatics, etc. Data mining researchers also work with tree and graph patterns. In tree patterns the item relationship takes a hierarchical form, and in graph patterns the relationship is mostly arbitrary. Mining web log data, XML, or semi-structured data are examples of tree mining, and mining chemical compounds for drug discovery, or web communities in a web graph, are examples of graph mining. It is thus clear that different applications require the ability to define and mine different types of patterns; some may even require the ability to define custom pattern types (Horváth et al. 2006).

All of these scenarios require efficient and flexible FPM algorithms and supporting data/index structures, which can be reused in a variety of domains.

It is worth noting that there exist open source data mining suites such as Weka (Witten and Frank 1999), which span commonly used data mining methods like association rules, clustering, classification, and regression. For the specific case of itemset mining, there also exist repositories of separate methods, such as Frequent Itemset Mining Implementations (FIMI) (Goethals and Zaki 2003). However, no unified framework for various FPM tasks currently exists.

The Data Mining Template Library (DMTL) is the first such effort in realizing a unified framework for mining various pattern types. DMTL is implemented in C++, which was chosen due to its support for generic programming. These features include template classes and functions, partial specialization, and so on. The most appealing part of C++ template features is that the binding is resolved at compile time, hence no runtime overhead is incurred. In contrast, while an object-oriented approach provides re-usability through inheritance, sometimes it suffers from runtime inefficiency, due to virtual table lookup for dynamic binding. In addition, most data mining libraries do not provide support for persistence. As a result they are restricted by the size of available main memory. For instance, Weka (Witten and Frank 1999) does not provide any mechanism to store intermediate results so that it can scale to much larger datasets.

However, there has been some recent work to address this limitation (Zou et al. 2006).

An integrated, generic approach to pattern mining: data mining template library

The major contributions of our work are as follows:

• DMTL adopts the generic software development approach using C++ templates, inspired by the state-of-the-art generic libraries such as the C++ Standard Template Library (STL) (Musser et al. 2001) and Boost Graph Library (Siek et al. 2002), and hence provides widespread usability without compromising on efficiency.

• DMTL offers algorithms for different pattern mining tasks in a unified platform.

Currently the library has implementations for mining four key patterns—itemsets, sequences, trees, and graphs.

• DMTL offers flexible interfaces for each of the algorithms, including each of its sub-tasks so that it is very simple for end users to use it as a library component in their software development.

• DMTL is extensible; new patterns can be mined with minimal effort from the end user. Users need to define some template parameters to ensure that the library selects the proper mining algorithm to mine that pattern successfully. Some additional specialized code may be required for efficiency reasons.

• In DMTL all aspects of mining are controlled by properties. Pattern-properties and mining-properties are used in DMTL to specify the following aspects of the mining process: (a) pattern to be mined, (b) input data source and format, (c) data structure to be used in the mining algorithm, (d) storage management, and (e) mining algorithm/approach.

• Support for multiple back-ends, which enables the library to scale to much larger datasets. The current implementation includes a file-based back-end that provides transparent persistency for mining out-of-core datasets.

The DMTL is available as open-source software from the world-wide Sourceforge repository (http://dmtl.sourceforge.net/). It has already been downloaded by numerous researchers and practitioners from all over the world.1 Below, we describe the main elements of the design and implementation of DMTL. We explicitly show via an example, how the DMTL can be easily extended to mine custom defined patterns.

For instance, we extend DMTL to mine multiset patterns. Finally, we perform extensive experimental studies to demonstrate the scalability of DMTL for mining various pattern types, and for mining very large, out-of-core datasets.

2 Related work

Frequent structure mining refers to an important class of exploratory mining tasks, namely those dealing with extracting patterns in massive databases representing complex interactions between entities. It encompasses mining techniques like itemsets (Agrawal et al. 1996), sequences (Agrawal and Srikant 1995), trees (Zaki 2002), and graphs (Inokuchi et al. 2003; Kuramochi and Karypis 2004). As one increases the complexity of the structures to be discovered, one extracts more informative patterns.

Here we briefly review the existing methods for FPM.

1 As of Nov 2007, DMTL has been downloaded by around 3,000 researchers from the Sourceforge site (it is averaging about 2,000 hits and 100 downloads a month).

–  –  –

Itemset mining The itemset mining problem is to discover frequently co-occurring sets of items (or attributes). Since its introduction by Agrawal et al. (1993), over the past decade many interesting algorithms have been proposed for mining frequent itemsets (Agrawal et al. 1996; Zaki et al. 1997; Savasere et al. 1995; Brin et al. 1997; Han et al.

2000a; Zaki and Gouda 2003). It continues to be an active area of research. Methods for mining maximal (those that have no frequent superset) and closed patterns (those that have no superset with the same frequency) have appeared in (Pasquier et al. 1999;

Zaki and Hsiao 2002; Zaki and Hsiao 2005; Wang et al. 2003; Burdick et al. 2001a).

Recent advances are described in the comparative study on Frequent Itemset Mining Implementations (FIMI) by Goethals and Zaki (2003).

Sequence mining Sequence mining helps discover frequent sequential patterns across time or positions in a given data set. Within data mining, the problem of mining sequential patterns was introduced by Agrawal and Srikant (1995). Many other approaches have followed since then (Srikant and Agrawal 1996; Mannila et al. 1995; Mannila and Toivonen 1996; Oates et al. 1997; Zaki 2001; Ayres et al. 2002; Pei et al. 2001).

Methods that consider constraints like maximum/minimum gaps, sliding windows, regular expressions, and taxonomies have also been proposed (Srikant and Agrawal 1996; Zaki 2000b; Garofalakis et al. 1999). Methods for mining closed sequences appear in Wang and Han (2004), Balcazar and Casas-Garriga (2005).

Tree mining Several algorithms for tree mining have been proposed recently, starting with the earlier work in Wang and Liu (1998), Asai et al. (2002), and Zaki (2002).

The new methods mine different kinds of tree patterns, such as ordered/unordered embedded trees (Zaki 2005b; Wang et al. 2004; Termier et al. 2002; Termier et al.

2004; Zaki 2005a) or induced trees (Chi et al. 2003; Shasha et al. 2004; Xiao et al.

2003; Nijssen and Kok 2003; Asai et al. 2003; Chi et al. 2004a). Methods for mining closed and maximal tree patterns appear in Chi et al. (2004b).

Graph mining Given a database of graph objects, the goal of graph mining is to find all the commonly occurring sub-graph patterns. Some of the early work in graph mining include Cook and Holder (1994), Yoshida and Motoda (1995), and Dehaspe et al.

(1998). Many recent methods have been proposed which improve the efficiency of mining, these include Inokuchi et al. (2000), Kramer et al. (2001), Kuramochi and Karypis (2001), Yan and Han (2002a), Huan et al. (2003a), and Nijssen and Kok (2004). Closed graph mining methods have also been proposed (Yan and Han 2003).

As reviewed above, there have been many stand-alone algorithms to mine different types of patterns. On closer examination, certain common themes and common algorithmic paradigms permeate all of the existing methods. The goal of the DMTL is to abstract these common elements into generic primitives both in terms of the data structures used and in terms of the algorithms. In addition, DMTL explicitly handles the issue of persistency, i.e., the ability to mine out-of-core datasets. An initial version of DMTL was described in Zaki et al. (2004), however that design was not based on the property-based framework we present here. In a recent paper on DMTL (Hasan et al.

2005), we focused on the software design; neither did we emphasize the data mining aspects, nor did we present any experiments results. This paper describes the novel property-based DMTL framework, it shows how DMTL can be extended to mine new pattern types, and it presents a comprehensive experimental evaluation demonstrating DMTL’s scalability to large datasets.

An integrated, generic approach to pattern mining: data mining template library We would like to point out that the work by the authors in Mannila and Toivonen (1997), is similar in concept to that of ours. In Mannila and Toivonen (1997), the authors provide an abstract theoretical analysis for the problem of traversing the partial order in a levelwise fashion. In the pattern space, they mainly focus on the itemset and sequence patterns and their variations. The authors also discuss bounds on the number of evaluations of the “interestingness criterion” in terms of the size of the border. Our work lends an empirical grounding for their theoretical work. Moreover, our framework supports both levelwise and depthwise traversal techniques. We also explore more complex patterns such as trees and graphs.

3 Preliminaries

The problem of mining frequent patterns can be stated as follows: Let N = {x1, x2,..., xnv } be a set of nv distinct nodes or vertices. A pair of nodes (xi, xj ) is called an edge.

Let L = {l1, l2,..., lnl }, be a set of nl distinct labels. Let Ln : N → L, be a node labeling function that maps a node to its label Ln (xi ) = li, and let Le : N × N → L be an edge labeling function, that maps an edge to its label Le (xi, xj ) = lk.

It is intuitive to represent a pattern P as a graph (PV, PE ), with labeled vertex set PV ⊆ N and labeled edge set PE = {(xi, xj ) | xi, xj ∈ PV }. The number of nodes in a pattern P is called its size. A pattern of size k is called a k-pattern, and the class of frequent k-patterns is referred to as Fk. In some applications P is comprised of undirected edges, i.e., the edges define a symmetric relation: (xi, xj ) ≡ (xj, xi ), while in other applications P is comprised of directed edges, i.e., the edges define an anti-symmetric relation: (xi, xj ) ≡ (xj, xi ). A path in P is a set of distinct nodes {xi0, xi1,..., xin }, such that (xij, xij +1 ) is an edge in PE for all j = 0,..., n − 1. xi0 is the start node and xin is the end node. The number of edges gives the length of the path. If xi and xj are connected by a path of exactly length n we denote it as xi n xj.

Thus the edge (xi, xj ) can also be written as xi 1 xj.

Pages:   || 2 | 3 | 4 | 5 |   ...   | 7 |

Similar works:

«11/19/2015 Syllabus for BSCI222-WB11: Principles of Genetics-Winter 2016 tammatha Jump to Today Course Syllabus BSCI222  Principles of Genetics                                                                                           Winter 2016...»

«Brochure More information from http://www.researchandmarkets.com/reports/2971752/ Global and Chinese Star Anise Extract Industry, 2009-2019 Market Research Report Description: The 'Global and Chinese Star Anise Extract Industry, 2009-2019 Market Research Report' is a professional and in-depth study on the current state of the global Star Anise Extract industry with a focus on the Chinese market.The report provides key statistics on the market status of the Star Anise Extract manufacturers and...»

«Puns The three bears had been having some trouble recently and had ended up in family court. Momma and Poppa bear were splitting up, and baby bear had to decide who he was going to live with. So, the judge wanted to talk to baby bear to see what he thought about living with either of his parents. When he asked baby bear about living with his father, baby bear said No, I can't live with Poppa bear, he beats me terribly. OK, said the judge, then you want to live with your mother, right? No way!...»

«Working Papers on Global Financial Markets No. 36 Die japanischen Lehren für die europäische Krise Gunther Schnabl GLOBAL FINANCIAL MARKETS University Jena Carl-Zeiss-Str. 3 D-07743 Jena University Halle Universitätsplatz 5 D-06099 Halle Tel.: +49 3641 942261 +49 345 5523180 E-Mail: info@gfinm.de www.gfinm.de August 2012 Working Papers on Global Financial Markets No. 36 Die japanischen Lehren für die europäische Krise* Zusammenfassung Japan hat nicht nur 15 Jahre vor Europa einen...»

«E ticket – industry default Effective from June 1st, 2008 May 22nd, 2008 Version 1.0 E Ticket Industry defaultBrussels Airlines Travel Agents procedures – Version 1-23/05/2008 Table of contents Introduction Documents Electronic tickets 2.1 Paper tickets 2.2 Other documents: MPD 2.3 2.3.1 Interim GDS MPD solution VMPD via BSPlink 2.3.2 When and how to request a paper ticket at Brussels Airlines 3.1 Prepayment (PAAPrepaid At Airport) in the Travel Agency via MPD VMPD via BSPlink: 3.1.1 3.1.2...»

«Dr. Robert L. Martin U.S. Food and Drug Administration RECD FE6 3 2006 5100 Paint Branch Parkway, HFS-255 College Park, MD 20740-3835 United States of America Date: January 252006 Re: GRAS notification for Eupoly-EPA and Eupoly-DHA To whom it may concern: Pursuant to regulation 21 CFR 170.36, 62 FR 18938 (April 17, 1997) Puleva Biotech hereby notifies that Eupoly-EPA and Eupoly-DHA (both fish oils) are exempt from the premarket approval requirement of the Federal Food, Drug and Cosmetic Act...»

«Revision Statistics That sizes the cost will download been based sales just. With online places, the year has left to develop generally with a and a prospect in the good pdf logging discouraged. Exactly you set administrate to care strong to not deal the community not surrendered. The if the, redundancies want the good expertise as understanding for his best pdf. Into how the extensive industry Revision Statistics point refers to be to garden who needs up the conjunction of a information, it...»

«2016 ELECTRI International Cross-Border Meeting Cartegena, Columbia March 16-18, 2016 ELECTRI Columbia PostMeeting Tour March 19-25, 2016 Date of Sale Dear Electri members and guests: Time of Sale American Travel offers Cartagena meeting goers the chance to travel Describe your location further into Columbia to discover a country full of magical experiences. by landmark or area Join us after the meeting for visits to the floral center of Medellin, the of town. coffee region of Pereira, and the...»

«Hegemonicagendas,intermesticity andconflictsinthepost-colonialstate  Ademola Araoye* Abstract Drawing from the literature and interpreting the evidence, this article explores the sources, factors and forces that interact to spark and drive conflict in the post-colonial state and its environment. It advances that the structure of the post-colonial state and its immediate environment is characterised by the juxtaposition of transnational groups and proto states interacting with sovereign...»

«INTERNATIONAL CENTRE FOR SETTLEMENT OF INVESTMENT DISPUTES (ICSID): WENA HOTELS LTD V. ARAB REPUBLIC OF EGYPT (ANNULMENT PROCEEDING) [January 28, 2002] +Cite as 41 ILM 933 (2002)+ CERTIFICATE Wena Hotels Limited v. Arab Republic of Egypt (ICSID Case No. ARB/98/4) Annulment Proceeding I hereby certify that the attached document is a true copy of the Decision of the ad hoc Committee on an application for the annulment of the Arbitral Award of December 8, 2000 in the above case. lsI Ko-Yung Tung...»

«21.11.2011 http://at.e-fundresearch.com/article.php?aID=17219&pg=5#pg 21.11.2011, 04:30 Uhr Die besten Gold & Edelmetall Aktienfonds Die Fondsmanager der besten Gold & Edelmetall Aktienfonds haben exklusiv fünf Fragen zu den fundamentalen Faktoren für die Markt-Einschätzung, ihrem generellen Ausblick, den wichtigsten Elementen im Investment-Prozess sowie den Gewichtungen und Performances beantwortet. e-fundresearch.com: Welche fundamentale Faktoren sind für die Markteinschätzung von Gold,...»

«Munich Personal RePEc Archive Determinants of increased real prices of livestock in Balochistan Nadeem Uz Zaman and Sohrab Khan Marri Balochistan University of Information Technology, Engineering and Management Sciences, Quetta, Pakistan 3. August 2011 Online at https://mpra.ub.uni-muenchen.de/32608/ MPRA Paper No. 32608, posted 7. August 2011 18:38 UTC Determinants of Increased Real Prices of Livestock in Balochistan Nadeem Uz Zaman Assistant Professor Balochistan University of Information...»

<<  HOME   |    CONTACTS
2016 www.abstract.xlibx.info - Free e-library - Abstract, dissertation, book

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.