FREE ELECTRONIC LIBRARY - Abstract, dissertation, book

Pages:   || 2 | 3 | 4 | 5 |

«Harvesting Relational Tables from Lists on the Web Hazem Elmeleegy · Jayant Madhavan · Alon Halevy Received: date / Accepted: date Abstract A large ...»

-- [ Page 1 ] --

Noname manuscript No.

(will be inserted by the editor)

Harvesting Relational Tables from Lists on the Web

Hazem Elmeleegy · Jayant Madhavan · Alon


Received: date / Accepted: date


A large number of web pages contain data structured in the form of “lists”.

Many such lists can be further split into multi-column tables, which can then be used

in more semantically meaningful tasks. However, harvesting relational tables from such

lists can be a challenging task. The lists are manually generated and hence need not have well defined templates – they have inconsistent delimiters (if any) and often have missing information.

We propose a novel technique for extracting tables from lists. The technique is domain-independent and operates in a fully unsupervised manner. We first use multiple sources of information to split individual lines into multiple fields, and then compare the splits across multiple lines to identify and fix incorrect splits and bad alignments.

In particular, we exploit a corpus of HTML tables, also extracted from the Web, to identify likely fields and good alignments. For each extracted table, we compute an extraction score that reflects our confidence in the table’s quality.

We conducted an extensive experimental study using both real web lists and lists derived from tables on the Web. The experiments demonstrate the ability of our technique to extract tables with high accuracy. In addition, we applied our technique on a large sample of about 100,000 lists crawled from the Web. The analysis of the extracted tables have led us to believe that there are likely to be tens of millions of useful and query-able relational tables extractable from lists on the Web.

Keywords Structured Data on the Web · Table Extraction · List Segmentation · WebTables · ListExtract H. Elmeleegy AT&T Labs, Florham Park, NJ, USA E-mail: hazem@research.att.com J. Madhavan Google Inc., Mountain View, CA, USA E-mail: jayant@google.com A. Halevy Google Inc., Mountain View, CA, USA E-mail: halevy@google.com 1 Introduction The World Wide Web is a large, but as yet under-utilized, source of structured data.

Consequently, managing structured data on the web has recently become the focus of many research efforts (e.g. [1, 8, 13, 23, 25, 31]). Solutions have been proposed to find, extract, and integrate structured data. Building web-scale structured-data stores and exposing them offers many advantages, e.g., more sophisticated querying of web data and performing table search to bootstrap other data management tasks. In addition, the analysis of large amounts of structured data on the Web has enabled features such as schema auto-complete and synonymy discovery [8].

In recent work, Cafarella et al. [8] have shown that HTML tables are a particularly rich source of structured data. Their results indicate that there are more than 150 million HTML tables containing relational data on the web. In this paper we consider a complementary, and equally plentiful, source of relational data – lists on the Web.

The key challenge concerning HTML lists is that there is no clear notion of columns or cells, as is the case with tables. Each line in the list is largely unstructured text.

Delimiters are typically very inconsistent, if at all existent, and hence cannot be relied upon to split each line into the correct fields. Moreover, information might be missing on some lines and hence not all lines can be split into the same number of fields.

For example, consider the cartoon listing in Figure 1. Each cartoon has a serial number, the cartoon name, the production company, and the production year. Some cartoons are missing information, such as “6. Gertie The Dinosaur”, where the production year is not specified. While it might seem that the list is uniformly delimited, a closer examination reveals several inconsistencies. First, a period is used both to separate the serial number from the cartoon name, and to terminate abbreviations (e.g., the production company “Warner Bros.”). Second, while the delimiter between the production company and year is typically a single slash (“/”), when abbreviations are used, the delimiter sequence is a period and a slash (“./”). Third, the slash also appears in the name of one of the cartoons (“Duck Dodgers in the 24 1/2th Century”).

Finally, for cartoon “6. Gertie The Dinosaur”, the slash delimiter is absent (along with the production year). These inconsistencies, while fairly easy for a human observer to detect, can be very confusing for an automated system.

The above example also demonstrates that the problem of segmenting lists is different from the traditional information extraction problem of wrapper generation [1, 2, 10, 11, 13, 16, 19, 20, 22, 29–31]. The typical assumption in wrapper generation is that web pages (or parts of web pages) are automatically generated for each record in an underlying table using an HTML template. Thus, the layout of each record can be assumed to be consistent (with different data fields being separated by HTML tags) and hence they can be inferred from multiple examples.

This paper proposes a technique for extracting tables from lists, ListExtract, that addresses the above challenges. Given an input list, ListExtract searches for the best possible table that the list can be segmented into. ListExtract is designed to be completely domain-independent and hence apply to any list found on the Web.

The over-arching idea underlying ListExtract is that finding the best table involves interleaving local decisions within each line in the list and table-oriented decisions across lines of the list. Within the lines, ListExtract uses some typical signals such as the data types, syntax, and delimiters. ListExtract also uses two new sources: (1) a large-scale language model (e.g., like in [6]) that records word co-occurrence scores, and (2) a large corpus of automatically extracted HTML tables [8]. The language model is Fig. 1 List of the 50 Greatest Cartoons: An example of Web lists that contain structured data that can be extracted into relational tables.

used to identify candidate phrases that should not be split within a line, and the table corpus identifies phrases that occur elsewhere in table cells.

When looking across lines of the list, ListExtract identifies splitting errors by considering the cohesion of values across the column of the resulting table. Here too, the table corpus is helpful because it identifies values that have appeared in the same column in other tables. In addition, when a splitting error is found by ListExtract, it realizes that the error must affect a streak of values occurring to the left or to the right of the value. As we describe, ListExtract operates in several phases that interleave these two types of decisions.

In summary, we make the following contributions:

1. We present a novel technique for extracting tables from lists that is both domainindependent and is completely unsupervised. These qualities are essential in making the technique applicable on a web scale.

2. We describe how language models and a corpus of tables can be used to identify segments in lines that are well suited to be cell-values in tables. We also show how the table corpus is instrumental in aligning segments across different lines in a list.

3. We present the results of an extensive experimental study based on real web lists, in addition to synthetic lists derived from HTML tables. The experiments demonstrate the effectiveness of our technique, and the impact of its various components. We also show that existing information extraction techniques cannot be applied to our problem effectively.

4. We take a first step towards estimating the number of high-quality tables that can be extracted from lists on the Web. From a sample of 100,000 web pages selected at random, we show that ListExtract can extract between 1400 and 9700 tables with more than one column from HTML lists, depending on the required quality threshold. The distribution of the number of columns in the extracted tables can be found in [15].

We note that to complete relational table extraction, we also need to assign column headers to the columns of output tables. However, we only focus on the task of splitting the list’s lines into the table’s columns. Finding meaningful column headers is an area of future work, and some of the techniques in [9, 28] can be directly applied.

This paper extends a previous conference publication [15] in the following ways: (1) we discuss the implementation challenges of ListExtract (Section 6), (2) we describe

–  –  –

Fig. 2 ListExtract proceeds as a sequence of operations that can grouped into the independent splitting, the alignment, and the refinement phases.

application scenarios for ListExtract (Section 8), (3) we provide additional details for the alignment phase (Algorithm 4.3) and the effects of selecting our line-splitting algorithm (Section 7.8). We also updated our discussion of recent related work.

The rest of the paper is organized as follows. Section 2 presents our problem formulation and an overview of our approach. Sections 3, 4, and 5 describe the three phases of our algorithm – splitting, alignment, and refinement. Section 6 outlines our implementation of ListExtract while Section 7 presents an experimental evaluation.

Section 8 discusses application sof ListExtract, Section 9 discusses related work and Section 10 concludes.

2 Problem Statement and Overview We begin by stating the problem we address and giving an overview of our solution.

2.1 Terminology and problem statement Consider a list L of n lines, where the ith line li consists of mi words wi1, wi2,..., wimi. Our goal is to extract a table T that contains n rows and some number of columns, say k.

We refer to each line in the list (that becomes a row in the table) as a record and each cell value as a field. Thus, the ith record in T contains the k fields fi1, fi2,..., fik. The field fij consists of mij successive words wipij, wi(pij +1),..., wi(pij +mij −1), where pij is the position of the first word in fij. We use the term field candidate to refer to a sequence of words that is being considered as a potential field.

In this work, we only consider records that are formed by a non-overlapping and complete splitting of a line in the list, i.e., each field is assumed to be disjoint and all words are assigned to some field.

Given a list L, our goal is to extract a table T that is the most likely representation of the underlying relational data. It is important to note that there is not necessarily a single right answer to the table extraction problem. Solutions may differ on how many columns they identify and how they deal with irregularities in the data. Ultimately, solution quality is subjective.

2.2 Algorithm overview Our ListExtract technique executes as a sequence of operations over the input list (see Figure 2). The underlying operations can be grouped into three main phases:

an independent splitting phase, an alignment phase, and a final refinement phase. We use two scoring functions to decide where to split individual records. We use a Field Quality Score, F Q(f ), to measure the quality of an individual field candidate f, and a Field-to-Field Consistency Score, F 2F C(f1, f2 ), to measure the likelihood of two field candidates f1 and f2 being in the same column. Both the scores take into consideration multiple sources of information.

–  –  –

Figure 3 shows the intermediate results of applying our technique on the first 17 rows in the cartoons list in Figure 1.

Phase 1 (Splitting): Each line in the input list is split into a multi-field record. The splitting is performed independently hence the obtained records do not necessarily have the same number of fields.

As shown in Figure 3(a), after the independent splitting phase, 13 out of the 17 lines are correctly split into records of four fields representing the sequence number, cartoon’s name, production company, and production year. Line 6 is also split correctly, though into a record of three fields only, as it is missing the year information. However, lines 4, 15 and 17 (highlighted) were incorrectly split. Interestingly, the cartoon’s name in line 17 had two very common substrings (“Popeye the Sailor” and “Sinbad the Sailor”), which lead the splitting algorithm to assign high scores to them, and thus treat each of them as a separate field.

Phase 2 (Alignment): An initial candidate table (TI ) is constructed by first determining a single number of likely columns in the output table. Records with too many fields are re-merged and re-split to make sure that they fit into the output table.

Records with too few fields are expanded by inserting fields with a null value. The fields are aligned into columns so as to ensure consistency among fields in the same column. Since the splitting decisions in the first phase were made independently for each record, the TI can still have some incorrect fields.

In our example, the number of columns in the output table is set at four (the most common number of fields across all records). Since record 17 has more than four fields, it is merged and re-split to force it to have no more than four fields. As shown in Figure 3(b), it re-splits into exactly four fields. However, the split between the first and second fields was inaccurate, again because the common sub-string “Sinbad the Sailor” is recognized as a separate field candidate.

In the initial table TI (Figure 3(c)), note that a null field was correctly placed for the missing year in record 6. Moreover, the fields, which were correctly split in records 4 and 15, were also correctly positioned in TI. In particular, the production year of record 4, and the serial number and production year for record 15 were placed in their correct columns.

Pages:   || 2 | 3 | 4 | 5 |

Similar works:

«Bench Scale Apparatus Measurement Uncertainty and Uncertainty Effects on Measurement of Fire Characteristics of Material Systems by Lei Zhao A Thesis Submitted to the Faculty of the WORCESTER POLYTECHNIC INSTITUTE In partial fulfillment of the requirements for the Degree of Master of Science In Fire Protection Engineering By May 2005 APPROVED: Professor Nicholas A. Dembsey, Major Advisor Dr. John L. de Ris, Co-Advisor FM Global Research Dr. Robert Bill, Co-Advisor FM Global Research Professor...»

«Pharmaceutical Scheduling Issues: Bringing New Treatments to Market Robert C. Newbold, CEO, ProChain Solutions, Inc. Wendell P. Simpson III, Ph.D., Principal Consultant, ProChain Solutions, Inc. Abstract Over the past fifteen years, the authors have worked with several of the top pharmaceutical companies in the world. In this paper they share the experiences and techniques they have used to help large pharma companies gain dramatic improvements in on-time performance and start-to-finish project...»

«1 Lasers, Spectrographs, and Detectors Fred LaPlant Abstract The introduction of Raman spectroscopy into new fields has been driven largely by advances in the underlying technology. While the spectrometer is still comprised of a light source, a wavelength selector, and a detector, the improvement in functionality of each of these components has had dramatic impacts on areas where Raman was once thought impractical, if not impossible. In addition, esoteric techniques once confined to academic...»

«Friedrich-Alexander-Universität Erlangen-Nürnberg Technische Fakultät, Department Informatik Lora Todorova Master Thesis Business Case: Confused Modeling at Eva's Way Submitted on November 25, 2013 Matriculation Number: 21244008 Supervisor: Prof. Dr. Dirk Riehle, M.B.A. Ann Barcomb, MSc Professur für Open-Source-Software Department Informatik, Technische Fakultät Friedrich-Alexander University Erlangen-Nürnberg Business Case: Confused Modeling at Eva’s Way Eidesstattliche Erklärung Ich...»

«C:\Dokumente und Einstellungen\MeurerNadine\Lokale Einstellungen\Temporary Internet Files\OLK31A\23052007_NM_BSI_TR03116_eCardProjekte_Sicherheitsanforderungen Kryptokatalog für Gw V1_0_20070323.doc Technische Richtlinie BSI TR-03116-1 Kryptographische Vorgaben für Projekte der Bundesregierung Teil 1: Telematikinfrastruktur Version: 3.19 Datum: 03.12.2015 Autoren: Technische Arbeitsgruppe TR-03116-1 Status: Veröffentlichung Fassung: Dezember 2015 BSI Technische Richtlinie 03116-1...»

«Ant Systems Eric D. Taillard Istituto Dalle Molle di Studi sull'Intelligenza Arti ciale IDSIA, Corso Elvezia 36, CH-6900 Lugano, Switzerland Technical Report IDSIA-05-99, February 1999 Abstract This article describes Ant Systems, a meta-heuristic based on an ant foraging metaphor. The presentation of Ant Systems has been somewhat generalized by adding a Queen process in charge of co-ordinating classical Ant processes, so that recent Ant Systems can be naturally included while remaining close...»

«Enhancing the use of anthropometric data Johan Molenbroek (1) and Renate de Bruin (2) Associate Professor Applied Ergonomics & Design Faculty Industrial Design Engineering Delft University of Technology Landbergstraat 15 2628 CE Delft The Netherlands Tel +31-15-2783086 Fax +31-278 7179 J.f.m.molenbroek@io.tudelft.nl Research Manager Applied Ergonomics & Design Faculty Industrial Design Engineering Delft University of Technology Erin, Ergonomics Consultancy, Nijmegen mail@ontwerpergonomie.nl...»

«S tat i S t i c S Pa P e r S e r i e S N O 3 / S e P t e m b e r 2013 QUaLitY meaSUreS iN NON-raNDOm SamPLiNG mFi iNtereSt rate StatiSticS techNicaL exPert GrOUP ON mFi iNtereSt rate StatiSticS In 2013 all ECB publications feature a motif taken from the €5 banknote. NOte: This Statistics Paper should not be reported as representing the views of the European Central Bank (ECB). The views expressed are those of the authors and do not necessarily reflect those of the ECB. Technical Expert Group...»

«Twin Heart Meditation Step-by-step instructions in twin heart meditation by V.C. Vishwanathan The twin heart meditation technique developed by Master Choa Kok Sui, the Philippines based founder of pranic healing, is a powerful form of meditation, for it enhances your physical, mental, emotional and spiritual well-being. If practiced regularly, it brings about a deep inner transformation and expansion of consciousness so that you achieve illumination, self-realization, perfect harmony and...»

«Cort Lippe 1 Real-time Granular Sampling Using the IRCAM Signal Processing Workstation Cort Lippe IRCAM, 31 rue St-Merri, Paris, 75004, France Running Title: Real-time Granular Sampling [This copy of this paper is only a sample. Please do not reproduce. The original was published in the Contemporary Music Reveiw, published by Harwood Academic Publishers, 1994] Cort Lippe 2 Abstract This paper describes research into granular sampling, including time-stretching techniques, and presents a musical...»

«Клуб успешных трейдеров Robot-Forex.biz Мы знаем, как заработать на Форекс! A MARKETPLACE BOOK Trading Chaos Maximize Profits with Proven Technical Techniques SECOND EDITION JUSTINE GREGORY-WILLIAMS and BILL M. WILLIAMS John Wiley & Sons, Inc. КНИГА О РЫНКЕ Торговый Хаос Максимизируйте прибыль, используя доказанные технические приемы ВТОРОЕ ИЗДАНИЕ...»

«1 Yumiko Kumagai Dr. Eijun Senaha Scholar & Scholarship 1 30 August 1999 An Annotated Bibliography: Katherine Mansfield’s Modernist Exploration of Truth of Woman’s Life in Prelude Introduction Katherine Mansfield’s Prelude is her unique story in which she explores female consciousness hidden under an ordinary family life while re-creating her own childhood of New Zealand. Mansfield depicts the inner lives at various stages of woman's life, projecting into them not only her memories of...»

<<  HOME   |    CONTACTS
2016 www.abstract.xlibx.info - Free e-library - Abstract, dissertation, book

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.