«Harvesting Relational Tables from Lists on the Web Hazem Elmeleegy · Jayant Madhavan · Alon Halevy Received: date / Accepted: date Abstract A large ...»
Noname manuscript No.
(will be inserted by the editor)
Harvesting Relational Tables from Lists on the Web
Hazem Elmeleegy · Jayant Madhavan · Alon
Received: date / Accepted: date
A large number of web pages contain data structured in the form of “lists”.
Many such lists can be further split into multi-column tables, which can then be used
in more semantically meaningful tasks. However, harvesting relational tables from such
lists can be a challenging task. The lists are manually generated and hence need not have well deﬁned templates – they have inconsistent delimiters (if any) and often have missing information.
We propose a novel technique for extracting tables from lists. The technique is domain-independent and operates in a fully unsupervised manner. We ﬁrst use multiple sources of information to split individual lines into multiple ﬁelds, and then compare the splits across multiple lines to identify and ﬁx incorrect splits and bad alignments.
In particular, we exploit a corpus of HTML tables, also extracted from the Web, to identify likely ﬁelds and good alignments. For each extracted table, we compute an extraction score that reﬂects our conﬁdence in the table’s quality.
We conducted an extensive experimental study using both real web lists and lists derived from tables on the Web. The experiments demonstrate the ability of our technique to extract tables with high accuracy. In addition, we applied our technique on a large sample of about 100,000 lists crawled from the Web. The analysis of the extracted tables have led us to believe that there are likely to be tens of millions of useful and query-able relational tables extractable from lists on the Web.
Keywords Structured Data on the Web · Table Extraction · List Segmentation · WebTables · ListExtract H. Elmeleegy AT&T Labs, Florham Park, NJ, USA E-mail: firstname.lastname@example.org J. Madhavan Google Inc., Mountain View, CA, USA E-mail: email@example.com A. Halevy Google Inc., Mountain View, CA, USA E-mail: firstname.lastname@example.org 1 Introduction The World Wide Web is a large, but as yet under-utilized, source of structured data.
Consequently, managing structured data on the web has recently become the focus of many research eﬀorts (e.g. [1, 8, 13, 23, 25, 31]). Solutions have been proposed to ﬁnd, extract, and integrate structured data. Building web-scale structured-data stores and exposing them oﬀers many advantages, e.g., more sophisticated querying of web data and performing table search to bootstrap other data management tasks. In addition, the analysis of large amounts of structured data on the Web has enabled features such as schema auto-complete and synonymy discovery .
In recent work, Cafarella et al.  have shown that HTML tables are a particularly rich source of structured data. Their results indicate that there are more than 150 million HTML tables containing relational data on the web. In this paper we consider a complementary, and equally plentiful, source of relational data – lists on the Web.
The key challenge concerning HTML lists is that there is no clear notion of columns or cells, as is the case with tables. Each line in the list is largely unstructured text.
Delimiters are typically very inconsistent, if at all existent, and hence cannot be relied upon to split each line into the correct ﬁelds. Moreover, information might be missing on some lines and hence not all lines can be split into the same number of ﬁelds.
For example, consider the cartoon listing in Figure 1. Each cartoon has a serial number, the cartoon name, the production company, and the production year. Some cartoons are missing information, such as “6. Gertie The Dinosaur”, where the production year is not speciﬁed. While it might seem that the list is uniformly delimited, a closer examination reveals several inconsistencies. First, a period is used both to separate the serial number from the cartoon name, and to terminate abbreviations (e.g., the production company “Warner Bros.”). Second, while the delimiter between the production company and year is typically a single slash (“/”), when abbreviations are used, the delimiter sequence is a period and a slash (“./”). Third, the slash also appears in the name of one of the cartoons (“Duck Dodgers in the 24 1/2th Century”).
Finally, for cartoon “6. Gertie The Dinosaur”, the slash delimiter is absent (along with the production year). These inconsistencies, while fairly easy for a human observer to detect, can be very confusing for an automated system.
The above example also demonstrates that the problem of segmenting lists is different from the traditional information extraction problem of wrapper generation [1, 2, 10, 11, 13, 16, 19, 20, 22, 29–31]. The typical assumption in wrapper generation is that web pages (or parts of web pages) are automatically generated for each record in an underlying table using an HTML template. Thus, the layout of each record can be assumed to be consistent (with diﬀerent data ﬁelds being separated by HTML tags) and hence they can be inferred from multiple examples.
This paper proposes a technique for extracting tables from lists, ListExtract, that addresses the above challenges. Given an input list, ListExtract searches for the best possible table that the list can be segmented into. ListExtract is designed to be completely domain-independent and hence apply to any list found on the Web.
The over-arching idea underlying ListExtract is that ﬁnding the best table involves interleaving local decisions within each line in the list and table-oriented decisions across lines of the list. Within the lines, ListExtract uses some typical signals such as the data types, syntax, and delimiters. ListExtract also uses two new sources: (1) a large-scale language model (e.g., like in ) that records word co-occurrence scores, and (2) a large corpus of automatically extracted HTML tables . The language model is Fig. 1 List of the 50 Greatest Cartoons: An example of Web lists that contain structured data that can be extracted into relational tables.
used to identify candidate phrases that should not be split within a line, and the table corpus identiﬁes phrases that occur elsewhere in table cells.
When looking across lines of the list, ListExtract identiﬁes splitting errors by considering the cohesion of values across the column of the resulting table. Here too, the table corpus is helpful because it identiﬁes values that have appeared in the same column in other tables. In addition, when a splitting error is found by ListExtract, it realizes that the error must aﬀect a streak of values occurring to the left or to the right of the value. As we describe, ListExtract operates in several phases that interleave these two types of decisions.
In summary, we make the following contributions:
1. We present a novel technique for extracting tables from lists that is both domainindependent and is completely unsupervised. These qualities are essential in making the technique applicable on a web scale.
2. We describe how language models and a corpus of tables can be used to identify segments in lines that are well suited to be cell-values in tables. We also show how the table corpus is instrumental in aligning segments across diﬀerent lines in a list.
3. We present the results of an extensive experimental study based on real web lists, in addition to synthetic lists derived from HTML tables. The experiments demonstrate the eﬀectiveness of our technique, and the impact of its various components. We also show that existing information extraction techniques cannot be applied to our problem eﬀectively.
4. We take a ﬁrst step towards estimating the number of high-quality tables that can be extracted from lists on the Web. From a sample of 100,000 web pages selected at random, we show that ListExtract can extract between 1400 and 9700 tables with more than one column from HTML lists, depending on the required quality threshold. The distribution of the number of columns in the extracted tables can be found in .
We note that to complete relational table extraction, we also need to assign column headers to the columns of output tables. However, we only focus on the task of splitting the list’s lines into the table’s columns. Finding meaningful column headers is an area of future work, and some of the techniques in [9, 28] can be directly applied.
This paper extends a previous conference publication  in the following ways: (1) we discuss the implementation challenges of ListExtract (Section 6), (2) we describe
Fig. 2 ListExtract proceeds as a sequence of operations that can grouped into the independent splitting, the alignment, and the reﬁnement phases.
application scenarios for ListExtract (Section 8), (3) we provide additional details for the alignment phase (Algorithm 4.3) and the eﬀects of selecting our line-splitting algorithm (Section 7.8). We also updated our discussion of recent related work.
The rest of the paper is organized as follows. Section 2 presents our problem formulation and an overview of our approach. Sections 3, 4, and 5 describe the three phases of our algorithm – splitting, alignment, and reﬁnement. Section 6 outlines our implementation of ListExtract while Section 7 presents an experimental evaluation.
Section 8 discusses application sof ListExtract, Section 9 discusses related work and Section 10 concludes.
2 Problem Statement and Overview We begin by stating the problem we address and giving an overview of our solution.
2.1 Terminology and problem statement Consider a list L of n lines, where the ith line li consists of mi words wi1, wi2,..., wimi. Our goal is to extract a table T that contains n rows and some number of columns, say k.
We refer to each line in the list (that becomes a row in the table) as a record and each cell value as a ﬁeld. Thus, the ith record in T contains the k ﬁelds fi1, fi2,..., fik. The ﬁeld fij consists of mij successive words wipij, wi(pij +1),..., wi(pij +mij −1), where pij is the position of the ﬁrst word in fij. We use the term ﬁeld candidate to refer to a sequence of words that is being considered as a potential ﬁeld.
In this work, we only consider records that are formed by a non-overlapping and complete splitting of a line in the list, i.e., each ﬁeld is assumed to be disjoint and all words are assigned to some ﬁeld.
Given a list L, our goal is to extract a table T that is the most likely representation of the underlying relational data. It is important to note that there is not necessarily a single right answer to the table extraction problem. Solutions may diﬀer on how many columns they identify and how they deal with irregularities in the data. Ultimately, solution quality is subjective.
2.2 Algorithm overview Our ListExtract technique executes as a sequence of operations over the input list (see Figure 2). The underlying operations can be grouped into three main phases:
an independent splitting phase, an alignment phase, and a ﬁnal reﬁnement phase. We use two scoring functions to decide where to split individual records. We use a Field Quality Score, F Q(f ), to measure the quality of an individual ﬁeld candidate f, and a Field-to-Field Consistency Score, F 2F C(f1, f2 ), to measure the likelihood of two ﬁeld candidates f1 and f2 being in the same column. Both the scores take into consideration multiple sources of information.
Figure 3 shows the intermediate results of applying our technique on the ﬁrst 17 rows in the cartoons list in Figure 1.
Phase 1 (Splitting): Each line in the input list is split into a multi-ﬁeld record. The splitting is performed independently hence the obtained records do not necessarily have the same number of ﬁelds.
As shown in Figure 3(a), after the independent splitting phase, 13 out of the 17 lines are correctly split into records of four ﬁelds representing the sequence number, cartoon’s name, production company, and production year. Line 6 is also split correctly, though into a record of three ﬁelds only, as it is missing the year information. However, lines 4, 15 and 17 (highlighted) were incorrectly split. Interestingly, the cartoon’s name in line 17 had two very common substrings (“Popeye the Sailor” and “Sinbad the Sailor”), which lead the splitting algorithm to assign high scores to them, and thus treat each of them as a separate ﬁeld.
Phase 2 (Alignment): An initial candidate table (TI ) is constructed by ﬁrst determining a single number of likely columns in the output table. Records with too many ﬁelds are re-merged and re-split to make sure that they ﬁt into the output table.
Records with too few ﬁelds are expanded by inserting ﬁelds with a null value. The ﬁelds are aligned into columns so as to ensure consistency among ﬁelds in the same column. Since the splitting decisions in the ﬁrst phase were made independently for each record, the TI can still have some incorrect ﬁelds.
In our example, the number of columns in the output table is set at four (the most common number of ﬁelds across all records). Since record 17 has more than four ﬁelds, it is merged and re-split to force it to have no more than four ﬁelds. As shown in Figure 3(b), it re-splits into exactly four ﬁelds. However, the split between the ﬁrst and second ﬁelds was inaccurate, again because the common sub-string “Sinbad the Sailor” is recognized as a separate ﬁeld candidate.
In the initial table TI (Figure 3(c)), note that a null ﬁeld was correctly placed for the missing year in record 6. Moreover, the ﬁelds, which were correctly split in records 4 and 15, were also correctly positioned in TI. In particular, the production year of record 4, and the serial number and production year for record 15 were placed in their correct columns.