«Lexical dynamics and conceptual change: Analyses and implications for information retrieval Robert Liebscher & Richard K. Belew Department of ...»
Cognitive Science Online, Vol.1, pp.46–57, 2003 http://cogsci-online.ucsd.edu
Lexical dynamics and conceptual change:
Analyses and implications for information
Robert Liebscher & Richard K. Belew
Department of Cognitive Science
University of California, San Diego
9500 Gilman Drive
La Jolla, CA 92093-0515
One important aspect of a document’s context is the time at which it was
written. We report here on analyses of formal (dissertation abstracts) and informal (discussion board postings) communications among academics within two separate disciplines. We focus on academic communications because these especially must be understood within the context of what has been said before, together with what is considered relevant and worth saying at the time of publication. All corpora include time-stamp information that allows temporal analysis of changing lexical frequencies across decades. Using techniques borrowed from time series analysis, we ﬁnd distinct patterns of “rising” and “falling” bigram frequencies in both domains, and argue that this information can be exploited to improve retrieval of relevant documents.
1 Introduction A common idealization of many information retrieval (IR) tasks is to reduce retrievable documents and queries to simply the sets of lexical keywords they contain. But it is becoming increasingly acknowledged that if IR systems are ever to qualitatively improve their ability to help users seek relevant documents, increased attention must be paid to the context shaping use of these keywords by both the browsing user and the documents’ original authors. In many situations, one obvious feature of both users’ and authors’ contexts is their respective places in time. As typically addressed, the retrieval task makes almost no assumptions about time. In general, the time frame in which an author commits her thoughts to writing, as well as the time frame within which the querying user operates, are either assumed to be irrelevant to effective retrieval, or tacitly assumed to be roughly contemporaneous with one another.
But the conceptual frameworks within which authors write can play an enormous role in what they choose to say, and how they choose to say it. This is especially true in scientiﬁc writings, where acceptance by peer reviewers is essential to success and recognition. By the same token, scientists and students searching historically through the body of work of a science will often be familiar with a different vocabulary as they pursue
of a science is potentially informed by observation of word frequency statistics combined with metadata capturing the document’s publication time. Words and phrases change in both frequency of use and meaning through time. For example, the token WEB is found in many areas of discourse today, though little more than a decade ago, prior to the advent of the World Wide Web, its use was much more circumscribed.
In this paper, it is assumed that within a given domain, the overall frequency of a term k at time t is proportional to a community’s collective “interest” in k at t. As interest in a topic waxes or wanes, the raw number of documents containing information about that topic rises or falls through time. These arguments, along with observations about the nature of scientiﬁc discourse, are used to construct a temporal weighting scheme that places a document in an appropriate historical context.
Consider a “rising” term k, which moves from obscurity to immense popularity over the duration of a corpus. In an academic context, it is not unreasonable to assume that the term was once part of a small group of “seminal papers” that helped to launch a ﬁeld of inquiry, a technology, a methodology, etc. (To be concrete, imagine a search for the now ubiquitous term DNA. We would surely want Watson & Crick’s one-page paper of 1953 to be deemed relevant!) Under an atemporal paradigm, a query for k will return a temporally random subset of documents in the corpus, leaving our user in the dark with regard to any notion of the conceptual development of her query term. The seminal papers have a chance of being deemed relevant that is proportional only to their length.
But with the temporal weighting scheme introduced in Section 3, when the frequency of this rising term k is initially low (i.e. used in very few documents), its weight will be ampliﬁed. At a later point in time, when its use is much more common, its weight will be dampened so as not to over-emphasize the many documents about k that exist at that time.
Alternatively, imagine a “dying” term, where k is omnipresent at one point in time, then falls in frequency until it is no longer used. Under a temporal weighting scheme, its initial use is dampened, and its later use is ampliﬁed. One consequence of this would be to emphasize historical documents that are written retrospectively about the term in question.
These provide a good starting place for someone who wishes, as above, to gain an historical perspective on the development of k.
This paper reports attempts to formalize these arguments and improve information retrieval by incorporating the time at which a document was written into the retrieval process.
2.1 Corpora The analysis will concentrate on abstracts from Doctoral and Masters’ dissertations because these are available across decades in a relatively consistent format. Further, the focus is placed on a particular discipline, artiﬁcial intelligence, in order to relate the analysis of changing frequency statistics to the semantics of the evolving science that generates them.
While this is terriﬁcally rich data, the amount of text provided by only dissertations’ titles and abstracts is not great. This is especially unfortunate, because standard time series analysis demands large data volumes. However, as will be demonstrated, much information can be gained from an analysis that employs the most straightforward linear models of trend.
Three corpora were used in these studies. The ﬁrst, AIT, contains approximately 5,000 abstracts from Ph.D. and Masters theses in artiﬁcial intelligence, collected by University Microﬁlms, Inc. from 1986 to 1997 (Belew, 2000). Each document is labeled with its year Cognitive Science Online, Vol.1, 2003 48 of publication.
The second corpus, CommDis, also from University Microﬁlms, Inc. contains approximately 4,000 abstracts from Ph.D. and Masters theses in language and communicative disorders. The abstracts run from 1980 to 2002, and each is labeled with its year of publication.
The third corpus, AIList Digest (hereafter AIList), is a subscription-based electronic newsletter that contains over 10,000 discussion board postings, conference announcements, and essays on artiﬁcial intelligence that were collected and distributed weekly from 1983 to 1988. Each document is labeled with the exact time and date on which it was written. As the intent of this study was to treat AIList as a record of informal academic communication, some documents, such as bibliographies and subscription statistics, needed to be removed.
A very simple type/token ratio ﬁlter worked well to preserve relevant articles of discourse while ﬁltering out the unwanted documents, which generally contained few grammatical terms relative to the total number of tokens.
2.2 Tracking lexical frequency change
Bigram frequency counts were made for each of the 12 years in AIT, for each of 12 bins of roughly equal size (approximately 110,000 tokens each) in AIList, and for each of the 23 years of CommDis. Statistics were also collected on unigrams, but for the purposes of this work, bigrams provide a much greater level of detail and are considered better descriptors (and therefore more likely to serve as query terms) in domain-speciﬁc corpora (Damerau, 1993). Unigrams of particular interest are abbreviations and acronyms, which are discussed in Section 5.
The task of modeling the frequencies of terms that have consistently increased or decreased through time lends itself well to formal methods in time series analysis (Box et al., 1994).
However, the 12 data points associated with AIT and AIList are too sparse to allow accurate modeling, and even the 23 points associated with CommDis only approach data sufﬁciency.
For this reason, the analysis here is restricted to simple linear models of trend.
Mean smoothing was performed over the temporal frequency plot of each term that met a minimum frequency requirement, each bin being averaged with its two neighbors. Smoothing is especially necessary for AIList, as an extended thread of conversation or long essay might cause a spike in the frequency of a particular term that does not accurately reﬂect the level of community interest at that time.
Each smoothed frequency plot was then ﬁt with a regression line, and the adjusted correlation coefﬁcient (between time and frequency) r and slope of the line b were measured. The slope b of a term’s temporal frequency plot is only a meaningful number when compared to the slopes of other terms. Terms which steadily increased in frequency through time will have positive slopes; those which steadily decreased will have negative slopes. Terms with a slope of 2s rose (or fell) twice as quickly as those with a slope of s.
Figure 1 shows the temporal frequency plot of two examples drawn from each corpus.
Terms that met a minimum threshold for r (0.70) and absolute value of b (corpus-speciﬁc)1 were extracted for further study. These included 49%, 42%, and 28% of the bigrams, respectively, for AIT, AIList, and CommDis, and represent the terms that might beneﬁt from a temporal analysis of frequency.
Table 1 depicts for the top ten “rising” and “falling” terms extracted from the AIT corpus, along with their b and r values. While the frequencies of many terms within a domain Careful manual examination of the data led to the choices of 5.0, 4.0, and 7.0 for threshold values of b for AIT, AIList, and CommDis, respectively.
Cognitive Science Online, Vol.1, 2003 49
Figure 1: Examples of rising and falling terms from the three different corpora, with regression lines. Top to bottom: AIT, AIList, CommDis.
are temporally invariant or random, there are many others that undergo large frequency changes over time. In the traditional retrieval task, this information is not exploited.
Appendix A contains similar tables for AIList and CommDis. The results from CommDis dissertations provide strong conﬁrmation that the analysis of conceptual change within the domain of AI transfers well to this new domain, despite large differences in the time frame and rate of dissertation publication across the two corpora. The results of the AIList newsgroup analysis are more mixed. While some of AIList’s changing bigrams do indeed overlap semantically with those found in AIT, others appear to be “noise.” These are most likely a consequence of both small AIList corpus size and difﬁculties in cleanly parsing the highly variable news postings (e.g., identifying common “signature lines” used by frequent posters). These anomolies may, however, also point to qualitative features of informal language use within such groups that limit the utility of the methods being used.
The simplest form of TF-IDF weighting multiplies the raw term frequency (TF) of a
term in a document by the term’s inverse document frequency (IDF) weight:
Table 1: Top ten rising and falling bigrams from AIT (1986-1997). Informal queries of AI practitioners revealed that the terms in these lists matched well with their memory of developments in the ﬁeld over the period in question.
where fkd is the frequency with which keyword k occurs in document d, N Doc is the total number of documents in the corpus, and Dk is the number of documents containing keyword k.
In a temporal model, the corpus is divided into a series of independent sub-corpora, each associated with documents occurring within a particular time slice. IDF weights can then be computed independently for each sub-corpus.
One way to characterize the change is to contrast the temporally changing weights we
propose as a difference with the traditional IDF weighting:
Two assumptions must be made explicit before proceeding. First, fkd, the number of times a term k occurs in document d, is only expected to correlate with the document’s length.
The average number of times that a document of length l mentions a term k, then, does not vary with respect to time. Second, documents of the same type within a domain do not become longer or shorter, on average, over time.
If these assumptions are true, then T Fk (t), the total frequency of k in time slice t, should be proportional to Dk (t), a rough measure of collective community interest in k at time t.
This means that the following should hold:
log(Dk /Dk (t)) log(T Fk /T Fk (t)) (6) Table 2 shows that the number of documents in which a term appears at time t can be approximated by the total term frequency, as measured by the correlation coefﬁcient r averaged over all terms. Note the striking similarity between T Fk (t) and Dk (t). Furthermore, the correlation between total term frequency and within-document frequency is low2, supporting our ﬁrst assumption above. fkd is a constant over all t, as the burden of “temporal T F (t) and f (t) are not entirely uncorrelated (r = 0.0) because there are some very infrequent terms in each corpus that have one document written speciﬁcally about them. These terms spike in Cognitive Science Online, Vol.1, 2003 52 0.6
0.45 0.4 0.35 0.3 0.25 0.2 0.15
Figure 3: T Fk (t) and Dk (t) for the term ARTIFICIAL INTELLIGENCE over the duration of the AIT corpus. T Fk (t) is frequency per hundred terms for scaling purposes.