«The State of Authorship Attribution Studies: Some Problems and Solutions JOSEPH RUDMAN Carnegie Mellon, Pittsburgh, Pennsylvania 15213, U.S.A. ...»
Computers and the Humanities 31: 351–365, 1998.
© 1998 Kluwer Academic Publishers. Printed in the Netherlands.
The State of Authorship Attribution Studies: Some
Problems and Solutions
Carnegie Mellon, Pittsburgh, Pennsylvania 15213, U.S.A. (e-mail: email@example.com)
Key words: authorship attribution, statistics, stylistics
Abstract. The statement, “Results of most non-traditional authorship attribution studies are not universally accepted as deﬁnitive,” is explicated. A variety of problems in these studies are listed and discussed: studies governed by expediency; a lack of competent research; ﬂawed statistical techniques; corrupted primary data; lack of expertise in allied ﬁelds; a dilettantish approach; inadequate treatment of errors. Various solutions are suggested: construct a correct and complete experimental design; educate the practitioners; study style in its totality; identify and educate the gatekeepers;
develop a complete theoretical framework; form an association of practitioners.
1. Introduction Non-traditional authorship attribution studies – those employing the computer, statistics, and stylistics – have had enough time to pass through any “shake-down” phase and enter one marked by solid, scientiﬁc, and steadily progressing studies.
But, after over 30 years and 300 publications, they have not.
These studies (experiments) must not only form and force a consensus on methodology among their practitioners but they also must demand an intellectual and a scientiﬁc respect for and belief in their results. This is lacking. There is more wrong with authorship attribution studies than there is right.
In this paper I attempt to:
1. Show that serious problems exist in non-traditional attribution studies;
2. Detail a few of the more common or crucial problems;
3. Highlight some solutions.
But most of all I would like to fuel a concerted effort to look at this ﬁeld in a scientiﬁc way – to treat each study as a unique, hard scientiﬁc experiment with the concomitant controls, rigor, reproducibility, open availability of all data, programs and tests performed, and with a well articulated theoretical framework.
There are many more problems and solutions than those treated below. There also is a real need to list and discuss what is “right” with non-traditional authorship attribution studies. Many practitioners have done credible work and have advanced 352 JOSEPH RUDMAN the ﬁeld. However, this paper concentrates on the majority of studies – studies that evidence major problems.
Nor can the question whether all of the building blocks of non-traditional authorship studies are set on a solid foundation or on quicksand be treated in this paper. An in-depth book length treatment of every facet of the ﬁeld is forthcoming.
2. Problems Exist The Bibliographies of stylistics contain thousands of titles, there is no lack of observed facts; however, the polysemy of concepts, the imprecision of methods, the uncertainty about the very goal of this research hardly make for a prosperous discipline.
Todorov1 The results of most non-traditional authorship attribution studies are not universally accepted as deﬁnitive. One major indication that there are problems in any ﬁeld is when there is no consensus on results, no consensus as to accepted or correct methodology, and no consensus as to accepted or correct techniques. An even stronger indication of problems is disagreement over many of the underlying assumptions – in our case in the “core” ﬁelds of statistics and stylistics – assumptions such as the consciousness or unconsciousness of style or the randomness of word selection.
I am not the ﬁrst to point out this lack of consensus. Others, Ledger,2 Brunet,3 and Burrows,4 to name just a few, describe aspects of this debilitating fact. But so far with little effect. It seems that for every paper announcing an authorship attribution method that “works” or a variation of one of these methods, there is a
counter paper that points out real or imagined crucial shortcomings:
• Even as early as 1903, Robert Moritz pointed out major ﬂaws in the 1888 “Sherman principle” of sentence length as an indicator of style and authorship;5
• Mealand called Neumann’s heavy reliance on discriminant analysis “problematic”;6
• Donald McNeil pointed out that scientists strongly disagree as to Zipf’s Law;7
• Christian Delcourt raised objections against some uses of co-occurrence analysis;8
• Portnoy and Peterson pointed out what they considered errors in Radday and Wickmann’s use of the correlation coefﬁcient, chi-squared test, and t-test;9
• Hilton and Holmes showed problems in Morton’s QSUM (cusum) technique;10
• Smith raised many objections against Morton’s early methods;11
• In fact, Morton’s methods have been assailed since 1965 when Ellisone said that Morton’s methods were, “... an abuse of both computers and scholarTHE STATE OF AUTHORSHIP ATTRIBUTION STUDIES ship.” “When put to the same tests... [Morton’s] own writings seemed to bear the stamp of multiple authorship”;12
• There are the lengthy and well documented Merriam versus Smith controversies;13
• Foster’s attribution of “A Funeral Elegy” to Shakespeare is under ﬁre;14
• And there is the current Foster versus Elliott and Valenza brouhaha unfolding on the pages of Computers and the Humanities.15 This widespread disagreement not only threatens to undermine the legitimate studies in the court of public and professional opinion but it also has kept authorship attribution studies out of most United States court proceedings. For example, the judge in the Patty Hearst trial ruled that Dr. Singer’s testimony on stylistic comparisons should not be admitted into evidence.16 Great Britain’s judicial system, which accepts authorship attribution as a legitimate science, is faced with a serious quandary since one of its star expert witnesses in these cases, Morton, had his method seemingly debunked on live television.17 The cause of so much disagreement and misunderstanding is not always on the part of the reader. The onus of competency, clarity, and completeness is on the practitioner. The researcher must document and make clear every step of the way.
No smoke and mirrors, no hocus-pocus, no “trust me on this.” There is also a lack of continuity. Many, if not most of the attribution studies are done by a “one problem” practitioner with no long range commitment to the ﬁeld. This might always be a problem, but understandably so. Once a scholar’s speciﬁc attribution study is completed (with or without valid results), why should that scholar continue with other attribution studies in alien ﬁelds.
Non-traditional authorship attribution studies bring a unique problem to interdisciplinary studies: who is the authority? who is the experimental spokesman?
the group leader? Is it the linguist? the statistician? the computer scientist? the rhetorician? Is it the expert in the ﬁeld of the questioned work: literature? classics?
law? philosophy? religion? economics?
What journal or journals do we turn to for an imprimatur or even a nihil obstat.
A quick scan of my working bibliography shows that non-traditional authorship attribution studies have been published in well over 76 journals representing 11 major ﬁelds – not to mention the 50 or so books, 11 dissertations, and numerous conference proceedings.
Most authorship attribution studies have been governed by expediency, e.g.:
1. The copy text is not the one that should be used but it was available in electronic form and isn’t too bad.18 Neither time constraints nor funding constraints should preclude the correct copy text.
2. This is not how the data should have been treated but the packaged program that I used didn’t do exactly what I wanted.
Never let the computer program dictate the design of the experiment. Practitioners should at least understand enough about programming to know what the computer can and cannot do.
3. The control data aren’t complete but it would have been too complicated to input the complete set.
4. The control data are not from the correct time period (authors, genre) but they were available in machine readable form.
5. I only had one year to do the research and the study, so some corners had to be cut.
It is important that both readers and practitioners realize that there is nothing, nothing in an authorship attribution study that is beyond the responsibility of the practitioner. If you are planning a study and cannot get the correct electronic texts, or you realize that control texts do not exist, do not do the study. If packaged programs cannot do the needed analysis, either write the program, hire it out, or do not do the study.
PROBLEMThere is a lack of competent and complete bibliographical research and there is little experimental memory. Researchers working in the same subject area of authorship attribution often fail to cite and make use of pertinent previous efforts.
Willard McCarty’s recent posting on Humanist, although in a more general context,
points this out:
... scholarship in the ﬁeld is signiﬁcantly inhibited, I would argue, by the low degree to which previous work in humanities computing and current work in related ﬁelds is known and recognized.19 How many authorship attribution practitioners are aware of William Benjamin Smith who, under the pen name of Conrad Mascol, published two articles, one in 1887 and the other in 1888 describing his “curve of style.”20 This is the same year – 1887 – that Mendenhall published his “Characteristic Curves of Composition.”21 But Smith is just not mentioned. In 1888, Sherman’s “principle of sentence length as an indicator of style and attribution” was published, but Sherman is very rarely mentioned. Mendenhall is usually cited as if in a vacuum.
THE STATE OF AUTHORSHIP ATTRIBUTION STUDIESKenneth Neumann’s impressive 1990 dissertation, The Authenticity of the Pauline Epistles in the Light of Stylostatistical Analysis, didn’t reference Mascol’s two 1888 articles on the “Curves of Pauline and Pseudo-Pauline Style.”22 Most of us are aware of David Holmes’ “The Analysis of Literary Style – A Review.”23 It is one of the most referenced works on authorship attribution studies. But, how often has Gerald McMenamin’s excellent 1993 book, Forensic Stylistics,24 been referenced?
How many studies and articles written in English reference the untranslated works from the French, the German, the Russian, and other languages.
PROBLEMProfessor G.E.P. Box and Dr. F. Yates expressed reservations about the encouragement of unthinking manipulation of numbers. We share their view that statistical methods should not be applied to numbers but rather to the situations giving rise to the data.
Andrews & Hertzberg25 Many researchers are led into this swampy quagmire of authorship attribution studies by the ignis fatuus of a more sophisticated statistical technique. Too many researchers have a great new technique and go looking for a quick and easy problem – one with available data. Simply using statistics does not give validity to attribution studies. Too many papers place too much emphasis on statistical technique – they try to create an aura of scientiﬁc invincibility without scientiﬁc rigor.
The earlier examples of non-consensus mentioned in Section 2 are all examples of a disagreement over statistics.
Blind borrowing of statistical techniques from other disciplines must stop:
• The Efron-Thisted tests (expanded from Fisher) are from butterﬂy collecting;
• Simpson’s index is based on the distribution of different species co-existing in a given ecosystem;
• The modal analysis used by Elliott’s group is derived from signal processing;
• Morton’s QSUM is based on industrial process and quality control monitoring.
The Effron-Thisted tests are based on the assumption that things (words) are well mixed in time. The assumption is that you will not capture all the members of one species early on and all of the members of another species later.26 McNeil, in his work on estimating an author’s vocabulary, assumes that vocabulary is ﬁxed and ﬁnite and that the author writes by successively drawing words from this collection, independently of the previous collection.27 We must be leery of assumptions. We must be able to prove any assumptions.
Statistics should not be the tail that wags the dog of attribution studies.
356 JOSEPH RUDMAN Where is compliance or even reference to the 1978 “Proposed Criteria for Publishing Statistical Results,” that appeared in the Bulletin of the Association for Literary and Linguistic Computing28 or the 1980 “Statement on Statistics,” that was printed in Computers and the Humanities?29 Are they still adaquate? Should they be updated?
But, statistics should not become the bugaboo of attribution studies. Statistics is a sine qua non.
PROBLEMAs incorrect and inappropriate as some statistics are, it is the primary data that is at the root of many if not most of the problems in authorship attribution studies. It is a given that the primary data or texts being used in attribution studies should be as close to the original holograph as possible – each stage of removal introduces systematic and other errors that may be fatal.
Many studies fail to comprehend that the concept of “author” changes throughout the ages and plays a signiﬁcant part in setting up each authorship study.
• Oral Tradition – Homer. How long after the initial composition were the Iliad and the Odyssey ﬁrst put in written form? How much of the text is formulaic phrases used as memorization aids?30 How do you account for this in an attribution study?
• Scribal Tradition – The scribe in ancient Hebrew literature not only re-wrote but interpreted.