«Abstract. This paper describes the forensic and intelligence analysis capabilities of the Email Mining Toolkit (EMT) under development at the ...»
Behavior Proﬁling of Email
Salvatore J. Stolfo, Shlomo Hershkop, Ke Wang, Olivier Nimeskern, and
Columbia University, New York, NY 10027, USA
Abstract. This paper describes the forensic and intelligence analysis
capabilities of the Email Mining Toolkit (EMT) under development at
the Columbia Intrusion Detection (IDS) Lab. EMT provides the means
of loading, parsing and analyzing email logs, including content, in a wide
range of formats. Many tools and techniques have been available from the ﬁelds of Information Retrieval (IR) and Natural Language Processing (NLP) for analyzing documents of various sorts, including emails. EMT, however, extends these kinds of analyses with an entirely new set of analyses that model ”user behavior”. EMT thus models the behavior of individual user email accounts, or groups of accounts, including the ”social cliques” revealed by a user’s email behavior.
1 Introduction This paper describes the forensic and intelligence analysis capabilities of the Email Mining Toolkit (EMT) under development at the Columbia IDS Lab.
EMT provides the means of loading, parsing and analyzing email logs, including content, in a wide range of formats. Many tools and techniques have been available from the ﬁelds of IR and NLP for analyzing documents of various sorts, including emails. EMT, however, extends these kinds of analyses with an entirely new set of analyses that model ”user behavior”. EMT thus models the behavior of individual user email accounts, or groups of accounts, including the ”social cliques” revealed by a user’s email behavior. EMT’s design has been driven by the core security application to detect virus propagations, spambot activity and security policy violations. However, the technology also provides critical intelligence gathering and forensic analysis capabilities for agencies to analyze disparate Internet data sources for the detection of malicious users, attackers, and other targets of interest. This dual use is graphically displayed in Figure ??. For example, one target application for intelligence gathering supported by EMT is the identiﬁcation of likely ”proxy email accounts”, email accounts that exhibit similar behavior and thus may be used by a single person. Although EMT has been designed speciﬁcally for email analysis, the principles of its operation are equally relevant to other Internet audit sources.
This data mining technology previously reported [?,?,?], and graphically displayed in Figure ??, has been proven to automatically compute or create both signature-based misuse detection and anomaly detection-based misuse discovery.
The application of this technology to diverse Internet objects andevents (e.g., email and web transactions) allows for a broad range of behavior-based analyses including the detection of proxy email accounts and groups of user accounts that communicate with one another including covert group activities.
Data mining applies machine learning and statistical techniques to automatically discover and detect misuse patterns, as well as anomalous activities in general. When applied to network-based activities and user account observations for the detection of errant or misuse behavior, these methods are referred to as behavior-based misuse detection.
Behavior-based misuse detection can provide important new assistance for counter-terrorism intelligence. In addition to standard Internet misuse detection, these techniques will automatically detect certain patterns across user accounts that are indicative of covert, malicious or counter-intelligence activities.
Moreover, behavior-based detection provides workbench functionalities to interactively assist an intelligence agent with targeted investigations and oﬀ-line forensics analyses.
Intelligence oﬃcers have a myriad of tasks and problems confronting them each day. The sheer volume of source materials requires a means of honing in on those sources of maximal value to their mission. A variety of techniques can be applied drawing upon the research and technology developed in the ﬁeld of Information Retrieval. There is, however, an additional source of information available that can used to aid even the simplest task of rank ordering and sorting documents for inspection: behavior models associated with the documents can be used to identify and group sources in interesting new ways. This is demonstrated by the Email Mining Toolkit that applies a variety of data mining techniques for proﬁling and behavior modeling of email sources.
The deployment of behavior-based techniques for intelligence investigation and tracking tasks represents a signiﬁcant qualitative step in the counter-intelligence ”arms race”. Because there is no way to predict what data mining will discover over any given data set, ”counter-escalation” is particularly diﬃcult.
Behavior-based misuse detection is more robust against standard knowledgebased techniques. Behavior-based detection has the capabilities to detect new patterns (i.e., patterns that have not been previously observed), provide early warning alerts to users and analysts, and automatically adapt to both normal and misuse behavior. By applying statistical techniques over actual system and user account behavior measurements, automatically-generated models and rules are tuned to the particular source material. This process, in turn, avoids the human bias that is intrinsic when misuse signatures, patterns and other knowledge-based models are designed by hand, as is the norm.
Despite this, no general infrastructure has been developed for the systematic application of behavior-based (misuse) detection across a broad set of detection and intelligence analysis tasks such as fraudulent Internet activities, virus detection, intrusion detection and user account proﬁling. Today’s Internet security systems are specialized to apply a small range of techniques, usually knowledgebased, to an individual misuse detection problem, such as intrusion, virus or SPAM detection. Moreover, these systems are designed for one particular network environment, such as medium-sized network enclaves, and only tap into an individual cross-section of network activity such as email activity or TCP/IP activity. Behavior-based detection technology as proposed herein will likely provide a quantum leap in security and in intelligence analysis in both oﬄine and online task environments.
EMT has been described in another publication, focusing on its use for security applications, including virus and spam detection, as well as security policy violations. In this paper, we focus on several of its features speciﬁc to intelligence applications, namely the means of clustering email by content based analyses, identiﬁcation of ”similar email accounts” based upon measuring similarity between account proﬁles represented by histograms, and clique analyses that are supported by EMT.
1.1 Applying Behavior-Based Detection to Email sources
Table ?? enumerates a range of behavior-based Internet applications. These applications cover a set of detection, security and marketing applications that exist within the government, commercial and private sectors. Each of these applications are within the capabilities of behavior-based techniques by applying data mining algorithms over appropriate audit data sources.
Our current research has applied behavior-based methods directly to the ﬁrst six applications listed in Table ??: Fraud detection, malicious email detection, intrusion detection, user community discovery, behavior pattern discovery, and analyst workbench. Each of these are Internet security applications, applying to both outbound and inbound network- and email-based traﬃc.
Solving Internet security problems greatly assists surveillance intelligence activities. For example, the discovery of user account communities and the discovery and detection of certain community behavior patterns can be directed to uncover certain classes of covert, clandestine or espionage behavior performed with Internet resources. Furthermore, fraud detection in particular has direct beneﬁt for an intelligence agency by proﬁling and identifying users and clusters of users that participate in such malicious Internet activities such as fraudulent activities.
Behavior-based detection has been proven against similar, analogous security applications. The ﬁnance, telecom and energy industries have protected their customers from fraudulent misuse of their services (e.g., fraudulent misuse of credit card accounts, telephone calling cards, stealing of utility service, etc.) by modeling their individual customer accounts and detecting deviations from this model for each of their customers. The behavior-based protection paradigm applied to the Internet thus has an historical precedent that is now ubiquitous and transparent as exempliﬁed by the credit card in the reader’s wallet or purse.
1.2 EMT as an Analyst Workbench for Interactive Intelligence Investigations The ”Malicious Email Tracking” (MET) [?] is an online system that uses email ﬂow statistics to capture new virii, which are largely undetectable by the ”signature” detection methods of today’s state-of-the-art commercial virus detection systems. Speciﬁcally, all email attachments are tracked by tracing a private hash value, temporal statistics such as replication rate are recorded to trace the attachments’ trajectory, e.g., across LANs, and these statistics directly inform the detection of self-replicating, malicious software attachments. MET has been developed and deployed as an extension to mail servers and is fully described elsewhere. MET is an example of an online ”behavior-based” security system that defends and protects a system not solely by attempting to identify known attacks against a system, but rather by detecting deviations from a system’s normal behavior. Many approaches to ”anomaly detection” have been proposed, including research systems that aim to detect masqueraders by modeling user behaviors in command line sequences, or even keystrokes. However, in this case, MET is architected to protect user accounts by modeling user email ﬂows to detect malicious email attachments, especially polymorphic viruses that are not detectable or traceable via signature-based detection methods.
The ”Email Mining Toolkit” (EMT) on the other hand, is an oﬄine system applied to email ﬁles gathered from server logs or client email programs. EMT computes information about email ﬂows from and to email accounts, aggregate statistical information from groups of accounts, and analyzes content ﬁelds of emails. The EMT system provides temporal statistical feature computations and behavior-based modeling techniques, through an interactive user interface to enable targeted intelligence investigations and semi-manual forensic analysis of email ﬁles. Figure ?? illustrates the general architecture of a behavior-based
system deploying dual functionality:
1. An online security detection application (in this case, MET for malicious email detection)
2. A general analyst workbench for intelligence investigations (EMT, for email source analysis) As this ﬁgure illustrates, these functionalities share a great deal of overhead.
With regard to the implementation, by deploying these dual functionalities, the audit module, computation of temporal statistics, user modeler and database of user models each serve for both functionalities. Moreover, with regard to the conceptual design, the particular set of temporal statistics and user model processes designed for one can improve the performance of the other. In particular, temporal features, as well as user account models and clusters, are representatively
general ”fundamental building blocks.” EMT provides the following functionalities, interactively:
– Querying a database (warehouse) of email data and computed feature values,
• Ordering and sorting emails on the basis of content analysis (n-gram analysis, keyword spotting, and classiﬁcations of email supported by an integrated supervised learning feature using Nave Bayes classiﬁer trained on user selected features)
• Historical features that proﬁle user groups by statistically measuring behavior characteristics.
• User models that group users according to features such as typical emailing patterns (as represented by histograms over diﬀerent selectable statistics), and email communities (including the ”social cliques” revealed in email exchanges between email accounts.
– Applying statistical models to email data to alert on abnormal or unusual email events.
Table 1. Behavior-Based Internet Applications for Security and Beyond
EMT is also designed as a plug in to a data mining platform, originally designed and implemented at Columbia called the DW/AMG architecture (Data Warehouse/Adaptive Model Generation system). That work has been transferred to System Detection Inc (SysD http://www.sysd.com), a DARPA-spinout from Columbia who has commercialized the system as the Hawkeye Security Platform.
2 EMT Features The full range of EMT features have been described elsewhere. For the present paper, we provide a brief overview of several of its key features of direct relevance to security analysis and intelligence applications, along with descriptive screenshots of EMT in operation.
Fig. 1. User account proﬁling, dual use: online detection and oﬄine analysis.
2.1 Attachment models
MET was initially conceived to statistically model the behavior of email attachments in real time ﬂowing through an enclave’s email server, and support the coordinated sharing of information among a wide area of email servers to identify malicious attachments and halt their propagation before saturation. In order to properly share such information, each attachment must be uniquely identiﬁed, which is accomplished through the computation of an MD5 hash of the entire attachment.
EMT runs an analysis on each attachment in the database to calculate a number of metrics. These include, birth rate, lifespan, incident rate, prevalence, threat, spread, and death rate. They are explained fully in 1, and are displayed graphically in Figure 3.