«THÈSE NO 4056 (2008) PRÉSENTÉE LE 15 MAI 2008 À LA FACULTÉ DE L'ENVIRONNEMENT NATUREL, ARCHITECTURAL ET CONSTRUIT LABORATOIRE D'INFORMATIQUE ET ...»
Data Mining MethoDologies for supporting
engineers During systeM iDentification
THÈSE NO 4056 (2008)
PRÉSENTÉE LE 15 MAI 2008
À LA FACULTÉ DE L'ENVIRONNEMENT NATUREL, ARCHITECTURAL ET CONSTRUIT
LABORATOIRE D'INFORMATIQUE ET DE MÉCANIQUE APPLIQUÉES À LA CONSTRUCTION
PROGRAMME DOCTORAL EN INFORMATIQUE, COMMUNICATIONS ET INFORMATION
ÉCOLE POLYTECHNIQUE FÉDÉRALE DE LAUSANNE
POUR L'OBTENTION DU GRADE DE DOCTEUR ÈS SCIENCESPAR Sandro SAITTA ingénieur informaticien diplômé EPF de nationalité suisse et originaire de Bavois (VD)
acceptée sur proposition du jury:
Prof. B. Moret, président du jury Prof. I. Smith, Dr B. Raphael, directeurs de thèse Prof. B. Faltings, rapporteur Prof. P. Struss, rapporteur Dr E. Viennet, rapporteur Suisse ای ر ﻨﺎ ﻞ ز ﺪ ﯽ ﻦ i Acknowledgments This work was funded by the Swiss National Science Foundation under grant #200020-109257.
My ﬁrst acknowledgment goes to my co-advisor, Prof. Ian Smith, who was present during my PhD for guiding my work. He also was a valuable person for explaining to me crucial aspects of thesis work. Benny Raphael, my other co-advisor and an Assistant Prof. at the National University of Singapore, followed my thesis from the very beginning to the end. Working in an autonomous manner during four years is not always straightforward. When I faced an obstacle, Benny was always there to help me. I will always remember a sentence he wrote to me: “Remember, you can make something good come out of everything if you do it in the right spirit”. Prakash Kripakaran, a Post doc. researcher at EPFL, is also the kind of person that you only meet once in your life. I think my work would never have reached the present state without his help. I had a lot of fruitful discussions with him regarding diﬃcult issues in my research. Instead of only giving a simple answer, Prakash always suggested a new way to face a particular problem. Fran¸ois Fleuret, a researcher at IDIAP, is an expert in machine learning.
c I had a lot of general discussions with him about data mining and it was really good food for thought. He is also the ﬁrst person who made me understand what research really is about. I also thank examiners for their time and interest in my work: Prof. Boi Faltings (EPFL), Prof.
Youn`s Bennani (Universit´ Paris 13) and Prof. Peter
Data alone are worth almost nothing. While data collection is increasing exponentially worldwide, a clear distinction between retrieving data and obtaining knowledge has to be made. Data are retrieved while measuring phenomena or gathering facts. Knowledge refers to data patterns and trends that are useful for decision making. Data interpretation creates a challenge that is particularly present in system identiﬁcation, where thousands of models may explain a given set of measurements. Manually interpreting such data is not reliable. One solution is to use data mining. This thesis thus proposes an integration of techniques from data mining, a ﬁeld of research where the aim is to ﬁnd knowledge from data, into an existing multiple-model system identiﬁcation methodology.
It is shown that, within a framework for decision support, data mining techniques constitute a valuable tool for engineers performing system identiﬁcation. For example, clustering techniques group similar models together in order to guide subsequent decisions since they might indicate possible states of a structure. A main issue concerns the number of clusters, which, usually, is unknown.
For determining the correct number of clusters in data and estimating the quality of a clustering algorithm, a score function is proposed. The score function is a reliable index for estimating the number of clusters in a given data set, thus increasing understanding of results. Furthermore, useful information for engineers who perform system identiﬁcation is achieved through the use of feature selection techniques. They allow selection of relevant parameters that explain candidate models. The core algorithm is a feature selection strategy based on global search.
In addition to providing information about the candidate model space, data mining is found to be a valuable tool for supporting decisions related to subsequent sensor placement. When integrated into a methodology for iterative sensor placement, clustering is found to provide useful support through providing a rational basis for decisions related to subsequent sensor placement on existing structures. Greedy and global search strategies should be selected according to the context. Experiments show that whereas global search is more eﬃcient for initial sensor placement, a greedy strategy is more suitable for iterative sensor placement.
Keywords: data mining, machine learning, correlation, PCA, clustering, K-means, cluster validity, feature selection, PGSL, SVM, system identiﬁcation, decision support, sensor placement, measurement system design.
λi i-th Lagrange multiplier C SVM tuning parameter representing the penalty of misclassifying training examples Pi i-th probability H(X) Entropy of variable X
“The capacity of digital data storage worldwide has doubled every nine months for at least a decade, at twice the rate predicted by Moore’s Law for the growth of computing power during the same period.” (Fayyad and Uthurusamy, 2002)
This chapter introduces the context of the thesis. It brieﬂy describes related topics such as data mining, system identiﬁcation and sensor placement. The last section presents research questions as well as the research methodology for achieving the objectives of this thesis.
1.1 Context Data alone is worth almost nothing. While data is increasing exponentially, people in some ﬁelds are “starving” for knowledge. In spite of this, the gap between data and knowledge may be huge. These days, the meaning of the word data is often confused with knowledge. Knowledge is obtained through the understanding of data. The amazing increase in data worldwide brings several challenges. The more the amount of data, the more diﬃcult it is to understand. It is sometimes assumed that the increase of knowledge is proportional to the increase of data. The reason for such an assertion might be the lack of appreciation of the diﬀerence between obtaining and understanding data.
Increase of data is a challenge that is particularly present in engineering. The number of sensors is increasing while costs are decreasing. In many domains, engineers are saturated with data of many types. A good example of such a task is model-based diagnosis (de Kleer and Williams, 1987) and system identiﬁcation (Ljung, 1999). Recently, a new methodology (RobertCHAPTER 1. INTRODUCTION Nicoud, 2003) has been developed in which system identiﬁcation is treated as a constraint satisfaction problem (CSP) instead of the more traditional optimization problem. This approach results in a set of several candidate models instead of a single model.
When there are many models, engineers need sophisticated tools to interpret them. Data mining (Tan et al., 2006) may provide help. Data mining techniques are used for the task of identifying characteristics of candidate models. Better system identiﬁcation is possible by integrating data mining into the overall process. No work has been done on mining models.
More speciﬁcally, data mining techniques have never been used for identifying characteristics of candidate models that explain observations (Chapter 2 provides more details). The present work is an attempt to ﬁll this knowledge gap by developing an overall methodology for multiple-model system identiﬁcation that integrates data mining to provide support for engineers.
1.2 Data Mining
Data mining techniques are becoming important in the context of the increasing trend in data worldwide as explained in Section 1.1. There are more and more sensors capturing changes in our environment and our infrastructure. Therefore, a growing challenge involves determining the meaning of data. As written in Piatetsky-Shapiro (2007), “[...] as long as the world keeps producing data of all kinds [...] at an ever increasing rate, the demand for data mining will continue to grow.” Data mining is a ﬁeld which is concerned with understanding data. In other words, the aim is to look for patterns in data (Pal and Mitra, 2004). As this pattern may be very diﬃcult to ﬁnd, it is sometimes compared to gold mining in rivers (Figure 1.1); gravel represents the enormous amount of data and gold nuggets are the hidden patterns to ﬁnd.
Although civil engineers were among the ﬁrst of all traditional engineering disciplines to use the power of computers ﬁve decades ago, they are now lagging behind other professions in the use of advanced techniques such as data mining. Indeed, data mining techniques have proven their eﬃciency in domains such as handwritten digit recognition, image and speech recognition, DNA sequences, ﬁnancial time series and web mining. Although data mining has been used in engineering, most of this work takes advantage of the predictive abilities of data mining methods. Very little work applies data mining techniques to tasks such as describing the structure of data. Known to the author, there is no attempt to apply data mining to models in system identiﬁcation. This work is thus a new application for data mining.
1.3. SYSTEM IDENTIFICATION
1.3 System Identiﬁcation Several years after construction, structures may no longer fulﬁll their intended functions. As written in Levy and Salvadori (2002), “It is the destiny of the man-made environment to vanish [...] ”. People outside of civil engineering domains have the misconception that civil engineers know exactly how structures behave in service. The complexity of both the structures and the materials involved make the understanding of exact structural behavior impossible. One way to learn about the state of the structure, before it collapses or as frequently happens, it reaches a stage where repair costs increase by orders of magnitude, is through diagnosis. When the goal of diagnosis is to determine models that reasonably explain measured responses, the approach is commonly known as system identiﬁcation. Although system identiﬁcation is closely related to diagnosis, the focus of this work is on helping engineers identify the system, not diagnose it.
The aim is not to propose a way to repair the system as it is the case in diagnosis, rather to ﬁnd the state of the system (even if it is not damaged) in order to improve management of artifacts that are expected to last more than one hundred years.
The goal of system identiﬁcation is to determine the state of a system and values of system parameters through comparisons of predicted with observed responses. Traditionally, this is treated as an optimization problem in which the best combination of values of model parameters are selected such that diﬀerences between model predictions and measurements are minimal.
Recent work has brought out the diﬀerent types of errors that can occur in system identiﬁcaCHAPTER 1. INTRODUCTION tion processes (Robert-Nicoud, 2003). These errors make optimization in system identiﬁcation unreliable since the global optimum may not correspond to the true state of the system due to compensating modeling and measurement errors. In such situations, treating the task as a constraint satisfaction problem (CSP) is more appropriate (see Section 2.5). It is noted that recent work proposes a distributed version of the constraint programming approach (Faltings, 2006).
Since measurements are indirect, the use of models is necessary. Even though a design model may be the most appropriate for designing and analyzing the structure prior to construction, it often cannot be used for system identiﬁcation. This is usually because design models are conservative. On the other hand, diagnosis models have to be as accurate as possible in order to avoid wrong diagnoses. The current work is a combination of model based reasoning concepts from computer science (de Kleer and Williams, 1987) and traditional model updating techniques used in engineering (Ljung, 1999). A correct understanding of the output using such techniques is an important challenge.
Diﬃculties associated with system identiﬁcation are that since many model predictions might match observations with certain limits, the best matching model may not be the correct model.
In this work, the reliability of identiﬁcation is deﬁned as the probability that the candidate model(s) obtained through system identiﬁcation corresponds to reality. Reliability is poor when many models predict the similar responses at measured locations. Factors that aﬀect the reliability of system identiﬁcation have been studied in previous research (Robert-Nicoud et al., 2004). The present work is an extension of this research and uses data mining techniques for a better estimation of the reliability of identiﬁcation.
1.4 Sensor Placement A basic assumption of system identiﬁcation is that there is a set of sensors measuring an effect. There are thousands of ways to measure physical phenomena in structures and many new technologies are emerging. Although their development has been the result of signiﬁcant scientiﬁc eﬀort, decisions related to the choice of measurement technology, speciﬁcations of performance and positioning of measurement locations are often not based on systematic and rational methodologies. While use of engineering experience and judgment may often result in measurement systems that provide useful results, a poorly designed measurement system can waste time and money.
When placing sensors on a structure, the analogy with medical diagnosis is relevant. People usually go to the doctor for a diagnosis of their conditions. They want to know what is wrong. For that, the doctor measures physiological parameters such as temperature and pulse rate. They try
1.5. OBJECTIVES to infer causes from what is measured. The way doctors conduct the measurements is iterative.