«BREAST CANCER DIAGNOSIS VIA DATA MINING: PERFORMANCE ANALYSIS OF SEVEN DIFFERENT ALGORITHMS Zehra Karapinar Senturk 1 and Resul Kara1 Department of ...»
Computer Science & Engineering: An International Journal (CSEIJ), Vol. 4, No. 1, February 2014
BREAST CANCER DIAGNOSIS VIA DATA MINING:
PERFORMANCE ANALYSIS OF SEVEN DIFFERENT
Zehra Karapinar Senturk 1 and Resul Kara1
Department of Computer Engineering, Duzce University, Duzce, Turkey
According to World Health Organization (WHO), breast cancer is the top cancer in women both in the developed and the developing world. Increased life expectancy, urbanization and adoption of western lifestyles trigger the occurrence of breast cancer in the developing world. Most cancer events are diagnosed in the late phases of the illness and so, early detection in order to improve breast cancer outcome and survival is very crucial.
In this study, it is intended to contribute to the early diagnosis of breast cancer. An analysis on breast cancer diagnoses for the patients is given. For the purpose, first of all, data about the patients whose cancers’ have already been diagnosed is gathered and they are arranged, and then whether the other patients are in trouble with breast cancer is tried to be predicted under cover of those data. Predictions of the other patients are realized through seven different algorithms and the accuracies of those have been given. The data about the patients have been taken from UCI Machine Learning Repository thanks to Dr.
William H. Wolberg from the University of Wisconsin Hospitals, Madison. During the prediction process, RapidMiner 5.0 data mining tool is used to apply data mining with the desired algorithms.
KEYWORDSData mining, breast cancer diagnoses, RapidMiner
1. INTRODUCTION There is a great increase in data owned as we come from past to present, and so, to control and manage those rapidly increasing data get harder evenly. When the calculations that are kept on papers did not suffice to store data and also when to find a data got harder, the need for easy manageable and relatively big systems appeared. It is started to keep rapidly increasing data in computer hard discs through the proliferation of computer usage. Although the usage of only computer hard discs seems to be solution at first glance, difficulties in some operations like accessing data that takes up large spaces in memories and making changes in some data directed people to the idea of database management systems. The facility to make the operations on stored data easily is provided by those systems. The operations that normally take a lot of time to achieve are realized in a short period of time with error rate minimization thanks to database management systems. However, current database management systems become insufficient when the needs require obtaining more information from data. The need of gathering more and more information from data is felt in about all areas of life and the methods are considered to satisfy DO
these needs. Realistic predictions for the future are made by analyzing the data on hand with the developed methods. The process of obtaining information from data is then called as “data mining”.
Data mining can be defined as analyzing data from different perspectives and summarizing it to obtain useful information. Information here may be used for the purposes like increasing income or decreasing costs. Technically, data mining is the process of finding certain relationships or models among dozens of area in very big relational databases.
The purpose of this study is to make analysis to be used for diagnoses of breast cancer illness with data mining. Thus, the leeway problem which is vital in cancer illnesses will vanish and the acquired time may then be used for the treatment of the illness.
In the literature, there are many studies done on cancer detection and/or data mining.  used data mining for the diagnosis of ovarian cancer. For the analysis, serum proteomics that distinguish the serum ovarian cancer cases from non-cancer ones are used. An SVM (Support Vector Machine) based method is applied and statistical testing and GA (Genetic Algorithms) based methods are used for feature selection.  aimed to propose a new 3-D microwave approach based on SVM classifier whose output is transformed to a posteriori probability of tumor presence. Gene expression data sets for ovarian, prostate and lung cancers are analyzed in another paper . An integrated gene search algorithm (preprocessing: GA and correlation based heuristics, making predictions/ data mining: decision tree and SVM algorithms) for genetic expression data analysis is proposed. In  the clinical and imaging diagnostic rules of peripheral lung cancer by data mining techniques that are Association Rules (AR) of knowledge discovery process and Rough Set (RS) reduction algorithm and Genetic Algorithm (GA) of generic data analysis tool (ROSETTA) are extracted .  deals with complementary learning fuzzy neural network (CLFNN) for the diagnosis of ovarian cancer. CLFNN-micro-array, CLFNN-blood test, CLFNN-proteomics demonstrates good sensitivity and specificity. So, it is shown that CLFNN outperforms most of the conventional methods in ovarian cancer diagnosis.
 applies the classification technology to construct an optimum cerebrevascular disease predictive model. Classification algorithms used are decision tree, Bayesian classifier, and back propagation neural network.
The objective of  is to develop an original method to extract sets of relevant molecular biomarkers (gene sequences) that can be used for class prediction and as a prognostic and predictive tool. With the help of the analysis of DNA microarrays, molecular biomarkers are generated and this analysis is based on a specific data mining technique: Sequential Pattern Discovery.
The performance of data classification by integrating artificial neural networks with multivariate adaptive regression splines (MARS) approach is explored for mining breast cancer pattern .
This approach is based on firstly to use MARS in modeling the classification problem, then obtained significant variables are used as input variables of designed neural networks model. A comparison of three data mining techniques artificial neural networks, decision trees, and logistic regression is realized in a study to predict the survivability of breast cancer . Accuracy rates are found as 93.6%, 91.2%, and 89.2% respectively. Many aspects of possible relationships among DNA viruses and breast tumors are considered . Feasible clusters in DNA virus combinations that depend on the observed probability of breast cancer, fibro adenoma and normal mammary tissue are created in this study and viral prerequisites for breast carcinogenesis and the protectives are determined. Obtaining bioinformatics about breast tumor and DNA viruses, and building an Computer Science & Engineering: An International Journal (CSEIJ), Vol. 4, No. 1, February 2014 accurate diagnosis model for breast cancer and fibro adenoma are aimed . A hybrid SVM-based strategy with feature selection to render a diagnosis between the breast cancer and fibro adenoma and to find important risk factor for breast cancer is constructed. DNA viruses, HSV-1, EBV, CMV, HPV and HHV-8 are evaluated. There is also another study related to breast cancer. Breast cancer pattern is mined using discrete particle swarm optimization and statistical method .
Besides, to detect breast cancer, association rules (AR) and neural network (NN) are used this time . AR is used to reduce the dimension of the database and NN is used for intelligent classification. In Menendeza et al (2010), a Self-Organizing Map (SOM) based clustering algorithm for preprocessing of samples from a breast cancer screening program is introduced.
Prediction of the recurrence of breast cancer is investigated . The accuracy of Cox Regression and SVM algorithms are compared and it is shown that a parallelism of adequate treatment and follow-up by recurrence prediction prevent the recurrence of breast cancer.
In this study, different from the studies stated above, breast cancer is tried to be predicted whether as a benign or malignant case through seven different algorithms which have not been tried for breast cancer yet in the literature and a performance analysis is aimed to be performed.
2. MATERIALS AND METHODSIn this study, data mining is applied to the health sector. Possible cancer diagnoses for new patients whose other data (laboratory results) exists in hospital databases, but diagnoses have not been determined yet are to be predicted using the data of the patients whose breast cancer have been diagnosed before. Different algorithms have been used for the operation of predicting and the one with the high confidence can then be preferred.
The required data about breast cancer patients have been taken from UCI Machine Learning Repository thanks to Dr. William H. Wolberg from the University of Wisconsin Hospitals, Madison. This data includes 699 samples with 10+1 attributes (1 for class). These attributes are as
In this data set we have 458 benign and 241 malignant cases. There were some attributes having “?” value and those are removed from the set in the data preprocessing phase that is before mining.
Computer Science & Engineering: An International Journal (CSEIJ), Vol. 4, No. 1, February 2014 After the data is obtained and cleared, they are divided into two sets as training and testing. Some of them are used in training phase and the rest are used for testing the algorithms. Then, data is transferred to RapidMiner data mining tool and breast cancer diagnosis for each sample in the test set is predicted with seven different algorithms which are Discriminant Analysis, Artificial Neural Networks, Decision Trees, Logistic Regression, Support Vector Machines, Naïve Bayes, and KNN. Last but not least, the performance analysis including these algorithms is realized and the best one for breast cancer is determined.
Prediction mechanism in RapidMiner can be summarized as shown in the figure below. Model box here stands for the selected algorithm. In our case, this structure will be established and run 7 times for our 7 algorithms.
The algorithms used in RapidMiner for the diagnosis of breast cancer are given below with the explanations in RapidMiner 5.0 Help.
Computer Science & Engineering: An International Journal (CSEIJ), Vol. 4, No. 1, February 2014
2.1. Discriminant Analysis Discriminant analysis in RapidMiner is applied with nominal labels and numerical attributes. It is used to determine which variables discriminate between two or more naturally occurring groups, it may have a descriptive or a predictive objective. Discriminant analysis is performed in three ways as linear, quadratic, and regularized in RapidMiner. In linear case, a linear combination of features which best separates two or more classes of examples is tried to be found. Then, the resultant combination is used as a linear classifier. Linear Discriminant analysis is somewhat like the variance analysis and regression analysis with some difference. In these two methods, the dependent variable is a numerical value while it is a categorical value in LDA (Linear Discriminant Analysis). LDA is also related to principle component analysis (PCA) and factor analysis (both look for linear combinations of variables which best explain the data), but PCA and other methods does not consider the difference in classes while LDA attempts to model the difference between the classes of data.
Quadratic Discriminant Analysis (QDA) is closely related to linear discriminant analysis (LDA), where it is assumed that the measurements are normally distributed. Unlike LDA however, in QDA there is no assumption that the covariance of each of the classes is identical.
The regularized discriminant analysis (RDA) is a generalization of the LDA and QDA. Both algorithms are special cases of this algorithm. If the alpha parameter is set to 1, RDA operator performs LDA. Similarly if the alpha parameter is set to 0, RDA operator performs QDA.
In our problem we applied linear form of discriminant analysis.
2.2. Artificial Neural Networks (Multi Layer Perceptron) Multi Layer Perceptron is a classifier that uses back propagation to classify instances. This network can be built by hand, created by an algorithm or both. The network can also be monitored and modified during training time. The nodes in this network are all sigmoid (except for when the class is numeric in which case the output nodes become unthresholded linear units).
Parameters of this algorithm are:
L: Learning Rate for the backpropagation algorithm. (Value should be between 0 - 1, Default = 0.3). Range: real; -?-+?
M: Momentum Rate for the backpropagation algorithm. (Value should be between 0 - 1, Default = 0.2). Range: real; -?-+?
N: Number of epochs to train through. (Default = 500). Range: real; -?-+?
V: Percentage size of validation set to use to terminate training (if this is non zero it can pre-empt num of epochs. (Value should be between 0 - 100, Default = 0). Range: real; -?-+?
S: The value used to seed the random number generator (Value should be = 0 and and a long, Default = 0). Range: real; -?-+?
E: The consequetive number of errors allowed for validation testing before the netwrok terminates. (Value should be 0, Default = 20). Range: real; -?-+?
G: GUI will be opened. (Use this to bring up a GUI). Range: boolean; default: false A: Autocreation of the network connections will NOT be done. (This will be ignored if -G is NOT set) Range: boolean; default: false B: A NominalToBinary filter will NOT automatically be used. (Set this to not use a NominalToBinary filter). Range: boolean; default: false Computer Science & Engineering: An International Journal (CSEIJ), Vol. 4, No. 1, February 2014 H: The hidden layers to be created for the network. (Value should be a list of comma separated Natural numbers or the letters 'a' = (attribs + classes) / 2, 'i' = attribs, 'o' = classes, 't' = attribs.+ classes) for wildcard values, Default = a). Range: string; default: 'a' C: Normalizing a numeric class will NOT be done. (Set this to not normalize the class if it's numeric). Range: boolean; default: false