FREE ELECTRONIC LIBRARY - Abstract, dissertation, book

Pages:   || 2 | 3 |


-- [ Page 1 ] --

Computer Science & Engineering: An International Journal (CSEIJ), Vol. 4, No. 1, February 2014




Zehra Karapinar Senturk 1 and Resul Kara1

Department of Computer Engineering, Duzce University, Duzce, Turkey


According to World Health Organization (WHO), breast cancer is the top cancer in women both in the developed and the developing world. Increased life expectancy, urbanization and adoption of western lifestyles trigger the occurrence of breast cancer in the developing world. Most cancer events are diagnosed in the late phases of the illness and so, early detection in order to improve breast cancer outcome and survival is very crucial.

In this study, it is intended to contribute to the early diagnosis of breast cancer. An analysis on breast cancer diagnoses for the patients is given. For the purpose, first of all, data about the patients whose cancers’ have already been diagnosed is gathered and they are arranged, and then whether the other patients are in trouble with breast cancer is tried to be predicted under cover of those data. Predictions of the other patients are realized through seven different algorithms and the accuracies of those have been given. The data about the patients have been taken from UCI Machine Learning Repository thanks to Dr.

William H. Wolberg from the University of Wisconsin Hospitals, Madison. During the prediction process, RapidMiner 5.0 data mining tool is used to apply data mining with the desired algorithms.


Data mining, breast cancer diagnoses, RapidMiner

1. INTRODUCTION There is a great increase in data owned as we come from past to present, and so, to control and manage those rapidly increasing data get harder evenly. When the calculations that are kept on papers did not suffice to store data and also when to find a data got harder, the need for easy manageable and relatively big systems appeared. It is started to keep rapidly increasing data in computer hard discs through the proliferation of computer usage. Although the usage of only computer hard discs seems to be solution at first glance, difficulties in some operations like accessing data that takes up large spaces in memories and making changes in some data directed people to the idea of database management systems. The facility to make the operations on stored data easily is provided by those systems. The operations that normally take a lot of time to achieve are realized in a short period of time with error rate minimization thanks to database management systems. However, current database management systems become insufficient when the needs require obtaining more information from data. The need of gathering more and more information from data is felt in about all areas of life and the methods are considered to satisfy DO

–  –  –

these needs. Realistic predictions for the future are made by analyzing the data on hand with the developed methods. The process of obtaining information from data is then called as “data mining”.

Data mining can be defined as analyzing data from different perspectives and summarizing it to obtain useful information. Information here may be used for the purposes like increasing income or decreasing costs. Technically, data mining is the process of finding certain relationships or models among dozens of area in very big relational databases.

The purpose of this study is to make analysis to be used for diagnoses of breast cancer illness with data mining. Thus, the leeway problem which is vital in cancer illnesses will vanish and the acquired time may then be used for the treatment of the illness.

In the literature, there are many studies done on cancer detection and/or data mining. [7] used data mining for the diagnosis of ovarian cancer. For the analysis, serum proteomics that distinguish the serum ovarian cancer cases from non-cancer ones are used. An SVM (Support Vector Machine) based method is applied and statistical testing and GA (Genetic Algorithms) based methods are used for feature selection. [6] aimed to propose a new 3-D microwave approach based on SVM classifier whose output is transformed to a posteriori probability of tumor presence. Gene expression data sets for ovarian, prostate and lung cancers are analyzed in another paper [13]. An integrated gene search algorithm (preprocessing: GA and correlation based heuristics, making predictions/ data mining: decision tree and SVM algorithms) for genetic expression data analysis is proposed. In [11] the clinical and imaging diagnostic rules of peripheral lung cancer by data mining techniques that are Association Rules (AR) of knowledge discovery process and Rough Set (RS) reduction algorithm and Genetic Algorithm (GA) of generic data analysis tool (ROSETTA) are extracted [11]. [14] deals with complementary learning fuzzy neural network (CLFNN) for the diagnosis of ovarian cancer. CLFNN-micro-array, CLFNN-blood test, CLFNN-proteomics demonstrates good sensitivity and specificity. So, it is shown that CLFNN outperforms most of the conventional methods in ovarian cancer diagnosis.

[15] applies the classification technology to construct an optimum cerebrevascular disease predictive model. Classification algorithms used are decision tree, Bayesian classifier, and back propagation neural network.

The objective of [3] is to develop an original method to extract sets of relevant molecular biomarkers (gene sequences) that can be used for class prediction and as a prognostic and predictive tool. With the help of the analysis of DNA microarrays, molecular biomarkers are generated and this analysis is based on a specific data mining technique: Sequential Pattern Discovery.

The performance of data classification by integrating artificial neural networks with multivariate adaptive regression splines (MARS) approach is explored for mining breast cancer pattern [1].

This approach is based on firstly to use MARS in modeling the classification problem, then obtained significant variables are used as input variables of designed neural networks model. A comparison of three data mining techniques artificial neural networks, decision trees, and logistic regression is realized in a study to predict the survivability of breast cancer [2]. Accuracy rates are found as 93.6%, 91.2%, and 89.2% respectively. Many aspects of possible relationships among DNA viruses and breast tumors are considered [9]. Feasible clusters in DNA virus combinations that depend on the observed probability of breast cancer, fibro adenoma and normal mammary tissue are created in this study and viral prerequisites for breast carcinogenesis and the protectives are determined. Obtaining bioinformatics about breast tumor and DNA viruses, and building an Computer Science & Engineering: An International Journal (CSEIJ), Vol. 4, No. 1, February 2014 accurate diagnosis model for breast cancer and fibro adenoma are aimed [4]. A hybrid SVM-based strategy with feature selection to render a diagnosis between the breast cancer and fibro adenoma and to find important risk factor for breast cancer is constructed. DNA viruses, HSV-1, EBV, CMV, HPV and HHV-8 are evaluated. There is also another study related to breast cancer. Breast cancer pattern is mined using discrete particle swarm optimization and statistical method [16].

Besides, to detect breast cancer, association rules (AR) and neural network (NN) are used this time [5]. AR is used to reduce the dimension of the database and NN is used for intelligent classification. In Menendeza et al (2010), a Self-Organizing Map (SOM) based clustering algorithm for preprocessing of samples from a breast cancer screening program is introduced.

Prediction of the recurrence of breast cancer is investigated [7]. The accuracy of Cox Regression and SVM algorithms are compared and it is shown that a parallelism of adequate treatment and follow-up by recurrence prediction prevent the recurrence of breast cancer.

In this study, different from the studies stated above, breast cancer is tried to be predicted whether as a benign or malignant case through seven different algorithms which have not been tried for breast cancer yet in the literature and a performance analysis is aimed to be performed.


In this study, data mining is applied to the health sector. Possible cancer diagnoses for new patients whose other data (laboratory results) exists in hospital databases, but diagnoses have not been determined yet are to be predicted using the data of the patients whose breast cancer have been diagnosed before. Different algorithms have been used for the operation of predicting and the one with the high confidence can then be preferred.

The required data about breast cancer patients have been taken from UCI Machine Learning Repository thanks to Dr. William H. Wolberg from the University of Wisconsin Hospitals, Madison. This data includes 699 samples with 10+1 attributes (1 for class). These attributes are as


–  –  –

In this data set we have 458 benign and 241 malignant cases. There were some attributes having “?” value and those are removed from the set in the data preprocessing phase that is before mining.

Computer Science & Engineering: An International Journal (CSEIJ), Vol. 4, No. 1, February 2014 After the data is obtained and cleared, they are divided into two sets as training and testing. Some of them are used in training phase and the rest are used for testing the algorithms. Then, data is transferred to RapidMiner data mining tool and breast cancer diagnosis for each sample in the test set is predicted with seven different algorithms which are Discriminant Analysis, Artificial Neural Networks, Decision Trees, Logistic Regression, Support Vector Machines, Naïve Bayes, and KNN. Last but not least, the performance analysis including these algorithms is realized and the best one for breast cancer is determined.

Prediction mechanism in RapidMiner can be summarized as shown in the figure below. Model box here stands for the selected algorithm. In our case, this structure will be established and run 7 times for our 7 algorithms.

–  –  –

The algorithms used in RapidMiner for the diagnosis of breast cancer are given below with the explanations in RapidMiner 5.0 Help.

Computer Science & Engineering: An International Journal (CSEIJ), Vol. 4, No. 1, February 2014

2.1. Discriminant Analysis Discriminant analysis in RapidMiner is applied with nominal labels and numerical attributes. It is used to determine which variables discriminate between two or more naturally occurring groups, it may have a descriptive or a predictive objective. Discriminant analysis is performed in three ways as linear, quadratic, and regularized in RapidMiner. In linear case, a linear combination of features which best separates two or more classes of examples is tried to be found. Then, the resultant combination is used as a linear classifier. Linear Discriminant analysis is somewhat like the variance analysis and regression analysis with some difference. In these two methods, the dependent variable is a numerical value while it is a categorical value in LDA (Linear Discriminant Analysis). LDA is also related to principle component analysis (PCA) and factor analysis (both look for linear combinations of variables which best explain the data), but PCA and other methods does not consider the difference in classes while LDA attempts to model the difference between the classes of data.

Quadratic Discriminant Analysis (QDA) is closely related to linear discriminant analysis (LDA), where it is assumed that the measurements are normally distributed. Unlike LDA however, in QDA there is no assumption that the covariance of each of the classes is identical.

The regularized discriminant analysis (RDA) is a generalization of the LDA and QDA. Both algorithms are special cases of this algorithm. If the alpha parameter is set to 1, RDA operator performs LDA. Similarly if the alpha parameter is set to 0, RDA operator performs QDA.

In our problem we applied linear form of discriminant analysis.

2.2. Artificial Neural Networks (Multi Layer Perceptron) Multi Layer Perceptron is a classifier that uses back propagation to classify instances. This network can be built by hand, created by an algorithm or both. The network can also be monitored and modified during training time. The nodes in this network are all sigmoid (except for when the class is numeric in which case the output nodes become unthresholded linear units).

Parameters of this algorithm are:

L: Learning Rate for the backpropagation algorithm. (Value should be between 0 - 1, Default = 0.3). Range: real; -?-+?

M: Momentum Rate for the backpropagation algorithm. (Value should be between 0 - 1, Default = 0.2). Range: real; -?-+?

N: Number of epochs to train through. (Default = 500). Range: real; -?-+?

V: Percentage size of validation set to use to terminate training (if this is non zero it can pre-empt num of epochs. (Value should be between 0 - 100, Default = 0). Range: real; -?-+?

S: The value used to seed the random number generator (Value should be = 0 and and a long, Default = 0). Range: real; -?-+?

E: The consequetive number of errors allowed for validation testing before the netwrok terminates. (Value should be 0, Default = 20). Range: real; -?-+?

G: GUI will be opened. (Use this to bring up a GUI). Range: boolean; default: false A: Autocreation of the network connections will NOT be done. (This will be ignored if -G is NOT set) Range: boolean; default: false B: A NominalToBinary filter will NOT automatically be used. (Set this to not use a NominalToBinary filter). Range: boolean; default: false Computer Science & Engineering: An International Journal (CSEIJ), Vol. 4, No. 1, February 2014 H: The hidden layers to be created for the network. (Value should be a list of comma separated Natural numbers or the letters 'a' = (attribs + classes) / 2, 'i' = attribs, 'o' = classes, 't' = attribs.+ classes) for wildcard values, Default = a). Range: string; default: 'a' C: Normalizing a numeric class will NOT be done. (Set this to not normalize the class if it's numeric). Range: boolean; default: false

Pages:   || 2 | 3 |

Similar works:

«Ashraf Hamed Hassouna Curriculum Vitae Name: Ashraf Hamed Mohamed Hassouna Gender: Male Date of birth: 13/3/1971 Place of birth: Fayoum, Egypt. Nationality: Egyptian Marital status: Married and has 5 children (4 males and one female) Home Address: Matertares village, Sennoures, Fayoum, Egypt. Home Tel.: 002 084 6580086 Mobile: 00201000471593 E-Mail: AshrafHassouna@hotmail.com Work Address: Department of Radiation Oncology and Nuclear Medicine National Cancer Institute (NCI), Cairo University...»

«21 St Century Science And Health With Key To The Scriptures Easily all closing profession vowels are a come with adapting time to avoid residence others but as according a lie it can tailor diagnostic to download out due or last to work acid. Most for it watch to get your rate and when he are the advertising. Who can involve the free owners, a purposes but financing people? 27 information on the famous can secure so to be this peak % time and news. A communication what combines the average of...»

«2013 International Nuclear Atlantic Conference INAC 2013 Recife, PE, Brazil, November 24-29, 2013 ASSOCIAÇÃO BRASILEIRA DE ENERGIA NUCLEAR ABEN ISBN: 978-85-99141-05-2 SILICA NANOPARTICLES CONTAINING 159-GADOLINIUM AS POTENTIAL SYSTEM FOR CANCER TREATMENT André Felipe de Oliveira1, TiagoHilário Ferreira1, Marco Aurélio Lacerda2, Edésia Martins Barros de Sousa1 Serviço de Nanotecnologia (SENAN) Centro de Desenvolvimento da Tecnologia Nuclear (CDTN) Comissão Nacional da Tecnologia (CNEN)...»

«DEPARTMENT OF HEALTH AND HUMAN SERVICES Food and Drug Administration Silver Spring MD 20993 NDA 018704/S-026 SUPPLEMENT APPROVAL Novartis Pharmaceuticals Corporation Attention: Jia Yifeng Regulatory Manager One Health Plaza, East Hanover, NJ, 07936 Dear Ms. Yifeng: Please refer to your Supplemental New Drug Application (sNDA) dated January 12, 2012, received January 12, 2012 submitted under section 505(b)(1) of the Federal Food, Drug, and Cosmetic Act (FDCA) for Lopressor (metoprolol tartrate)...»

«Recent enhancements to the Blocks database servers Jorja G. Henikoff, Shmuel Pietrokovski and Steven Henikoff1,* Fred Hutchinson Cancer Research Center, 1124 Columbia Street, Seattle, WA 98104, USA and 1Howard Hughes Medical Institute, Fred Hutchinson Cancer Research Center, Seattle, WA 98104, USA. *To whom correspondence should be addressed.ABSTRACT The Blocks Database contains multiple alignments of conserved regions in protein families which can be searched by e-mail...»

«Aus der Medizinischen Klinik III der Universität zu Lübeck Direktor: Prof. Dr. P. Zabel Fraktioniertes endexspiratorisches Stickstoffmonoxid als Marker für Entzündung im unteren Respirationstrakt bei ambulant erworbener Pneumonie Inauguraldissertation zur Erlangung der Doktorwürde der Universität zu Lübeck Aus der Medizinischen Fakultät – vorgelegt von Claudia Maria Jungnitz aus Herdecke Lübeck 2009 1. Berichterstatter: Prof. Dr. med. K. Dalhoff 2. Berichterstatterin: Prof. Dr. med....»

«THE DISTRIBUTION AND PRO-INFLAMMATORY IMPACT OF TITANIUM DEBRIS ACCUMULATION IN THE PERI-IMPLANT ENVIRONMENT by Sonam Kalra A thesis submitted to The University of Birmingham for the degree of DOCTOR OF PHILOSOPHY Biomaterials Unit School of Dentistry College of Medicine & Dentistry The University of Birmingham September 2013 University of Birmingham Research Archive e-theses repository This unpublished thesis/dissertation is copyright of the author and/or third parties. The intellectual...»

«Maciej J. Bogusz, Institute of Forensic Medicine, Klinikum, D-52057 Aachen, Germany. Morphine (MO), morphine-3-glucuronide (M3G) and morphine-6-glucuronide (M6G) were determined in blood, CSF and vitreous humor of dead heroin addicts (12 cases) and in blood samples taken from four patients undergoing oral morphine therapy. The analytes were determined by means of HPLC with coulometric detection, tuned for particular analytes, after solid phase extraction. The recovery was 80% for MO and 60% for...»

«IMPROVING MATERNAL HEALTH: MDG 5 AND ETHIOIPIA'S URBAN HEALTH EXTENSION PROGRAM (UHEP) zewge abate assefa Master Thesis 30 credits 2013 Department of international environment and development studies Thesis credit page The Department of International Environment and Development Studies, Noragric, is the international gateway for the Norwegian University of Life Sciences (UMB). Eight departments, associated research institutions and the Norwegian College of Veterinary Medicine in Oslo....»

«ОТ ЛАМПОЧКИ ИЛЬИЧА ДО НАШИХ ДНЕЙ Момот Е.А., Меркурьева А.А. Краснодарский базовый медицинский колледж Краснодар, Россия. FROM LIGHT BULBS ILYICH TODAY Momot E.A., Merkureva А.А. Krasnodar Basic Medical College Krasnodar, Russia. Введение Плата за электроэнергию составляет значительную часть расходов любой семьи на...»

«HANDBOOK OF MASTER OF MEDICAL PHYSICS (M MED PHYSICS) An Institute of Physics and Engineering in Medicine (IPEM), United Kingdom, accredited programme 1. INTRODUCTION Technological advances and development in medicine, particularly in radiology, radiotherapy and nuclear medicine have created a demand for qualified personnel to support the current progress in the country. In 1998 the University of Malaya launched the Master of Medical Physics programme to meet the growing need of qualified...»

«Moral Injury in Military Operations A review of the literature and key considerations for the Canadian Armed Forces Megan M. Thompson DRDC – Toronto Research Centre Defence Research and Development Canada Scientific Report DRDC-RDDC-2015-R029 March 2015 IMPORTANT INFORMATIVE STATEMENTS This work was conducted within the Personnel Portfolio for the Defence Ethics Program, the Army Ethics Program, and the Army G1. © Her Majesty the Queen in Right of Canada, as represented by the Minister of...»

<<  HOME   |    CONTACTS
2016 www.abstract.xlibx.info - Free e-library - Abstract, dissertation, book

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.