# «arch J. Senthilnatha1, Vipul Dasb2, S.N. Omkara3, V. Mania4 a Department of Aerospace Engineering, Indian Institute of Science, Bangalore, India b ...»

Clustering using Levy Flight Cuckoo Search

J. Senthilnatha1, Vipul Dasb2, S.N. Omkara3, V. Mania4

a

Department of Aerospace Engineering, Indian Institute of Science, Bangalore, India

b

Department of Information technology, National Institute of Technology, Karnataka, India

{ snrj@aero.iisc.ernet.in; 2vipulramdas@gmail.com; 3omkar@aero.iisc.ernet.in;

mani@aero.iisc.ernet.in}

Abstract. In this paper, a comparative study is carried using three nature-inspired

algorithms namely Genetic Algorithm (GA), Particle Swarm Optimization (PSO) and Cuckoo Search (CS) on clustering problem. Cuckoo search is used with levy flight. The heavy-tail property of levy flight is exploited here. These algorithms are used on three standard benchmark datasets and one real-time multi-spectral satellite dataset. The results are tabulated and analysed using various techniques. Finally we conclude that under the given set of parameters, cuckoo search works efficiently for majority of the dataset and levy flight plays an important role.

Keywords: Genetic algorithm, Particle swarm optimization, Cuckoo search, Levy flight, Clustering.

1 Introduction Clustering is an unsupervised learning method where objects with closer resemblance are grouped together to form a cluster based on a similarity measure. The objective of clustering is to minimize intra-cluster distance while inter-cluster distance is maximized [1]. Clustering has various applications which include data analysis, machine learning, image analysis and other engineering applications.

Clustering can be classified into two types: hierarchical and partition. In hierarchical clustering, objects belong to more than one cluster forming a hierarchical pattern. Hierarchical clustering is carried out by splitting and merging the dataset.

In splitting the number of cluster centres generated would be greater than the number of classes while merging is to group the dataset to exact number of classes. In partition clustering, objects are clustered into disjoint groups without forming a hierarchy. In both methods, similarity measure is used to generate cluster centres.

Previously, the most popularly used and tested partition based algorithm is kmeans clustering. The main disadvantage of k-means clustering is convergence to J. C. Bansal et al. (eds.), Proceedings of Seventh International Conference on Bio-Inspired 65 Computing: Theories and Applications (BIC-TA 2012), Advances in Intelligent Systems and Computing 202, DOI: 10.1007/978-81-322-1041-2_6, Ó Springer India 2013 66 J. Senthilnatha et al.

the local minima [2]. In literature, nature inspired algorithms are used effectively in clustering problem as it converges to global minima [2, 3]. These algorithms are based on the exploration and exploitation behaviour observed in nature and is effectively used in optimization problems.

In this paper, a comparative performance study is carried out based on the results obtained using three nature inspired algorithms namely genetic algorithm (GA), particle swarm optimization (PSO) and cuckoo search algorithm (CS) on clustering problem. The standard benchmark clustering data used in our study are the same that is available in the (UCI machine learning repository) literature [4] and a real-time multi-spectral satellite image for crop type classification. Xin-She et.al [5] has implemented and analyzed CS algorithm by comparing with GA and PSO using standard benchmark functions. In their study, CS algorithm is used with levy flight and is found to be performing better compared to the other two methods. In literature, CS has been used without levy distribution for clustering problem on satellite image [3]. In our study, we use CS with levy flight as used in [5], on clustering data set by comparing with GA and PSO. The important property of levy flight is it makes sure that the whole search space is covered, which is due to the heavy-tailed property of levy distribution [6-10]. In our study, we split the data into training and testing samples. The cluster centres are determined using the algorithms on the training dataset and the testing dataset is used to determine the classification error percentage (CEP).

The remaining sections are in the following order: in section 2 the problem formulation for clustering is discussed, in section 3 a brief discussion of the algorithms is presented, in section 4 and section 5 we discuss analysis of the results obtained and conclusion respectively..

2 Problem Formulation

3 Methodology This section gives brief introduction about the algorithms used in our study, the way it has been applied for clustering problem and also the pseudo-code for the algorithms are discussed.

3.1 Genetic algorithm This algorithm is based on the natural selection process seen in nature [11, 12].

The best fit organism of the current generation carries on the genes to the next generation. The concept of genetic operators (cross-over and mutation) is included in the algorithm wherein a change in the gene structure is introduced that produces an entirely different trait. The main idea behind genetic algorithm is the operators used namely reproduction, crossover and mutation.

This algorithm takes a predetermined number of random solutions (population) in the search space called chromosomes. Here the convergence criterion is used to terminate the algorithm. At each iteration the chromosomes are made to crossover using single point crossover and the fitness of each chromosomes is calculated using fi = f(xi) i=1,2,...,n 3 where f(xi) is the fitness function given by Eq. 1 considering the clusters individually and n is the population size.

The fittest chromosomes (solutions) among the entire population are considered for the next generation (iteration). At any random point the chromosomes undergo mutation based on the mutation rate. The fitness is calculated and the best solutions carryon till termination criteria is reached. Thus the cluster centres are generated using the training data set.

Pseudo-code

1. Initialize population of n chromosomes

2. Repeat till stopping criteria

a) Calculate fitness using Eq. 3

b) Apply elitism by sorting the fitness value of the population

c) Retain the best fit solutions (reproduction)

d) Crossover the adjacent chromosomes at a random position using single point crossover

e) Mutate randomly selected point within a chromosome

3. Cluster centre will be the best fit solution from the population 68 J. Senthilnatha et al.

** 3.2 Particle Swarm Optimization**

This is a population based method which iteratively improves the solution by moving the solutions closer to the optimal solution. Here each particle moves towards the optimal solution with a velocity vi at each iteration. Eventually all particles converge to an optimal position [13].

Initially n particles are created and randomly distributed in the search space. The fitness of each particle is evaluated using Eq.3 and Eq.1, considering the classes individually. All the particles are made to move one step towards the fittest particle (global best solution) as well as towards its personal best position with a velocity vi given by vi(t+1)=w*vi(t)+bp*rand*(pi-ci)+bg*rand*(g-ci) 4 where pi is the personal best position of the particle, ci is the current position of the particle, g is the global best of the entire particle, w is the inertial constant, bp is the personal best constant and bg is the global best constant, i=1, 2,..., n. Each particle moves using ci(t+1)=ci(t)+vi 5 The fitness of each particle is calculated and the personal best position and the global best are determined. This process is repeated until stopping criteria is met.

The global best position will be the cluster centre to the given data set.

Pseudo-code

1. Initialize n particles

2. Repeat till stopping criteria met

a) Calculate fitness of each particle using Eq.3

b) global best position is the best fit particle

c) move all the particles towards the global best position using Eq.4 and Eq.5

d) for each particle if (fitness of current position fitness of personal best) then personalbest = current position

e) update personal best position for each particle

f) global best fitness value is retained

3. Cluster centre is the global best position

** 3.3 Cuckoo Search**

This algorithm is based on the breeding pattern of parasitic cuckoos [3, 5, 14].

Some species of cuckoo namely ani and Guira lay their eggs in the nest of other birds. The possibility of occurrence of such act leads to i) the host birds’ eggs being destroyed by the cuckoo itself or the cuckoo chick upon hatching; ii) the host birds may realise the presence of a foreign egg in its nest and may throw away these eggs or abandon the nest altogether and build a new nest elsewhere [5].

These are the processes in nature that this algorithm inculcates. The basic assumptions made are: 1) At a time each cuckoo lays one egg and dumps it into ranClustering using Levy Flight Cuckoo Search 69 domly chosen nest; 2) The best nest with high quality eggs will carry over to the next generation; 3) Each nest contains only one egg and the number of host nests are fixed and; 4) The probability that the host bird discovers the cuckoo egg is pa..

This implies that the fraction pa of n nests is replaced by new nests (with new random solutions) [5].

Each nest represents a solution and a cuckoo egg represents a new solution. The aim is to use the new and potentially better solutions (cuckoo eggs). An initial population of host nest is generated randomly. The algorithm runs till the convergence is reached. At each iteration a cuckoo is selected at random using levy flight as given [5] xi(t+1)=xi(t) + α*L 6 where α is the step-size, L is a value from the Levy distribution, i=1,2,...,n, n is the number of nests considered. The fitness of the cuckoo is calculated using Eq.3 and Eq.1, considering the classes individually.

Choose a random nest from the given population of nests and evaluate its fitness from Eq.6. If the fitness of the new solution is better than the older one then replace the older one with the new one. A fraction pa of the total number of nests is replaced by new nests with new random solution. The best nests with the fittest egg (solution) are carried-on to the next generation.

This is continued till the termination criteria is reached and the best nest with fittest egg is taken as the optimal value. Thus the cluster centres can be generated using this optimal value.

Pseudo-code

1. Initialise n nests

2. Repeat till stopping criteria is met

a) Randomly select a cuckoo using levy flight using Eq.6

b) Calculate its fitness using Eq.3 (Fc)

c) Randomly select a nest

d) Calculate its fitness using Eq.3 (Fn)

e) If (Fc Fn) then Replace the nest with the cuckoo

f) A fraction pa of nest are replaced by new nests

g) Calculate fitness and keep best nests

h) Store the best nest as optimal fitness value

3. Cluster centre will be the best nest position 4 Results and discussion

The performance measures used in this paper are classification error percentage [2], Statistical significance test [4], Receiver operating characteristic [15, 16] and time complexity analyses.

4.1. Classification error percentage

The algorithms are run five times and the average of the results is as shown in Table 2. The values are obtained using the testing dataset. The parameters such as the maximum generation and the number of initial random solution are kept the same for all the algorithms. Each algorithm is run till it converged to a point with a tolerance of 0.01. In GA, the best 40% of the parent generation and the best 60% of the offspring generation are carried on to the next generation. In PSO, the inertial constant (w), the personal best constant (bp) and the global best constant (bg) are all set to 1. In CS algorithm, the probability factor pa is set to 0.25.

Clustering using Levy Flight Cuckoo Search 71

**4.2. Statistical significance test**

Statistical significance is done to ascertain that the results are obtained consistently. No matter where the initial random solutions are picked up from, they would always converge to the global optimum position (cluster centre). This would imply that an algorithm which performed better than the other algorithms will always perform better when run under similar initial conditions. In this study a binomial test is conducted [3] between CS and GA and also CS and PSO based on the result obtained on image segmentation dataset.

Assume the test is carried between CS and GA. Here the total number of testruns is N, i.e., the result of CS and GA differ in N places. Let S (success) is the number of times CS gave correct result and F (failure) is the number times GA gave correct result. Now, calculating the p-value (probability of S successes out of N trials) using the binomial distribution as N

In our case, we analyse on the image segmentation dataset. We give an example of the analyses using cuckoo search. The classification matrix obtained after applying cuckoo search algorithm on image segmentation data is given in Table 5.

In the above representation, row indicates the class the element belongs to and the column indicates the class the elements are classified into after using the cluster centre based on the CS algorithm. The principal diagonal elements represent correctly classified elements. Consider class 1, from Table 4, we have TP=126, FN= 174, FP=168, TN=1632. From this data, we calculate the true posiClustering using Levy Flight Cuckoo Search 73 tive rate, false positive rate and the accuracy of the given algorithm on class 1 of the given clustering dataset. From Eq. 9, Eq. 10 and Eq. 11, TPR is 0.4200, FPR is

0.0933 and ACC is 0.8371. This implies that 42% of what actually belonged to class 1 was correctly classified and 9% of the data which did not belong to class 1 were added to class 1. The overall efficiency of the algorithm with respect to class 1 is 83%. Similarly the ROC analyses for all the classes of image segmentation dataset for the above three algorithms are given in Table 6.