How to do Affinity based Clustering in R?

This recipe helps you do Affinity based Clustering in R

Recipe Objective

Organised logical groups of information is preferred over unorganised data by people. For example, anyone finds it easier to remember information when it is clustered together by taking its common characteristics into account.

Likewise, A machine learning technique that provides a way to find groups/clusters of different observations within a dataset is called Clustering. In this technique due to the absence of response variable, it is considered to be an unsupervised method. This implies that the relationships between 'n' number of observations is found without being trained by a response variable. The few applications of Clustering analysis are ​

  1. Customer segmentation: process for dividing customers into groups based on similar characteristics.
  2. Stock Market Clustering based on the performance of the stocks
  3. Reducing Dimensionality

There are two most common Clustering algorithm that is used: ​

  1. KMeans Clustering: commonly used when we have large dataset
  2. Heirarchical or Agglomerative Clustering: commonly used when we have small dataset
  3. Density based clustering (DBSCAN)
  4. Affinity Propogation

Affinity propagation is a clustering algorithm developed by Frey and Duecke that identifies exemplars among data points and forms clusters of data points around these exemplars. One of the drawbacks of KMeans is that it is sensitive to the initial random selection of exemplars. Affinity propagation overcomes this problem and we do not need to specify the number of clusters in advance. It compute the optimal number of clusters for us. ​

This recipe demonstrates Affinity Based Clustering using on a real-life Mall dataset to carry out customer segmentation in R-language. ​

STEP 1: Importing Necessary Libraries

# For Data Manipulation library(tidyverse) # For Clustering algorithm library(cluster) install.packages("apcluster") library(apcluster) # for cluster visualisation library(factoextra)

STEP 2: Loading the Dataset

Dataset description: It is a basic data about the customers going to the supermarket mall. This can be used for customer segmentation. There are 200 observations(customers) and no missing data.

It consists of four columns ie. measured attrutes: ​

  1. CustomerID is the customer identification number.
  2. Gender is Female and Male.
  3. Age is the age of customers.
  4. Annual Income (k) is the annual income of clients in thousands of dollars.
  5. Spending Score (1-100) is the spending score assigned by the shopping center according to the customer's purchasing behavior

# creating a dataframe customer_seg customer_seg = read.csv('R_350_Mall_Customers.csv') # getting the required information about the dataset glimpse(customer_seg)

Observations: 200
Variables: 5
$ CustomerID              1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1...
$ Gender                  Male, Male, Female, Female, Female, Female, ...
$ Age                     19, 21, 20, 23, 31, 22, 35, 23, 64, 30, 67, ...
$ Annual.Income..k..      15, 15, 16, 16, 17, 17, 18, 18, 19, 19, 19, ...
$ Spending.Score..1.100.  39, 81, 6, 77, 40, 76, 6, 94, 3, 72, 14, 99,...

For the simplicity of demonstrating heirarchichal Clustering with visualisation, we will only consider two measured attributes (Age and Annual Income). ​

# assigning columns 3 and 4 to a new dataset customer_prep customer_prep = customer_seg[3:4]

STEP 3: Data Preprocessing (Scaling)

This is a pre-modelling step. In this step, the data must be scaled or standardised so that different attributes can be comparable. Standardised data has mean zero and standard deviation one. we do thiis using scale() function

Note: Scaling is an important pre-modelling step which has to be mandatory

# scaling the dataset customer_prep = scale(customer_prep) customer_prep %>% head()

Age		Annual.Income..k..
-1.4210029	-1.734646
-1.2778288	-1.734646
-1.3494159	-1.696572
-1.1346547	-1.696572
-0.5619583	-1.658498
-1.2062418	-1.658498

STEP 4: Performing Affinity based clustering

We use the apcluster(s = , x = ) function to carry this out task.

  1. s = is a similarity matrix for the input data. The choice negDistMat(r=2) is the standard similarity measure used in the papers of Frey and Dueck — negative squared distances.
  2. x = input data

# to plot the eps values a = apcluster(negDistMat(r=2), x=customer_prep) # to draw an optimum line a

APResult object

Number of samples     =  200 
Number of iterations  =  138 
Input preference      =  -2.981604 
Sum of similarities   =  -28.08527 
Sum of preferences    =  -38.76086 
Net similarity        =  -66.84613 
Number of clusters    =  13 

Exemplars:
   16 21 31 45 49 83 90 106 124 164 175 191 196
Clusters:
   Cluster 1, exemplar 16:
      1 2 3 4 6 8 14 16 18 22 30 32 34 36
   Cluster 2, exemplar 21:
      5 7 10 12 15 17 20 21 24 26 28 29 39
   Cluster 3, exemplar 31:
      9 11 13 19 25 31 41 54
   Cluster 4, exemplar 45:
      23 27 33 35 37 43 45 47 51 55 56 57 60 64 67
   Cluster 5, exemplar 49:
      38 40 42 44 46 48 49 50 52 53 59 70
   Cluster 6, exemplar 83:
      58 61 63 65 68 71 73 74 75 83 91 103 107 109 110 111 117
   Cluster 7, exemplar 90:
      72 77 80 81 84 86 87 90 93 97 99 102 105 108 118 119 120 129 131
   Cluster 8, exemplar 106:
      62 66 69 76 79 85 88 92 96 98 100 101 104 106 112 114 115 116 121 125 133 
      135 139 163
   Cluster 9, exemplar 124:
      78 82 89 94 95 113 122 123 124 127 128 130 132 137 140 151 152 153 154 
      157 167
   Cluster 10, exemplar 164:
      126 134 136 138 142 143 144 145 146 148 149 150 156 158 159 160 162 164 
      166 168 169 170 171 172 173 174 176 178
   Cluster 11, exemplar 175:
      141 147 155 161 165 175 177 179 183 187
   Cluster 12, exemplar 191:
      180 181 182 184 185 186 188 189 190 191 192
   Cluster 13, exemplar 196:
      193 194 195 196 197 198 199 200

# optimal number of clusters cat("optimal number of clusters:", length(a@clusters), "\n")

optimal number of clusters: 13 

STEP 5: Cluster Visualization

# cluster visualisation plot(a, customer_prep)

This plot helps us to analyse the different clusters of customers formed so that we can target the respective clusters seperately in our marketing strategy.

What Users are saying..

profile image

Gautam Vermani

Data Consultant at Confidential
linkedin profile url

Having worked in the field of Data Science, I wanted to explore how I can implement projects in other domains, So I thought of connecting with ProjectPro. A project that helped me absorb this topic... Read More

Relevant Projects

Ensemble Machine Learning Project - All State Insurance Claims Severity Prediction
In this ensemble machine learning project, we will predict what kind of claims an insurance company will get. This is implemented in python using ensemble machine learning algorithms.

Build a Churn Prediction Model using Ensemble Learning
Learn how to build ensemble machine learning models like Random Forest, Adaboost, and Gradient Boosting for Customer Churn Prediction using Python

Credit Card Fraud Detection as a Classification Problem
In this data science project, we will predict the credit card fraud in the transactional dataset using some of the predictive models.

Deploy Transformer BART Model for Text summarization on GCP
Learn to Deploy a Machine Learning Model for the Abstractive Text Summarization on Google Cloud Platform (GCP)

CycleGAN Implementation for Image-To-Image Translation
In this GAN Deep Learning Project, you will learn how to build an image to image translation model in PyTorch with Cycle GAN.

Build a Music Recommendation Algorithm using KKBox's Dataset
Music Recommendation Project using Machine Learning - Use the KKBox dataset to predict the chances of a user listening to a song again after their very first noticeable listening event.

Build an Image Classifier for Plant Species Identification
In this machine learning project, we will use binary leaf images and extracted features, including shape, margin, and texture to accurately identify plant species using different benchmark classification techniques.

Loan Eligibility Prediction in Python using H2O.ai
In this loan prediction project you will build predictive models in Python using H2O.ai to predict if an applicant is able to repay the loan or not.

Build Deep Autoencoders Model for Anomaly Detection in Python
In this deep learning project , you will build and deploy a deep autoencoders model using Flask.

LLM Project to Build and Fine Tune a Large Language Model
In this LLM project for beginners, you will learn to build a knowledge-grounded chatbot using LLM's and learn how to fine tune it.

OSZAR »