Customer Clusutering

Clustering customers based income and amount spent on products

Posted by Begen Yussupov on November 24, 2021 · 14 mins read

Companies are investing and exploring strategeis designed maintain current customers, to acquire new customers, help retain the current base and increase the customers lifelong value.

As competition is rising, customer relationship management plays a significant role in identifying and performing analysis of company’s valuable customers and adopting best marketing strategies.

This project is the illustration of using a clustering technique that identifies customers with similar characteristics and behaviors and segregating into homogeneous clusters.

We assume that those distinct groups of customers who function differently and follow different approaches in their spending and purchasing habits.

So main aim of the project is to identify different customer types and segment them into cluster of similar profiles, so target marketing can be executed effectively and efficiently.

As a result, will develop high-quality and long-term customer relationship that increase loyalty, growth and profit.

On this project, we will be using clustering algorithms KMeans Clustering. Clustering is a type of data mining technique used in a number of ways involving areas such as machine learning, pattern recognition and classification.

Our dataset has information about Mall visitors such as income, total amount spent on certain products etc. Through KMeans algoriths, we will separate those customers into several clusters. Further marketing department can offer customized offers on products aimed at increasing sales. So our algorith builds clustering model of given dataset.

Once the model have been fit to previously seen data they can be used to predict and understand new observations.

We have data of 2249 customers visiting stores with following information

import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, StackingClassifier
from xgboost import XGBClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, classification_report
from imblearn.combine import SMOTETomek
from sklearn.feature_selection import SelectFromModel
from sklearn.decomposition import PCA
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import MinMaxScaler
from sklearn.cluster import KMeans

pd.options.display.max_columns = None
pd.options.display.max_rows = None
np.set_printoptions(suppress=True)
df = pd.read_csv('marketing_campaign.csv', sep=';')
df.head()

Below are features we have

  • Education level
  • Marital status
  • Kids at home
  • Teen at home
  • Income
  • Amounts spent on fish products
  • Amounts spent on mean products
  • Amounts spent on fruits
  • Amounts spent on sweet products
  • Amounts spent on gold products
  • Amounts spent on wines
  • Number of purchases made with discounts
  • Number of purchases made with catalogue
  • Number of purchases made in store
  • Website purchases
  • Number of visits to website
  • Number of days since the last purchase

We also have data on customer acceptance of campaign 1 to 5.

Now let’s see the shape of our data and information about the datafram including the data type of each column and memory usage of the entire data.

df.shape

Data contains 2240 rows and 28 columns

This info() pandas method prints information about dataframe incluyding the data types and non-null values

df.info()


 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Year_Birth           2240 non-null   int64  
 1   Education            2240 non-null   object 
 2   Marital_Status       2240 non-null   object 
 3   Income               2216 non-null   float64
 4   Kidhome              2240 non-null   int64  
 5   Teenhome             2240 non-null   int64  
 6   Dt_Customer          2240 non-null   object 
 7   Recency              2240 non-null   int64  
 8   MntWines             2240 non-null   int64  
 9   MntFruits            2240 non-null   int64  
 10  MntMeatProducts      2240 non-null   int64  
 11  MntFishProducts      2240 non-null   int64  
 12  MntSweetProducts     2240 non-null   int64  
 13  MntGoldProds         2240 non-null   int64  
 14  NumDealsPurchases    2240 non-null   int64  
 15  NumWebPurchases      2240 non-null   int64  
 16  NumCatalogPurchases  2240 non-null   int64  
 17  NumStorePurchases    2240 non-null   int64  
 18  NumWebVisitsMonth    2240 non-null   int64  
 19  AcceptedCmp3         2240 non-null   int64  
 20  AcceptedCmp4         2240 non-null   int64  
 21  AcceptedCmp5         2240 non-null   int64  
 22  AcceptedCmp1         2240 non-null   int64  
 23  AcceptedCmp2         2240 non-null   int64  
 24  Complain             2240 non-null   int64  
 25  Z_CostContact        2240 non-null   int64  
 26  Z_Revenue            2240 non-null   int64  
 27  Response             2240 non-null   int64  
dtypes: float64(1), int64(24), object(3)
memory usage: 490.1+ KB

Describe method gives us descriptive statistics of a dataframe.

df.describe().T
count mean std min 25% 50% 75% max
Year_Birth 2240.0 1968.805804 11.984069 1893.0 1959.00 1970.0 1977.00 1996.0
Income 2216.0 52247.251354 25173.076661 1730.0 35303.00 51381.5 68522.00 666666.0
Kidhome 2240.0 0.444196 0.538398 0.0 0.00 0.0 1.00 2.0
Teenhome 2240.0 0.506250 0.544538 0.0 0.00 0.0 1.00 2.0
Recency 2240.0 49.109375 28.962453 0.0 24.00 49.0 74.00 99.0
MntWines 2240.0 303.935714 336.597393 0.0 23.75 173.5 504.25 1493.0
MntFruits 2240.0 26.302232 39.773434 0.0 1.00 8.0 33.00 199.0
MntMeatProducts 2240.0 166.950000 225.715373 0.0 16.00 67.0 232.00 1725.0
MntFishProducts 2240.0 37.525446 54.628979 0.0 3.00 12.0 50.00 259.0
MntSweetProducts 2240.0 27.062946 41.280498 0.0 1.00 8.0 33.00 263.0
MntGoldProds 2240.0 44.021875 52.167439 0.0 9.00 24.0 56.00 362.0
NumDealsPurchases 2240.0 2.325000 1.932238 0.0 1.00 2.0 3.00 15.0
NumWebPurchases 2240.0 4.084821 2.778714 0.0 2.00 4.0 6.00 27.0
NumCatalogPurchases 2240.0 2.662054 2.923101 0.0 0.00 2.0 4.00 28.0
NumStorePurchases 2240.0 5.790179 3.250958 0.0 3.00 5.0 8.00 13.0
NumWebVisitsMonth 2240.0 5.316518 2.426645 0.0 3.00 6.0 7.00 20.0
AcceptedCmp3 2240.0 0.072768 0.259813 0.0 0.00 0.0 0.00 1.0
AcceptedCmp4 2240.0 0.074554 0.262728 0.0 0.00 0.0 0.00 1.0
AcceptedCmp5 2240.0 0.072768 0.259813 0.0 0.00 0.0 0.00 1.0
AcceptedCmp1 2240.0 0.064286 0.245316 0.0 0.00 0.0 0.00 1.0
AcceptedCmp2 2240.0 0.013393 0.114976 0.0 0.00 0.0 0.00 1.0
Complain 2240.0 0.009375 0.096391 0.0 0.00 0.0 0.00 1.0
Z_CostContact 2240.0 3.000000 0.000000 3.0 3.00 3.0 3.00 3.0
Z_Revenue 2240.0 11.000000 0.000000 11.0 11.00 11.0 11.00 11.0
Response 2240.0 0.149107 0.356274 0.0 0.00 0.0 0.00 1.0

Lets see if the dataset has any categorical data

df.describe(include='O').T
count unique top freq
Education 2240 5 Graduation 1127
Marital_Status 2240 8 Married 864
Dt_Customer 2240 663 2012-08-31 12

Feature Engineering

Our features must not contain missing or null values and outliers.

As part of feature engineering, we will check our data for missing or null values. We can find if our dataset contains any null values by running following line of code:

df.isnull().sum() 
Year_Birth              0
Education               0
Marital_Status          0
Income                 24
Kidhome                 0
Teenhome                0
Dt_Customer             0
Recency                 0
MntWines                0
MntFruits               0
MntMeatProducts         0
MntFishProducts         0
MntSweetProducts        0
MntGoldProds            0
NumDealsPurchases       0
NumWebPurchases         0
NumCatalogPurchases     0
NumStorePurchases       0
NumWebVisitsMonth       0
AcceptedCmp3            0
AcceptedCmp4            0
AcceptedCmp5            0
AcceptedCmp1            0
AcceptedCmp2            0
Complain                0
Z_CostContact           0
Z_Revenue               0
Response                0
dtype: int64

We can see that income contains 24 missing values. We either can drop those missing values. However in our case, we will calculate the average of income and replace with average. This should not significantly affect our prediction.

df['Income']=df['Income'].fillna(df['Income'].mean())
df.isnull().sum()
Year_Birth             0
Education              0
Marital_Status         0
Income                 0
Kidhome                0
Teenhome               0
Dt_Customer            0
Recency                0
MntWines               0
MntFruits              0
MntMeatProducts        0
MntFishProducts        0
MntSweetProducts       0
MntGoldProds           0
NumDealsPurchases      0
NumWebPurchases        0
NumCatalogPurchases    0
NumStorePurchases      0
NumWebVisitsMonth      0
AcceptedCmp3           0
AcceptedCmp4           0
AcceptedCmp5           0
AcceptedCmp1           0
AcceptedCmp2           0
Complain               0
Z_CostContact          0
Z_Revenue              0
Response               0
dtype: int64

Next step would be to check if there are any outliers in our dataset. We’ll write detect_outlier function to detect our outliers.

def detect_outliers(frame):
    for i in frame.columns:
        if(frame[i].dtype == 'int64'):
            sns.boxplot(frame[i])
            plt.show()
            
        elif(frame[i].dtype == 'float64'):
            sns.boxplot(frame[i])
            plt.show()
            
detect_outliers(df)

png

We can see that income has extremely high value that is outlier. Such outlier has to be removed as we are not expecting small number of visitor with extremely high income.

We have several options in dealing with outliers. In our case, we will replace missing values with mean value of salary. This is reasoable approach and should not bias our outcome significantly.

df=df[np.abs(df.Income-df.Income.mean())<=(3*df.Income.std())]
sns.boxplot(data=df['Income'])

png

Now outliers have been removed. Please refer to above output

Lets take a look at correlation matrix between features.

plt.figure(figsize=(30,30))
sns.heatmap(df.corr(),annot=True)

png

Modeling

KMeans clustering is one of the most poplular algorithms used for clustering as its simple and efficient to use. The aim of the KMeans algorithm is to divide M points in N dimensions into K clusters fixed a priori. K cluster points that will be centroids are placed in the space among the data points and each data point is assigned to the centroid for which the distance is the least. This means that algorithm will be completed when objective function will have least squarred error.

KMeans clustering requires number of clusters that we need to input. In order to identify number of clusters, we will use elbow methods that will help us to get the optimal number of clusters recommended.

X1=df[['Income','TotalSpent']].iloc[:,:].values
clusters=[]
for i in range(1,11): 
        kmeans=KMeans(n_clusters=i,init='k-means++',random_state=0)
        kmeans.fit(X1)
        clusters.append(kmeans.inertia_)
plt.plot(range(1,11),clusters)
plt.title('Elbow Method')
plt.xlabel('no of clusters')
plt.ylabel('Inertia')
plt.show() 

In figure below, it can be observed that the elbow point occurs at K=4. After K=4, the differences is not significant. Hence we selected K=4 clusters and will use it as an input in our KMeans algorithm.

png

X=df[['Income','TotalSpent']]

km_5 = KMeans(n_clusters=4, init='k-means++', random_state=0)
km_5.fit(X)
centroids = km_5.cluster_centers_
X['Labels'] = km_5.labels_

plt.figure(figsize=(12, 8))

sns.scatterplot(X['Income'], X['TotalSpent'], hue=X['Labels'], 
                palette=sns.color_palette('hls', 5))

plt.scatter(centroids[:,0], centroids[:,1], c='red',s=200)

plt.title('',fontsize=18)
plt.show()

png

As shown above, the scatter plot of the clusters is created with Income on X-axis against TotalSpent on Y-axis. The datapoints under each cluster are represented using different colors and the centroids are also depicted.

Based on this clustering, each group’s values have to be studied and marketing department has to develop proper strategies aimed at increasing customer loyalty and profits.