Customer Clusutering - Begen Yussupov

Companies are investing and exploring strategeis designed maintain current customers, to acquire new customers, help retain the current base and increase the customers lifelong value.

As competition is rising, customer relationship management plays a significant role in identifying and performing analysis of company’s valuable customers and adopting best marketing strategies.

This project is the illustration of using a clustering technique that identifies customers with similar characteristics and behaviors and segregating into homogeneous clusters.

We assume that those distinct groups of customers who function differently and follow different approaches in their spending and purchasing habits.

So main aim of the project is to identify different customer types and segment them into cluster of similar profiles, so target marketing can be executed effectively and efficiently.

As a result, will develop high-quality and long-term customer relationship that increase loyalty, growth and profit.

On this project, we will be using clustering algorithms KMeans Clustering. Clustering is a type of data mining technique used in a number of ways involving areas such as machine learning, pattern recognition and classification.

Our dataset has information about Mall visitors such as income, total amount spent on certain products etc. Through KMeans algoriths, we will separate those customers into several clusters. Further marketing department can offer customized offers on products aimed at increasing sales. So our algorith builds clustering model of given dataset.

Once the model have been fit to previously seen data they can be used to predict and understand new observations.

We have data of 2249 customers visiting stores with following information

import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, StackingClassifier
from xgboost import XGBClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, classification_report
from imblearn.combine import SMOTETomek
from sklearn.feature_selection import SelectFromModel
from sklearn.decomposition import PCA
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import MinMaxScaler
from sklearn.cluster import KMeans

pd.options.display.max_columns = None
pd.options.display.max_rows = None
np.set_printoptions(suppress=True)

df = pd.read_csv('marketing_campaign.csv', sep=';')
df.head()

Below are features we have

Education level
Marital status
Kids at home
Teen at home
Income
Amounts spent on fish products
Amounts spent on mean products
Amounts spent on fruits
Amounts spent on sweet products
Amounts spent on gold products
Amounts spent on wines
Number of purchases made with discounts
Number of purchases made with catalogue
Number of purchases made in store
Website purchases
Number of visits to website
Number of days since the last purchase

We also have data on customer acceptance of campaign 1 to 5.

Now let’s see the shape of our data and information about the datafram including the data type of each column and memory usage of the entire data.

df.shape

Data contains 2240 rows and 28 columns

This info() pandas method prints information about dataframe incluyding the data types and non-null values

df.info()


 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 Year_Birth           2240 non-null   int64  
 Education            2240 non-null   object 
 Marital_Status       2240 non-null   object 
 Income               2216 non-null   float64
 Kidhome              2240 non-null   int64  
 Teenhome             2240 non-null   int64  
 Dt_Customer          2240 non-null   object 
 Recency              2240 non-null   int64  
 MntWines             2240 non-null   int64  
 MntFruits            2240 non-null   int64  
MntMeatProducts      2240 non-null   int64  
MntFishProducts      2240 non-null   int64  
MntSweetProducts     2240 non-null   int64  
MntGoldProds         2240 non-null   int64  
NumDealsPurchases    2240 non-null   int64  
NumWebPurchases      2240 non-null   int64  
NumCatalogPurchases  2240 non-null   int64  
NumStorePurchases    2240 non-null   int64  
NumWebVisitsMonth    2240 non-null   int64  
AcceptedCmp3         2240 non-null   int64  
AcceptedCmp4         2240 non-null   int64  
AcceptedCmp5         2240 non-null   int64  
AcceptedCmp1         2240 non-null   int64  
AcceptedCmp2         2240 non-null   int64  
Complain             2240 non-null   int64  
Z_CostContact        2240 non-null   int64  
Z_Revenue            2240 non-null   int64  
Response             2240 non-null   int64  
dtypes: float64(1), int64(24), object(3)
memory usage: 490.1+ KB

Describe method gives us descriptive statistics of a dataframe.

df.describe().T

	count	mean	std	min	25%	50%	75%	max
Year_Birth	2240.0	1968.805804	11.984069	1893.0	1959.00	1970.0	1977.00	1996.0
Income	2216.0	52247.251354	25173.076661	1730.0	35303.00	51381.5	68522.00	666666.0
Kidhome	2240.0	0.444196	0.538398	0.0	0.00	0.0	1.00	2.0
Teenhome	2240.0	0.506250	0.544538	0.0	0.00	0.0	1.00	2.0
Recency	2240.0	49.109375	28.962453	0.0	24.00	49.0	74.00	99.0
MntWines	2240.0	303.935714	336.597393	0.0	23.75	173.5	504.25	1493.0
MntFruits	2240.0	26.302232	39.773434	0.0	1.00	8.0	33.00	199.0
MntMeatProducts	2240.0	166.950000	225.715373	0.0	16.00	67.0	232.00	1725.0
MntFishProducts	2240.0	37.525446	54.628979	0.0	3.00	12.0	50.00	259.0
MntSweetProducts	2240.0	27.062946	41.280498	0.0	1.00	8.0	33.00	263.0
MntGoldProds	2240.0	44.021875	52.167439	0.0	9.00	24.0	56.00	362.0
NumDealsPurchases	2240.0	2.325000	1.932238	0.0	1.00	2.0	3.00	15.0
NumWebPurchases	2240.0	4.084821	2.778714	0.0	2.00	4.0	6.00	27.0
NumCatalogPurchases	2240.0	2.662054	2.923101	0.0	0.00	2.0	4.00	28.0
NumStorePurchases	2240.0	5.790179	3.250958	0.0	3.00	5.0	8.00	13.0
NumWebVisitsMonth	2240.0	5.316518	2.426645	0.0	3.00	6.0	7.00	20.0
AcceptedCmp3	2240.0	0.072768	0.259813	0.0	0.00	0.0	0.00	1.0
AcceptedCmp4	2240.0	0.074554	0.262728	0.0	0.00	0.0	0.00	1.0
AcceptedCmp5	2240.0	0.072768	0.259813	0.0	0.00	0.0	0.00	1.0
AcceptedCmp1	2240.0	0.064286	0.245316	0.0	0.00	0.0	0.00	1.0
AcceptedCmp2	2240.0	0.013393	0.114976	0.0	0.00	0.0	0.00	1.0
Complain	2240.0	0.009375	0.096391	0.0	0.00	0.0	0.00	1.0
Z_CostContact	2240.0	3.000000	0.000000	3.0	3.00	3.0	3.00	3.0
Z_Revenue	2240.0	11.000000	0.000000	11.0	11.00	11.0	11.00	11.0
Response	2240.0	0.149107	0.356274	0.0	0.00	0.0	0.00	1.0

Lets see if the dataset has any categorical data

df.describe(include='O').T

	count	unique	top	freq
Education	2240	5	Graduation	1127
Marital_Status	2240	8	Married	864
Dt_Customer	2240	663	2012-08-31	12

Feature Engineering

Our features must not contain missing or null values and outliers.

As part of feature engineering, we will check our data for missing or null values. We can find if our dataset contains any null values by running following line of code:

df.isnull().sum() 

Year_Birth              0
Education               0
Marital_Status          0
Income                 24
Kidhome                 0
Teenhome                0
Dt_Customer             0
Recency                 0
MntWines                0
MntFruits               0
MntMeatProducts         0
MntFishProducts         0
MntSweetProducts        0
MntGoldProds            0
NumDealsPurchases       0
NumWebPurchases         0
NumCatalogPurchases     0
NumStorePurchases       0
NumWebVisitsMonth       0
AcceptedCmp3            0
AcceptedCmp4            0
AcceptedCmp5            0
AcceptedCmp1            0
AcceptedCmp2            0
Complain                0
Z_CostContact           0
Z_Revenue               0
Response                0
dtype: int64

We can see that income contains 24 missing values. We either can drop those missing values. However in our case, we will calculate the average of income and replace with average. This should not significantly affect our prediction.

df['Income']=df['Income'].fillna(df['Income'].mean())
df.isnull().sum()

Year_Birth             0
Education              0
Marital_Status         0
Income                 0
Kidhome                0
Teenhome               0
Dt_Customer            0
Recency                0
MntWines               0
MntFruits              0
MntMeatProducts        0
MntFishProducts        0
MntSweetProducts       0
MntGoldProds           0
NumDealsPurchases      0
NumWebPurchases        0
NumCatalogPurchases    0
NumStorePurchases      0
NumWebVisitsMonth      0
AcceptedCmp3           0
AcceptedCmp4           0
AcceptedCmp5           0
AcceptedCmp1           0
AcceptedCmp2           0
Complain               0
Z_CostContact          0
Z_Revenue              0
Response               0
dtype: int64

Next step would be to check if there are any outliers in our dataset. We’ll write detect_outlier function to detect our outliers.

def detect_outliers(frame):
    for i in frame.columns:
        if(frame[i].dtype == 'int64'):
            sns.boxplot(frame[i])
            plt.show()
            
        elif(frame[i].dtype == 'float64'):
            sns.boxplot(frame[i])
            plt.show()
            
detect_outliers(df)

png

We can see that income has extremely high value that is outlier. Such outlier has to be removed as we are not expecting small number of visitor with extremely high income.

We have several options in dealing with outliers. In our case, we will replace missing values with mean value of salary. This is reasoable approach and should not bias our outcome significantly.

df=df[np.abs(df.Income-df.Income.mean())<=(3*df.Income.std())]
sns.boxplot(data=df['Income'])

png

Now outliers have been removed. Please refer to above output

Lets take a look at correlation matrix between features.

plt.figure(figsize=(30,30))
sns.heatmap(df.corr(),annot=True)

png

Modeling

KMeans clustering is one of the most poplular algorithms used for clustering as its simple and efficient to use. The aim of the KMeans algorithm is to divide M points in N dimensions into K clusters fixed a priori. K cluster points that will be centroids are placed in the space among the data points and each data point is assigned to the centroid for which the distance is the least. This means that algorithm will be completed when objective function will have least squarred error.

KMeans clustering requires number of clusters that we need to input. In order to identify number of clusters, we will use elbow methods that will help us to get the optimal number of clusters recommended.

X1=df[['Income','TotalSpent']].iloc[:,:].values
clusters=[]
for i in range(1,11): 
        kmeans=KMeans(n_clusters=i,init='k-means++',random_state=0)
        kmeans.fit(X1)
        clusters.append(kmeans.inertia_)
plt.plot(range(1,11),clusters)
plt.title('Elbow Method')
plt.xlabel('no of clusters')
plt.ylabel('Inertia')
plt.show() 

In figure below, it can be observed that the elbow point occurs at K=4. After K=4, the differences is not significant. Hence we selected K=4 clusters and will use it as an input in our KMeans algorithm.

png

X=df[['Income','TotalSpent']]

km_5 = KMeans(n_clusters=4, init='k-means++', random_state=0)
km_5.fit(X)
centroids = km_5.cluster_centers_
X['Labels'] = km_5.labels_

plt.figure(figsize=(12, 8))

sns.scatterplot(X['Income'], X['TotalSpent'], hue=X['Labels'], 
                palette=sns.color_palette('hls', 5))

plt.scatter(centroids[:,0], centroids[:,1], c='red',s=200)

plt.title('',fontsize=18)
plt.show()

png

As shown above, the scatter plot of the clusters is created with Income on X-axis against TotalSpent on Y-axis. The datapoints under each cluster are represented using different colors and the centroids are also depicted.

Based on this clustering, each group’s values have to be studied and marketing department has to develop proper strategies aimed at increasing customer loyalty and profits.

← Previous Post Next Post →