A Machine Learning approach to Lead Conversion Score Prediction — Python

Prasad Surapaneni
18 min readJul 7, 2021

--

Introduction:

Most of the businesses conduct marketing campaigns to reach out to the customers and they generate the leads for the business, but the challenging task here is to categorize the leads so that potential leads can be identified and converted to customers.

Identifying potential leads and focusing on them will save lot of time and money for the company and the same can be invested in other business promotions.

This case study deals with lead scoring activity with the help of Machine Learning algorithms. The main focus of this study is how to increase the productivity of marketing campaigns and utilize the resources efficiently by reducing the manual efforts.

Business Problem:

A bank conducts a marketing campaign to offer term deposit to their existing clients. They want to identify the leads with higher chance of conversion rate. They have the client basic information, loan details and previous campaign details for each client if they are already contacted and some attributes related to social and economic context.

Conversion status for the previous campaigns is also captured in the system, we need to predict the lead conversion probability based on the data available for the previous campaigns and as well as the current campaign.

Use of ML:

It takes lot of human effort to manually categorize the leads by just looking at the data. It is possible only when the input values which are being captured are very limited in number.

Here is exactly the Machine Learning comes in handy. A well designed ML application can drastically reduce the human effort and time and it is also most likely to produce the accurate results than human prediction.

In this case study let us discuss in detail about one such end-to-end ML approach in which we tried various models and performance metrics.

Data Source:

The data set for this case study has been taken from publicly available source. https://archive.ics.uci.edu/ml/datasets/bank+marketing

Data set Overview:

Data contains 21 columns in total including target variable which shows the conversion status for each client. We can check detailed explanation about each attribute in the above mentioned source. However all the columns in the data set are listed below and we will describe each attribute in the EDA section.

As we can see there are no null values in any of the attribute values in the dataset. However the dataset is pre-filled with missing values as “unknown” which we will see later in our EDA section.

To make our programming easy, I have divided the attributes into numerical and categorical variables and remaining one variable is target variable “y”. There are 10 numerical features and 10 categorical features in the dataset.

There are 41,118 instances in the data set out of which 4,640 leads were converted into customers who actually subscribed to a term deposit offered by the bank. If we consider the converted leads as positive instances and the remaining as negative instances, the proportion of positive and negatives instances shows that the data is imbalanced.

Existing Solutions:

Some solutions do exist for this problem in some Machine Learning communities and blog platforms, most of them focused on lift index, AUC and F1 score as the performance metric.

Links provided to some research papers and blogs in the reference section of this blog. In the current solution, I have tried to maximize the Recall score so that to ensure company doesn’t loose too many potential leads.

A research paper on the similar problem discussed about the lift index computation in detail which was used in the current solution also.

Improvements to the existing approaches:

As mentioned earlier in this approach, I have focused more on Recall value. Since the conversion rate is very low at 11 percent of the total leads, company cannot afford loosing potential leads. So trying to get more recall value may increase the number of false positives instances but we don’t miss too many leads which are most likely to be convertible.

In the process of maximizing the recall value, lift index I have got slightly more number of false positive predictions that accounts to 10 percent of the total leads. Company can afford to reach this extra 10 percent leads to achieve the target of 90+ percentage of actual clients who are most likely to subscribe a term deposit.

Exploratory Data Analysis:

We will try to explore the data in detail and analyze each attribute with the help of most popular python libraries like pandas, numpy, matplotlib and seaborn. All attributes are analyzed with some plotting tools and observations for each plot/table are written in detail.

Let’s start exploring each input variable or feature…

Note: here the words variable, attribute and feature are used interchangeably.

Uni-variate Analysis: duration -> last contact duration in seconds

  • From the above plot, it can be clearly observed that the rate of conversion is high if the call duration is long. There is some overlap between both the classes when call duration is low.
  • It seems there are some outliers in data in terms of call duration

From the observations of above box plot for last contact call duration, let’s try to find the percentiles of the duration.

We can observe that duration values is below 2175 seconds for 99.9 percent of the instances where the maximum duration is somewhere around 5000 seconds. Let’s drop the data points which has duration greater than 2175 seconds.

Uni-variate Analysis: age -> client age

  • From the above box plot for age attribute, it alone does not have much impact on conversion rate.

Let’s try another plot for age feature

Below is the plot for probability distribution and cumulative distribution of the age feature.

  • From the above plot, we can observe by looking at the CDF curves, the rate of conversion is slightly decreasing as age increasing.
  • Rate of lead conversion is slightly higher in people below 40 years of age.

To analyze the data for the rest of the variables I have written a function which plots the bar chart for given attribute and also displays the category table with conversion percentage for each category.

Uni-variate Analysis: job -> type of client’s job

  • It appears the rate of conversion is high for people who are students and retired employees.

Uni-variate Analysis: marital -> marital status of the client

  • Conversion rate varies between 10% to 15% irrespective of marital status

Uni-variate Analysis: education -> educational qualification of the client

  • Conversion rate varies between 7% to 15% among all education qualifications except illiterates which is very small in number.
  • Education is “unknown” for nearly 4% of the data instances.

Uni-variate Analysis: default -> if client has any credit in default?

  • People who are not loan defaulters have highest subscription rate to term deposits
  • Around 21% of the data instances having unknown default status

Uni-variate Analysis: housing -> if the client has housing loan?

  • Term deposit subscription rate is not much dependent on housing loan, and housing loan status is not known for 2% of the data

Uni-variate Analysis: loan -> if the client has personal loan?

  • Term deposit subscription rate is not much dependent on personal loan status, and housing loan status is not known for 2% of the data

Uni-variate Analysis: contact -> communication type

  • Conversion rate is higher at 14% for contact type ‘Celular’

Uni-variate Analysis: month -> last contact month

  • Number of leads are high in May, July, August, June, November and April.
  • But the conversion rate much higher in March, December, September and October months though the number of leads are small.

Uni-variate Analysis: day_of_week -> last contact day of the week

  • Day of week alone does not have significant impact on conversion rate

Uni-variate Analysis: campaign -> number of contacts performed during the current campaign for this client

  • Initial campaigns have good number of leads and also rate of conversion.

Campaign is a numerical feature, we can use even box plot to check the data.

  • It seems campaign is also having some outliers, let us check with percentiles
  • Since the value of 99.9 percentile is doubled from 99.9 percentile value, we can take 99 percentile value as cutoff and remove the 406 rows.

Uni-variate Analysis: pdays -> number of days that passed by after the client was last contacted from a previous campaign

  • pdays=999 : Client was not contacted previously. 96% of leads are totally new leads.
  • Conversion rate is high for clients who were contacted in the previous campaigns but the number of leads are very small in that category.

Uni-variate Analysis: previous -> number of contacts performed before this campaign for this client

  • 86% of the people were not contacted in previous campaigns
  • Conversion rate is increasing with the number of contacts increasing in previous campaigns.

Uni-variate Analysis: poutcome -> outcome of the previous marketing campaign

  • 86% of the people were not contacted in previous campaigns

Bi-variate Analysis: correlation table for numerical features

Using the correlation function of pandas and seaborn’s heatmap we can plot the correlation matrix as shown below.

From the above correlation table for numerical features, following are the observations…

  • “emp.var.rate” and “euribor3m” are highly correlated.
  • “euribor3m” and “nr.employed” are highly correlated.
  • So from the above 2 statements, euribor3m and nr.employed are highly correlated. We can ignore any of the 2 variables while training since there will not be significant difference in performance.
  • Varibales like duration, previous are positively correlated with target variable.
  • Variables nr.employed, pdays, euribor3m and emp.var.rate are negatively correlated with target variable.

Bi-variate analysis: association matrix for categorical features

Interpretations from the above correlation matrix for categorical variables…

  • Housing and Loan are moderately associated with each other
  • Contact and Month are moderately associated with each other
  • poutcome (outcome of the previous marketing campaign), month (last contact month), job and contact (contact communication type) are highly associated with the target variable.

First-cut Approach:

Data Cleaning:

Let us remove outliers from the data based on our EDA. We have already observed some outliers in call duration and campaign values.

  • We are left with 98.91 percentage of the original data after removing outliers for our training and modeling purpose.
  • Since “emp.var.rate”, “euribor3m” and “nr.eployed” attributes are having high correlation, here we are ignoring the attributes “emp.var.rate” and “euribor3m” and keeping only “nr.employed”. All three variables have negative correlation with the target variable but “nr.employed” is having high negative correlation with target variable compare to other 2 variables.
  • We should convert all the feature text to lower case letters and also need to modify some values of the categorical variables. Since all the missing values are filled with “unknown” in all categorical features, we will get the duplicate column names with “unknown” as column name but each belongs to different categorical variable. Same is the case with “yes” and “no”, multiple categorical variables have these values. To avoid this conflict, we will prefix the feature name before the actual value of that variable of that particular instance.
  • In categorical variables, job and education are highly associated with each other and education has very less impact on target variable. Ignoring education from the categorical variables does not make much difference.

Train Test split:

Split the data into train and test sets for training and evaluation purposes. Let’s split the data in stratified fashion using the class labels.

X_train, X_test, y_train, y_test = train_test_split(data.drop('y', axis=1), data.y.values, stratify=data.y, test_size=0.2, random_state=10)

print("Train data shape: ", X_train.shape, y_train.shape)
print("Test data shape: ",X_test.shape, y_test.shape)

Train data shape: (32592, 19) (32592,)
Test data shape: (8148, 19) (8148,)

Data Transformation:

Transform all numerical features with standard scaler. We will use only train data for training standard scaler to ensure that no values from the test data is revealed while training a model so that we know the actual prediction capability of the model.

scaler = StandardScaler()
scaler = scaler.fit(data[num_features])
# transform all numerical features and save it new dataframes
X_train_processed[num_features] = scaler.transform(X_train[num_features])
X_test_processed[num_features] = scaler.transform(X_test[num_features])

Used one-hot encoding for categorical feature transformation.

onehotencoder = OneHotEncoder(handle_unknown='ignore')
cat_features_ohe = onehotencoder.fit(X_train[cat_features])
X_train_ohe = cat_features_ohe.transform(X_train[cat_features])
X_test_ohe = cat_features_ohe.transform(X_test[cat_features])

Finally merge both transformed numerical features and categorical features into one final set for training and evaluation.

X_train_final = pd.concat([X_train_ohe, X_train_processed], axis=1)
X_test_final = pd.concat([X_test_ohe, X_test_processed], axis=1)

print("Final processed train data and test data shapes: ")
print(X_train_final.shape, X_test_final.shape)

Final Train data shape: (32592, 60) (32592,)
Final Test data shape: (8148, 60) (8148,)

Having tried all basic classification models like Logistic Regression, Naive Bayes, SVM with different kernels, Decision Tree, Random Forest, XGBoost, CatBoost and LGBM, found that LGBM classifier able to perform slighly better compare to other classifiers.

Accuracy, AUC, F1 score and Lift index are at their best for LGBM classifier.

Above plot shows the accuracy and confusion matrix values on test data with LGBM classifier. The basic classifier model has given us accuracy around 91.68%. But accuracy is not our metric for this problem so let’s move onto other metrics.

Train AUC of 96.94 and Test AUC of 94.90 were achieved by our base line model.

We can see the precision, recall and F1 scores in the above image. Since our focus is recall value for the positive class, base line model has given us the recall value of 54%. It means around 46% of the actual positive instances have been predicted as negative instances. We are missing almost half of the leads from the actual positive leads who are having higher conversion chance.

Above plot is cumulative gain curve or lift curve which shows actual gain for each chunk of data/leads. The orange line in the plot is the representation for positive class.

In the cumulative gain curve, X-axis is the percentile bins of the total data points and Y-axis is the count of actual converted leads in each bin. So by looking at the above curve we can observe that around 96 to 98 percent of the total converted leads are under first 30 percentile chunks.

It means by targeting or reach out to 30% of the total leads, the company can achieve 96 to 98% of the most likely convertible leads.

Below table is the performance comparison of all basic classifiers:

Then I have done the hyper parameter tuning for all classification models which were trained earlier. While doing the hyper parameter tuning I have also tuned the class_weight parameter for all classifiers to address the issue of data imbalance by giving the more weight to the minor class i.e. positive class in our problem.

As we are focusing on maximizing the recall value, Random Forest classifier has got the best recall value of 91.97% after hyper parameter tuning and balancing the class weights.

Below is the performance comparison of all classifier models after hyper parameter tuning and handling class weight.

Custom Stacking/Ensemble Model:

Having trained basic classification models and built-in ensemble models, I thought of building a custom ensemble model to leverage the diversity of wide range of classification models. Unlike the popular ensemble models like Random Forest which uses a bagging method, XGBoost, CatBoost, LGBM which uses boosting method, our custom model uses another ensemble technique called stacking along with bagging.

To build this custom model, we basically split the data into three sets D1, D2 and a test set. To do this we first split the whole data into train and test in 80:20 ratio and then split the train data into two datasets called D1, D2 in 50:50 ratio.

Below snippet of code splits the whole dataset into train and test sets in 80:20 ratio with stratified fashion using the class labels.

X_train, X_test, y_train, y_test = train_test_split(data.drop('y', axis=1), data.y.values, stratify=data.y, test_size=0.2, random_state=10)

print("Train data shape: ", X_train.shape, y_train.shape)
print("Test data shape: ",X_test.shape, y_test.shape)

Then split the train data into D1 and D2 sets in 50:50 ratio with same stratified technique using the target labels.

# split the train data set into D1 and D2
X_d1, X_d2, y_d1, y_d2 = train_test_split(X_train, y_train, stratify=y_train, test_size=0.5, random_state=10)

print("D1 data set shape: ", X_d1.shape, y_d1.shape)
print("D2 data set shape: ",X_d2.shape, y_d2.shape)

After splitting the data into three sets, shapes of the three datasets are…
D1 data set shape: (16296, 18) (16296,)
D2 data set shape: (16296, 18) (16296,)
Test data shape: (8148, 18) (8148,)

Data Transformation:

All numerical features are transformed with the help of standard scaler, and only D1 data has been exposed to standard scaler for training. D2 and Test datasets are transformed with the scaler which was trained on D1 data so that we did not expose the D2 and Test datasets to the model and are maintained as unseen data for evaluation purpose.

scaler = StandardScaler()
scaler = scaler.fit(X_d1[num_features])
# transform all numerical features and save it new dataframes
X_d1_processed[num_features] = scaler.transform(X_d1[num_features])
X_d2_processed[num_features] = scaler.transform(X_d2[num_features])
X_test_processed[num_features] = scaler.transform(X_test[num_features])

Categorical features have been transformed with the help of one-hot encoder. It was also fitted on D1 data only.

onehotencoder = OneHotEncoder(handle_unknown='ignore')
cat_features_ohe = onehotencoder.fit(X_d1[cat_features])
X_d1_ohe = cat_features_ohe.transform(X_d1[cat_features])
X_d2_ohe = cat_features_ohe.transform(X_d2[cat_features])
X_test_ohe = cat_features_ohe.transform(X_test[cat_features])

Then finally we need to merge the transformed numerical features and transformed categorical features into one dataset.

X_d1_final = pd.concat([X_d1_ohe, X_d1_processed], axis=1)
X_d2_final = pd.concat([X_d2_ohe, X_d2_processed], axis=1)
X_test_final = pd.concat([X_test_ohe, X_test_processed], axis=1)

At the end of the above step, we have total three datasets D1 transformed, D2 transformed and Test transformed.

Now as a next step in out custom model, we generate the random samples from dataset D1 and train a base learner from the bunch of base learners selected. Logistic Regression, SVC, Decision Tree classifier, Random Forest, CatBoost and LGBM classifiers are chosen as base learners in this study. A randomly chosen base learner from the list of classifiers gets trained on sample generated from D1 each time.

Here the number of samples and base learners can be thought of hyper parameters. Each time a randomly picked base learner gets trained on only one randomly generated sample and we save that trained model. After training K (K can be chosen with hyper parameter tuning) base models on “K samples, we start predicting the target labels for D2 and Test datasets with trained models.

predict_labels = predict_targets(X_d2_final)
predict_labels_test = predict_targets(X_test_final)

So we get “K” predictions for each data point in D2 and Test sets. Then we stack all these predictions horizontally and create a meta data set for training and evaluation of our custom stack model. Let’s call these datasets as D_meta_train and D_meta_test and custom stack model as meta model.

The shapes of D_meta_train and D_meta_test sets become (number of points in D2, K) and (number of points in Test, K). For example in our case if we take K as 100 then…

Shape of D_meta_train is (16296, 100)

Shape of D_meta_test is (8148, 100)

Then we train LGBM classifier as our meta classification model on D_meta_train set and use D_meta_test set to evaluate our meta model.

lgbm_meta_clf = LGBMClassifier(class_weight='balanced', random_state=10)

lgbm_meta_clf.fit(D_meta_train, y_d2)

Below plot shows the custom stack model accuracy and confusion matrix for D_meta_test set.

From the above confusion matrix values we can observe that our model is able to predict 90+ percentage of the positive instances correctly.

Above plot is the ROC-AUC curve for custom model, Train AUC @ 91.87 and Test AUC @ 91.92. There is no over-fit in the model.

Precision, Recall and F1 score values for both classes for the custom model are shown in the above screenshot.

lift_index = get_lift_index(y_test, y_test_pred_prob)
lift_index

Lift Index value: 0.9383947939262473

Finally the cumulative gain curve for test data set, as I have already explained in detail about the lift curve in the first cut model section, company can achieve around 94% of the conversion rate of total 11.8 percent positive leads who are most likely to convert by just targeting the 23% of the total leads.

Let’s look at the confusion matrix again for final conclusion…
TP + FP = 10.24% + 12.38% = 22.62% of total data points were predicted as positive class by the model.
TP + FN = 10.24% + 1.08% = 11.32% of total data points are actual positive class.

That says, by targeting those 22% leads we can achieve 10.24% (90+% of total converted leads) out of 11.32% of total positive leads. It means that if our model predicts a lead as positive lead, there is a 45–50% chance that lead is likely to be converted. If a lead is predicted as negative, then there is very less probability of around 1–2% that lead is likely to be turned as positive lead.

Based on this analysis, company can smartly target their clients with reduced costs and effort and spend the same on business promotions and other business development activities.

Performance Comparison:

As mentioned earlier about the number of samples (K), here different values for K were experimented for random samples generation and two classification models were trained (LGBM and CatBoost) as meta models. After experimenting with different K values, all the metrics are listed in the below table for comparison.

From the above comparison table we can observe that LGBM with K=100 has given the best results in terms of Recall and Lift Index. The last column shows how much percentage of leads to be targeted to achieve the best conversion rate. It is 26 percent of total leads in our case.

Web Application and Deployment of the Model:

Once we are done with training and evaluation of the final model i.e LGBM with K=100, a web application was developed to finally deploy the trained model for making predictions on unseen new data.

We do not train any model in the final web application, will only use the trained base learners and trained meta model which were already saved in pickle files.

It can be deployed and hosted in any cloud based services like AWS, Google Cloud etc.

Inference:

A web interface was developed for capturing the new leads data and predict the chance of conversion (positive/negative) at the time of saving/submitting the data. This was just an interface developed for testing purpose. Actual production interface will have some additional fields where as this interface was designed with the fields which are used in the model training.

prediction of new instances

Future Work:

As further improvement to the current model, we can try predicting the probability instead of directly predicting the class label.
If model returns the probability instead of class label, company can choose whether to contact the customer based on the probability score.

Feature importance from the final model can also be explored to understand the important features and target clients in that area. As per our exploratory data analysis last contact duration, month of contact, number of times the client was contacted before the current campaign (previous) and job are most important features.

--

--