Comparative Analysis of Non-Parametric Prediction Methods on Retail Dataset

This project serves as the capstone for my Master’s in Data Analytics. It employs a Random Forest Regressor and a K-Nearest Neighbors (KNN) Regressor to predict customer spending based on available features in the dataset. Model performance is evaluated using Mean Absolute Error (MAE) and R-squared (R²) metrics. Additionally, an Analysis of Variance (ANOVA) is conducted to compare each model’s effectiveness in explaining the variance within the dataset.

Research Question

Can a predictive Random Forest Regressor model be constructed on the retail dataset?  

A null hypothesis and an alternative hypothesis provide direction for this study: 

Null Hypothesis (H0) – A Random Forest Regressor cannot be created to predict customer spending with an accuracy of at least 80%. 

Alternate Hypothesis (H1) – A Random Forest Regressor can be created to predict customer spending with an accuracy of at least 80%. 

Data Acquisition

The dataset was sourced from Kaggle and contains 100,000 records of simulated data with the following variables:

Target Variable – DependentPurchase_Amount
Feature Variables – IndependentAge, Gender, Income, Education, Region, Loyalty_Status, Purchase_Frequency, Product_Category, Promotion_Usage, Satisfaction_Score
Remaining Variables – Not Used in AnalysisCustomer_ID

Data Cleaning

The dataset was inspected, and no null values or duplicates were found. Outliers were identified using the Interquartile Range (IQR) method but were determined to have minimal impact on the analysis.

Data Exploration

Univariate visualizations were used to examine the distribution of each variable, while bivariate visualizations explored relationships with the target variable (Purchase_Amount). The use of non-parametric methods was justified by confirming that the residuals of a fitted linear model were not normally distributed.

Model Development

Both the Random Forest Regressor and KNN Regressor were optimized using GridSearchCV, which determined the best hyperparameters while incorporating cross-validation to reduce the risk of overfitting.

Model Evaluation

The two models produced the following evaluation metrics:

ModelMAE
Random Forest Regressor1,1080.89
KNN Regressor1,8370.75

The Random Forest Regressor had a Mean Absolute Error (MAE) of $1,108, meaning its predictions were, on average, $1,108 off from the actual Purchase Amount. In comparison, the KNN Regressor’s MAE was $1,837, indicating larger errors. The R² score of 0.89 for the Random Forest Regressor suggests it explained 89% of the variation in Purchase Amount, whereas the KNN Regressor, with an R² of 0.75, explained only 75%.

This analysis also employed ANOVA to compare the variance between actual purchase amounts and the predictions made by the Random Forest Regressor and KNN Regressor:

Sum of SquaresDegrees of FreedomF-statisticP-Value
Random Forest Regressor Predictions5.030419e+101.021405.0136250.000000
KNN Regressor Predictions3.840113e+061.01.634120.201164
Residual4.699520e+1019997.0NaNNaN

The Random Forest Regressor’s P-value is below the 0.05 threshold, which indicates that it explains a statistically significant portion of the variance of the data. Conversely, the KNN Regressor’s P-value of 0.201 suggests that it does not. Therefore, the ANOVA results show that the Random Forest Regressor is significantly better at explaining the variance in the data than the KNN Regressor.

Feature Importance

The feature importance analysis from the Random Forest Regressor indicates that only the Income variable significantly impacted the prediction of a customer’s purchase amount:

Data Summary and Implications

To put the MAE into context, the range of Purchase Amount was calculated to determine how each model’s MAE compared to the range of the data:

Purchase Amount Maximum – Purchase Amount Minimum = Purchase Amount Range

26,204 – 1,118 = 25,086

5% of the Purchase Amount Range is 1,254

The Random Forest Regressor had a MAE of 1,108 which is less than 1,254 which means that, on average, the Random Forest Regressor was off by less than 5% of the range of the data when making predictions. Conversely, the KNN Regressor had a MAE of 1,837 which is greater than 1,254, which means that, on average, the KNN Regressor was off by greater than 5% of the range of the data when making predictions.

represents the amount of variance of the target variable that each model can explain with the predictor variables. Therefore, the Random Forest Regressor was able to explain over 89% of the variance in Purchase Amount based on the features in the dataset. Conversely, the KNN Regressor was only able to 75% of the variance in Purchase Amount based on the features in the dataset. Because regression models do not have an actual “accuracy” metric like classification models do, R² was selected as the best indicator of each model’s accuracy. Therefore, the Null Hypothesis which stated a Random Forest Regressor cannot be created to predict customer spending with an accuracy of at least 80% is rejected.

The results from the ANOVA test show that the predictions made by the Random Forest Regressor have a statistical significance with a P-Value less than the 0.05 threshold at 0.00. Meanwhile, the predictions made by the KNN Regressor have a P-Value that is greater than the 0.05 threshold at 0.201. The ANOVA test provides evidence that there is a significant statistical difference in the variance explained by each model and that the difference is unlikely due to chance.

These results show that the business should be comfortable putting the Random Forest Regressor into production to begin making predictions of how much customers are likely to spend.  


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *