This project serves as the capstone for my Master’s in Data Analytics. It employs a Random Forest Regressor and a K-Nearest Neighbors (KNN) Regressor to predict customer spending based on available features in the dataset. Model performance is evaluated using Mean Absolute Error (MAE) and R-squared (R²) metrics. Additionally, an Analysis of Variance (ANOVA) is conducted to compare each model’s effectiveness in explaining the variance within the dataset.
Research Question
Can a predictive Random Forest Regressor model be constructed on the retail dataset?
A null hypothesis and an alternative hypothesis provide direction for this study:
Null Hypothesis (H0) – A Random Forest Regressor cannot be created to predict customer spending with an accuracy of at least 80%.
Alternate Hypothesis (H1) – A Random Forest Regressor can be created to predict customer spending with an accuracy of at least 80%.
Data Acquisition
The dataset was sourced from Kaggle and contains 100,000 records of simulated data with the following variables:
Target Variable – Dependent | Purchase_Amount |
Feature Variables – Independent | Age, Gender, Income, Education, Region, Loyalty_Status, Purchase_Frequency, Product_Category, Promotion_Usage, Satisfaction_Score |
Remaining Variables – Not Used in Analysis | Customer_ID |
Data Cleaning
The dataset was inspected, and no null values or duplicates were found. Outliers were identified using the Interquartile Range (IQR) method but were determined to have minimal impact on the analysis.
Data Exploration
Univariate visualizations were used to examine the distribution of each variable, while bivariate visualizations explored relationships with the target variable (Purchase_Amount). The use of non-parametric methods was justified by confirming that the residuals of a fitted linear model were not normally distributed.
Model Development
Both the Random Forest Regressor and KNN Regressor were optimized using GridSearchCV, which determined the best hyperparameters while incorporating cross-validation to reduce the risk of overfitting.
Model Evaluation
The two models produced the following evaluation metrics:
Model | MAE | R² |
Random Forest Regressor | 1,108 | 0.89 |
KNN Regressor | 1,837 | 0.75 |
The Random Forest Regressor had a Mean Absolute Error (MAE) of $1,108, meaning its predictions were, on average, $1,108 off from the actual Purchase Amount. In comparison, the KNN Regressor’s MAE was $1,837, indicating larger errors. The R² score of 0.89 for the Random Forest Regressor suggests it explained 89% of the variation in Purchase Amount, whereas the KNN Regressor, with an R² of 0.75, explained only 75%.
This analysis also employed ANOVA to compare the variance between actual purchase amounts and the predictions made by the Random Forest Regressor and KNN Regressor:
Sum of Squares | Degrees of Freedom | F-statistic | P-Value | |
Random Forest Regressor Predictions | 5.030419e+10 | 1.0 | 21405.013625 | 0.000000 |
KNN Regressor Predictions | 3.840113e+06 | 1.0 | 1.63412 | 0.201164 |
Residual | 4.699520e+10 | 19997.0 | NaN | NaN |
The Random Forest Regressor’s P-value is below the 0.05 threshold, which indicates that it explains a statistically significant portion of the variance of the data. Conversely, the KNN Regressor’s P-value of 0.201 suggests that it does not. Therefore, the ANOVA results show that the Random Forest Regressor is significantly better at explaining the variance in the data than the KNN Regressor.
Feature Importance
The feature importance analysis from the Random Forest Regressor indicates that only the Income variable significantly impacted the prediction of a customer’s purchase amount:

Data Summary and Implications
To put the MAE into context, the range of Purchase Amount was calculated to determine how each model’s MAE compared to the range of the data:
Purchase Amount Maximum – Purchase Amount Minimum = Purchase Amount Range
26,204 – 1,118 = 25,086
5% of the Purchase Amount Range is 1,254
The Random Forest Regressor had a MAE of 1,108 which is less than 1,254 which means that, on average, the Random Forest Regressor was off by less than 5% of the range of the data when making predictions. Conversely, the KNN Regressor had a MAE of 1,837 which is greater than 1,254, which means that, on average, the KNN Regressor was off by greater than 5% of the range of the data when making predictions.
R² represents the amount of variance of the target variable that each model can explain with the predictor variables. Therefore, the Random Forest Regressor was able to explain over 89% of the variance in Purchase Amount based on the features in the dataset. Conversely, the KNN Regressor was only able to 75% of the variance in Purchase Amount based on the features in the dataset. Because regression models do not have an actual “accuracy” metric like classification models do, R² was selected as the best indicator of each model’s accuracy. Therefore, the Null Hypothesis which stated a Random Forest Regressor cannot be created to predict customer spending with an accuracy of at least 80% is rejected.
The results from the ANOVA test show that the predictions made by the Random Forest Regressor have a statistical significance with a P-Value less than the 0.05 threshold at 0.00. Meanwhile, the predictions made by the KNN Regressor have a P-Value that is greater than the 0.05 threshold at 0.201. The ANOVA test provides evidence that there is a significant statistical difference in the variance explained by each model and that the difference is unlikely due to chance.
These results show that the business should be comfortable putting the Random Forest Regressor into production to begin making predictions of how much customers are likely to spend.
Leave a Reply