Traps to avoid when building a sales forecasting model

Over the past year, we have encountered several companies that have been unable to build effective sales forecasting models. Indeed, several businesspeople appear quite cynical about the possibility of analyzing their retail/distribution network and of ultimately being able to predict in a relatively accurate fashion whether a new site will be profitable for their company in the first place and if so, to what extent.

After discussions with many corporate managers, several of whom work in the American market, I have been able to identify certain common elements regarding how these modeling exercises were conducted. It turns out that the same mistakes are often repeated from one company to the next. These methodological or conceptual errors alone account for the failure of these models to work properly.

I have thus drawn up a list of the most common errors committed in the creation of forecasting models used to evaluate the viability of future business sites. I have summarized these reasons below. Over the coming months, I will more thoroughly examine a few of these factors that merit a more systematic explanation.

Sales forecasting models often do not work because:

1. The size of the sample is too small: The first rule of parametric statistics is to acknowledge the Normal Law: the notorious Bell Curve. If the sites/stores sample used for the modeling exercise is under 30, the model will not be representative of reality and cannot be used to forecast the sales of future sites with any kind of accuracy.

2. Assumed linearity: The majority of the models we have encountered are models based upon a simple linear regression equation. Without verifying whether the phenomena being studied are indeed linear, this statistical technique is often misused or overused. Indeed, the sales of a network of sites seldom have a linear distribution. Hence, in addition to linear models, there exist notably log-linear, exponential and logarithmic models that can also be used.

3. Colinearity is ignored: Often, one sees models that use variables having a strong colinear relationship, which in itself is a major error that inserts “noise” into the model and that prevents other potentially explanatory variables from taking their proper place. Colinearity exists when two or more variables are explaining the same thing. For example, income, age and type of job are three variables that tend to usually point in the same direction. It is thus advisable to perform a factorial analysis prior to the modeling exercise in order to combine the effect of colinear variables and thereby reduce their import.

4. Equality of variance ignored: A test for the equality of variance exists for most statistical techniques used in linear modeling. This test is extremely important with regard to the model’s capacity to forecast the sales of a new site. When the variances of the variables in a sample are not equal, the model will only be valid to explain this particular sample and thus cannot be extrapolated to another sample or to the reference population.

5. Commercial zones that are too big (the aggregation of variables not sufficiently discriminant): The sales forecasting models must be exclusively built upon easily accessible socio-demographic or market data, because this is the only data that will ultimately be available for the analysis of any future site. When building sales models, one has to aggregate those socio-demographic or market variables that are available around existing sites in order to create the query file. Often, the commercial zone being used is too big and is characterized by abundant overlap between commercial zones, with the result that one ends up considering the same geographic areas for different sites, thereby limiting the discriminant potential of the variables being used.

6. Weighting of total sales: a direct result of the size of the commercial zone being used to aggregate socio-demographic data, it will never be possible, unless one is analyzing a network of point of sale that are very far apart, to examine a sufficiently large commercial zone without having any overlapping at all and that includes 100% of the sales. Total sales per store must thus be weighted by the percentage of sales included in the primary commercial zone. Accordingly, if 34% of the sales come from the primary zone (the transit effect), then one must try to model only 34% of total sales.

7. Segmentation of the sites: It is a generally accepted principle that all customers are not created equal. Hence the segmentation systems fad (Mosaic, Prismz, Focus). Commercial sites are also not all created equal. They differ in size, in their pulling power, in age, in the products available, in the markets they serve. To that effect, it is wise to segment the sites before the modeling exercise and to thus build more than one model.

8. Accuracy of the model: Before using a forecasting model to evaluate the viability of a new site, it’s worthwhile to validate its ability to estimate the sales of current sites. In order to accomplish this, one has to build a model based upon one-half of the sample of sites, and then test the model on the other half. In this manner, one can confirm the accuracy of the model, whose success rate would have to be over 85%.

As I write these few lines, four additional errors that distort the accuracy and reliability of certain sales forecasting models come to mind; a random blend of independent corporate and socio-demographic variables, sloppy analyses (univariate, followed by bivariate, and then multivariate), the use of incomplete or presumably complete information, and the use of variables created at different aggregation levels. I will discuss these elements in a future article.

All too often, building a sales forecasting model is taken too lightly. In addition, people tend to think that statistical analysis software can do all the work. This is simply not the case. Statistical modeling is a discipline where logical thinking and an intimate familiarity with the phenomena under consideration categorically trumps technology.