Calculating correlation coefficients is a method of determining whether or not two sets of data are related to one another. In addition, correlation coefficients can tell you whether the datasets are positively or negatively (inversely) related. Positive relationships exist when the values in the data change in the same direction, either going down or up at the same time. Inverse relationships exist when values in the datasets go up and down contrary to one another. Both positively and inversely related datasets can be useful forecasting indicators as long as the correlation coefficient between the two datasets is strong. If we know that two datasets are related, then we can potentially use the known values in one dataset to estimate the unknown values in the other dataset.
A typical formula for calculating a correlation coefficient is Pearson's Correlation Coefficient, which returns a value between -1 and 1. Values closer to -1 and 1 are deemed to indicate a strong correlation (not causation!), while values closer to zero indicate a weak or no correlation. Negative values indicate an inverse relationship, while positive values indicate a positive relationship. Pearson's Correlation Coefficient is given by the following formula:
Here, n represents the count of dataset pairs, x represents one set of data points, and y represents the corresponding set of data points. r is the correlation coefficient.
However, when comparing two sets of data, you must be aware that one or both of the sets of data may exhibit what is called seasonality. In short, seasonality refers to regular, predictable patterns that occur within time series data. These patterns occur on a frequency that is less than a year, such as monthly or quarterly. There are many factors that might cause these patterns, such as vacations, holidays, and so on. The classic example of seasonality is retail sales in the United States, which predictably spike (peak) between Thanksgiving and Christmas. The problem with seasonality and calculating correlation coefficients is that, unless both sets of data exhibit the same seasonality, the calculation of the correlation coefficient will turn out to be weaker than it actually is. Thus, if you're dealing with time series data that exhibits seasonality, it is imperative that you de-seasonalize the data prior to calculating the correlation coefficient.
Years ago, I wrote one of the first – if not the first – implementations of using DAX to de-seasonalize data in order to calculate a correlation coefficient and forecast a future value. The article I wrote appeared on the venerable TechNet. Being fairly new to DAX, the process was very column-centric. Years later, Power BI community member Daniil published an excellent DAX correlation coefficient calculation as a measure. However, Daniil's formula did not include de-seasonalization. Microsoft then created, in my opinion, a deficient, Power BI Quick Measure based on Daniil's formula that is now included in Power BI Desktop. Somehow, the whole forecasting aspect of the correlation coefficient was lost over the years. But what good is having a correlation coefficient if you cannot forecast values with it? This recipe attempts to combine the entire history of calculating correlation coefficients in DAX and demonstrate how to de-seasonalize data before calculating a correlation coefficient and then, most importantly, forecasting future values.