Knowing the underlying (probability) distribution of your data has many modeling advantages. The easiest manner to determine the underlying distribution is by visually inspecting the random variable(s) using a histogram. With the candidate distribution, various plots can be created such as the Probability Distribution Function plot (PDF/CDF), and the QQ plot. However, to determine the exact distribution parameters (e.g., loc, scale), it is essential to use quantitative methods. In this blog, I will describe why it is important to determine the underlying probability distribution for your data set. What the differences are between parametric and non-parametric distributions. How to determine the best fit using a quantitative approach and how to confirm it using visual inspections. Analyses are performed using the distfit library, and a notebook is accompanied for easy access and experimenting.
The importance of distribution fitting and Probability Density Functions.
The probability density function is a fundamental concept in statistics. Briefly, for a given random variable X, we aim to specify the function f that gives a natural description of the distribution of X. See also the terminology section at the bottom for more about probability density functions. Although there is a lot of great material that describes these concepts [1], it can remain challenging to understand why it is important to know the underlying data distribution for your data set. Let me try to explain the importance with a small analogy. Suppose you need to go from location A to B, which type of car would you prefer? The answer is straightforward. You will likely start with exploring the terrain. With that information, you can then select the best-suited car (sports car, four-wheel drive, etc). Logically, a sports car is better suited for smooth, flat terrain, while a four-wheel drive is better suited for rough, hilly terrain. In other words, without the exploratory analysis of the terrain, it can be hard to select the best possible car. However, such an exploratory step is easily forgotten or neglected in data modeling.
Before making modeling decisions, you need to know the underlying data distribution.
When it comes to data, it is important to explore the fundamental characteristics of the data too, such as skewness, kurtosis, outliers, distribution shape, univariate, bimodal, etc. Based on these characteristics it is easier to decide which models are best to use because most models have prerequisites for the data. As an example, a well-known and popular technique is Principal Component Analysis (PCA). This method computes the covariance matrix and requires the data to be multivariate normal for the PCA to be valid. In addition, a PCA is also known to be sensitive to outliers. Thus, before doing a PCA step, you need to know whether your data needs a (log)normalization or whether outliers need to be removed. More details about PCA can be found here [2].
Histograms can build a sense of intuition.
The histogram is a well-known plot in data analysis which is a graphical representation of the distribution of the dataset. The histogram summarizes the number of observations that fall within the bins. With libraries such as matplotlib hist()
it is straightforward to make a visual inspection of the data. Changing the range of the number of bins will help to identify whether the density looks like a common probability distribution by the shape of the histogram. An inspection will also give hints whether the data is symmetric or skewed and whether it has multiple peaks or outliers. In most cases, you will observe a distribution shape as depicted in Figure 1.
The bell shape of the Normal distribution.
The descending or ascending shape of an Exponential or Pareto distribution.
The flat shape of the Uniform distribution.
The complex shape that does not fit any of the theoretical distributions (e.g, multiple peaks).
In case you find distributions with multiple peaks (bimodal or multimodal), the peaks should not disappear with different numbers of bins. Bimodal distributions usually hint toward mixed populations. In addition, if you observe large spikes in density for a given value or a small range of values, it may point toward possible outliers. Outliers are expected to be far away from the rest of the density.
A histogram is a great manner to inspect a relatively small number of samples (random variables, or data points). However, when the number of samples increase or more than two histograms are plotted, the visuals can become troublesome, and a visual comparison with a theoretical distribution difficult to judge. Instead, a Cumulative Distribution Function (CDF) plot or Quantile-Quantile plot (QQ plot) can be more insightful. But these plots require a candidate theoretical distribution(s) that best matches (or fits) with the empirical data distribution. So let’s determine the best theoretical distribution in the next section! See also the terminology section at the bottom for more information about random variables and theoretical distributions.
Listen to this episode with a 7-day free trial
Subscribe to Causal Data Science to listen to this post and get 7 days of free access to the full post archives.