Understanding the strength of relationships between variables in a data set is important because variables with statistically similar behavior can affect the reliability of models. To detect the relationships between features and remove the so-called multicollinearity we can use correlation measures for continuous variables. However, when we also have categorical variables and thus mixed data sets, it becomes even more challenging to test for multicollinearity. Statistical tests, such as Hypergeometric testing and the Mann-Whitney U test can be used to test for associations across variables in mixed data sets. Although this is great, it requires various intermediate steps such as the typing of variables, one-hot encoding, and multiple test corrections, among others. This entire pipeline is readily implemented in a method named HNet. In this blog, I will demonstrate how to detect variables with similar behavior so that multicollinearity can be easily detected.
Data understanding is a crucial step.
Real-world data often contains measurements with both continuous and discrete values. We need to look at each variable and use common sense to determine whether variables can be related to each other. But when there are tens (or more) variables, where each variable can have multiple states per category, it becomes time-consuming and error-prone to manually check all the variables. We can automate this task by performing intensive pre-processing steps, together with statistical testing methods. Here comes HNet [1, 2] into play which uses statistical tests to determine the significant relationships across all variables in a dataset. It allows you to input your raw unstructured data into the model and then outputs a network that sheds light on the complex relationships across variables. Let’s go to the next section where I will explain how to detect variables with similar behavior using statistical testing.
Detection of variables with similar behavior.
When we talk about multicollinearity, it means that variables in a data set have statistically similar behavior, and as a consequence can hamper the reliability and/or robustness of models. As an example, suppose we have a data set where multiple measurements are taken from the same sensor. We can then easily compute the correlation between the variables and determine which variables are (in)dependent. In other words, we can set a threshold, such as r>0.8
and remove dependent variables. However, in the case of a data set with categorical variables, we cannot compute correlations but we need to compute associations instead. From a statistical point of view, there are many manners to test for association, such as the Chi-square test, Fisher exact test, and the Hypergeometric test. These tests are often used where one or both of the variables are either ordinal or nominal. In the next section, I will demonstrate how the Hypergeometric test can be used to analyze whether two variables are associated.
The Hypergeometric Test Detects Variables with Statistical Overlap.
The Hypergeometric test can be used to test whether two variables are overlapping more in a certain state than you would expect by chance. I will import the data science salary data set which is derived from ai-jobs.net [3]. The data set contains 11 features for 4134 samples. In the code section below are shown the variables.
Listen to this episode with a 7-day free trial
Subscribe to Causal Data Science to listen to this post and get 7 days of free access to the full post archives.