Causal Data Science
Causal Discovery
Chat with Your Dataset using Bayesian Inferences.
Preview
0:00
Current time: 0:00 / Total time: -1:51
-1:51

Chat with Your Dataset using Bayesian Inferences.

The ability to ask questions to your data set has always been an intriguing prospect. You will be surprised how easy it is to learn a local Bayesian model that can be used to interrogate your data set
Photo by Vadim Bogulov on Unsplash

With the rise of large language models (LLMs), it has become accessible for a broader audience to analyze your own data set and, so to speak, “ask questions”. Although this is great, such an approach has also disadvantages when using it as an analytical step in automated pipelines. This is especially the case when the outcome of models can have a significant impact. To maintain control and ensure results are accurate we can also use Bayesian inferences to talk to our data set. In this blog, we will go through the steps on how to learn a Bayesian model and apply do-calculus on the data science salary data set. I will demonstrate how to create a model that allows you to “ask questions” to your data set and maintain control. You will be surprised by the ease of creating such a model using the bnlearn library.

Refer a friend

Donate Subscriptions

Introduction.

Extracting valuable insights from data sets is an ongoing challenge for data scientists and analysts. ChatGPT-like models have made it easier to interactively analyze data sets but at the same time, it can become less transparent and even unknown why choices are made. Relying on such black-box approaches is far from ideal in automated analytical pipelines. Creating transparent models is especially important when the outcome of a model is impactful on the actions that are taken.

The ability to communicate effectively with data sets has always been an intriguing prospect for researchers and practitioners alike.

In the next sections, I will first introduce the bnlearn library [1] on how to learn causal networks. Then I will demonstrate how to learn causal networks using a mixed data set, and how to apply do-calculus to effectively query the data set. Let’s see how Bayesian inference can help us to interact with our data sets!


The Bnlearn library.

Bnlearn is a powerful Python package that provides a comprehensive set of functions for causal analysis using Bayesian Networks. It can handle both discrete, mixed, and continuous data sets, and offers a wide range of user-friendly functionalities for causal learning, including structure learning, parameter learning, and making inferences [1–3]. Before we can make inferences, we need to understand structure learning and parameter learning because it relies on both learnings.

Learning the causal structure of a data set is one of the great features of bnlearn. Structure learning eliminates the need for prior knowledge or assumptions about the underlying relationships between variables. There are three approaches in bnlearn to learn a causal model and capture the dependencies between variables. Structure learning will result in a so-called Directed Acyclic Graph or DAG). Although all three techniques will result in a causal DAG, some can handle a large number of features while others have higher accuracy.

  • Score-based structure learning: Using scoring functions BIC, BDeu, k2, bds, aic, in combination with search strategies such as exhaustivesearch, hillclimbsearch, chow-liu, Tree-augmented Naive Bayes (TAN), NaiveBayesian.

  • Constraint-based structure learning (PC): Using statistics such chi-square test to test for edge strengths prior the modeling.

  • Hybrid structure learning: (the combination of both techniques)

  • Score-based, Constraint-based, and Hybrid structure learning. Although all three techniques will result in a causal DAG, some can handle a large number of features while others have higher accuracy.

Parameter learning is the second important part of Bayesian network analysis, and bnlearn excels in this area as well. By leveraging a set of data samples and a (pre-determined) DAG we can estimate the Conditional Probability Distributions or Tables (CPDs or CPTs).

Bnlearn also provided a plethora of functions and helper utilities to assist users throughout the analysis process. These include data set transformation functions, topological ordering derivation, graph comparison tools, insightful interactive plotting capabilities, and more. The bnlearn library supports loading bif files, converting directed graphs to undirected ones, and performing statistical tests for assessing independence among variables. In case you want to see how bnlearn performs compared to other causal libraries.

In the next section, we will jump into making inferences using do-calculus with hands-on examples. This allows us to ask questions to our data set. As mentioned earlier, structure learning and parameter learning form the basis.

Listen to this episode with a 7-day free trial

Subscribe to Causal Data Science to listen to this post and get 7 days of free access to the full post archives.

Causal Data Science
Causal Discovery
Learn the core concepts of machine learning, causal discovery, and data visualization through clear, hands-on Python examples. Master both theory and practice to apply these techniques confidently in real-world scenarios!