## Definition A **Bayesian network** (also: belief network, directed graphical model) is a directed acyclic graph (DAG) representing a joint probability distribution over a set of variables, where each node has an associated *conditional probability distribution* given its parents. Introduced by Judea Pearl in the 1980s. ## Structure - **Nodes** — random variables $X_1, \dots, X_n$. - **Edges** — direct probabilistic dependencies (typically interpreted as causal). - **Each node** $X_i$ has a **Conditional Probability Table (CPT)** $P(X_i \mid \text{Parents}(X_i))$. The joint distribution factorises: $ P(X_1, \dots, X_n) = \prod_{i=1}^n P(X_i \mid \text{Parents}(X_i)) $ This factorisation is the key efficiency: a joint over $n$ binary variables has $2^n - 1$ parameters; the Bayesian network has only $\sum_i 2^{|\text{Parents}(X_i)|}$ — exponentially fewer for sparse graphs. ## Canonical Example: Alarm Network ``` Burglary Earthquake \ / \ / Alarm / \ JohnCalls MaryCalls ``` Five variables. Burglary and earthquake independently affect the alarm; John and Mary each independently call if they hear the alarm. CPTs encode probabilities like $P(\text{Alarm} \mid \text{Burglary}, \text{Earthquake})$. ## Conditional Independence Encoded Two nodes are *conditionally independent* given a set if no active path connects them — see [[D-Separation]] for the formal test. This sparsity is what makes inference tractable. ## Inference Tasks - **Posterior inference.** Given some evidence, compute $P(X_q \mid \text{evidence})$ for a query variable $X_q$. - **MAP (Maximum A Posteriori).** Find the most probable assignment of unobserved variables. - **Most Probable Explanation (MPE).** MAP for *all* unobserved variables. Exact inference is **NP-hard** in general. Algorithms: - **[[Variable Elimination]]** — exact, sums out variables in an ordering. - **Junction tree algorithm** — exact, more efficient for repeated queries. - **Gibbs sampling, MCMC** — approximate. - **Variational inference** — approximate, often faster than MCMC. - **Belief propagation** — exact on trees, approximate on loopy graphs. ## Learning Bayesian Networks - **Parameters** (CPTs) — learn from data with maximum likelihood or Bayesian estimation. - **Structure** — much harder. Search over DAGs scored by BIC or marginal likelihood; constraint-based methods using conditional independence tests. ## Where They Win and Lose **Strengths:** - Interpretable causal-ish structure. - Handles missing data naturally. - Probabilistic outputs with calibrated uncertainty. - Can encode expert knowledge alongside data. **Limitations:** - Discrete variables most natural; continuous needs conditional Gaussians or other tricks. - Inference is exponential in worst case. - Structure learning is brittle without lots of data. ## Modern Status Bayesian networks are no longer the dominant ML paradigm — neural networks have taken that role. But they remain decisive in: - **Medical diagnosis** — interpretability and uncertainty calibration matter. - **Industrial fault diagnosis** — limited data, expert knowledge available. - **Causal inference** — judea Pearl's *do-calculus* sits on top of Bayesian networks. ## Related - [[Bayes Theorem]] - [[D-Separation]] - [[Variable Elimination]] - [[Naive Bayes]] - [[Hidden Markov Model]] - [[Markov Network]]