## Definition
A **Bayesian network** (also: belief network, directed graphical model) is a directed acyclic graph (DAG) representing a joint probability distribution over a set of variables, where each node has an associated *conditional probability distribution* given its parents. Introduced by Judea Pearl in the 1980s.
## Structure
- **Nodes** — random variables $X_1, \dots, X_n$.
- **Edges** — direct probabilistic dependencies (typically interpreted as causal).
- **Each node** $X_i$ has a **Conditional Probability Table (CPT)** $P(X_i \mid \text{Parents}(X_i))$.
The joint distribution factorises:
$
P(X_1, \dots, X_n) = \prod_{i=1}^n P(X_i \mid \text{Parents}(X_i))
$
This factorisation is the key efficiency: a joint over $n$ binary variables has $2^n - 1$ parameters; the Bayesian network has only $\sum_i 2^{|\text{Parents}(X_i)|}$ — exponentially fewer for sparse graphs.
## Canonical Example: Alarm Network
```
Burglary Earthquake
\ /
\ /
Alarm
/ \
JohnCalls MaryCalls
```
Five variables. Burglary and earthquake independently affect the alarm; John and Mary each independently call if they hear the alarm. CPTs encode probabilities like $P(\text{Alarm} \mid \text{Burglary}, \text{Earthquake})$.
## Conditional Independence Encoded
Two nodes are *conditionally independent* given a set if no active path connects them — see [[D-Separation]] for the formal test. This sparsity is what makes inference tractable.
## Inference Tasks
- **Posterior inference.** Given some evidence, compute $P(X_q \mid \text{evidence})$ for a query variable $X_q$.
- **MAP (Maximum A Posteriori).** Find the most probable assignment of unobserved variables.
- **Most Probable Explanation (MPE).** MAP for *all* unobserved variables.
Exact inference is **NP-hard** in general. Algorithms:
- **[[Variable Elimination]]** — exact, sums out variables in an ordering.
- **Junction tree algorithm** — exact, more efficient for repeated queries.
- **Gibbs sampling, MCMC** — approximate.
- **Variational inference** — approximate, often faster than MCMC.
- **Belief propagation** — exact on trees, approximate on loopy graphs.
## Learning Bayesian Networks
- **Parameters** (CPTs) — learn from data with maximum likelihood or Bayesian estimation.
- **Structure** — much harder. Search over DAGs scored by BIC or marginal likelihood; constraint-based methods using conditional independence tests.
## Where They Win and Lose
**Strengths:**
- Interpretable causal-ish structure.
- Handles missing data naturally.
- Probabilistic outputs with calibrated uncertainty.
- Can encode expert knowledge alongside data.
**Limitations:**
- Discrete variables most natural; continuous needs conditional Gaussians or other tricks.
- Inference is exponential in worst case.
- Structure learning is brittle without lots of data.
## Modern Status
Bayesian networks are no longer the dominant ML paradigm — neural networks have taken that role. But they remain decisive in:
- **Medical diagnosis** — interpretability and uncertainty calibration matter.
- **Industrial fault diagnosis** — limited data, expert knowledge available.
- **Causal inference** — judea Pearl's *do-calculus* sits on top of Bayesian networks.
## Related
- [[Bayes Theorem]]
- [[D-Separation]]
- [[Variable Elimination]]
- [[Naive Bayes]]
- [[Hidden Markov Model]]
- [[Markov Network]]