• Название:

    Learning Bayesian Networks Neapolitan R. E.


  • Размер: 4.81 Мб
  • Формат: PDF
  • или
  • Сообщить о нарушении / Abuse

Установите безопасный браузер



  • Название: baybook.dvi
  • Автор: Administrator

Предпросмотр документа

Learning Bayesian Networks
Richard E. Neapolitan
Northeastern Illinois University
Chicago, Illinois
In memory of my dad, a difficult but loving father, who raised me well.

ii

Contents
Preface

ix

I

1

Basics

1 Introduction to Bayesian Networks
1.1 Basics of Probability Theory . . . . . . . . . . . . . . . . . . . .
1.1.1 Probability Functions and Spaces . . . . . . . . . . . . . .
1.1.2 Conditional Probability and Independence . . . . . . . . .
1.1.3 Bayes’ Theorem . . . . . . . . . . . . . . . . . . . . . . .
1.1.4 Random Variables and Joint Probability Distributions . .
1.2 Bayesian Inference . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2.1 Random Variables and Probabilities in Bayesian Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2.2 A Definition of Random Variables and Joint Probability
Distributions for Bayesian Inference . . . . . . . . . . . .
1.2.3 A Classical Example of Bayesian Inference . . . . . . . . .
1.3 Large Instances / Bayesian Networks . . . . . . . . . . . . . . . .
1.3.1 The Difficulties Inherent in Large Instances . . . . . . . .
1.3.2 The Markov Condition . . . . . . . . . . . . . . . . . . . .
1.3.3 Bayesian Networks . . . . . . . . . . . . . . . . . . . . . .
1.3.4 A Large Bayesian Network . . . . . . . . . . . . . . . . .
1.4 Creating Bayesian Networks Using Causal Edges . . . . . . . . .
1.4.1 Ascertaining Causal Influences Using Manipulation . . . .
1.4.2 Causation and the Markov Condition . . . . . . . . . . .

3
5
6
9
12
13
20

2 More DAG/Probability Relationships
2.1 Entailed Conditional Independencies . . . . . . . . . . . .
2.1.1 Examples of Entailed Conditional Independencies .
2.1.2 d-Separation . . . . . . . . . . . . . . . . . . . . .
2.1.3 Finding d-Separations . . . . . . . . . . . . . . . .
2.2 Markov Equivalence . . . . . . . . . . . . . . . . . . . . .
2.3 Entailing Dependencies with a DAG . . . . . . . . . . . .
2.3.1 Faithfulness . . . . . . . . . . . . . . . . . . . . . .

65
66
66
70
76
84
92
95

iii

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

20
24
27
29
29
31
40
43
43
44
51

iv

CONTENTS

2.4
2.5
2.6

II

2.3.2 Embedded Faithfulness . . . . . . . . . . . . . .
Minimality . . . . . . . . . . . . . . . . . . . . . . . . .
Markov Blankets and Boundaries . . . . . . . . . . . . .
More on Causal DAGs . . . . . . . . . . . . . . . . . . .
2.6.1 The Causal Minimality Assumption . . . . . . .
2.6.2 The Causal Faithfulness Assumption . . . . . . .
2.6.3 The Causal Embedded Faithfulness Assumption

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

Inference

3 Inference: Discrete Variables
3.1 Examples of Inference . . . . . . . . . . . . . . . .
3.2 Pearl’s Message-Passing Algorithm . . . . . . . . .
3.2.1 Inference in Trees . . . . . . . . . . . . . . .
3.2.2 Inference in Singly-Connected Networks . .
3.2.3 Inference in Multiply-Connected Networks .
3.2.4 Complexity of the Algorithm . . . . . . . .
3.3 The Noisy OR-Gate Model . . . . . . . . . . . . .
3.3.1 The Model . . . . . . . . . . . . . . . . . .
3.3.2 Doing Inference With the Model . . . . . .
3.3.3 Further Models . . . . . . . . . . . . . . . .
3.4 Other Algorithms that Employ the DAG . . . . . .
3.5 The SPI Algorithm . . . . . . . . . . . . . . . . . .
3.5.1 The Optimal Factoring Problem . . . . . .
3.5.2 Application to Probabilistic Inference . . .
3.6 Complexity of Inference . . . . . . . . . . . . . . .
3.7 Relationship to Human Reasoning . . . . . . . . .
3.7.1 The Causal Network Model . . . . . . . . .
3.7.2 Studies Testing the Causal Network Model

99
104
108
110
110
111
112

121
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

123
124
126
127
142
153
155
156
156
160
161
161
162
163
168
170
171
171
173

4 More Inference Algorithms
4.1 Continuous Variable Inference . . . . . . . . . . . . . . . . . . .
4.1.1 The Normal Distribution . . . . . . . . . . . . . . . . .
4.1.2 An Example Concerning Continuous Variables . . . . .
4.1.3 An Algorithm for Continuous Variables . . . . . . . . .
4.2 Approximate Inference . . . . . . . . . . . . . . . . . . . . . . .
4.2.1 A Brief Review of Sampling . . . . . . . . . . . . . . . .
4.2.2 Logic Sampling . . . . . . . . . . . . . . . . . . . . . . .
4.2.3 Likelihood Weighting . . . . . . . . . . . . . . . . . . . .
4.3 Abductive Inference . . . . . . . . . . . . . . . . . . . . . . . .
4.3.1 Abductive Inference in Bayesian Networks . . . . . . . .
4.3.2 A Best-First Search Algorithm for Abductive Inference .

.
.
.
.
.
.
.
.
.
.
.

181
181
182
183
185
205
205
211
217
221
221
224

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

CONTENTS
5 Influence Diagrams
5.1 Decision Trees . . . . . . . . . . . . . . . . . . .
5.1.1 Simple Examples . . . . . . . . . . . . .
5.1.2 Probabilities, Time, and Risk Attitudes
5.1.3 Solving Decision Trees . . . . . . . . . .
5.1.4 More Examples . . . . . . . . . . . . . .
5.2 Influence Diagrams . . . . . . . . . . . . . . . .
5.2.1 Representing with Influence Diagrams .
5.2.2 Solving Influence Diagrams . . . . . . .
5.3 Dynamic Networks . . . . . . . . . . . . . . . .
5.3.1 Dynamic Bayesian Networks . . . . . .
5.3.2 Dynamic Influence Diagrams . . . . . .

III

v

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

Learning

239
239
239
242
245
245
259
259
266
272
272
279

291

6 Parameter Learning: Binary Variables
6.1 Learning a Single Parameter . . . . . . . . . . . . . . . . . . . . .
6.1.1 Probability Distributions of Relative Frequencies . . . . .
6.1.2 Learning a Relative Frequency . . . . . . . . . . . . . . .
6.2 More on the Beta Density Function . . . . . . . . . . . . . . . . .
6.2.1 Non-integral Values of a and b . . . . . . . . . . . . . . .
6.2.2 Assessing the Values of a and b . . . . . . . . . . . . . . .
6.2.3 Why the Beta Density Function? . . . . . . . . . . . . . .
6.3 Computing a Probability Interval . . . . . . . . . . . . . . . . . .
6.4 Learning Parameters in a Bayesian Network . . . . . . . . . . . .
6.4.1 Urn Examples . . . . . . . . . . . . . . . . . . . . . . . .
6.4.2 Augmented Bayesian Networks . . . . . . . . . . . . . . .
6.4.3 Learning Using an Augmented Bayesian Network . . . . .
6.4.4 A Problem with Updating; Using an Equivalent Sample
Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.5 Learning with Missing Data Items . . . . . . . . . . . . . . . . .
6.5.1 Data Items Missing at Random . . . . . . . . . . . . . . .
6.5.2 Data Items Missing Not at Random . . . . . . . . . . . .
6.6 Variances in Computed Relative Frequencies . . . . . . . . . . . .
6.6.1 A Simple Variance Determination . . . . . . . . . . . . .
6.6.2 The Variance and Equivalent Sample Size . . . . . . . . .
6.6.3 Computing Variances in Larger Networks . . . . . . . . .
6.6.4 When Do Variances Become Large? . . . . . . . . . . . .

293
294
294
303
310
311
313
315
319
323
323
331
336

7 More Parameter Learning
7.1 Multinomial Variables . . . . . . . . . . . . . . . . .
7.1.1 Learning a Single Parameter . . . . . . . . .
7.1.2 More on the Dirichlet Density Function . . .
7.1.3 Computing Probability Intervals and Regions
7.1.4 Learning Parameters in a Bayesian Network .

381
381
381
388
389
392

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

348
357
358
363
364
364
366
372
373

vi

CONTENTS
7.1.5 Learning with Missing Data Items . . . . . .
7.1.6 Variances in Computed Relative Frequencies
7.2 Continuous Variables . . . . . . . . . . . . . . . . . .
7.2.1 Normally Distributed Variable . . . . . . . .
7.2.2 Multivariate Normally Distributed Variables
7.2.3 Gaussian Bayesian Networks . . . . . . . . .

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

398
398
398
399
413
425

8 Bayesian Structure Learning
8.1 Learning Structure: Discrete Variables . . . . . . . . . . . . . . .
8.1.1 Schema for Learning Structure . . . . . . . . . . . . . . .
8.1.2 Procedure for Learning Structure . . . . . . . . . . . . . .
8.1.3 Learning From a Mixture of Observational and Experimental Data. . . . . . . . . . . . . . . . . . . . . . . . . .
8.1.4 Complexity of Structure Learning . . . . . . . . . . . . .
8.2 Model Averaging . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.3 Learning Structure with Missing Data . . . . . . . . . . . . . . .
8.3.1 Monte Carlo Methods . . . . . . . . . . . . . . . . . . . .
8.3.2 Large-Sample Approximations . . . . . . . . . . . . . . .
8.4 Probabilistic Model Selection . . . . . . . . . . . . . . . . . . . .
8.4.1 Probabilistic Models . . . . . . . . . . . . . . . . . . . . .
8.4.2 The Model Selection Problem . . . . . . . . . . . . . . . .
8.4.3 Using the Bayesian Scoring Criterion for Model Selection
8.5 Hidden Variable DAG Models . . . . . . . . . . . . . . . . . . . .
8.5.1 Models Containing More Conditional Independencies than
DAG Models . . . . . . . . . . . . . . . . . . . . . . . . .
8.5.2 Models Containing the Same Conditional Independencies
as DAG Models . . . . . . . . . . . . . . . . . . . . . . . .
8.5.3 Dimension of Hidden Variable DAG Models . . . . . . . .
8.5.4 Number of Models and Hidden Variables . . . . . . . . . .
8.5.5 Efficient Model Scoring . . . . . . . . . . . . . . . . . . .
8.6 Learning Structure: Continuous Variables . . . . . . . . . . . . .
8.6.1 The Density Function of D . . . . . . . . . . . . . . . . .
8.6.2 The Density function of D Given a DAG pattern . . . . .
8.7 Learning Dynamic Bayesian Networks . . . . . . . . . . . . . . .

441
441
442
445

9 Approximate Bayesian Structure Learning
9.1 Approximate Model Selection . . . . . . . . . . . . . . .
9.1.1 Algorithms that Search over DAGs . . . . . . . .
9.1.2 Algorithms that Search over DAG Patterns . . .
9.1.3 An Algorithm Assuming Missing Data or Hidden
9.2 Approximate Model Averaging . . . . . . . . . . . . . .
9.2.1 A Model Averaging Example . . . . . . . . . . .
9.2.2 Approximate Model Averaging Using MCMC . .

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

449
450
451
452
453
462
468
468
472
473
476
477
479
484
486
487
491
491
495
505

511
. . . . . 511
. . . . . 513
. . . . . 518
Variables 529
. . . . . 531
. . . . . 532
. . . . . 533

CONTENTS

vii

10 Constraint-Based Learning
541
10.1 Algorithms Assuming Faithfulness . . . . . . . . . . . . . . . . . 542
10.1.1 Simple Examples . . . . . . . . . . . . . . . . . . . . . . . 542
10.1.2 Algorithms for Determining DAG patterns . . . . . . . . 545
10.1.3 Determining if a Set Admits a Faithful DAG Representation552
10.1.4 Application to Probability . . . . . . . . . . . . . . . . . . 560
10.2 Assuming Only Embedded Faithfulness . . . . . . . . . . . . . . 561
10.2.1 Inducing Chains . . . . . . . . . . . . . . . . . . . . . . . 562
10.2.2 A Basic Algorithm . . . . . . . . . . . . . . . . . . . . . . 568
10.2.3 Application to Probability . . . . . . . . . . . . . . . . . . 590
10.2.4 Application to Learning Causal Influences1 . . . . . . . . 591
10.3 Obtaining the d-separations . . . . . . . . . . . . . . . . . . . . . 599
10.3.1 Discrete Bayesian Networks . . . . . . . . . . . . . . . . . 600
10.3.2 Gaussian Bayesian Networks . . . . . . . . . . . . . . . . 603
10.4 Relationship to Human Reasoning . . . . . . . . . . . . . . . . . 604
10.4.1 Background Theory . . . . . . . . . . . . . . . . . . . . . 604
10.4.2 A Statistical Notion of Causality . . . . . . . . . . . . . . 606
11 More Structure Learning
11.1 Comparing the Methods . . . . . . . . . . . . .
11.1.1 A Simple Example . . . . . . . . . . . .
11.1.2 Learning College Attendance Influences
11.1.3 Conclusions . . . . . . . . . . . . . . . .
11.2 Data Compression Scoring Criteria . . . . . . .
11.3 Parallel Learning of Bayesian Networks . . . .
11.4 Examples . . . . . . . . . . . . . . . . . . . . .
11.4.1 Structure Learning . . . . . . . . . . . .
11.4.2 Inferring Causal Relationships . . . . .

IV

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

Applications

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

617
617
618
620
623
624
624
624
625
633

647

12 Applications
649
12.1 Applications Based on Bayesian Networks . . . . . . . . . . . . . 649
12.2 Beyond Bayesian networks . . . . . . . . . . . . . . . . . . . . . . 655
Bibliography

657

Index

686

1 The

relationships in the examples in this section are largely fictitious.

viii

CONTENTS

Preface
Bayesian networks are graphical structures for representing the probabilistic
relationships among a large number of variables and doing probabilistic inference
with those variables. During the 1980’s, a good deal of related research was done
on developing Bayesian networks (belief networks, causal networks, influence
diagrams), algorithms for performing inference with them, and applications that
used them. However, the work was scattered throughout research articles. My
purpose in writing the 1990 text Probabilistic Reasoning in Expert Systems was
to unify this research and establish a textbook and reference for the field which
has come to be known as ‘Bayesian networks.’ The 1990’s saw the emergence
of excellent algorithms for learning Bayesian networks from data. However,
by 2000 there still seemed to be no accessible source for ‘learning Bayesian
networks.’ Similar to my purpose a decade ago, the goal of this text is to
provide such a source.
In order to make this text a complete introduction to Bayesian networks,
I discuss methods for doing inference in Bayesian networks and influence diagrams. However, there is no effort to be exhaustive in this discussion. For
example, I give the details of only two algorithms for