comparison pylearn/algorithms/mcRBM.py @ 984:5badf36a6daf

mcRBM - added notes to leading comment
author James Bergstra <bergstrj@iro.umontreal.ca>
date Tue, 24 Aug 2010 13:50:26 -0400
parents 2a53384d9742
children 78b5bdf967f6
comparison
equal deleted inserted replaced
983:15371ff780a0 984:5badf36a6daf
3 3
4 Ranzato, M. and Hinton, G. E. (2010) 4 Ranzato, M. and Hinton, G. E. (2010)
5 Modeling pixel means and covariances using factored third-order Boltzmann machines. 5 Modeling pixel means and covariances using factored third-order Boltzmann machines.
6 IEEE Conference on Computer Vision and Pattern Recognition. 6 IEEE Conference on Computer Vision and Pattern Recognition.
7 7
8 and performs one of the experiments on CIFAR-10 discussed in that paper. 8 and performs one of the experiments on CIFAR-10 discussed in that paper. There are some minor
9 discrepancies between the paper and the accompanying code (train_mcRBM.py), and the
10 accompanying code has been taken to be correct in those cases because I couldn't get things to
11 work otherwise.
9 12
10 13
11 Math 14 Math
12 ==== 15 ====
13 16
22 25
23 E = \sum_f h_f ( \sum_i C_{if} v_i )^2 26 E = \sum_f h_f ( \sum_i C_{if} v_i )^2
24 27
25 28
26 29
27 Full Energy of mean and Covariance RBM, with 30 Version in paper
31 ----------------
32
33 Full Energy of the Mean and Covariance RBM, with
28 :math:`h_k = h_k^{(c)}`, 34 :math:`h_k = h_k^{(c)}`,
29 :math:`g_j = h_j^{(m)}`, 35 :math:`g_j = h_j^{(m)}`,
30 :math:`b_k = b_k^{(c)}`, 36 :math:`b_k = b_k^{(c)}`,
31 :math:`c_j = b_j^{(m)}`, 37 :math:`c_j = b_j^{(m)}`,
32 :math:`U_{if} = C_{if}`, 38 :math:`U_{if} = C_{if}`,
33 39
34 :
35
36 E (v, h, g) = 40 E (v, h, g) =
37 - 0.5 \sum_f \sum_k P_{fk} h_k ( \sum_i U_{if} v_i )^2 / |U_{*f}|^2 |v|^2 41 - 0.5 \sum_f \sum_k P_{fk} h_k ( \sum_i (U_{if} v_i) / |U_{.f}|*|v| )^2
38 - \sum_k b_k h_k 42 - \sum_k b_k h_k
39 + 0.5 \sum_i v_i^2 43 + 0.5 \sum_i v_i^2
40 - \sum_j \sum_i W_{ij} g_j v_i 44 - \sum_j \sum_i W_{ij} g_j v_i
41 - \sum_j c_j g_j 45 - \sum_j c_j g_j
42 46
43 For the energy function to correspond to a probability distribution, P must be non-positive. 47 For the energy function to correspond to a probability distribution, P must be non-positive. P
44 48 is initialized to be a diagonal, and in our experience it can be left as such because even in
49 the paper it has a very low learning rate, and is only allowed to be updated after the filters
50 in U are learned (in effect).
51
52 Version in published train_mcRBM code
53 -------------------------------------
54
55 The train_mcRBM file implements learning in a similar but technically different Energy function:
56
57 E (v, h, g) =
58 - 0.5 \sum_f \sum_k P_{fk} h_k (\sum_i U_{if} v_i / sqrt(\sum_i v_i^2/I + 0.5))^2
59 - \sum_k b_k h_k
60 + 0.5 \sum_i v_i^2
61 - \sum_j \sum_i W_{ij} g_j v_i
62 - \sum_j c_j g_j
63
64 There are two differences with respect to the paper:
65
66 - 'v' is not normalized by its length, but rather it is normalized to have length close to
67 the square root of the number of its components. The variable called 'small' that
68 "avoids division by zero" is orders larger than machine precision, and is on the order of
69 the normalized sum-of-squares, so I've included it in the Energy function.
70
71 - 'U' is also not normalized by its length. U is initialized to have columns that are
72 shorter than unit-length (approximately 0.2 with the 105 principle components in the
73 train_mcRBM data). During training, the columns of U are constrained manually to have
74 equal lengths (see the use of normVF), but Euclidean norm is allowed to change. During
75 learning it quickly converges towards 1 and then exceeds 1. It does not seem like this
76 column-wise normalization of U is justified by maximum-likelihood, I have no intuition
77 for why it is used.
78
79
80 Version in this code
81 --------------------
82
83 This file implements the same algorithm as the train_mcRBM code, except that the P matrix is
84 omitted for clarity, and replaced analytically with a negative identity matrix.
85
86 E (v, h, g) =
87 + 0.5 \sum_k h_k (\sum_i U_{ik} v_i / sqrt(\sum_i v_i^2/I + 0.5))^2
88 - \sum_k b_k h_k
89 + 0.5 \sum_i v_i^2
90 - \sum_j \sum_i W_{ij} g_j v_i
91 - \sum_j c_j g_j
92
93
45 94
46 Conventions in this file 95 Conventions in this file
47 ======================== 96 ========================
48 97
49 This file contains some global functions, as well as a class (MeanCovRBM) that makes using them a little 98 This file contains some global functions, as well as a class (MeanCovRBM) that makes using them a little
62 - `U`, a matrix whose rows are visible covariance directions (I x F) 111 - `U`, a matrix whose rows are visible covariance directions (I x F)
63 - `W`, a matrix whose rows are visible mean directions (I x J) 112 - `W`, a matrix whose rows are visible mean directions (I x J)
64 - `b`, a vector of hidden covariance biases (K) 113 - `b`, a vector of hidden covariance biases (K)
65 - `c`, a vector of hidden mean biases (J) 114 - `c`, a vector of hidden mean biases (J)
66 115
67 Matrices are generally layed out according to a C-order convention. 116 Matrices are generally layed out and accessed according to a C-order convention.
68 117
69 """ 118 """
70 119
120 #
121 # WORKING NOTES
122 # THIS DERIVATION IS BASED ON THE ** PAPER ** ENERGY FUNCTION
123 # NOT THE ENERGY FUNCTION IN THE CODE!!!
124 #
71 # Free energy is the marginal energy of visible units 125 # Free energy is the marginal energy of visible units
72 # Recall: 126 # Recall:
73 # Q(x) = exp(-E(x))/Z ==> -log(Q(x)) - log(Z) = E(x) 127 # Q(x) = exp(-E(x))/Z ==> -log(Q(x)) - log(Z) = E(x)
74 # 128 #
75 # 129 #