# HG changeset patch
# User James Bergstra <bergstrj@iro.umontreal.ca>
# Date 1282672226 14400
# Node ID 5badf36a6daf8f033e95f86cb59faa9779674688
# Parent  15371ff780a0f77fefb6257e5908545014ab6d0c
mcRBM - added notes to leading comment

diff -r 15371ff780a0 -r 5badf36a6daf pylearn/algorithms/mcRBM.py
--- a/pylearn/algorithms/mcRBM.py	Mon Aug 23 16:06:23 2010 -0400
+++ b/pylearn/algorithms/mcRBM.py	Tue Aug 24 13:50:26 2010 -0400
@@ -5,7 +5,10 @@
     Modeling pixel means and covariances using factored third-order Boltzmann machines.
     IEEE Conference on Computer Vision and Pattern Recognition.
 
-and performs one of the experiments on CIFAR-10 discussed in that paper.
+and performs one of the experiments on CIFAR-10 discussed in that paper.  There are some minor
+discrepancies between the paper and the accompanying code (train_mcRBM.py), and the
+accompanying code has been taken to be correct in those cases because I couldn't get things to
+work otherwise.
 
 
 Math
@@ -24,25 +27,71 @@
 
 
 
-Full Energy of mean and Covariance RBM, with 
+Version in paper
+----------------
+
+Full Energy of the Mean and Covariance RBM, with 
 :math:`h_k = h_k^{(c)}`,
 :math:`g_j = h_j^{(m)}`,
 :math:`b_k = b_k^{(c)}`,
 :math:`c_j = b_j^{(m)}`,
 :math:`U_{if} = C_{if}`,
 
-:
+    E (v, h, g) =
+        - 0.5 \sum_f \sum_k P_{fk} h_k ( \sum_i (U_{if} v_i) / |U_{.f}|*|v| )^2 
+        - \sum_k b_k h_k
+        + 0.5 \sum_i v_i^2
+        - \sum_j \sum_i W_{ij} g_j v_i
+        - \sum_j c_j g_j
+
+For the energy function to correspond to a probability distribution, P must be non-positive.  P
+is initialized to be a diagonal, and in our experience it can be left as such because even in
+the paper it has a very low learning rate, and is only allowed to be updated after the filters
+in U are learned (in effect).
+
+Version in published train_mcRBM code
+-------------------------------------
+
+The train_mcRBM file implements learning in a similar but technically different Energy function:
 
     E (v, h, g) =
-        - 0.5 \sum_f \sum_k P_{fk} h_k ( \sum_i U_{if} v_i )^2 / |U_{*f}|^2 |v|^2
+        - 0.5 \sum_f \sum_k P_{fk} h_k (\sum_i U_{if} v_i / sqrt(\sum_i v_i^2/I + 0.5))^2 
         - \sum_k b_k h_k
         + 0.5 \sum_i v_i^2
         - \sum_j \sum_i W_{ij} g_j v_i
         - \sum_j c_j g_j
 
-For the energy function to correspond to a probability distribution, P must be non-positive.
+There are two differences with respect to the paper:
+
+    - 'v' is not normalized by its length, but rather it is normalized to have length close to
+      the square root of the number of its components.  The variable called 'small' that
+      "avoids division by zero" is orders larger than machine precision, and is on the order of
+      the normalized sum-of-squares, so I've included it in the Energy function.
+
+    - 'U' is also not normalized by its length.  U is initialized to have columns that are
+      shorter than unit-length (approximately 0.2 with the 105 principle components in the
+      train_mcRBM data).  During training, the columns of U are constrained manually to have
+      equal lengths (see the use of normVF), but Euclidean norm is allowed to change.  During
+      learning it quickly converges towards 1 and then exceeds 1.  It does not seem like this
+      column-wise normalization of U is justified by maximum-likelihood, I have no intuition
+      for why it is used.
 
 
+Version in this code
+--------------------
+
+This file implements the same algorithm as the train_mcRBM code, except that the P matrix is
+omitted for clarity, and replaced analytically with a negative identity matrix.
+
+    E (v, h, g) =
+        + 0.5 \sum_k h_k (\sum_i U_{ik} v_i / sqrt(\sum_i v_i^2/I + 0.5))^2 
+        - \sum_k b_k h_k
+        + 0.5 \sum_i v_i^2
+        - \sum_j \sum_i W_{ij} g_j v_i
+        - \sum_j c_j g_j
+
+      
+
 Conventions in this file
 ========================
 
@@ -64,10 +113,15 @@
  - `b`, a vector of hidden covariance biases (K)
  - `c`, a vector of hidden mean biases  (J)
 
-Matrices are generally layed out according to a C-order convention.
+Matrices are generally layed out and accessed according to a C-order convention.
 
 """
 
+#
+# WORKING NOTES
+# THIS DERIVATION IS BASED ON THE ** PAPER ** ENERGY FUNCTION
+# NOT THE ENERGY FUNCTION IN THE CODE!!!
+#
 # Free energy is the marginal energy of visible units
 # Recall: 
 #   Q(x) = exp(-E(x))/Z ==> -log(Q(x)) - log(Z) = E(x)