Mercurial > pylearn

=========================
Optimization for Learning
=========================

Members: Bergstra, Lamblin, Dellaleau, Glorot, Breuleux, Bordes
Leader: Bergstra


Initial Writeup by James
=========================================


Previous work - scikits, openopt, scipy  provide function optimization
algorithms.  These are not currently GPU-enabled but may be in the future.


IS PREVIOUS WORK SUFFICIENT?
--------------------------------

In many cases it is (I used it for sparse coding, and it was ok).

These packages provide batch optimization, whereas we typically need online
optimization.

It can be faster (to run) and more convenient (to implement) to have
optimization algorithms as Theano update expressions.


What optimization algorithms do we want/need?
---------------------------------------------

 - sgd
 - sgd + momentum
 - sgd with annealing schedule
 - TONGA
 - James Marten's Hessian-free
 - Conjugate gradients, batch and (large) mini-batch [that is also what Marten's thing does]

Do we need anything to make batch algos work better with Pylearn things?
 - conjugate methods? yes
 - L-BFGS? maybe, when needed


Proposal for API
================

Stick to the same style of API that we've used for SGD so far.  I think it has
worked well.  It takes theano expressions as inputs and returns theano
expressions as results.  The caller is responsible for building those
expressions into a callable function that does the minimization (and other
things too maybe).


def stochastic_gradientbased_optimization_updates(parameters, cost=None, grads=None, **kwargs):
   """
   :param parameters: list or tuple of Theano variables (typically shared vars)
       that we want to optimize iteratively algorithm.

   :param cost: scalar-valued Theano variable that computes noisy estimate of
       cost  (what are the conditions on the noise?).  The cost is ignored if
       grads are given.

   :param grads: list or tuple of Theano variables representing the gradients on
       the corresponding parameters.  These default to tensor.grad(cost,
       parameters).

   :param kwargs: algorithm-dependent arguments

   :returns: a list of pairs (v, new_v) that indicate the value (new_v) each
      variable (v) should take in order to carry out the optimization procedure.

      The first section of the return value list corresponds to the terms in
      `parameters`, and the optimization algorithm can return additional update
      expression afterward.  This list of pairs can be passed directly to the
      dict() constructor to create a dictionary such that dct[v] == new_v.
   """


Why not a class interface with an __init__ that takes the kwargs, and an
updates() that returns the updates?  It would be wrong for auxiliary shared
variables to be involved in two updates, so the interface should not encourage
separate methods for those two steps.
author	James Bergstra <bergstrj@iro.umontreal.ca>
date	Thu, 09 Sep 2010 17:44:43 -0400
parents	baf1988db557
children	9fe0f0755b03