view doc/v2_planning/optimization.txt @ 1239:470beb000694

merge
author Yoshua Bengio <bengioy@iro.umontreal.ca>
date Thu, 23 Sep 2010 11:49:42 -0400
parents 0e12ea6ba661
children
line wrap: on
line source

=========================
Optimization for Learning
=========================

Members: Bergstra, Lamblin, Delalleau, Glorot, Breuleux, Bordes
Leader: Bergstra



Initial Writeup by James
=========================================



Previous work - scikits, openopt, scipy  provide function optimization
algorithms.  These are not currently GPU-enabled but may be in the future.


IS PREVIOUS WORK SUFFICIENT?
--------------------------------

In many cases it is (I used it for sparse coding, and it was ok).

These packages provide batch optimization, whereas we typically need online
optimization.

It can be faster (to run) and more convenient (to implement) to have
optimization algorithms as Theano update expressions.


What optimization algorithms do we want/need?
---------------------------------------------

 - sgd 
 - sgd + momentum
 - sgd with annealing schedule
 - TONGA
 - James Marten's Hessian-free
 - Conjugate gradients, batch and (large) mini-batch [that is also what Marten's thing does]

Do we need anything to make batch algos work better with Pylearn things?
 - conjugate methods? yes
 - L-BFGS? maybe, when needed





Discussion
==========

OD asks: Could it be more convenient for x0 to be a list?
 
JB replies: Yes, but that's not the interface used by other minimize()
routines (e.g. in scipy).  Maybe another list-based interface is required?

OD replies: I think most people would prefer to use a list-based interface, so
    they don't have to manually pack / unpack multiple arrrays of parameters. So I
    would vote in favor or having both (where the main reason to also provide a
    non-list interface would be to allow one to easily switch e.g. to scipy's
    minimize). 
    I would guess the reason scipy's interface is like this is because it makes
    it easier for the optimization algorithm. However, this does not really
    matter if we are just wrapping a theano-based algorithm (that already has
    to handle multiple parameters), and avoiding useless data copies on each call
    to f / df can only help speed-wise.

JB replies: Done, I added possibility that x0 is list of ndarrays to the api
doc.



OD asks: Why make a difference between iterative and one-shot versions? A one-shot
    algorithm can be seen as an iterative one that stops after its first
    iteration. The difference I see between the two interfaces proposed here
    is mostly that one relies on Theano while the other one does not, but
    hopefully a non-Theano one can be created by simply wrapping around the
    Theano one.

JB replies: Right, it would make more sense to distinguish them by the fact that
one works on Theano objects, and the other on general Python callable functions.
There is room for an iterative numpy interface, but I didn't make it yet.  Would
that answer your question?

OD replies and asks: Partly. Do we really need a non-iterative interface?

OD: I wish we could get closer to each other the Theano and Numpy interfaces.
It would be nice if we could do something like:

.. code-block:: python

    # Theano version.
    updates = sgd([p], gradients=[g], stop=stop, step_size=.1)
    sgd_step = theano.function([input_var, target_var], [], updates=updates)
    while not stop.value:
        input, target = training_iter.next()
        sgd_step(input, target)

    # Numpy version (you can replace *.value by regular numpy arrays).
    sgd_step = sgd([p.value], gradients=g_func, stop=stop.value, step_size=.1)
    while not stop.value:
        input, target = training_iter.next()
        sgd_step(input, target)

where sgd would look something like:

.. code-block:: python

    class sgd(...):
        def __init__(self, parameters, cost=None, gradients=None, stop=None,
                     step_size=None):
            # Allow for extra arguments to be provided in self.__call__, that
            # are forwarded to the underlying gradients function.
            self.gradients = lambda *lst, **kw: gradients(*(parameters + lst),
                                                          **kw)
            ...

        def __call__(*lst, **kw):
            grads = self.gradients(*lst, **kw)
            for param, grad in izip(self.parameters, grads):
                param -= self.step_size * grad

Then a wrapper to provide a scipy-like interface could be:

.. code-block:: python

    def minimize(x0, f, df, algo, **kw):
        stop = numpy.array(0, dtype=numpy.int8)
        algo_step = eval(algo)([x0], cost=f, gradients=lambda x: (df(x), ),
                               stop=stop, **kw)
        while not stop:
            algo_step()