Mercurial > pylearn

--- a/doc/v2_planning/learner.txt	Wed Sep 08 14:13:43 2010 -0400
+++ b/doc/v2_planning/learner.txt	Wed Sep 08 14:14:02 2010 -0400
@@ -173,116 +173,231 @@
 the picture and make a useful boosting implementation.


+Using External Hyper-Parameter Optimization Software
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+TODO: use-case - show how we could use the optimizer from
+http://www.cs.ubc.ca/labs/beta/Projects/ParamILS/
+

 Implementation Details / API
 ----------------------------

-TODO: PUT IN TERMINOLOGY OF LEARNER, HYPER-LEARNER.
-
-TODO: SEPARATE DISCUSSION OF PERSISTENT STORAGE FROM LEARNER INTERFACE.
-
-TODO: API describing hyperparameters (categorical, integer, bounds on values, etc.)
-
-TODO: use-case - show how we could use the optimizer from
-      http://www.cs.ubc.ca/labs/beta/Projects/ParamILS/
-
-ExperimentGraph
-~~~~~~~~~~~~~~~
-
-One API that needs to be defined for this perspective to be practical is the
-ExperimentGraph.  I'll present it in terms of global functions, but an
-object-oriented things probably makes more sense in the code itself.
-
-
-    def explored_nodes(graph):
-       """Return iterator over explored nodes (ints? objects?)"""
+Learner
+~~~~~~~
+    An object that allows us to explore the graph discussed above.  Specifically, it represents
+    an explored node in that graph.

-    def forget_nodes(graph, nodes):
-       """Clear the nodes from memory (save space)"""
-
-    def all_edges_from(graph, node):
-       """Return iterator over all possible edges
-
-       Edges might be parametric - like "set learn_rate to (float)"
-
-       Edges might contain a reference to their 'from' end... not sure.
-
-       """
-    def explored_edges_from(graph, node):
-        """Return the edges that have been explored
-        """
-
-    def add_node(graph, new_node):
-        """add a node.  It may be serialized."""
-
-    def add_edge(graph, edge):
-        """add edge, it may be serialize"""
-
-    def connect(graph, from_node, to_node, edge):
-        """
-        to_node = None for un-explored edge
+    def active_instructions()
+        """ Return a list/set of Instruction instances (see below) that the Learner is prepared
+        to handle.
         """

-It makes sense to have one ExperimentGraph implementation for each storage
-mechanism - Memory, JobMan, sqlite, couchdb, mongodb, etc.
-
-The nodes should be serializable objects (like the 'learner' objects in Yoshua's
-text above, so that you can do node.learner.predict() if the edge leading to
-`node` trained something new).
-
-The nodes could also contain the various costs (train, valid, test), and other
-experiment statistics that are node-specific.
+    def copy(), deepcopy()
+        """ Learners should be serializable """


-Some implementations might also include functions for asynchronous updating of
-the ExperimentGraph:
+    To make the implementation easier, I found it was helpful to introduce a string-valued
+    `fsa_state` member attribute and associate methods to these states.  That made it
+    syntactically easy to build relatively complex finite-state transition graphs to describe
+    which instructions were active at which times in the life-cycle of a learner.


-ExperimentGraphEdge
-~~~~~~~~~~~~~~~~~~~
-
-The ExperimentGraph is primarily a dictionary container for nodes and edges.
-An ExperimentGraphEdge implementation is the model-dependent component that
-actually interprets the edges as computations.
-
-    def estimate_compute_time(graph, node, edge):
-       """Return an estimated walltime expense for the computation"""
+Instruction
+~~~~~~~~~~~
+    An object that represents a potential edge in the graph discussed above.  It is an
+    operation that a learner can perform.

-    def compute_edge(graph, node, edge, async=False, priority=1):
-       """Run the computations assocated with this graph edge, and store the
-       resulting 'to_node' to the graph when complete.
-
-       If async is True, the function doesn't return until the graph is updated
-       with `to_node`.
+    arg_types
+        """a list of Type object (see below) indicating what args are required by execute"""

-       The priority is used by implementations that use cluster software or
-       something to manage a worker pool that computes highest-priority edges
-       first.
-
-       """
-
-    def list_compute_queue(graph):
-        """Return edges scheduled for exploration (and maybe a handle for
-        where/when they started running and other backend details)
+    def execute(learner, args, kwargs):
+        """ Perform some operation on the learner (follow an edge in the graph discussed above)
+        and modify the learner in-place.  Calling execute 'moves' the learner from one node in
+        the graph along an edge.  To have the old learner as well, it must be copied prior to
+        calling execute().
         """

-Different implementations of ExperimentGraphExplorer will correspond to
-different experiments.  There can also be ExperimentGraphExplorer
-implementations that are proxies, and perform the computations in different
-threads, or across ssh, or cluster software.
+    def expense(learner, args, kwargs, resource_type='CPUtime'):
+        """ Return an estimated cost of performing this instruction (calling execute), in time,
+        space, number of computers, disk requierement, etc.
+        """
+
+Type
+~~~~
+    An object that describes a parameter domain for a call to Instruction.execute.
+    It is not necessary that a Type specifies exactly which arguments are legal, but it should
+    `include` all legal arguments, and exclude as many illegal ones as possible.
+
+    def includes(value):
+        """return True if value is a legal argument"""


-Learner
-~~~~~~~
-
-A learner is a program that implements a policy for graph exploration by
-exploiting the ExperimentGraph and ExperimentGraphEdge interfaces.
-
-The convenience of the API hinges on the extent to which we can implement
-policies that work on different experiment-graphs (where the labels on the edges
-and semantics are different).  The use-cases above make me optimistic that it
-will work sufficiently well to be worth doing in the absence of better ideas.
+    To make things a bit more practical, there are some Type subclasses like Int, Float, Str,
+    ImageDataset, SgdOptimizer, that include additional attributes (e.g. min, max, default) so
+    that automatic graph exploration algorithms can generate legal arguments with reasonable
+    efficiency.


+The proxy pattern is a powerful way to combine learners. Especially when proxy Learner
+instances also introduce Proxy Instruction classes.

+For example, it is straightforward to implement a hyper-learner by implementing a Learner with
+another learner (sub-learner) as a member attribute.  The hyper-learner makes some
+modifications to the instruction_set() return value of the sub-learner, typically to introduce
+more powerful instructions and hide simpler ones.
+
+It is less straightforward, but consistent with the design to implement a Learner that
+encompasses job management.  Such a learner would retain the semantics of the
+instruction_set of the sub-learner, but would replace the Instruction objects themselves with
+Instructions that arranged for remote procedure calls (e.g. jobman, multiprocessing, bqtools,
+etc.)  Such a learner would replace synchronous instructions (return on completion) with
+asynchronous ones (return after scheduling) and the active instruction set would also change
+asynchronously, but neither of these things is inconsistent with the Learner API.
+
+
+TODO
+~~~~
+
+I feel like something is missing from the API - and that is an interface to the graph structure
+discussed above.  The nodes in this graph are natural places to store meta-information for
+visualization, statistics-gathering etc.   But none of the APIs above corresponds to the graph
+itself. In other words, there is no API through which to attach information to nodes.  It is
+not good to say that the Learner instance *is* the node because (a) learner instances change
+during graph exploration and (b) learner instances are big, and we don't want to have to keep a
+whole saved model just to attach meta-info e.g. validation score.    Choosing this API spills
+over into other committees, so we should get their feedback about how to resolve it.
+
+Comment by OD
+~~~~~~~~~~~~~
+(I hope it's ok to leave comments even though I'm not in committee... I'm
+interested to see how the learner interface is shaping up so I'll be keeping
+an eye on this file)
+I'm wondering what's the benefit of such an API compared to simply defining a
+new method for each instruction. It seems to me that typically, the 'execute'
+method would end up being something like
+    if instruction == 'do_x':
+        self.do_x(..)
+    elif instruction == 'do_y':
+        self.do_y(..)
+    ...
+so why not directly call do_x / do_y instead?
+
+
+Comment by RP
+~~~~~~~~~~~~~
+
+James correct me if I'm wrong, but I think each instruction has a execute
+command. The job of the learner is to traverse the graph and for each edge
+that it decides to cross to call the execute of that edge. Maybe James has
+something else in mind, but this was my understanding.
+
+
+
+Just another view/spin on the same idea (Razvan)
+================================================
+
+
+My idea is probably just a spin off from what James wrote. It is an extension
+of what I send on the mailing list some time ago.
+
+Big Picture
+-----------
+
+What do we care about ?
+~~~~~~~~~~~~~~~~~~~~~~~
+
+This is the list of the main points that I have in mind :
+
+ * Re-usability
+ * Extensibility
+ * Simplicity or easily readable code ( connected to re-usability )
+ * Modularity ( connected to extensibility )
+ * Fast to write code ( - sort of comes out of simplicity)
+ * Efficient code
+
+
+Composition
+~~~~~~~~~~~
+
+To me this reads as code generated by composing pieces. Imagine this :
+you start of with something primitive that I will call a "variable", which
+probably is a very unsuitable name. And then you compose those intial
+"variables" or transform them through several "functions". Each such
+"function" hides some logic, that you as the user don't care about.
+You can have low-level or micro "functions" and high-level or macro
+"functions", where a high-level function is just a certain compositional
+pattern of low-level "functions". There are several classes of "functions"
+and "variables" that can be inter-changable. This is how modularity is
+obtained, by chainging between functions from a certain class.
+
+Now when you want to research something, what you do is first select
+the step you want to look into. If you are lucky you can re-write this
+step as certain decomposition of low-level transformations ( there can be
+multiple such decompositions). If not you have to implement such a
+decompositions acording to your needs. Pick the low-level transformations you want
+to change and write new versions that implement your logic.
+
+I think the code will be easy to read, because it is just applying a fixed
+set of transformations, one after the other. The one who writes the code can
+decide how explicit he wants to write things by switching between high-level
+and low-level functions.
+
+I think the code this way is re-usable, because you can just take this chain
+of transformation and replace the one you care about, without looking into
+the rest.
+
+You get this fractal property of the code. Zooming in, you always get just
+a set of functions applied to a set of variables. In the begining those might
+not be there, and you would have to create new "low level" decompositions,
+maybe even new "variables" that get data between those decompositions.
+
+The thing with variables here, is that I don't want this "functions" to have
+a state. All the information is passed along through these variables. This
+way understanding the graph is easy, debugging it is also easier ( then having
+all these hidden states ..)
+
+Note that while doing so we might ( and I strongly think we should) create
+a (symbolic) DAG of operations. ( this is where it becomes what James was saying).
+In such a DAG the "variables" will the nodes and the functions will be edges.
+I think having a DAG is useful in many ways (all this are things that one
+might think about implementing in a far future, I'm not proposing to implement
+them unless we want to use them - like the reconstruction ):
+  * there exist the posibility of writing optimizations ( theano style )
+  * there exist the posibility to add global view utility functions ( like
+  a reconstruction function for SdA - extremely low level here), or global
+  view diagnostic tools
+  * the posibility of creating a GUI ( where you just create the Graph by
+  picking transforms and variables from a list ) or working interactively
+  and then generating code that will reproduce the graph
+  * you can view the graph and different granularity levels to understand
+  things ( global diagnostics)
+
+We should have a taxonomy of possible classes of functions and possible
+classes of variables, but those should not be exclusive. We can work at a high
+level for now, and decompose those high level functions to lower level when
+we need to. We can introduce new classes of functions or intermediate
+variables between those low level functions.
+
+
+Similarities with James' idea
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+ As I said before, this is I think just another view on what James proposed.
+ The learner in his case is the module that traverses the graph of this
+ operations, which makes sense here as well.
+
+ The  'execute' command in his api is just applying a function to some variables in
+ my case.
+
+ The learner keeps track of the graph that is formed I think in both cases.
+
+ His view is a bit more  general. I see the graph as fully created by the user,
+ and the learner just has to go from the start to the end. In his case the
+ traversal is conditioned on some policies. I think these ideas can be mixed /
+ united. What I would see in my case to have this functionality is something
+ similar to the lazy linker for Theano.
+
+
+
+