pylearn: doc/v2_planning/main

annotate doc/v2_planning/main_plan.txt @ 1087:8c448829db30

learning committee first draft of an api

author	Razvan Pascanu <r.pascanu@gmail.com>
date	Sat, 11 Sep 2010 20:33:34 -0400
parents	bc246542d6ff
children	1ed0719cfbce

rev	line source
941 939806d33183 v2_planning.txt James Bergstra <bergstrj@iro.umontreal.ca> parents: diff changeset	1
939806d33183 v2_planning.txt James Bergstra <bergstrj@iro.umontreal.ca> parents: diff changeset	2 Motivation
939806d33183 v2_planning.txt James Bergstra <bergstrj@iro.umontreal.ca> parents: diff changeset	3 ==========
939806d33183 v2_planning.txt James Bergstra <bergstrj@iro.umontreal.ca> parents: diff changeset	4
1007 2e515be92a0e motivations and meeting points Yoshua Bengio <bengioy@iro.umontreal.ca> parents: 1001 diff changeset	5 Yoshua (points discussed Thursday Sept 2, 2010 at LISA tea-talk)
2e515be92a0e motivations and meeting points Yoshua Bengio <bengioy@iro.umontreal.ca> parents: 1001 diff changeset	6 ------
2e515be92a0e motivations and meeting points Yoshua Bengio <bengioy@iro.umontreal.ca> parents: 1001 diff changeset	7
2e515be92a0e motivations and meeting points Yoshua Bengio <bengioy@iro.umontreal.ca> parents: 1001 diff changeset	8 **** Why we need to get better organized in our code-writing ****
2e515be92a0e motivations and meeting points Yoshua Bengio <bengioy@iro.umontreal.ca> parents: 1001 diff changeset	9
2e515be92a0e motivations and meeting points Yoshua Bengio <bengioy@iro.umontreal.ca> parents: 1001 diff changeset	10 - current state of affairs on top of Theano is anarchic and does not lend itself to easy code re-use
2e515be92a0e motivations and meeting points Yoshua Bengio <bengioy@iro.umontreal.ca> parents: 1001 diff changeset	11 - the lab is growing and will continue to grow significantly, and more people outside the lab are using Theano
2e515be92a0e motivations and meeting points Yoshua Bengio <bengioy@iro.umontreal.ca> parents: 1001 diff changeset	12 - we have new industrial partners and funding sources that demand deliverables, and more/better collectively organized efforts
2e515be92a0e motivations and meeting points Yoshua Bengio <bengioy@iro.umontreal.ca> parents: 1001 diff changeset	13
2e515be92a0e motivations and meeting points Yoshua Bengio <bengioy@iro.umontreal.ca> parents: 1001 diff changeset	14 * Who can take advantage of this *
2e515be92a0e motivations and meeting points Yoshua Bengio <bengioy@iro.umontreal.ca> parents: 1001 diff changeset	15
2e515be92a0e motivations and meeting points Yoshua Bengio <bengioy@iro.umontreal.ca> parents: 1001 diff changeset	16 - us, directly, taking advantage of the different advances made by different researchers in the lab to yield better models
2e515be92a0e motivations and meeting points Yoshua Bengio <bengioy@iro.umontreal.ca> parents: 1001 diff changeset	17 - us, easier to compare different models and different datasets with different metrics on different computing platforms available to us
2e515be92a0e motivations and meeting points Yoshua Bengio <bengioy@iro.umontreal.ca> parents: 1001 diff changeset	18 - future us, new students, able to quickly move into 'production' mode without having to reinvent the wheel
2e515be92a0e motivations and meeting points Yoshua Bengio <bengioy@iro.umontreal.ca> parents: 1001 diff changeset	19 - students in the two ML classes, able to play with the library to explore new ML variants
2e515be92a0e motivations and meeting points Yoshua Bengio <bengioy@iro.umontreal.ca> parents: 1001 diff changeset	20 - other ML researchers in academia, able to play with our algorithms, try new variants, cite our papers
2e515be92a0e motivations and meeting points Yoshua Bengio <bengioy@iro.umontreal.ca> parents: 1001 diff changeset	21 - non-ML users in or out of academia, and our user-partners
2e515be92a0e motivations and meeting points Yoshua Bengio <bengioy@iro.umontreal.ca> parents: 1001 diff changeset	22
2e515be92a0e motivations and meeting points Yoshua Bengio <bengioy@iro.umontreal.ca> parents: 1001 diff changeset	23
2e515be92a0e motivations and meeting points Yoshua Bengio <bengioy@iro.umontreal.ca> parents: 1001 diff changeset	24 * Move with care *
2e515be92a0e motivations and meeting points Yoshua Bengio <bengioy@iro.umontreal.ca> parents: 1001 diff changeset	25
2e515be92a0e motivations and meeting points Yoshua Bengio <bengioy@iro.umontreal.ca> parents: 1001 diff changeset	26 - Write down use-cases, examples for each type of module, do not try to be TOO general
2e515be92a0e motivations and meeting points Yoshua Bengio <bengioy@iro.umontreal.ca> parents: 1001 diff changeset	27 - Want to keep ease of exploring and flexibility, not create a prison
2e515be92a0e motivations and meeting points Yoshua Bengio <bengioy@iro.umontreal.ca> parents: 1001 diff changeset	28 - Too many constraints can lead to paralysis, especially in C++ object-oriented model
2e515be92a0e motivations and meeting points Yoshua Bengio <bengioy@iro.umontreal.ca> parents: 1001 diff changeset	29 - Too few guidelines lead to code components that are not interchangeable
2e515be92a0e motivations and meeting points Yoshua Bengio <bengioy@iro.umontreal.ca> parents: 1001 diff changeset	30 - Poor code practice leads to buggy, spaguetti code
2e515be92a0e motivations and meeting points Yoshua Bengio <bengioy@iro.umontreal.ca> parents: 1001 diff changeset	31
2e515be92a0e motivations and meeting points Yoshua Bengio <bengioy@iro.umontreal.ca> parents: 1001 diff changeset	32 * What *
2e515be92a0e motivations and meeting points Yoshua Bengio <bengioy@iro.umontreal.ca> parents: 1001 diff changeset	33
2e515be92a0e motivations and meeting points Yoshua Bengio <bengioy@iro.umontreal.ca> parents: 1001 diff changeset	34 - define standards
2e515be92a0e motivations and meeting points Yoshua Bengio <bengioy@iro.umontreal.ca> parents: 1001 diff changeset	35 - write-up a few instances of each basic type (dataset, learner, optimizer, hyper-parameter exploration boilerplate, etc.) enough to implement some of the basic algorithms we use often (e.g. like those in the tutorials)
2e515be92a0e motivations and meeting points Yoshua Bengio <bengioy@iro.umontreal.ca> parents: 1001 diff changeset	36 - let the library grow according to our needs
2e515be92a0e motivations and meeting points Yoshua Bengio <bengioy@iro.umontreal.ca> parents: 1001 diff changeset	37 - keep tight reins on it to control quality
2e515be92a0e motivations and meeting points Yoshua Bengio <bengioy@iro.umontreal.ca> parents: 1001 diff changeset	38
2e515be92a0e motivations and meeting points Yoshua Bengio <bengioy@iro.umontreal.ca> parents: 1001 diff changeset	39 * Content and Form *
2e515be92a0e motivations and meeting points Yoshua Bengio <bengioy@iro.umontreal.ca> parents: 1001 diff changeset	40
2e515be92a0e motivations and meeting points Yoshua Bengio <bengioy@iro.umontreal.ca> parents: 1001 diff changeset	41 We need to establish guidelines and conventions for
2e515be92a0e motivations and meeting points Yoshua Bengio <bengioy@iro.umontreal.ca> parents: 1001 diff changeset	42
2e515be92a0e motivations and meeting points Yoshua Bengio <bengioy@iro.umontreal.ca> parents: 1001 diff changeset	43 * Content: what are the re-usable components? define conventions or API for each, make sure they fit with each other
2e515be92a0e motivations and meeting points Yoshua Bengio <bengioy@iro.umontreal.ca> parents: 1001 diff changeset	44 * Form: social engineering, coding practices and conventions, code review, incentives
2e515be92a0e motivations and meeting points Yoshua Bengio <bengioy@iro.umontreal.ca> parents: 1001 diff changeset	45
941 939806d33183 v2_planning.txt James Bergstra <bergstrj@iro.umontreal.ca> parents: diff changeset	46 Yoshua:
939806d33183 v2_planning.txt James Bergstra <bergstrj@iro.umontreal.ca> parents: diff changeset	47 -------
939806d33183 v2_planning.txt James Bergstra <bergstrj@iro.umontreal.ca> parents: diff changeset	48
939806d33183 v2_planning.txt James Bergstra <bergstrj@iro.umontreal.ca> parents: diff changeset	49 We are missing a Theano Machine Learning library.
939806d33183 v2_planning.txt James Bergstra <bergstrj@iro.umontreal.ca> parents: diff changeset	50
939806d33183 v2_planning.txt James Bergstra <bergstrj@iro.umontreal.ca> parents: diff changeset	51 The deep learning tutorials do a good job but they lack the following features, which I would like to see in a ML library:
939806d33183 v2_planning.txt James Bergstra <bergstrj@iro.umontreal.ca> parents: diff changeset	52
939806d33183 v2_planning.txt James Bergstra <bergstrj@iro.umontreal.ca> parents: diff changeset	53 - a well-organized collection of Theano symbolic expressions (formulas) for handling most of
939806d33183 v2_planning.txt James Bergstra <bergstrj@iro.umontreal.ca> parents: diff changeset	54 what is needed either in implementing existing well-known ML and deep learning algorithms or
939806d33183 v2_planning.txt James Bergstra <bergstrj@iro.umontreal.ca> parents: diff changeset	55 for creating new variants (without having to start from scratch each time), that is the
939806d33183 v2_planning.txt James Bergstra <bergstrj@iro.umontreal.ca> parents: diff changeset	56 mathematical core,
939806d33183 v2_planning.txt James Bergstra <bergstrj@iro.umontreal.ca> parents: diff changeset	57
939806d33183 v2_planning.txt James Bergstra <bergstrj@iro.umontreal.ca> parents: diff changeset	58 - a well-organized collection of python modules to help with the following:
939806d33183 v2_planning.txt James Bergstra <bergstrj@iro.umontreal.ca> parents: diff changeset	59 - several data-access models that wrap around learning algorithms for interfacing with various types of data (static vectors, images, sound, video, generic time-series, etc.)
939806d33183 v2_planning.txt James Bergstra <bergstrj@iro.umontreal.ca> parents: diff changeset	60 - generic utility code for optimization
939806d33183 v2_planning.txt James Bergstra <bergstrj@iro.umontreal.ca> parents: diff changeset	61 - stochastic gradient descent variants
939806d33183 v2_planning.txt James Bergstra <bergstrj@iro.umontreal.ca> parents: diff changeset	62 - early stopping variants
939806d33183 v2_planning.txt James Bergstra <bergstrj@iro.umontreal.ca> parents: diff changeset	63 - interfacing to generic 2nd order optimization methods
939806d33183 v2_planning.txt James Bergstra <bergstrj@iro.umontreal.ca> parents: diff changeset	64 - 2nd order methods tailored to work on minibatches
939806d33183 v2_planning.txt James Bergstra <bergstrj@iro.umontreal.ca> parents: diff changeset	65 - optimizers for sparse coefficients / parameters
939806d33183 v2_planning.txt James Bergstra <bergstrj@iro.umontreal.ca> parents: diff changeset	66 - generic code for model selection and hyper-parameter optimization (including the use and coordination of multiple jobs running on different machines, e.g. using jobman)
939806d33183 v2_planning.txt James Bergstra <bergstrj@iro.umontreal.ca> parents: diff changeset	67 - generic code for performance estimation and experimental statistics
939806d33183 v2_planning.txt James Bergstra <bergstrj@iro.umontreal.ca> parents: diff changeset	68 - visualization tools (using existing python libraries) and examples for all of the above
939806d33183 v2_planning.txt James Bergstra <bergstrj@iro.umontreal.ca> parents: diff changeset	69 - learning algorithm conventions and meta-learning algorithms (bagging, boosting, mixtures of experts, etc.) which use them
939806d33183 v2_planning.txt James Bergstra <bergstrj@iro.umontreal.ca> parents: diff changeset	70
939806d33183 v2_planning.txt James Bergstra <bergstrj@iro.umontreal.ca> parents: diff changeset	71 [Note that many of us already use some instance of all the above, but each one tends to reinvent the wheel and newbies don't benefit from a knowledge base.]
939806d33183 v2_planning.txt James Bergstra <bergstrj@iro.umontreal.ca> parents: diff changeset	72
939806d33183 v2_planning.txt James Bergstra <bergstrj@iro.umontreal.ca> parents: diff changeset	73 - a well-documented set of python scripts using the above library to show how to run the most
939806d33183 v2_planning.txt James Bergstra <bergstrj@iro.umontreal.ca> parents: diff changeset	74 common ML algorithms (possibly with examples showing how to run multiple experiments with
939806d33183 v2_planning.txt James Bergstra <bergstrj@iro.umontreal.ca> parents: diff changeset	75 many different models and collect statistical comparative results). This is particularly
939806d33183 v2_planning.txt James Bergstra <bergstrj@iro.umontreal.ca> parents: diff changeset	76 important for pure users to adopt Theano in the ML application work.
939806d33183 v2_planning.txt James Bergstra <bergstrj@iro.umontreal.ca> parents: diff changeset	77
939806d33183 v2_planning.txt James Bergstra <bergstrj@iro.umontreal.ca> parents: diff changeset	78 Ideally, there would be one person in charge of this project, making sure a coherent and
939806d33183 v2_planning.txt James Bergstra <bergstrj@iro.umontreal.ca> parents: diff changeset	79 easy-to-read design is developed, along with many helping hands (to implement the various
939806d33183 v2_planning.txt James Bergstra <bergstrj@iro.umontreal.ca> parents: diff changeset	80 helper modules, formulae, and learning algorithms).
939806d33183 v2_planning.txt James Bergstra <bergstrj@iro.umontreal.ca> parents: diff changeset	81
939806d33183 v2_planning.txt James Bergstra <bergstrj@iro.umontreal.ca> parents: diff changeset	82
939806d33183 v2_planning.txt James Bergstra <bergstrj@iro.umontreal.ca> parents: diff changeset	83 James:
939806d33183 v2_planning.txt James Bergstra <bergstrj@iro.umontreal.ca> parents: diff changeset	84 -------
939806d33183 v2_planning.txt James Bergstra <bergstrj@iro.umontreal.ca> parents: diff changeset	85
939806d33183 v2_planning.txt James Bergstra <bergstrj@iro.umontreal.ca> parents: diff changeset	86 I am interested in the design and implementation of the "well-organized collection of Theano
939806d33183 v2_planning.txt James Bergstra <bergstrj@iro.umontreal.ca> parents: diff changeset	87 symbolic expressions..."
939806d33183 v2_planning.txt James Bergstra <bergstrj@iro.umontreal.ca> parents: diff changeset	88
939806d33183 v2_planning.txt James Bergstra <bergstrj@iro.umontreal.ca> parents: diff changeset	89 I would like to explore algorithms for hyper-parameter optimization, following up on some
939806d33183 v2_planning.txt James Bergstra <bergstrj@iro.umontreal.ca> parents: diff changeset	90 "high-throughput" work. I'm most interested in the "generic code for model selection and
939806d33183 v2_planning.txt James Bergstra <bergstrj@iro.umontreal.ca> parents: diff changeset	91 hyper-parameter optimization..." and "generic code for performance estimation...".
939806d33183 v2_planning.txt James Bergstra <bergstrj@iro.umontreal.ca> parents: diff changeset	92
939806d33183 v2_planning.txt James Bergstra <bergstrj@iro.umontreal.ca> parents: diff changeset	93 I have some experiences with the data-access requirements, and some lessons I'd like to share
939806d33183 v2_planning.txt James Bergstra <bergstrj@iro.umontreal.ca> parents: diff changeset	94 on that, but no time to work on that aspect of things.
939806d33183 v2_planning.txt James Bergstra <bergstrj@iro.umontreal.ca> parents: diff changeset	95
939806d33183 v2_planning.txt James Bergstra <bergstrj@iro.umontreal.ca> parents: diff changeset	96 I will continue to contribute to the "well-documented set of python scripts using the above to
939806d33183 v2_planning.txt James Bergstra <bergstrj@iro.umontreal.ca> parents: diff changeset	97 showcase common ML algorithms...". I have an Olshausen&Field-style sparse coding script that
939806d33183 v2_planning.txt James Bergstra <bergstrj@iro.umontreal.ca> parents: diff changeset	98 could be polished up. I am also implementing the mcRBM and I'll be able to add that when it's
939806d33183 v2_planning.txt James Bergstra <bergstrj@iro.umontreal.ca> parents: diff changeset	99 done.
939806d33183 v2_planning.txt James Bergstra <bergstrj@iro.umontreal.ca> parents: diff changeset	100
939806d33183 v2_planning.txt James Bergstra <bergstrj@iro.umontreal.ca> parents: diff changeset	101
939806d33183 v2_planning.txt James Bergstra <bergstrj@iro.umontreal.ca> parents: diff changeset	102
939806d33183 v2_planning.txt James Bergstra <bergstrj@iro.umontreal.ca> parents: diff changeset	103 Suggestions for how to tackle various desiderata
939806d33183 v2_planning.txt James Bergstra <bergstrj@iro.umontreal.ca> parents: diff changeset	104 ================================================
939806d33183 v2_planning.txt James Bergstra <bergstrj@iro.umontreal.ca> parents: diff changeset	105
939806d33183 v2_planning.txt James Bergstra <bergstrj@iro.umontreal.ca> parents: diff changeset	106
945 cafa16bfc7df additions to v2_planning James Bergstra <bergstrj@iro.umontreal.ca> parents: 941 diff changeset	107 Theano Symbolic Expressions for ML
cafa16bfc7df additions to v2_planning James Bergstra <bergstrj@iro.umontreal.ca> parents: 941 diff changeset	108 ----------------------------------
cafa16bfc7df additions to v2_planning James Bergstra <bergstrj@iro.umontreal.ca> parents: 941 diff changeset	109
cafa16bfc7df additions to v2_planning James Bergstra <bergstrj@iro.umontreal.ca> parents: 941 diff changeset	110 We could make this a submodule of pylearn: ``pylearn.nnet``.
cafa16bfc7df additions to v2_planning James Bergstra <bergstrj@iro.umontreal.ca> parents: 941 diff changeset	111
946 7c4504a4ce1a additions to formulas, data access, hyper-params, scripts Yoshua Bengio <bengioy@iro.umontreal.ca> parents: 945 diff changeset	112 Yoshua: I would use a different name, e.g., "pylearn.formulas" to emphasize that it is not just
7c4504a4ce1a additions to formulas, data access, hyper-params, scripts Yoshua Bengio <bengioy@iro.umontreal.ca> parents: 945 diff changeset	113 about neural nets, and that this is a collection of formulas (expressions), rather than
7c4504a4ce1a additions to formulas, data access, hyper-params, scripts Yoshua Bengio <bengioy@iro.umontreal.ca> parents: 945 diff changeset	114 completely self-contained classes for learners. We could have a "nnet.py" file for
7c4504a4ce1a additions to formulas, data access, hyper-params, scripts Yoshua Bengio <bengioy@iro.umontreal.ca> parents: 945 diff changeset	115 neural nets, though.
7c4504a4ce1a additions to formulas, data access, hyper-params, scripts Yoshua Bengio <bengioy@iro.umontreal.ca> parents: 945 diff changeset	116
945 cafa16bfc7df additions to v2_planning James Bergstra <bergstrj@iro.umontreal.ca> parents: 941 diff changeset	117 There are a number of ideas floating around for how to handle classes /
947 216f4ce969b2 small addition Frederic Bastien <nouiz@nouiz.org> parents: 946 diff changeset	118 modules (LeDeepNet, pylearn.shared.layers, pynnet, DeepAnn) so lets implement as much
945 cafa16bfc7df additions to v2_planning James Bergstra <bergstrj@iro.umontreal.ca> parents: 941 diff changeset	119 math as possible in global functions with no classes. There are no models in
cafa16bfc7df additions to v2_planning James Bergstra <bergstrj@iro.umontreal.ca> parents: 941 diff changeset	120 the wish list that require than a few vectors and matrices to parametrize.
cafa16bfc7df additions to v2_planning James Bergstra <bergstrj@iro.umontreal.ca> parents: 941 diff changeset	121 Global functions are more reusable than classes.
cafa16bfc7df additions to v2_planning James Bergstra <bergstrj@iro.umontreal.ca> parents: 941 diff changeset	122
cafa16bfc7df additions to v2_planning James Bergstra <bergstrj@iro.umontreal.ca> parents: 941 diff changeset	123
cafa16bfc7df additions to v2_planning James Bergstra <bergstrj@iro.umontreal.ca> parents: 941 diff changeset	124 Data access
cafa16bfc7df additions to v2_planning James Bergstra <bergstrj@iro.umontreal.ca> parents: 941 diff changeset	125 -----------
cafa16bfc7df additions to v2_planning James Bergstra <bergstrj@iro.umontreal.ca> parents: 941 diff changeset	126
cafa16bfc7df additions to v2_planning James Bergstra <bergstrj@iro.umontreal.ca> parents: 941 diff changeset	127 A general interface to datasets from the perspective of an experiment driver
cafa16bfc7df additions to v2_planning James Bergstra <bergstrj@iro.umontreal.ca> parents: 941 diff changeset	128 (e.g. kfold) is to see them as a function that maps index (typically integer)
cafa16bfc7df additions to v2_planning James Bergstra <bergstrj@iro.umontreal.ca> parents: 941 diff changeset	129 to example (whose type and nature depends on the dataset, it could for
cafa16bfc7df additions to v2_planning James Bergstra <bergstrj@iro.umontreal.ca> parents: 941 diff changeset	130 instance be an (image, label) pair). This interface permits iterating over
cafa16bfc7df additions to v2_planning James Bergstra <bergstrj@iro.umontreal.ca> parents: 941 diff changeset	131 the dataset, shuffling the dataset, and splitting it into folds. For
cafa16bfc7df additions to v2_planning James Bergstra <bergstrj@iro.umontreal.ca> parents: 941 diff changeset	132 efficiency, it is nice if the dataset interface supports looking up several
cafa16bfc7df additions to v2_planning James Bergstra <bergstrj@iro.umontreal.ca> parents: 941 diff changeset	133 index values at once, because looking up many examples at once can sometimes
946 7c4504a4ce1a additions to formulas, data access, hyper-params, scripts Yoshua Bengio <bengioy@iro.umontreal.ca> parents: 945 diff changeset	134 be faster than looking each one up in turn. In particular, looking up
7c4504a4ce1a additions to formulas, data access, hyper-params, scripts Yoshua Bengio <bengioy@iro.umontreal.ca> parents: 945 diff changeset	135 a consecutive block of indices, or a slice, should be well supported.
945 cafa16bfc7df additions to v2_planning James Bergstra <bergstrj@iro.umontreal.ca> parents: 941 diff changeset	136
cafa16bfc7df additions to v2_planning James Bergstra <bergstrj@iro.umontreal.ca> parents: 941 diff changeset	137 Some datasets may not support random access (e.g. a random number stream) and
cafa16bfc7df additions to v2_planning James Bergstra <bergstrj@iro.umontreal.ca> parents: 941 diff changeset	138 that's fine if an exception is raised. The user will see a NotImplementedError
946 7c4504a4ce1a additions to formulas, data access, hyper-params, scripts Yoshua Bengio <bengioy@iro.umontreal.ca> parents: 945 diff changeset	139 or similar, and try something else. We might want to have a way to test
7c4504a4ce1a additions to formulas, data access, hyper-params, scripts Yoshua Bengio <bengioy@iro.umontreal.ca> parents: 945 diff changeset	140 that a dataset is random-access or not without having to load an example.
945 cafa16bfc7df additions to v2_planning James Bergstra <bergstrj@iro.umontreal.ca> parents: 941 diff changeset	141
cafa16bfc7df additions to v2_planning James Bergstra <bergstrj@iro.umontreal.ca> parents: 941 diff changeset	142
cafa16bfc7df additions to v2_planning James Bergstra <bergstrj@iro.umontreal.ca> parents: 941 diff changeset	143 A more intuitive interface for many datasets (or subsets) is to load them as
cafa16bfc7df additions to v2_planning James Bergstra <bergstrj@iro.umontreal.ca> parents: 941 diff changeset	144 matrices or lists of examples. This format is more convenient to work with at
cafa16bfc7df additions to v2_planning James Bergstra <bergstrj@iro.umontreal.ca> parents: 941 diff changeset	145 an ipython shell, for example. It is not good to provide only the "dataset
cafa16bfc7df additions to v2_planning James Bergstra <bergstrj@iro.umontreal.ca> parents: 941 diff changeset	146 as a function" view of a dataset. Even if a dataset is very large, it is nice
cafa16bfc7df additions to v2_planning James Bergstra <bergstrj@iro.umontreal.ca> parents: 941 diff changeset	147 to have a standard way to get some representative examples in a convenient
cafa16bfc7df additions to v2_planning James Bergstra <bergstrj@iro.umontreal.ca> parents: 941 diff changeset	148 structure, to be able to play with them in ipython.
cafa16bfc7df additions to v2_planning James Bergstra <bergstrj@iro.umontreal.ca> parents: 941 diff changeset	149
cafa16bfc7df additions to v2_planning James Bergstra <bergstrj@iro.umontreal.ca> parents: 941 diff changeset	150
cafa16bfc7df additions to v2_planning James Bergstra <bergstrj@iro.umontreal.ca> parents: 941 diff changeset	151 Another thing to consider related to datasets is that there are a number of
cafa16bfc7df additions to v2_planning James Bergstra <bergstrj@iro.umontreal.ca> parents: 941 diff changeset	152 other efforts to have standard ML datasets, and we should be aware of them,
cafa16bfc7df additions to v2_planning James Bergstra <bergstrj@iro.umontreal.ca> parents: 941 diff changeset	153 and compatible with them when it's easy:
cafa16bfc7df additions to v2_planning James Bergstra <bergstrj@iro.umontreal.ca> parents: 941 diff changeset	154 - mldata.org (they have a file format, not sure how many use it)
cafa16bfc7df additions to v2_planning James Bergstra <bergstrj@iro.umontreal.ca> parents: 941 diff changeset	155 - weka (ARFF file format)
cafa16bfc7df additions to v2_planning James Bergstra <bergstrj@iro.umontreal.ca> parents: 941 diff changeset	156 - scikits.learn
cafa16bfc7df additions to v2_planning James Bergstra <bergstrj@iro.umontreal.ca> parents: 941 diff changeset	157 - hdf5 / pytables
cafa16bfc7df additions to v2_planning James Bergstra <bergstrj@iro.umontreal.ca> parents: 941 diff changeset	158
cafa16bfc7df additions to v2_planning James Bergstra <bergstrj@iro.umontreal.ca> parents: 941 diff changeset	159
cafa16bfc7df additions to v2_planning James Bergstra <bergstrj@iro.umontreal.ca> parents: 941 diff changeset	160 pylearn.datasets uses a DATA_ROOT environment variable to locate a filesystem
cafa16bfc7df additions to v2_planning James Bergstra <bergstrj@iro.umontreal.ca> parents: 941 diff changeset	161 folder that is assumed to have a standard form across different installations.
cafa16bfc7df additions to v2_planning James Bergstra <bergstrj@iro.umontreal.ca> parents: 941 diff changeset	162 That's where the data files are. The correct format of this folder is currently
cafa16bfc7df additions to v2_planning James Bergstra <bergstrj@iro.umontreal.ca> parents: 941 diff changeset	163 defined implicitly by the contents of /data/lisa/data at DIRO, but it would be
cafa16bfc7df additions to v2_planning James Bergstra <bergstrj@iro.umontreal.ca> parents: 941 diff changeset	164 better to document in pylearn what the contents of this folder should be as
cafa16bfc7df additions to v2_planning James Bergstra <bergstrj@iro.umontreal.ca> parents: 941 diff changeset	165 much as possible. It should be possible to rebuild this tree from information
cafa16bfc7df additions to v2_planning James Bergstra <bergstrj@iro.umontreal.ca> parents: 941 diff changeset	166 found in pylearn.
cafa16bfc7df additions to v2_planning James Bergstra <bergstrj@iro.umontreal.ca> parents: 941 diff changeset	167
946 7c4504a4ce1a additions to formulas, data access, hyper-params, scripts Yoshua Bengio <bengioy@iro.umontreal.ca> parents: 945 diff changeset	168 Yoshua (about ideas proposed by Pascal Vincent a while ago):
7c4504a4ce1a additions to formulas, data access, hyper-params, scripts Yoshua Bengio <bengioy@iro.umontreal.ca> parents: 945 diff changeset	169
7c4504a4ce1a additions to formulas, data access, hyper-params, scripts Yoshua Bengio <bengioy@iro.umontreal.ca> parents: 945 diff changeset	170 - we may want to distinguish between datasets and tasks: a task defines
7c4504a4ce1a additions to formulas, data access, hyper-params, scripts Yoshua Bengio <bengioy@iro.umontreal.ca> parents: 945 diff changeset	171 not just the data but also things like what is the input and what is the
7c4504a4ce1a additions to formulas, data access, hyper-params, scripts Yoshua Bengio <bengioy@iro.umontreal.ca> parents: 945 diff changeset	172 target (for supervised learning), and importantly a set of performance metrics
7c4504a4ce1a additions to formulas, data access, hyper-params, scripts Yoshua Bengio <bengioy@iro.umontreal.ca> parents: 945 diff changeset	173 that make sense for this task (e.g. those used by papers solving a particular
7c4504a4ce1a additions to formulas, data access, hyper-params, scripts Yoshua Bengio <bengioy@iro.umontreal.ca> parents: 945 diff changeset	174 task, or reported for a particular benchmark)
7c4504a4ce1a additions to formulas, data access, hyper-params, scripts Yoshua Bengio <bengioy@iro.umontreal.ca> parents: 945 diff changeset	175
7c4504a4ce1a additions to formulas, data access, hyper-params, scripts Yoshua Bengio <bengioy@iro.umontreal.ca> parents: 945 diff changeset	176 - we should discuss about a few "standards" that datasets and tasks may comply to, such as
7c4504a4ce1a additions to formulas, data access, hyper-params, scripts Yoshua Bengio <bengioy@iro.umontreal.ca> parents: 945 diff changeset	177 - "input" and "target" fields inside each example, for supervised or semi-supervised learning tasks
7c4504a4ce1a additions to formulas, data access, hyper-params, scripts Yoshua Bengio <bengioy@iro.umontreal.ca> parents: 945 diff changeset	178 (with a convention for the semi-supervised case when only the input or only the target is observed)
7c4504a4ce1a additions to formulas, data access, hyper-params, scripts Yoshua Bengio <bengioy@iro.umontreal.ca> parents: 945 diff changeset	179 - "input" for unsupervised learning
7c4504a4ce1a additions to formulas, data access, hyper-params, scripts Yoshua Bengio <bengioy@iro.umontreal.ca> parents: 945 diff changeset	180 - conventions for missing-valued components inside input or target
7c4504a4ce1a additions to formulas, data access, hyper-params, scripts Yoshua Bengio <bengioy@iro.umontreal.ca> parents: 945 diff changeset	181 - how examples that are sequences are treated (e.g. the input or the target is a sequence)
7c4504a4ce1a additions to formulas, data access, hyper-params, scripts Yoshua Bengio <bengioy@iro.umontreal.ca> parents: 945 diff changeset	182 - how time-stamps are specified when appropriate (e.g., the sequences are asynchronous)
7c4504a4ce1a additions to formulas, data access, hyper-params, scripts Yoshua Bengio <bengioy@iro.umontreal.ca> parents: 945 diff changeset	183 - how error metrics are specified
7c4504a4ce1a additions to formulas, data access, hyper-params, scripts Yoshua Bengio <bengioy@iro.umontreal.ca> parents: 945 diff changeset	184 * example-level statistics (e.g. classification error)
7c4504a4ce1a additions to formulas, data access, hyper-params, scripts Yoshua Bengio <bengioy@iro.umontreal.ca> parents: 945 diff changeset	185 * dataset-level statistics (e.g. ROC curve, mean and standard error of error)
945 cafa16bfc7df additions to v2_planning James Bergstra <bergstrj@iro.umontreal.ca> parents: 941 diff changeset	186
cafa16bfc7df additions to v2_planning James Bergstra <bergstrj@iro.umontreal.ca> parents: 941 diff changeset	187
cafa16bfc7df additions to v2_planning James Bergstra <bergstrj@iro.umontreal.ca> parents: 941 diff changeset	188 Model Selection & Hyper-Parameter Optimization
cafa16bfc7df additions to v2_planning James Bergstra <bergstrj@iro.umontreal.ca> parents: 941 diff changeset	189 ----------------------------------------------
cafa16bfc7df additions to v2_planning James Bergstra <bergstrj@iro.umontreal.ca> parents: 941 diff changeset	190
cafa16bfc7df additions to v2_planning James Bergstra <bergstrj@iro.umontreal.ca> parents: 941 diff changeset	191 Driving a distributed computing job for a long time to optimize
cafa16bfc7df additions to v2_planning James Bergstra <bergstrj@iro.umontreal.ca> parents: 941 diff changeset	192 hyper-parameters using one or more clusters is the goal here.
cafa16bfc7df additions to v2_planning James Bergstra <bergstrj@iro.umontreal.ca> parents: 941 diff changeset	193 Although there might be some library-type code to write here, I think of this
cafa16bfc7df additions to v2_planning James Bergstra <bergstrj@iro.umontreal.ca> parents: 941 diff changeset	194 more as an application template. The user would use python code to describe
cafa16bfc7df additions to v2_planning James Bergstra <bergstrj@iro.umontreal.ca> parents: 941 diff changeset	195 the experiment to run and the hyper-parameter space to search. Then this
cafa16bfc7df additions to v2_planning James Bergstra <bergstrj@iro.umontreal.ca> parents: 941 diff changeset	196 application-driver would take control of scheduling jobs and running them on
cafa16bfc7df additions to v2_planning James Bergstra <bergstrj@iro.umontreal.ca> parents: 941 diff changeset	197 various computers... I'm imagining a potentially ugly brute of a hack that's
cafa16bfc7df additions to v2_planning James Bergstra <bergstrj@iro.umontreal.ca> parents: 941 diff changeset	198 not necessarily something we will want to expose at a low-level for reuse.
cafa16bfc7df additions to v2_planning James Bergstra <bergstrj@iro.umontreal.ca> parents: 941 diff changeset	199
946 7c4504a4ce1a additions to formulas, data access, hyper-params, scripts Yoshua Bengio <bengioy@iro.umontreal.ca> parents: 945 diff changeset	200 Yoshua: We want both the library-defined driver that takes instructions about how to generate
7c4504a4ce1a additions to formulas, data access, hyper-params, scripts Yoshua Bengio <bengioy@iro.umontreal.ca> parents: 945 diff changeset	201 new hyper-parameter combinations (e.g. implicitly providing a prior distribution from which
7c4504a4ce1a additions to formulas, data access, hyper-params, scripts Yoshua Bengio <bengioy@iro.umontreal.ca> parents: 945 diff changeset	202 to sample them), and examples showing how to use it in typical cases.
7c4504a4ce1a additions to formulas, data access, hyper-params, scripts Yoshua Bengio <bengioy@iro.umontreal.ca> parents: 945 diff changeset	203 Note that sometimes we just want to find the best configuration of hyper-parameters,
7c4504a4ce1a additions to formulas, data access, hyper-params, scripts Yoshua Bengio <bengioy@iro.umontreal.ca> parents: 945 diff changeset	204 but sometimes we want to do more subtle analysis. Often a combination of both.
7c4504a4ce1a additions to formulas, data access, hyper-params, scripts Yoshua Bengio <bengioy@iro.umontreal.ca> parents: 945 diff changeset	205 In this respect it could be useful for the user to define hyper-parameters over
7c4504a4ce1a additions to formulas, data access, hyper-params, scripts Yoshua Bengio <bengioy@iro.umontreal.ca> parents: 945 diff changeset	206 which scientific questions are sought (e.g. depth of an architecture) vs
7c4504a4ce1a additions to formulas, data access, hyper-params, scripts Yoshua Bengio <bengioy@iro.umontreal.ca> parents: 945 diff changeset	207 hyper-parameters that we would like to marginalize/maximize over (e.g. learning rate).
7c4504a4ce1a additions to formulas, data access, hyper-params, scripts Yoshua Bengio <bengioy@iro.umontreal.ca> parents: 945 diff changeset	208 This can influence both the sampling of configurations (we want to make sure that all
7c4504a4ce1a additions to formulas, data access, hyper-params, scripts Yoshua Bengio <bengioy@iro.umontreal.ca> parents: 945 diff changeset	209 combinations of question-driving hyper-parameters are covered) and the analysis
7c4504a4ce1a additions to formulas, data access, hyper-params, scripts Yoshua Bengio <bengioy@iro.umontreal.ca> parents: 945 diff changeset	210 of results (we may be willing to estimate ANOVAs or averaging or quantiles over
7c4504a4ce1a additions to formulas, data access, hyper-params, scripts Yoshua Bengio <bengioy@iro.umontreal.ca> parents: 945 diff changeset	211 the non-question-driving hyper-parameters).
945 cafa16bfc7df additions to v2_planning James Bergstra <bergstrj@iro.umontreal.ca> parents: 941 diff changeset	212
cafa16bfc7df additions to v2_planning James Bergstra <bergstrj@iro.umontreal.ca> parents: 941 diff changeset	213 Python scripts for common ML algorithms
cafa16bfc7df additions to v2_planning James Bergstra <bergstrj@iro.umontreal.ca> parents: 941 diff changeset	214 ---------------------------------------
cafa16bfc7df additions to v2_planning James Bergstra <bergstrj@iro.umontreal.ca> parents: 941 diff changeset	215
cafa16bfc7df additions to v2_planning James Bergstra <bergstrj@iro.umontreal.ca> parents: 941 diff changeset	216 The script aspect of this feature request makes me think that what would be
cafa16bfc7df additions to v2_planning James Bergstra <bergstrj@iro.umontreal.ca> parents: 941 diff changeset	217 good here is more tutorial-type scripts. And the existing tutorials could
cafa16bfc7df additions to v2_planning James Bergstra <bergstrj@iro.umontreal.ca> parents: 941 diff changeset	218 potentially be rewritten to use some of the pylearn.nnet expressions. More
cafa16bfc7df additions to v2_planning James Bergstra <bergstrj@iro.umontreal.ca> parents: 941 diff changeset	219 tutorials / demos would be great.
cafa16bfc7df additions to v2_planning James Bergstra <bergstrj@iro.umontreal.ca> parents: 941 diff changeset	220
946 7c4504a4ce1a additions to formulas, data access, hyper-params, scripts Yoshua Bengio <bengioy@iro.umontreal.ca> parents: 945 diff changeset	221 Yoshua: agreed that we could write them as tutorials, but note how the
7c4504a4ce1a additions to formulas, data access, hyper-params, scripts Yoshua Bengio <bengioy@iro.umontreal.ca> parents: 945 diff changeset	222 spirit would be different from the current deep learning tutorials: we would
7c4504a4ce1a additions to formulas, data access, hyper-params, scripts Yoshua Bengio <bengioy@iro.umontreal.ca> parents: 945 diff changeset	223 not mind using library code as much as possible instead of trying to flatten
7c4504a4ce1a additions to formulas, data access, hyper-params, scripts Yoshua Bengio <bengioy@iro.umontreal.ca> parents: 945 diff changeset	224 out everything in the interest of pedagogical simplicity. Instead, these
7c4504a4ce1a additions to formulas, data access, hyper-params, scripts Yoshua Bengio <bengioy@iro.umontreal.ca> parents: 945 diff changeset	225 tutorials should be meant to illustrate not the algorithms but *how to take
7c4504a4ce1a additions to formulas, data access, hyper-params, scripts Yoshua Bengio <bengioy@iro.umontreal.ca> parents: 945 diff changeset	226 advantage of the library. They could also be used as BLACK BOX* implementations
7c4504a4ce1a additions to formulas, data access, hyper-params, scripts Yoshua Bengio <bengioy@iro.umontreal.ca> parents: 945 diff changeset	227 by people who don't want to dig lower and just want to run experiments.
941 939806d33183 v2_planning.txt James Bergstra <bergstrj@iro.umontreal.ca> parents: diff changeset	228
939806d33183 v2_planning.txt James Bergstra <bergstrj@iro.umontreal.ca> parents: diff changeset	229 Functional Specifications
939806d33183 v2_planning.txt James Bergstra <bergstrj@iro.umontreal.ca> parents: diff changeset	230 =========================
939806d33183 v2_planning.txt James Bergstra <bergstrj@iro.umontreal.ca> parents: diff changeset	231
945 cafa16bfc7df additions to v2_planning James Bergstra <bergstrj@iro.umontreal.ca> parents: 941 diff changeset	232 TODO:
941 939806d33183 v2_planning.txt James Bergstra <bergstrj@iro.umontreal.ca> parents: diff changeset	233 Put these into different text files so that this one does not become a monster.
939806d33183 v2_planning.txt James Bergstra <bergstrj@iro.umontreal.ca> parents: diff changeset	234 For each thing with a functional spec (e.g. datasets library, optimization library) make a
939806d33183 v2_planning.txt James Bergstra <bergstrj@iro.umontreal.ca> parents: diff changeset	235 separate file.
939806d33183 v2_planning.txt James Bergstra <bergstrj@iro.umontreal.ca> parents: diff changeset	236
945 cafa16bfc7df additions to v2_planning James Bergstra <bergstrj@iro.umontreal.ca> parents: 941 diff changeset	237 Indexing Convention
cafa16bfc7df additions to v2_planning James Bergstra <bergstrj@iro.umontreal.ca> parents: 941 diff changeset	238 ~~~~~~~~~~~~~~~~~~~
cafa16bfc7df additions to v2_planning James Bergstra <bergstrj@iro.umontreal.ca> parents: 941 diff changeset	239
cafa16bfc7df additions to v2_planning James Bergstra <bergstrj@iro.umontreal.ca> parents: 941 diff changeset	240 Something to decide on - Fortran-style or C-style indexing. Although we have
cafa16bfc7df additions to v2_planning James Bergstra <bergstrj@iro.umontreal.ca> parents: 941 diff changeset	241 often used c-style indexing in the past (for efficiency in c!) this is no
cafa16bfc7df additions to v2_planning James Bergstra <bergstrj@iro.umontreal.ca> parents: 941 diff changeset	242 longer an issue with numpy because the physical layout is independent of the
cafa16bfc7df additions to v2_planning James Bergstra <bergstrj@iro.umontreal.ca> parents: 941 diff changeset	243 indexing order. The fact remains that Fortran-style indexing follows linear
cafa16bfc7df additions to v2_planning James Bergstra <bergstrj@iro.umontreal.ca> parents: 941 diff changeset	244 algebra conventions, while c-style indexing does not. If a global function
cafa16bfc7df additions to v2_planning James Bergstra <bergstrj@iro.umontreal.ca> parents: 941 diff changeset	245 includes a lot of math derivations, it would be really nice if the code used
cafa16bfc7df additions to v2_planning James Bergstra <bergstrj@iro.umontreal.ca> parents: 941 diff changeset	246 the same convention for the orientation of matrices, and endlessly annoying to
cafa16bfc7df additions to v2_planning James Bergstra <bergstrj@iro.umontreal.ca> parents: 941 diff changeset	247 have to be always transposing everything.
cafa16bfc7df additions to v2_planning James Bergstra <bergstrj@iro.umontreal.ca> parents: 941 diff changeset	248

Mercurial > pylearn

annotate doc/v2_planning/main_plan.txt @ 1087:8c448829db30