comparison doc/v2_planning.txt @ 949:d944e1c26a57

merge
author gdesjardins
date Mon, 16 Aug 2010 10:39:36 -0400
parents 216f4ce969b2
children
comparison
equal deleted inserted replaced
948:0b4c39c33eb9 949:d944e1c26a57
1
2 Motivation
3 ==========
4
5 Yoshua:
6 -------
7
8 We are missing a *Theano Machine Learning library*.
9
10 The deep learning tutorials do a good job but they lack the following features, which I would like to see in a ML library:
11
12 - a well-organized collection of Theano symbolic expressions (formulas) for handling most of
13 what is needed either in implementing existing well-known ML and deep learning algorithms or
14 for creating new variants (without having to start from scratch each time), that is the
15 mathematical core,
16
17 - a well-organized collection of python modules to help with the following:
18 - several data-access models that wrap around learning algorithms for interfacing with various types of data (static vectors, images, sound, video, generic time-series, etc.)
19 - generic utility code for optimization
20 - stochastic gradient descent variants
21 - early stopping variants
22 - interfacing to generic 2nd order optimization methods
23 - 2nd order methods tailored to work on minibatches
24 - optimizers for sparse coefficients / parameters
25 - generic code for model selection and hyper-parameter optimization (including the use and coordination of multiple jobs running on different machines, e.g. using jobman)
26 - generic code for performance estimation and experimental statistics
27 - visualization tools (using existing python libraries) and examples for all of the above
28 - learning algorithm conventions and meta-learning algorithms (bagging, boosting, mixtures of experts, etc.) which use them
29
30 [Note that many of us already use some instance of all the above, but each one tends to reinvent the wheel and newbies don't benefit from a knowledge base.]
31
32 - a well-documented set of python scripts using the above library to show how to run the most
33 common ML algorithms (possibly with examples showing how to run multiple experiments with
34 many different models and collect statistical comparative results). This is particularly
35 important for pure users to adopt Theano in the ML application work.
36
37 Ideally, there would be one person in charge of this project, making sure a coherent and
38 easy-to-read design is developed, along with many helping hands (to implement the various
39 helper modules, formulae, and learning algorithms).
40
41
42 James:
43 -------
44
45 I am interested in the design and implementation of the "well-organized collection of Theano
46 symbolic expressions..."
47
48 I would like to explore algorithms for hyper-parameter optimization, following up on some
49 "high-throughput" work. I'm most interested in the "generic code for model selection and
50 hyper-parameter optimization..." and "generic code for performance estimation...".
51
52 I have some experiences with the data-access requirements, and some lessons I'd like to share
53 on that, but no time to work on that aspect of things.
54
55 I will continue to contribute to the "well-documented set of python scripts using the above to
56 showcase common ML algorithms...". I have an Olshausen&Field-style sparse coding script that
57 could be polished up. I am also implementing the mcRBM and I'll be able to add that when it's
58 done.
59
60
61
62 Suggestions for how to tackle various desiderata
63 ================================================
64
65
66 Theano Symbolic Expressions for ML
67 ----------------------------------
68
69 We could make this a submodule of pylearn: ``pylearn.nnet``.
70
71 Yoshua: I would use a different name, e.g., "pylearn.formulas" to emphasize that it is not just
72 about neural nets, and that this is a collection of formulas (expressions), rather than
73 completely self-contained classes for learners. We could have a "nnet.py" file for
74 neural nets, though.
75
76 There are a number of ideas floating around for how to handle classes /
77 modules (LeDeepNet, pylearn.shared.layers, pynnet, DeepAnn) so lets implement as much
78 math as possible in global functions with no classes. There are no models in
79 the wish list that require than a few vectors and matrices to parametrize.
80 Global functions are more reusable than classes.
81
82
83 Data access
84 -----------
85
86 A general interface to datasets from the perspective of an experiment driver
87 (e.g. kfold) is to see them as a function that maps index (typically integer)
88 to example (whose type and nature depends on the dataset, it could for
89 instance be an (image, label) pair). This interface permits iterating over
90 the dataset, shuffling the dataset, and splitting it into folds. For
91 efficiency, it is nice if the dataset interface supports looking up several
92 index values at once, because looking up many examples at once can sometimes
93 be faster than looking each one up in turn. In particular, looking up
94 a consecutive block of indices, or a slice, should be well supported.
95
96 Some datasets may not support random access (e.g. a random number stream) and
97 that's fine if an exception is raised. The user will see a NotImplementedError
98 or similar, and try something else. We might want to have a way to test
99 that a dataset is random-access or not without having to load an example.
100
101
102 A more intuitive interface for many datasets (or subsets) is to load them as
103 matrices or lists of examples. This format is more convenient to work with at
104 an ipython shell, for example. It is not good to provide only the "dataset
105 as a function" view of a dataset. Even if a dataset is very large, it is nice
106 to have a standard way to get some representative examples in a convenient
107 structure, to be able to play with them in ipython.
108
109
110 Another thing to consider related to datasets is that there are a number of
111 other efforts to have standard ML datasets, and we should be aware of them,
112 and compatible with them when it's easy:
113 - mldata.org (they have a file format, not sure how many use it)
114 - weka (ARFF file format)
115 - scikits.learn
116 - hdf5 / pytables
117
118
119 pylearn.datasets uses a DATA_ROOT environment variable to locate a filesystem
120 folder that is assumed to have a standard form across different installations.
121 That's where the data files are. The correct format of this folder is currently
122 defined implicitly by the contents of /data/lisa/data at DIRO, but it would be
123 better to document in pylearn what the contents of this folder should be as
124 much as possible. It should be possible to rebuild this tree from information
125 found in pylearn.
126
127 Yoshua (about ideas proposed by Pascal Vincent a while ago):
128
129 - we may want to distinguish between datasets and tasks: a task defines
130 not just the data but also things like what is the input and what is the
131 target (for supervised learning), and *importantly* a set of performance metrics
132 that make sense for this task (e.g. those used by papers solving a particular
133 task, or reported for a particular benchmark)
134
135 - we should discuss about a few "standards" that datasets and tasks may comply to, such as
136 - "input" and "target" fields inside each example, for supervised or semi-supervised learning tasks
137 (with a convention for the semi-supervised case when only the input or only the target is observed)
138 - "input" for unsupervised learning
139 - conventions for missing-valued components inside input or target
140 - how examples that are sequences are treated (e.g. the input or the target is a sequence)
141 - how time-stamps are specified when appropriate (e.g., the sequences are asynchronous)
142 - how error metrics are specified
143 * example-level statistics (e.g. classification error)
144 * dataset-level statistics (e.g. ROC curve, mean and standard error of error)
145
146
147 Model Selection & Hyper-Parameter Optimization
148 ----------------------------------------------
149
150 Driving a distributed computing job for a long time to optimize
151 hyper-parameters using one or more clusters is the goal here.
152 Although there might be some library-type code to write here, I think of this
153 more as an application template. The user would use python code to describe
154 the experiment to run and the hyper-parameter space to search. Then this
155 application-driver would take control of scheduling jobs and running them on
156 various computers... I'm imagining a potentially ugly brute of a hack that's
157 not necessarily something we will want to expose at a low-level for reuse.
158
159 Yoshua: We want both the library-defined driver that takes instructions about how to generate
160 new hyper-parameter combinations (e.g. implicitly providing a prior distribution from which
161 to sample them), and examples showing how to use it in typical cases.
162 Note that sometimes we just want to find the best configuration of hyper-parameters,
163 but sometimes we want to do more subtle analysis. Often a combination of both.
164 In this respect it could be useful for the user to define hyper-parameters over
165 which scientific questions are sought (e.g. depth of an architecture) vs
166 hyper-parameters that we would like to marginalize/maximize over (e.g. learning rate).
167 This can influence both the sampling of configurations (we want to make sure that all
168 combinations of question-driving hyper-parameters are covered) and the analysis
169 of results (we may be willing to estimate ANOVAs or averaging or quantiles over
170 the non-question-driving hyper-parameters).
171
172 Python scripts for common ML algorithms
173 ---------------------------------------
174
175 The script aspect of this feature request makes me think that what would be
176 good here is more tutorial-type scripts. And the existing tutorials could
177 potentially be rewritten to use some of the pylearn.nnet expressions. More
178 tutorials / demos would be great.
179
180 Yoshua: agreed that we could write them as tutorials, but note how the
181 spirit would be different from the current deep learning tutorials: we would
182 not mind using library code as much as possible instead of trying to flatten
183 out everything in the interest of pedagogical simplicity. Instead, these
184 tutorials should be meant to illustrate not the algorithms but *how to take
185 advantage of the library*. They could also be used as *BLACK BOX* implementations
186 by people who don't want to dig lower and just want to run experiments.
187
188 Functional Specifications
189 =========================
190
191 TODO:
192 Put these into different text files so that this one does not become a monster.
193 For each thing with a functional spec (e.g. datasets library, optimization library) make a
194 separate file.
195
196
197
198 pylearn.formulas
199 ----------------
200
201 Directory with functions for building layers, calculating classification
202 errors, cross-entropies with various distributions, free energies, etc. This
203 module would include for the most part global functions, Theano Ops and Theano
204 optimizations.
205
206 Yoshua: I would break it down in module files, e.g.:
207
208 pylearn.formulas.costs: generic / common cost functions, e.g. various cross-entropies, squared error,
209 abs. error, various sparsity penalties (L1, Student)
210
211 pylearn.formulas.linear: formulas for linear classifier, linear regression, factor analysis, PCA
212
213 pylearn.formulas.nnet: formulas for building layers of various kinds, various activation functions,
214 layers which could be plugged with various costs & penalties, and stacked
215
216 pylearn.formulas.ae: formulas for auto-encoders and denoising auto-encoder variants
217
218 pylearn.formulas.noise: formulas for corruption processes
219
220 pylearn.formulas.rbm: energies, free energies, conditional distributions, Gibbs sampling
221
222 pylearn.formulas.trees: formulas for decision trees
223
224 pylearn.formulas.boosting: formulas for boosting variants
225
226 etc.
227
228 Fred: It seam that the DeepANN git repository by Xavier G. have part of this as function.
229
230 Indexing Convention
231 ~~~~~~~~~~~~~~~~~~~
232
233 Something to decide on - Fortran-style or C-style indexing. Although we have
234 often used c-style indexing in the past (for efficiency in c!) this is no
235 longer an issue with numpy because the physical layout is independent of the
236 indexing order. The fact remains that Fortran-style indexing follows linear
237 algebra conventions, while c-style indexing does not. If a global function
238 includes a lot of math derivations, it would be *really* nice if the code used
239 the same convention for the orientation of matrices, and endlessly annoying to
240 have to be always transposing everything.
241