Mercurial > pylearn
comparison doc/v2_planning.txt @ 949:d944e1c26a57
merge
author | gdesjardins |
---|---|
date | Mon, 16 Aug 2010 10:39:36 -0400 |
parents | 216f4ce969b2 |
children |
comparison
equal
deleted
inserted
replaced
948:0b4c39c33eb9 | 949:d944e1c26a57 |
---|---|
1 | |
2 Motivation | |
3 ========== | |
4 | |
5 Yoshua: | |
6 ------- | |
7 | |
8 We are missing a *Theano Machine Learning library*. | |
9 | |
10 The deep learning tutorials do a good job but they lack the following features, which I would like to see in a ML library: | |
11 | |
12 - a well-organized collection of Theano symbolic expressions (formulas) for handling most of | |
13 what is needed either in implementing existing well-known ML and deep learning algorithms or | |
14 for creating new variants (without having to start from scratch each time), that is the | |
15 mathematical core, | |
16 | |
17 - a well-organized collection of python modules to help with the following: | |
18 - several data-access models that wrap around learning algorithms for interfacing with various types of data (static vectors, images, sound, video, generic time-series, etc.) | |
19 - generic utility code for optimization | |
20 - stochastic gradient descent variants | |
21 - early stopping variants | |
22 - interfacing to generic 2nd order optimization methods | |
23 - 2nd order methods tailored to work on minibatches | |
24 - optimizers for sparse coefficients / parameters | |
25 - generic code for model selection and hyper-parameter optimization (including the use and coordination of multiple jobs running on different machines, e.g. using jobman) | |
26 - generic code for performance estimation and experimental statistics | |
27 - visualization tools (using existing python libraries) and examples for all of the above | |
28 - learning algorithm conventions and meta-learning algorithms (bagging, boosting, mixtures of experts, etc.) which use them | |
29 | |
30 [Note that many of us already use some instance of all the above, but each one tends to reinvent the wheel and newbies don't benefit from a knowledge base.] | |
31 | |
32 - a well-documented set of python scripts using the above library to show how to run the most | |
33 common ML algorithms (possibly with examples showing how to run multiple experiments with | |
34 many different models and collect statistical comparative results). This is particularly | |
35 important for pure users to adopt Theano in the ML application work. | |
36 | |
37 Ideally, there would be one person in charge of this project, making sure a coherent and | |
38 easy-to-read design is developed, along with many helping hands (to implement the various | |
39 helper modules, formulae, and learning algorithms). | |
40 | |
41 | |
42 James: | |
43 ------- | |
44 | |
45 I am interested in the design and implementation of the "well-organized collection of Theano | |
46 symbolic expressions..." | |
47 | |
48 I would like to explore algorithms for hyper-parameter optimization, following up on some | |
49 "high-throughput" work. I'm most interested in the "generic code for model selection and | |
50 hyper-parameter optimization..." and "generic code for performance estimation...". | |
51 | |
52 I have some experiences with the data-access requirements, and some lessons I'd like to share | |
53 on that, but no time to work on that aspect of things. | |
54 | |
55 I will continue to contribute to the "well-documented set of python scripts using the above to | |
56 showcase common ML algorithms...". I have an Olshausen&Field-style sparse coding script that | |
57 could be polished up. I am also implementing the mcRBM and I'll be able to add that when it's | |
58 done. | |
59 | |
60 | |
61 | |
62 Suggestions for how to tackle various desiderata | |
63 ================================================ | |
64 | |
65 | |
66 Theano Symbolic Expressions for ML | |
67 ---------------------------------- | |
68 | |
69 We could make this a submodule of pylearn: ``pylearn.nnet``. | |
70 | |
71 Yoshua: I would use a different name, e.g., "pylearn.formulas" to emphasize that it is not just | |
72 about neural nets, and that this is a collection of formulas (expressions), rather than | |
73 completely self-contained classes for learners. We could have a "nnet.py" file for | |
74 neural nets, though. | |
75 | |
76 There are a number of ideas floating around for how to handle classes / | |
77 modules (LeDeepNet, pylearn.shared.layers, pynnet, DeepAnn) so lets implement as much | |
78 math as possible in global functions with no classes. There are no models in | |
79 the wish list that require than a few vectors and matrices to parametrize. | |
80 Global functions are more reusable than classes. | |
81 | |
82 | |
83 Data access | |
84 ----------- | |
85 | |
86 A general interface to datasets from the perspective of an experiment driver | |
87 (e.g. kfold) is to see them as a function that maps index (typically integer) | |
88 to example (whose type and nature depends on the dataset, it could for | |
89 instance be an (image, label) pair). This interface permits iterating over | |
90 the dataset, shuffling the dataset, and splitting it into folds. For | |
91 efficiency, it is nice if the dataset interface supports looking up several | |
92 index values at once, because looking up many examples at once can sometimes | |
93 be faster than looking each one up in turn. In particular, looking up | |
94 a consecutive block of indices, or a slice, should be well supported. | |
95 | |
96 Some datasets may not support random access (e.g. a random number stream) and | |
97 that's fine if an exception is raised. The user will see a NotImplementedError | |
98 or similar, and try something else. We might want to have a way to test | |
99 that a dataset is random-access or not without having to load an example. | |
100 | |
101 | |
102 A more intuitive interface for many datasets (or subsets) is to load them as | |
103 matrices or lists of examples. This format is more convenient to work with at | |
104 an ipython shell, for example. It is not good to provide only the "dataset | |
105 as a function" view of a dataset. Even if a dataset is very large, it is nice | |
106 to have a standard way to get some representative examples in a convenient | |
107 structure, to be able to play with them in ipython. | |
108 | |
109 | |
110 Another thing to consider related to datasets is that there are a number of | |
111 other efforts to have standard ML datasets, and we should be aware of them, | |
112 and compatible with them when it's easy: | |
113 - mldata.org (they have a file format, not sure how many use it) | |
114 - weka (ARFF file format) | |
115 - scikits.learn | |
116 - hdf5 / pytables | |
117 | |
118 | |
119 pylearn.datasets uses a DATA_ROOT environment variable to locate a filesystem | |
120 folder that is assumed to have a standard form across different installations. | |
121 That's where the data files are. The correct format of this folder is currently | |
122 defined implicitly by the contents of /data/lisa/data at DIRO, but it would be | |
123 better to document in pylearn what the contents of this folder should be as | |
124 much as possible. It should be possible to rebuild this tree from information | |
125 found in pylearn. | |
126 | |
127 Yoshua (about ideas proposed by Pascal Vincent a while ago): | |
128 | |
129 - we may want to distinguish between datasets and tasks: a task defines | |
130 not just the data but also things like what is the input and what is the | |
131 target (for supervised learning), and *importantly* a set of performance metrics | |
132 that make sense for this task (e.g. those used by papers solving a particular | |
133 task, or reported for a particular benchmark) | |
134 | |
135 - we should discuss about a few "standards" that datasets and tasks may comply to, such as | |
136 - "input" and "target" fields inside each example, for supervised or semi-supervised learning tasks | |
137 (with a convention for the semi-supervised case when only the input or only the target is observed) | |
138 - "input" for unsupervised learning | |
139 - conventions for missing-valued components inside input or target | |
140 - how examples that are sequences are treated (e.g. the input or the target is a sequence) | |
141 - how time-stamps are specified when appropriate (e.g., the sequences are asynchronous) | |
142 - how error metrics are specified | |
143 * example-level statistics (e.g. classification error) | |
144 * dataset-level statistics (e.g. ROC curve, mean and standard error of error) | |
145 | |
146 | |
147 Model Selection & Hyper-Parameter Optimization | |
148 ---------------------------------------------- | |
149 | |
150 Driving a distributed computing job for a long time to optimize | |
151 hyper-parameters using one or more clusters is the goal here. | |
152 Although there might be some library-type code to write here, I think of this | |
153 more as an application template. The user would use python code to describe | |
154 the experiment to run and the hyper-parameter space to search. Then this | |
155 application-driver would take control of scheduling jobs and running them on | |
156 various computers... I'm imagining a potentially ugly brute of a hack that's | |
157 not necessarily something we will want to expose at a low-level for reuse. | |
158 | |
159 Yoshua: We want both the library-defined driver that takes instructions about how to generate | |
160 new hyper-parameter combinations (e.g. implicitly providing a prior distribution from which | |
161 to sample them), and examples showing how to use it in typical cases. | |
162 Note that sometimes we just want to find the best configuration of hyper-parameters, | |
163 but sometimes we want to do more subtle analysis. Often a combination of both. | |
164 In this respect it could be useful for the user to define hyper-parameters over | |
165 which scientific questions are sought (e.g. depth of an architecture) vs | |
166 hyper-parameters that we would like to marginalize/maximize over (e.g. learning rate). | |
167 This can influence both the sampling of configurations (we want to make sure that all | |
168 combinations of question-driving hyper-parameters are covered) and the analysis | |
169 of results (we may be willing to estimate ANOVAs or averaging or quantiles over | |
170 the non-question-driving hyper-parameters). | |
171 | |
172 Python scripts for common ML algorithms | |
173 --------------------------------------- | |
174 | |
175 The script aspect of this feature request makes me think that what would be | |
176 good here is more tutorial-type scripts. And the existing tutorials could | |
177 potentially be rewritten to use some of the pylearn.nnet expressions. More | |
178 tutorials / demos would be great. | |
179 | |
180 Yoshua: agreed that we could write them as tutorials, but note how the | |
181 spirit would be different from the current deep learning tutorials: we would | |
182 not mind using library code as much as possible instead of trying to flatten | |
183 out everything in the interest of pedagogical simplicity. Instead, these | |
184 tutorials should be meant to illustrate not the algorithms but *how to take | |
185 advantage of the library*. They could also be used as *BLACK BOX* implementations | |
186 by people who don't want to dig lower and just want to run experiments. | |
187 | |
188 Functional Specifications | |
189 ========================= | |
190 | |
191 TODO: | |
192 Put these into different text files so that this one does not become a monster. | |
193 For each thing with a functional spec (e.g. datasets library, optimization library) make a | |
194 separate file. | |
195 | |
196 | |
197 | |
198 pylearn.formulas | |
199 ---------------- | |
200 | |
201 Directory with functions for building layers, calculating classification | |
202 errors, cross-entropies with various distributions, free energies, etc. This | |
203 module would include for the most part global functions, Theano Ops and Theano | |
204 optimizations. | |
205 | |
206 Yoshua: I would break it down in module files, e.g.: | |
207 | |
208 pylearn.formulas.costs: generic / common cost functions, e.g. various cross-entropies, squared error, | |
209 abs. error, various sparsity penalties (L1, Student) | |
210 | |
211 pylearn.formulas.linear: formulas for linear classifier, linear regression, factor analysis, PCA | |
212 | |
213 pylearn.formulas.nnet: formulas for building layers of various kinds, various activation functions, | |
214 layers which could be plugged with various costs & penalties, and stacked | |
215 | |
216 pylearn.formulas.ae: formulas for auto-encoders and denoising auto-encoder variants | |
217 | |
218 pylearn.formulas.noise: formulas for corruption processes | |
219 | |
220 pylearn.formulas.rbm: energies, free energies, conditional distributions, Gibbs sampling | |
221 | |
222 pylearn.formulas.trees: formulas for decision trees | |
223 | |
224 pylearn.formulas.boosting: formulas for boosting variants | |
225 | |
226 etc. | |
227 | |
228 Fred: It seam that the DeepANN git repository by Xavier G. have part of this as function. | |
229 | |
230 Indexing Convention | |
231 ~~~~~~~~~~~~~~~~~~~ | |
232 | |
233 Something to decide on - Fortran-style or C-style indexing. Although we have | |
234 often used c-style indexing in the past (for efficiency in c!) this is no | |
235 longer an issue with numpy because the physical layout is independent of the | |
236 indexing order. The fact remains that Fortran-style indexing follows linear | |
237 algebra conventions, while c-style indexing does not. If a global function | |
238 includes a lot of math derivations, it would be *really* nice if the code used | |
239 the same convention for the orientation of matrices, and endlessly annoying to | |
240 have to be always transposing everything. | |
241 |