Mercurial > pylearn
annotate doc/v2_planning.txt @ 952:5f80351bc762
Moving sgd to a new 'gd' pylearn module, where it should be joined by TONGA
and Hessian-Free.
author | James Bergstra <bergstrj@iro.umontreal.ca> |
---|---|
date | Thu, 19 Aug 2010 11:53:19 -0400 |
parents | cafa16bfc7df |
children | 7c4504a4ce1a |
rev | line source |
---|---|
941 | 1 |
2 Motivation | |
3 ========== | |
4 | |
5 Yoshua: | |
6 ------- | |
7 | |
8 We are missing a *Theano Machine Learning library*. | |
9 | |
10 The deep learning tutorials do a good job but they lack the following features, which I would like to see in a ML library: | |
11 | |
12 - a well-organized collection of Theano symbolic expressions (formulas) for handling most of | |
13 what is needed either in implementing existing well-known ML and deep learning algorithms or | |
14 for creating new variants (without having to start from scratch each time), that is the | |
15 mathematical core, | |
16 | |
17 - a well-organized collection of python modules to help with the following: | |
18 - several data-access models that wrap around learning algorithms for interfacing with various types of data (static vectors, images, sound, video, generic time-series, etc.) | |
19 - generic utility code for optimization | |
20 - stochastic gradient descent variants | |
21 - early stopping variants | |
22 - interfacing to generic 2nd order optimization methods | |
23 - 2nd order methods tailored to work on minibatches | |
24 - optimizers for sparse coefficients / parameters | |
25 - generic code for model selection and hyper-parameter optimization (including the use and coordination of multiple jobs running on different machines, e.g. using jobman) | |
26 - generic code for performance estimation and experimental statistics | |
27 - visualization tools (using existing python libraries) and examples for all of the above | |
28 - learning algorithm conventions and meta-learning algorithms (bagging, boosting, mixtures of experts, etc.) which use them | |
29 | |
30 [Note that many of us already use some instance of all the above, but each one tends to reinvent the wheel and newbies don't benefit from a knowledge base.] | |
31 | |
32 - a well-documented set of python scripts using the above library to show how to run the most | |
33 common ML algorithms (possibly with examples showing how to run multiple experiments with | |
34 many different models and collect statistical comparative results). This is particularly | |
35 important for pure users to adopt Theano in the ML application work. | |
36 | |
37 Ideally, there would be one person in charge of this project, making sure a coherent and | |
38 easy-to-read design is developed, along with many helping hands (to implement the various | |
39 helper modules, formulae, and learning algorithms). | |
40 | |
41 | |
42 James: | |
43 ------- | |
44 | |
45 I am interested in the design and implementation of the "well-organized collection of Theano | |
46 symbolic expressions..." | |
47 | |
48 I would like to explore algorithms for hyper-parameter optimization, following up on some | |
49 "high-throughput" work. I'm most interested in the "generic code for model selection and | |
50 hyper-parameter optimization..." and "generic code for performance estimation...". | |
51 | |
52 I have some experiences with the data-access requirements, and some lessons I'd like to share | |
53 on that, but no time to work on that aspect of things. | |
54 | |
55 I will continue to contribute to the "well-documented set of python scripts using the above to | |
56 showcase common ML algorithms...". I have an Olshausen&Field-style sparse coding script that | |
57 could be polished up. I am also implementing the mcRBM and I'll be able to add that when it's | |
58 done. | |
59 | |
60 | |
61 | |
62 Suggestions for how to tackle various desiderata | |
63 ================================================ | |
64 | |
65 | |
945
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
66 Theano Symbolic Expressions for ML |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
67 ---------------------------------- |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
68 |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
69 We could make this a submodule of pylearn: ``pylearn.nnet``. |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
70 |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
71 There are a number of ideas floating around for how to handle classes / |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
72 modules (LeDeepNet, pylearn.shared.layers, pynnet) so lets implement as much |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
73 math as possible in global functions with no classes. There are no models in |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
74 the wish list that require than a few vectors and matrices to parametrize. |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
75 Global functions are more reusable than classes. |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
76 |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
77 |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
78 Data access |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
79 ----------- |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
80 |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
81 A general interface to datasets from the perspective of an experiment driver |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
82 (e.g. kfold) is to see them as a function that maps index (typically integer) |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
83 to example (whose type and nature depends on the dataset, it could for |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
84 instance be an (image, label) pair). This interface permits iterating over |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
85 the dataset, shuffling the dataset, and splitting it into folds. For |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
86 efficiency, it is nice if the dataset interface supports looking up several |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
87 index values at once, because looking up many examples at once can sometimes |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
88 be faster than looking each one up in turn. |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
89 |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
90 Some datasets may not support random access (e.g. a random number stream) and |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
91 that's fine if an exception is raised. The user will see a NotImplementedError |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
92 or similar, and try something else. |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
93 |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
94 |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
95 A more intuitive interface for many datasets (or subsets) is to load them as |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
96 matrices or lists of examples. This format is more convenient to work with at |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
97 an ipython shell, for example. It is not good to provide only the "dataset |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
98 as a function" view of a dataset. Even if a dataset is very large, it is nice |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
99 to have a standard way to get some representative examples in a convenient |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
100 structure, to be able to play with them in ipython. |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
101 |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
102 |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
103 Another thing to consider related to datasets is that there are a number of |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
104 other efforts to have standard ML datasets, and we should be aware of them, |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
105 and compatible with them when it's easy: |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
106 - mldata.org (they have a file format, not sure how many use it) |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
107 - weka (ARFF file format) |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
108 - scikits.learn |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
109 - hdf5 / pytables |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
110 |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
111 |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
112 pylearn.datasets uses a DATA_ROOT environment variable to locate a filesystem |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
113 folder that is assumed to have a standard form across different installations. |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
114 That's where the data files are. The correct format of this folder is currently |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
115 defined implicitly by the contents of /data/lisa/data at DIRO, but it would be |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
116 better to document in pylearn what the contents of this folder should be as |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
117 much as possible. It should be possible to rebuild this tree from information |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
118 found in pylearn. |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
119 |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
120 |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
121 |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
122 Model Selection & Hyper-Parameter Optimization |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
123 ---------------------------------------------- |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
124 |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
125 Driving a distributed computing job for a long time to optimize |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
126 hyper-parameters using one or more clusters is the goal here. |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
127 Although there might be some library-type code to write here, I think of this |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
128 more as an application template. The user would use python code to describe |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
129 the experiment to run and the hyper-parameter space to search. Then this |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
130 application-driver would take control of scheduling jobs and running them on |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
131 various computers... I'm imagining a potentially ugly brute of a hack that's |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
132 not necessarily something we will want to expose at a low-level for reuse. |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
133 |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
134 |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
135 Python scripts for common ML algorithms |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
136 --------------------------------------- |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
137 |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
138 The script aspect of this feature request makes me think that what would be |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
139 good here is more tutorial-type scripts. And the existing tutorials could |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
140 potentially be rewritten to use some of the pylearn.nnet expressions. More |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
141 tutorials / demos would be great. |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
142 |
941 | 143 |
144 Functional Specifications | |
145 ========================= | |
146 | |
945
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
147 TODO: |
941 | 148 Put these into different text files so that this one does not become a monster. |
149 For each thing with a functional spec (e.g. datasets library, optimization library) make a | |
150 separate file. | |
151 | |
945
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
152 |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
153 |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
154 pylearn.nnet |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
155 ------------ |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
156 |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
157 Submodule with functions for building layers, calculating classification |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
158 errors, cross-entropies with various distributions, free energies. This |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
159 module would include for the most part global functions, Theano Ops and Theano |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
160 optimizations. |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
161 |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
162 Indexing Convention |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
163 ~~~~~~~~~~~~~~~~~~~ |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
164 |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
165 Something to decide on - Fortran-style or C-style indexing. Although we have |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
166 often used c-style indexing in the past (for efficiency in c!) this is no |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
167 longer an issue with numpy because the physical layout is independent of the |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
168 indexing order. The fact remains that Fortran-style indexing follows linear |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
169 algebra conventions, while c-style indexing does not. If a global function |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
170 includes a lot of math derivations, it would be *really* nice if the code used |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
171 the same convention for the orientation of matrices, and endlessly annoying to |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
172 have to be always transposing everything. |
cafa16bfc7df
additions to v2_planning
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
941
diff
changeset
|
173 |