Mercurial > pylearn
comparison doc/v2_planning.txt @ 945:cafa16bfc7df
additions to v2_planning
author | James Bergstra <bergstrj@iro.umontreal.ca> |
---|---|
date | Wed, 11 Aug 2010 14:35:57 -0400 |
parents | 939806d33183 |
children | 7c4504a4ce1a |
comparison
equal
deleted
inserted
replaced
944:1529c84e460f | 945:cafa16bfc7df |
---|---|
61 | 61 |
62 Suggestions for how to tackle various desiderata | 62 Suggestions for how to tackle various desiderata |
63 ================================================ | 63 ================================================ |
64 | 64 |
65 | 65 |
66 Theano Symbolic Expressions for ML | |
67 ---------------------------------- | |
68 | |
69 We could make this a submodule of pylearn: ``pylearn.nnet``. | |
70 | |
71 There are a number of ideas floating around for how to handle classes / | |
72 modules (LeDeepNet, pylearn.shared.layers, pynnet) so lets implement as much | |
73 math as possible in global functions with no classes. There are no models in | |
74 the wish list that require than a few vectors and matrices to parametrize. | |
75 Global functions are more reusable than classes. | |
76 | |
77 | |
78 Data access | |
79 ----------- | |
80 | |
81 A general interface to datasets from the perspective of an experiment driver | |
82 (e.g. kfold) is to see them as a function that maps index (typically integer) | |
83 to example (whose type and nature depends on the dataset, it could for | |
84 instance be an (image, label) pair). This interface permits iterating over | |
85 the dataset, shuffling the dataset, and splitting it into folds. For | |
86 efficiency, it is nice if the dataset interface supports looking up several | |
87 index values at once, because looking up many examples at once can sometimes | |
88 be faster than looking each one up in turn. | |
89 | |
90 Some datasets may not support random access (e.g. a random number stream) and | |
91 that's fine if an exception is raised. The user will see a NotImplementedError | |
92 or similar, and try something else. | |
93 | |
94 | |
95 A more intuitive interface for many datasets (or subsets) is to load them as | |
96 matrices or lists of examples. This format is more convenient to work with at | |
97 an ipython shell, for example. It is not good to provide only the "dataset | |
98 as a function" view of a dataset. Even if a dataset is very large, it is nice | |
99 to have a standard way to get some representative examples in a convenient | |
100 structure, to be able to play with them in ipython. | |
101 | |
102 | |
103 Another thing to consider related to datasets is that there are a number of | |
104 other efforts to have standard ML datasets, and we should be aware of them, | |
105 and compatible with them when it's easy: | |
106 - mldata.org (they have a file format, not sure how many use it) | |
107 - weka (ARFF file format) | |
108 - scikits.learn | |
109 - hdf5 / pytables | |
110 | |
111 | |
112 pylearn.datasets uses a DATA_ROOT environment variable to locate a filesystem | |
113 folder that is assumed to have a standard form across different installations. | |
114 That's where the data files are. The correct format of this folder is currently | |
115 defined implicitly by the contents of /data/lisa/data at DIRO, but it would be | |
116 better to document in pylearn what the contents of this folder should be as | |
117 much as possible. It should be possible to rebuild this tree from information | |
118 found in pylearn. | |
119 | |
120 | |
121 | |
122 Model Selection & Hyper-Parameter Optimization | |
123 ---------------------------------------------- | |
124 | |
125 Driving a distributed computing job for a long time to optimize | |
126 hyper-parameters using one or more clusters is the goal here. | |
127 Although there might be some library-type code to write here, I think of this | |
128 more as an application template. The user would use python code to describe | |
129 the experiment to run and the hyper-parameter space to search. Then this | |
130 application-driver would take control of scheduling jobs and running them on | |
131 various computers... I'm imagining a potentially ugly brute of a hack that's | |
132 not necessarily something we will want to expose at a low-level for reuse. | |
133 | |
134 | |
135 Python scripts for common ML algorithms | |
136 --------------------------------------- | |
137 | |
138 The script aspect of this feature request makes me think that what would be | |
139 good here is more tutorial-type scripts. And the existing tutorials could | |
140 potentially be rewritten to use some of the pylearn.nnet expressions. More | |
141 tutorials / demos would be great. | |
142 | |
66 | 143 |
67 Functional Specifications | 144 Functional Specifications |
68 ========================= | 145 ========================= |
69 | 146 |
147 TODO: | |
70 Put these into different text files so that this one does not become a monster. | 148 Put these into different text files so that this one does not become a monster. |
71 For each thing with a functional spec (e.g. datasets library, optimization library) make a | 149 For each thing with a functional spec (e.g. datasets library, optimization library) make a |
72 separate file. | 150 separate file. |
73 | 151 |
152 | |
153 | |
154 pylearn.nnet | |
155 ------------ | |
156 | |
157 Submodule with functions for building layers, calculating classification | |
158 errors, cross-entropies with various distributions, free energies. This | |
159 module would include for the most part global functions, Theano Ops and Theano | |
160 optimizations. | |
161 | |
162 Indexing Convention | |
163 ~~~~~~~~~~~~~~~~~~~ | |
164 | |
165 Something to decide on - Fortran-style or C-style indexing. Although we have | |
166 often used c-style indexing in the past (for efficiency in c!) this is no | |
167 longer an issue with numpy because the physical layout is independent of the | |
168 indexing order. The fact remains that Fortran-style indexing follows linear | |
169 algebra conventions, while c-style indexing does not. If a global function | |
170 includes a lot of math derivations, it would be *really* nice if the code used | |
171 the same convention for the orientation of matrices, and endlessly annoying to | |
172 have to be always transposing everything. | |
173 |