comparison doc/v2_planning.txt @ 945:cafa16bfc7df

additions to v2_planning
author James Bergstra <bergstrj@iro.umontreal.ca>
date Wed, 11 Aug 2010 14:35:57 -0400
parents 939806d33183
children 7c4504a4ce1a
comparison
equal deleted inserted replaced
944:1529c84e460f 945:cafa16bfc7df
61 61
62 Suggestions for how to tackle various desiderata 62 Suggestions for how to tackle various desiderata
63 ================================================ 63 ================================================
64 64
65 65
66 Theano Symbolic Expressions for ML
67 ----------------------------------
68
69 We could make this a submodule of pylearn: ``pylearn.nnet``.
70
71 There are a number of ideas floating around for how to handle classes /
72 modules (LeDeepNet, pylearn.shared.layers, pynnet) so lets implement as much
73 math as possible in global functions with no classes. There are no models in
74 the wish list that require than a few vectors and matrices to parametrize.
75 Global functions are more reusable than classes.
76
77
78 Data access
79 -----------
80
81 A general interface to datasets from the perspective of an experiment driver
82 (e.g. kfold) is to see them as a function that maps index (typically integer)
83 to example (whose type and nature depends on the dataset, it could for
84 instance be an (image, label) pair). This interface permits iterating over
85 the dataset, shuffling the dataset, and splitting it into folds. For
86 efficiency, it is nice if the dataset interface supports looking up several
87 index values at once, because looking up many examples at once can sometimes
88 be faster than looking each one up in turn.
89
90 Some datasets may not support random access (e.g. a random number stream) and
91 that's fine if an exception is raised. The user will see a NotImplementedError
92 or similar, and try something else.
93
94
95 A more intuitive interface for many datasets (or subsets) is to load them as
96 matrices or lists of examples. This format is more convenient to work with at
97 an ipython shell, for example. It is not good to provide only the "dataset
98 as a function" view of a dataset. Even if a dataset is very large, it is nice
99 to have a standard way to get some representative examples in a convenient
100 structure, to be able to play with them in ipython.
101
102
103 Another thing to consider related to datasets is that there are a number of
104 other efforts to have standard ML datasets, and we should be aware of them,
105 and compatible with them when it's easy:
106 - mldata.org (they have a file format, not sure how many use it)
107 - weka (ARFF file format)
108 - scikits.learn
109 - hdf5 / pytables
110
111
112 pylearn.datasets uses a DATA_ROOT environment variable to locate a filesystem
113 folder that is assumed to have a standard form across different installations.
114 That's where the data files are. The correct format of this folder is currently
115 defined implicitly by the contents of /data/lisa/data at DIRO, but it would be
116 better to document in pylearn what the contents of this folder should be as
117 much as possible. It should be possible to rebuild this tree from information
118 found in pylearn.
119
120
121
122 Model Selection & Hyper-Parameter Optimization
123 ----------------------------------------------
124
125 Driving a distributed computing job for a long time to optimize
126 hyper-parameters using one or more clusters is the goal here.
127 Although there might be some library-type code to write here, I think of this
128 more as an application template. The user would use python code to describe
129 the experiment to run and the hyper-parameter space to search. Then this
130 application-driver would take control of scheduling jobs and running them on
131 various computers... I'm imagining a potentially ugly brute of a hack that's
132 not necessarily something we will want to expose at a low-level for reuse.
133
134
135 Python scripts for common ML algorithms
136 ---------------------------------------
137
138 The script aspect of this feature request makes me think that what would be
139 good here is more tutorial-type scripts. And the existing tutorials could
140 potentially be rewritten to use some of the pylearn.nnet expressions. More
141 tutorials / demos would be great.
142
66 143
67 Functional Specifications 144 Functional Specifications
68 ========================= 145 =========================
69 146
147 TODO:
70 Put these into different text files so that this one does not become a monster. 148 Put these into different text files so that this one does not become a monster.
71 For each thing with a functional spec (e.g. datasets library, optimization library) make a 149 For each thing with a functional spec (e.g. datasets library, optimization library) make a
72 separate file. 150 separate file.
73 151
152
153
154 pylearn.nnet
155 ------------
156
157 Submodule with functions for building layers, calculating classification
158 errors, cross-entropies with various distributions, free energies. This
159 module would include for the most part global functions, Theano Ops and Theano
160 optimizations.
161
162 Indexing Convention
163 ~~~~~~~~~~~~~~~~~~~
164
165 Something to decide on - Fortran-style or C-style indexing. Although we have
166 often used c-style indexing in the past (for efficiency in c!) this is no
167 longer an issue with numpy because the physical layout is independent of the
168 indexing order. The fact remains that Fortran-style indexing follows linear
169 algebra conventions, while c-style indexing does not. If a global function
170 includes a lot of math derivations, it would be *really* nice if the code used
171 the same convention for the orientation of matrices, and endlessly annoying to
172 have to be always transposing everything.
173