annotate doc/v2_planning/requirements.txt @ 1221:699ed5f5f188

answer to James comment
author Razvan Pascanu <r.pascanu@gmail.com>
date Wed, 22 Sep 2010 14:02:37 -0400
parents 5525cf3faaa2
children 31b72defb680
rev   line source
1192
ab80ba052d32 refactored the index page of the v2_planning stuff.
Frederic Bastien <nouiz@nouiz.org>
parents: 1187
diff changeset
1 .. _requirements:
ab80ba052d32 refactored the index page of the v2_planning stuff.
Frederic Bastien <nouiz@nouiz.org>
parents: 1187
diff changeset
2
1093
a65598681620 v2planning - initial commit of use_cases, requirements
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
3 ============
a65598681620 v2planning - initial commit of use_cases, requirements
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
4 Requirements
a65598681620 v2planning - initial commit of use_cases, requirements
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
5 ============
a65598681620 v2planning - initial commit of use_cases, requirements
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
6
a65598681620 v2planning - initial commit of use_cases, requirements
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
7
a65598681620 v2planning - initial commit of use_cases, requirements
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
8 Application Requirements
a65598681620 v2planning - initial commit of use_cases, requirements
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
9 ========================
a65598681620 v2planning - initial commit of use_cases, requirements
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
10
a65598681620 v2planning - initial commit of use_cases, requirements
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
11 Terminology and Abbreviations:
a65598681620 v2planning - initial commit of use_cases, requirements
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
12 ------------------------------
a65598681620 v2planning - initial commit of use_cases, requirements
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
13
a65598681620 v2planning - initial commit of use_cases, requirements
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
14 MLA - machine learning algorithm
a65598681620 v2planning - initial commit of use_cases, requirements
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
15
a65598681620 v2planning - initial commit of use_cases, requirements
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
16 learning problem - a machine learning application typically characterized by a
a65598681620 v2planning - initial commit of use_cases, requirements
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
17 dataset (possibly dataset folds) one or more functions to be learned from the
a65598681620 v2planning - initial commit of use_cases, requirements
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
18 data, and one or more metrics to evaluate those functions. Learning problems
a65598681620 v2planning - initial commit of use_cases, requirements
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
19 are the benchmarks for empirical model comparison.
a65598681620 v2planning - initial commit of use_cases, requirements
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
20
a65598681620 v2planning - initial commit of use_cases, requirements
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
21 n. of - number of
a65598681620 v2planning - initial commit of use_cases, requirements
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
22
a65598681620 v2planning - initial commit of use_cases, requirements
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
23 SGD - stochastic gradient descent
a65598681620 v2planning - initial commit of use_cases, requirements
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
24
a65598681620 v2planning - initial commit of use_cases, requirements
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
25 Users:
a65598681620 v2planning - initial commit of use_cases, requirements
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
26 ------
a65598681620 v2planning - initial commit of use_cases, requirements
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
27
a65598681620 v2planning - initial commit of use_cases, requirements
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
28 - New masters and PhD students in the lab should be able to quickly move into
a65598681620 v2planning - initial commit of use_cases, requirements
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
29 'production' mode without having to reinvent the wheel.
a65598681620 v2planning - initial commit of use_cases, requirements
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
30
a65598681620 v2planning - initial commit of use_cases, requirements
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
31 - Students in the two ML classes, able to play with the library to explore new
a65598681620 v2planning - initial commit of use_cases, requirements
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
32 ML variants. This means some APIs (e.g. Experiment level) must be really well
a65598681620 v2planning - initial commit of use_cases, requirements
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
33 documented and conceptually simple.
a65598681620 v2planning - initial commit of use_cases, requirements
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
34
a65598681620 v2planning - initial commit of use_cases, requirements
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
35 - Researchers outside the lab (who might study and experiment with our
a65598681620 v2planning - initial commit of use_cases, requirements
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
36 algorithms)
a65598681620 v2planning - initial commit of use_cases, requirements
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
37
a65598681620 v2planning - initial commit of use_cases, requirements
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
38 - Partners outside the lab (e.g. Bell, Ubisoft) with closed-source commercial
a65598681620 v2planning - initial commit of use_cases, requirements
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
39 projects.
a65598681620 v2planning - initial commit of use_cases, requirements
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
40
a65598681620 v2planning - initial commit of use_cases, requirements
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
41 Uses:
a65598681620 v2planning - initial commit of use_cases, requirements
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
42 -----
a65598681620 v2planning - initial commit of use_cases, requirements
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
43
a65598681620 v2planning - initial commit of use_cases, requirements
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
44 R1. reproduce previous work (our own and others')
a65598681620 v2planning - initial commit of use_cases, requirements
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
45
a65598681620 v2planning - initial commit of use_cases, requirements
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
46 R2. explore MLA variants by swapping components (e.g. optimization algo, dataset,
1096
2bbc294fa5ac requirements: Added a use case
Olivier Delalleau <delallea@iro>
parents: 1093
diff changeset
47 hyper-parameters)
1093
a65598681620 v2planning - initial commit of use_cases, requirements
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
48
a65598681620 v2planning - initial commit of use_cases, requirements
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
49 R3. analyze experimental results (e.g. plotting training curves, finding best
a65598681620 v2planning - initial commit of use_cases, requirements
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
50 models, marginalizing across hyper-parameter choices)
a65598681620 v2planning - initial commit of use_cases, requirements
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
51
a65598681620 v2planning - initial commit of use_cases, requirements
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
52 R4. disseminate (or serve as platform for disseminating) our own published algorithms
a65598681620 v2planning - initial commit of use_cases, requirements
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
53
a65598681620 v2planning - initial commit of use_cases, requirements
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
54 R5. provide implementations of common MLA components (e.g. classifiers, datasets,
a65598681620 v2planning - initial commit of use_cases, requirements
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
55 optimization algorithms, meta-learning algorithms)
a65598681620 v2planning - initial commit of use_cases, requirements
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
56
a65598681620 v2planning - initial commit of use_cases, requirements
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
57 R6. drive large scale parallizable computations (e.g. grid search, bagging,
a65598681620 v2planning - initial commit of use_cases, requirements
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
58 random search)
a65598681620 v2planning - initial commit of use_cases, requirements
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
59
a65598681620 v2planning - initial commit of use_cases, requirements
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
60 R7. provide implementations of standard pre-processing algorithms (e.g. PCA,
a65598681620 v2planning - initial commit of use_cases, requirements
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
61 stemming, Mel-scale spectrograms, GIST features, etc.)
a65598681620 v2planning - initial commit of use_cases, requirements
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
62
1096
2bbc294fa5ac requirements: Added a use case
Olivier Delalleau <delallea@iro>
parents: 1093
diff changeset
63 R8. provide high performance suitable for large-scale experiments
1093
a65598681620 v2planning - initial commit of use_cases, requirements
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
64
a65598681620 v2planning - initial commit of use_cases, requirements
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
65 R9. be able to use the most efficient algorithms in special case combinations of
a65598681620 v2planning - initial commit of use_cases, requirements
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
66 learning algorithm components (e.g. when there is a fast k-fold validation
a65598681620 v2planning - initial commit of use_cases, requirements
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
67 algorithm for a particular model family, the library should not require users
a65598681620 v2planning - initial commit of use_cases, requirements
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
68 to rewrite their standard k-fold validation script to use it)
a65598681620 v2planning - initial commit of use_cases, requirements
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
69
a65598681620 v2planning - initial commit of use_cases, requirements
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
70 R10. support experiments on a variety of datasets (e.g. movies, images, text,
1096
2bbc294fa5ac requirements: Added a use case
Olivier Delalleau <delallea@iro>
parents: 1093
diff changeset
71 sound, reinforcement learning?)
1093
a65598681620 v2planning - initial commit of use_cases, requirements
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
72
a65598681620 v2planning - initial commit of use_cases, requirements
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
73 R11. support efficient computations on datasets larger than RAM and GPU memory
a65598681620 v2planning - initial commit of use_cases, requirements
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
74
a65598681620 v2planning - initial commit of use_cases, requirements
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
75 R12. support infinite datasets (i.e. generated on the fly)
a65598681620 v2planning - initial commit of use_cases, requirements
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
76
1098
4eda3f52ebef v2planning - revs to requirements, added architecture
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1096
diff changeset
77 R13. apply trained models "in production".
4eda3f52ebef v2planning - revs to requirements, added architecture
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1096
diff changeset
78 - e.g. say you try many combinations of preprocessing, models and associated
4eda3f52ebef v2planning - revs to requirements, added architecture
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1096
diff changeset
79 hyper-parameters, and want to easily be able to recover the full "processing
4eda3f52ebef v2planning - revs to requirements, added architecture
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1096
diff changeset
80 pipeline" that performs best, and use it on real/test data later.
1093
a65598681620 v2planning - initial commit of use_cases, requirements
James Bergstra <bergstrj@iro.umontreal.ca>
parents:
diff changeset
81
1121
1f5465622394 requirements: Added comment about potentially conflicting requirements
Olivier Delalleau <delallea@iro>
parents: 1098
diff changeset
82 OD comments: Note that R9 and R13 may conflict with each other. Some
1f5465622394 requirements: Added comment about potentially conflicting requirements
Olivier Delalleau <delallea@iro>
parents: 1098
diff changeset
83 optimizations performed by R9 may modify the input "symbolic graph" in such a
1f5465622394 requirements: Added comment about potentially conflicting requirements
Olivier Delalleau <delallea@iro>
parents: 1098
diff changeset
84 way that extracting the required components for "production purpose" (R13)
1f5465622394 requirements: Added comment about potentially conflicting requirements
Olivier Delalleau <delallea@iro>
parents: 1098
diff changeset
85 could be made more difficult (or even impossible). Imagine for instance that
1f5465622394 requirements: Added comment about potentially conflicting requirements
Olivier Delalleau <delallea@iro>
parents: 1098
diff changeset
86 the graph is modified to take advantage of the fact that k-fold validation can
1f5465622394 requirements: Added comment about potentially conflicting requirements
Olivier Delalleau <delallea@iro>
parents: 1098
diff changeset
87 be performed efficiently internally by some specific algorithm. Then it may
1f5465622394 requirements: Added comment about potentially conflicting requirements
Olivier Delalleau <delallea@iro>
parents: 1098
diff changeset
88 not be obvious anymore how to remove the k-fold split in the saved model you
1f5465622394 requirements: Added comment about potentially conflicting requirements
Olivier Delalleau <delallea@iro>
parents: 1098
diff changeset
89 want to use in production.
1f5465622394 requirements: Added comment about potentially conflicting requirements
Olivier Delalleau <delallea@iro>
parents: 1098
diff changeset
90
1187
7d34edde029d added serializability requiremnt
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1121
diff changeset
91
7d34edde029d added serializability requiremnt
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1121
diff changeset
92 Requirements for component architecture
7d34edde029d added serializability requiremnt
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1121
diff changeset
93 =======================================
7d34edde029d added serializability requiremnt
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1121
diff changeset
94
7d34edde029d added serializability requiremnt
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1121
diff changeset
95
7d34edde029d added serializability requiremnt
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1121
diff changeset
96 R14. Serializability of experiments. (essentially in pursuit of R6)
7d34edde029d added serializability requiremnt
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1121
diff changeset
97
7d34edde029d added serializability requiremnt
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1121
diff changeset
98 Jobs that are running a learning algorithm with our components (datasets,
7d34edde029d added serializability requiremnt
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1121
diff changeset
99 models, algorithms) must be able to serialize the experiment's state to a string
7d34edde029d added serializability requiremnt
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1121
diff changeset
100 (typically written to disk) and be able to restart it from such a string. There
7d34edde029d added serializability requiremnt
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1121
diff changeset
101 must be a mechanism to tell a job to serialize the experiment as soon as
7d34edde029d added serializability requiremnt
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1121
diff changeset
102 possible, and a latency of up to 10 seconds should be acceptable. It must also
7d34edde029d added serializability requiremnt
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1121
diff changeset
103 be possible to deserialize the experiment for introspection (inspect the state
7d34edde029d added serializability requiremnt
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1121
diff changeset
104 of individual components), not just for continuing the experiment. The
7d34edde029d added serializability requiremnt
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1121
diff changeset
105 experiment can assume that resources on disk that were present when the
7d34edde029d added serializability requiremnt
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1121
diff changeset
106 experiment started will be present when the experiment resumes. The experiment
7d34edde029d added serializability requiremnt
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1121
diff changeset
107 cannot assume that resources written by the experiment will still be there (e.g.
7d34edde029d added serializability requiremnt
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1121
diff changeset
108 in /tmp or cwd). Implementations should make an effort to make the serialized
7d34edde029d added serializability requiremnt
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1121
diff changeset
109 representation compact, when it is possible to recompute or reload from disk
7d34edde029d added serializability requiremnt
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1121
diff changeset
110 at deserialization time.
7d34edde029d added serializability requiremnt
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1121
diff changeset
111
7d34edde029d added serializability requiremnt
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1121
diff changeset
112 This requirement is aimed at enabling process migration and job control as well
7d34edde029d added serializability requiremnt
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1121
diff changeset
113 as post-hoc analysis of experiment results.
7d34edde029d added serializability requiremnt
James Bergstra <bergstrj@iro.umontreal.ca>
parents: 1121
diff changeset
114
1205
5525cf3faaa2 requirements: Question about the serialization requirement
Olivier Delalleau <delallea@iro>
parents: 1192
diff changeset
115 OD asks: When you say "The experiment cannot assume that resources written by
5525cf3faaa2 requirements: Question about the serialization requirement
Olivier Delalleau <delallea@iro>
parents: 1192
diff changeset
116 the experiment will still be there", do you mean we should be able to recover
5525cf3faaa2 requirements: Question about the serialization requirement
Olivier Delalleau <delallea@iro>
parents: 1192
diff changeset
117 the exact same output after interrupting an experiment, wiping its expdir, and
5525cf3faaa2 requirements: Question about the serialization requirement
Olivier Delalleau <delallea@iro>
parents: 1192
diff changeset
118 restarting it? This would mean that any output saved on disk by the experiment
5525cf3faaa2 requirements: Question about the serialization requirement
Olivier Delalleau <delallea@iro>
parents: 1192
diff changeset
119 also has to be serialized within the experiment, which may lead to very big
5525cf3faaa2 requirements: Question about the serialization requirement
Olivier Delalleau <delallea@iro>
parents: 1192
diff changeset
120 serialization files (and possibly memory issues?)
5525cf3faaa2 requirements: Question about the serialization requirement
Olivier Delalleau <delallea@iro>
parents: 1192
diff changeset
121 A less constraining interpretation of your statement (which I like better) is
5525cf3faaa2 requirements: Question about the serialization requirement
Olivier Delalleau <delallea@iro>
parents: 1192
diff changeset
122 that we allow "previous" output to be lost: we only ask that the experiment
5525cf3faaa2 requirements: Question about the serialization requirement
Olivier Delalleau <delallea@iro>
parents: 1192
diff changeset
123 should be able to produce the "new" outputs after a wipe+restart.