Mercurial > pylearn
annotate pylearn/datasets/utlc.py @ 1432:8661f8ad407a
Add a cast in the chain of transformation of initial data set
author | Pascal Lamblin <lamblinp@iro.umontreal.ca> |
---|---|
date | Mon, 14 Feb 2011 19:27:37 -0500 |
parents | dce602150b5f |
children | 08beb6f28809 |
rev | line source |
---|---|
1427
a36d3a406c59
fix whitespace/indentation.
Frederic Bastien <nouiz@nouiz.org>
parents:
1426
diff
changeset
|
1 """ |
1402
b14f3d6f5cd4
first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff
changeset
|
2 user should use the load _ndarray_dataset or load_sparse_dataset function |
1404
89017617ab36
normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
1402
diff
changeset
|
3 |
89017617ab36
normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
1402
diff
changeset
|
4 See the file ${PYLEARN_DATA_ROOT}/UTCL/README for detail on the datasets. |
89017617ab36
normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
1402
diff
changeset
|
5 |
89017617ab36
normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
1402
diff
changeset
|
6 See the end of this file for an example. |
1402
b14f3d6f5cd4
first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff
changeset
|
7 """ |
b14f3d6f5cd4
first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff
changeset
|
8 |
b14f3d6f5cd4
first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff
changeset
|
9 import cPickle |
b14f3d6f5cd4
first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff
changeset
|
10 import gzip |
b14f3d6f5cd4
first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff
changeset
|
11 import os |
b14f3d6f5cd4
first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff
changeset
|
12 |
1404
89017617ab36
normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
1402
diff
changeset
|
13 import numpy |
89017617ab36
normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
1402
diff
changeset
|
14 import theano |
89017617ab36
normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
1402
diff
changeset
|
15 |
1402
b14f3d6f5cd4
first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff
changeset
|
16 import pylearn.io.filetensor as ft |
b14f3d6f5cd4
first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff
changeset
|
17 import config |
b14f3d6f5cd4
first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff
changeset
|
18 |
1428
3823dbfff6cf
add parameter to randomize the valid and test data.
Frederic Bastien <nouiz@nouiz.org>
parents:
1427
diff
changeset
|
19 def load_ndarray_dataset(name, normalize=True, transfer=False, |
3823dbfff6cf
add parameter to randomize the valid and test data.
Frederic Bastien <nouiz@nouiz.org>
parents:
1427
diff
changeset
|
20 normalize_on_the_fly=False, randomize_valid=False, |
3823dbfff6cf
add parameter to randomize the valid and test data.
Frederic Bastien <nouiz@nouiz.org>
parents:
1427
diff
changeset
|
21 randomize_test=False): |
1408
2993b2a5c1af
allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents:
1406
diff
changeset
|
22 """ Load the train,valid,test data for the dataset `name` |
2993b2a5c1af
allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents:
1406
diff
changeset
|
23 and return it in ndarray format. |
1427
a36d3a406c59
fix whitespace/indentation.
Frederic Bastien <nouiz@nouiz.org>
parents:
1426
diff
changeset
|
24 |
1431 | 25 We suppose the data was created with ift6266h11/pretraitement/to_npy.py that |
26 shuffle the train. So the train should already be shuffled. | |
27 | |
1408
2993b2a5c1af
allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents:
1406
diff
changeset
|
28 :param normalize: If True, we normalize the train dataset |
2993b2a5c1af
allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents:
1406
diff
changeset
|
29 before returning it |
1428
3823dbfff6cf
add parameter to randomize the valid and test data.
Frederic Bastien <nouiz@nouiz.org>
parents:
1427
diff
changeset
|
30 :param transfer: If True also return the transfer label(currently only available for ule) |
1426
4988f8ea0836
in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents:
1411
diff
changeset
|
31 :param normalize_on_the_fly: If True, we return a Theano Variable that will give |
4988f8ea0836
in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents:
1411
diff
changeset
|
32 as output the normalized value. If the user only |
4988f8ea0836
in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents:
1411
diff
changeset
|
33 take a subtensor of that variable, Theano optimization |
4988f8ea0836
in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents:
1411
diff
changeset
|
34 should make that we will only have in memory the subtensor |
4988f8ea0836
in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents:
1411
diff
changeset
|
35 portion that is computed in normalized form. We store |
4988f8ea0836
in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents:
1411
diff
changeset
|
36 the original data in shared memory in its original dtype. |
4988f8ea0836
in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents:
1411
diff
changeset
|
37 |
4988f8ea0836
in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents:
1411
diff
changeset
|
38 This is usefull to have the original data in its original |
4988f8ea0836
in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents:
1411
diff
changeset
|
39 dtype in memory to same memory. Especialy usefull to |
4988f8ea0836
in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents:
1411
diff
changeset
|
40 be able to use rita and harry with 1G per jobs. |
1428
3823dbfff6cf
add parameter to randomize the valid and test data.
Frederic Bastien <nouiz@nouiz.org>
parents:
1427
diff
changeset
|
41 :param randomize_valid: Do we randomize the order of the valid set? |
3823dbfff6cf
add parameter to randomize the valid and test data.
Frederic Bastien <nouiz@nouiz.org>
parents:
1427
diff
changeset
|
42 We always use the same random order |
3823dbfff6cf
add parameter to randomize the valid and test data.
Frederic Bastien <nouiz@nouiz.org>
parents:
1427
diff
changeset
|
43 If False, return in the same order as downloaded on the web |
3823dbfff6cf
add parameter to randomize the valid and test data.
Frederic Bastien <nouiz@nouiz.org>
parents:
1427
diff
changeset
|
44 :param randomize_test: Do we randomize the order of the test set? |
3823dbfff6cf
add parameter to randomize the valid and test data.
Frederic Bastien <nouiz@nouiz.org>
parents:
1427
diff
changeset
|
45 We always use the same random order |
3823dbfff6cf
add parameter to randomize the valid and test data.
Frederic Bastien <nouiz@nouiz.org>
parents:
1427
diff
changeset
|
46 If False, return in the same order as downloaded on the web |
1408
2993b2a5c1af
allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents:
1406
diff
changeset
|
47 """ |
1426
4988f8ea0836
in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents:
1411
diff
changeset
|
48 assert not (normalize and normalize_on_the_fly), "Can't normalize in 2 way at the same time!" |
1427
a36d3a406c59
fix whitespace/indentation.
Frederic Bastien <nouiz@nouiz.org>
parents:
1426
diff
changeset
|
49 |
1402
b14f3d6f5cd4
first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff
changeset
|
50 assert name in ['avicenna','harry','rita','sylvester','ule'] |
1426
4988f8ea0836
in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents:
1411
diff
changeset
|
51 common = os.path.join('UTLC','filetensor',name+'_') |
4988f8ea0836
in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents:
1411
diff
changeset
|
52 trname,vname,tename = [config.get_filepath_in_roots(common+subset+'.ft.gz', |
4988f8ea0836
in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents:
1411
diff
changeset
|
53 common+subset+'.ft') |
1402
b14f3d6f5cd4
first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff
changeset
|
54 for subset in ['train','valid','test']] |
1426
4988f8ea0836
in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents:
1411
diff
changeset
|
55 |
1402
b14f3d6f5cd4
first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff
changeset
|
56 train = load_filetensor(trname) |
b14f3d6f5cd4
first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff
changeset
|
57 valid = load_filetensor(vname) |
b14f3d6f5cd4
first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff
changeset
|
58 test = load_filetensor(tename) |
1428
3823dbfff6cf
add parameter to randomize the valid and test data.
Frederic Bastien <nouiz@nouiz.org>
parents:
1427
diff
changeset
|
59 if randomize_valid: |
3823dbfff6cf
add parameter to randomize the valid and test data.
Frederic Bastien <nouiz@nouiz.org>
parents:
1427
diff
changeset
|
60 rng = numpy.random.RandomState([1,2,3,4]) |
3823dbfff6cf
add parameter to randomize the valid and test data.
Frederic Bastien <nouiz@nouiz.org>
parents:
1427
diff
changeset
|
61 perm = rng.permutation(valid.shape[0]) |
3823dbfff6cf
add parameter to randomize the valid and test data.
Frederic Bastien <nouiz@nouiz.org>
parents:
1427
diff
changeset
|
62 valid = valid[perm] |
3823dbfff6cf
add parameter to randomize the valid and test data.
Frederic Bastien <nouiz@nouiz.org>
parents:
1427
diff
changeset
|
63 if randomize_test: |
3823dbfff6cf
add parameter to randomize the valid and test data.
Frederic Bastien <nouiz@nouiz.org>
parents:
1427
diff
changeset
|
64 rng = numpy.random.RandomState([1,2,3,4]) |
3823dbfff6cf
add parameter to randomize the valid and test data.
Frederic Bastien <nouiz@nouiz.org>
parents:
1427
diff
changeset
|
65 perm = rng.permutation(test.shape[0]) |
3823dbfff6cf
add parameter to randomize the valid and test data.
Frederic Bastien <nouiz@nouiz.org>
parents:
1427
diff
changeset
|
66 test = test[perm] |
1426
4988f8ea0836
in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents:
1411
diff
changeset
|
67 |
4988f8ea0836
in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents:
1411
diff
changeset
|
68 if normalize or normalize_on_the_fly: |
4988f8ea0836
in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents:
1411
diff
changeset
|
69 if normalize_on_the_fly: |
1432
8661f8ad407a
Add a cast in the chain of transformation of initial data set
Pascal Lamblin <lamblinp@iro.umontreal.ca>
parents:
1431
diff
changeset
|
70 # Shared variables of the original type |
1426
4988f8ea0836
in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents:
1411
diff
changeset
|
71 train = theano.shared(train, borrow=True, name=name+"_train") |
4988f8ea0836
in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents:
1411
diff
changeset
|
72 valid = theano.shared(valid, borrow=True, name=name+"_valid") |
4988f8ea0836
in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents:
1411
diff
changeset
|
73 test = theano.shared(test, borrow=True, name=name+"_test") |
1432
8661f8ad407a
Add a cast in the chain of transformation of initial data set
Pascal Lamblin <lamblinp@iro.umontreal.ca>
parents:
1431
diff
changeset
|
74 # Symbolic variables cast into floatX |
8661f8ad407a
Add a cast in the chain of transformation of initial data set
Pascal Lamblin <lamblinp@iro.umontreal.ca>
parents:
1431
diff
changeset
|
75 train = theano.tensor.cast(train, theano.config.floatX) |
8661f8ad407a
Add a cast in the chain of transformation of initial data set
Pascal Lamblin <lamblinp@iro.umontreal.ca>
parents:
1431
diff
changeset
|
76 valid = theano.tensor.cast(valid, theano.config.floatX) |
8661f8ad407a
Add a cast in the chain of transformation of initial data set
Pascal Lamblin <lamblinp@iro.umontreal.ca>
parents:
1431
diff
changeset
|
77 test = theano.tensor.cast(test, theano.config.floatX) |
1426
4988f8ea0836
in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents:
1411
diff
changeset
|
78 else: |
1404
89017617ab36
normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
1402
diff
changeset
|
79 train = numpy.asarray(train, theano.config.floatX) |
89017617ab36
normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
1402
diff
changeset
|
80 valid = numpy.asarray(valid, theano.config.floatX) |
89017617ab36
normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
1402
diff
changeset
|
81 test = numpy.asarray(test, theano.config.floatX) |
1426
4988f8ea0836
in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents:
1411
diff
changeset
|
82 |
4988f8ea0836
in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents:
1411
diff
changeset
|
83 if name == "ule": |
4988f8ea0836
in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents:
1411
diff
changeset
|
84 train /= 255 |
4988f8ea0836
in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents:
1411
diff
changeset
|
85 valid /= 255 |
4988f8ea0836
in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents:
1411
diff
changeset
|
86 test /= 255 |
4988f8ea0836
in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents:
1411
diff
changeset
|
87 elif name in ["avicenna", "sylvester"]: |
4988f8ea0836
in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents:
1411
diff
changeset
|
88 if name == "avicenna": |
4988f8ea0836
in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents:
1411
diff
changeset
|
89 train_mean = 514.62154022835455 |
4988f8ea0836
in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents:
1411
diff
changeset
|
90 train_std = 6.829096494224145 |
4988f8ea0836
in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents:
1411
diff
changeset
|
91 else: |
4988f8ea0836
in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents:
1411
diff
changeset
|
92 train_mean = 403.81889927027686 |
4988f8ea0836
in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents:
1411
diff
changeset
|
93 train_std = 96.43841050784053 |
4988f8ea0836
in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents:
1411
diff
changeset
|
94 train -= train_mean |
4988f8ea0836
in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents:
1411
diff
changeset
|
95 valid -= train_mean |
4988f8ea0836
in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents:
1411
diff
changeset
|
96 test -= train_mean |
4988f8ea0836
in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents:
1411
diff
changeset
|
97 train /= train_std |
4988f8ea0836
in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents:
1411
diff
changeset
|
98 valid /= train_std |
4988f8ea0836
in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents:
1411
diff
changeset
|
99 test /= train_std |
1404
89017617ab36
normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
1402
diff
changeset
|
100 elif name == "harry": |
89017617ab36
normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
1402
diff
changeset
|
101 std = 0.69336046033925791#train.std()slow to compute |
1410
e7844692e6e2
normalize the utlc ndarray dataset inplace to use less memory.
Frederic Bastien <nouiz@nouiz.org>
parents:
1408
diff
changeset
|
102 train /= std |
e7844692e6e2
normalize the utlc ndarray dataset inplace to use less memory.
Frederic Bastien <nouiz@nouiz.org>
parents:
1408
diff
changeset
|
103 valid /= std |
1427
a36d3a406c59
fix whitespace/indentation.
Frederic Bastien <nouiz@nouiz.org>
parents:
1426
diff
changeset
|
104 test /= std |
1404
89017617ab36
normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
1402
diff
changeset
|
105 elif name == "rita": |
1426
4988f8ea0836
in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents:
1411
diff
changeset
|
106 v = numpy.asarray(230, dtype=theano.config.floatX) |
4988f8ea0836
in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents:
1411
diff
changeset
|
107 train /= v |
4988f8ea0836
in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents:
1411
diff
changeset
|
108 valid /= v |
4988f8ea0836
in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents:
1411
diff
changeset
|
109 test /= v |
1404
89017617ab36
normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
1402
diff
changeset
|
110 else: |
89017617ab36
normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
1402
diff
changeset
|
111 raise Exception("This dataset don't have its normalization defined") |
1408
2993b2a5c1af
allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents:
1406
diff
changeset
|
112 if transfer: |
2993b2a5c1af
allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents:
1406
diff
changeset
|
113 transfer = load_filetensor(os.path.join(config.data_root(),"UTLC","filetensor",name+"_transfer.ft")) |
2993b2a5c1af
allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents:
1406
diff
changeset
|
114 return train, valid, test, transfer |
2993b2a5c1af
allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents:
1406
diff
changeset
|
115 else: |
2993b2a5c1af
allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents:
1406
diff
changeset
|
116 return train, valid, test |
1402
b14f3d6f5cd4
first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff
changeset
|
117 |
1430
931a19eeab5a
'allow to randomize the sparse valid/test utlc dataset at load time'
Frederic Bastien <nouiz@nouiz.org>
parents:
1429
diff
changeset
|
118 def load_sparse_dataset(name, normalize=True, transfer=False, |
931a19eeab5a
'allow to randomize the sparse valid/test utlc dataset at load time'
Frederic Bastien <nouiz@nouiz.org>
parents:
1429
diff
changeset
|
119 randomize_valid=False, |
931a19eeab5a
'allow to randomize the sparse valid/test utlc dataset at load time'
Frederic Bastien <nouiz@nouiz.org>
parents:
1429
diff
changeset
|
120 randomize_test=False): |
1408
2993b2a5c1af
allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents:
1406
diff
changeset
|
121 """ Load the train,valid,test data for the dataset `name` |
2993b2a5c1af
allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents:
1406
diff
changeset
|
122 and return it in sparse format. |
1427
a36d3a406c59
fix whitespace/indentation.
Frederic Bastien <nouiz@nouiz.org>
parents:
1426
diff
changeset
|
123 |
1431 | 124 We suppose the data was created with ift6266h11/pretraitement/to_npy.py that |
125 shuffle the train. So the train should already be shuffled. | |
126 | |
1408
2993b2a5c1af
allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents:
1406
diff
changeset
|
127 :param normalize: If True, we normalize the train dataset |
2993b2a5c1af
allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents:
1406
diff
changeset
|
128 before returning it |
2993b2a5c1af
allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents:
1406
diff
changeset
|
129 :param transfer: If True also return the transfer label |
1430
931a19eeab5a
'allow to randomize the sparse valid/test utlc dataset at load time'
Frederic Bastien <nouiz@nouiz.org>
parents:
1429
diff
changeset
|
130 :param randomize_valid: see same option for load_ndarray_dataset |
931a19eeab5a
'allow to randomize the sparse valid/test utlc dataset at load time'
Frederic Bastien <nouiz@nouiz.org>
parents:
1429
diff
changeset
|
131 :param randomize_test: see same option for load_ndarray_dataset |
931a19eeab5a
'allow to randomize the sparse valid/test utlc dataset at load time'
Frederic Bastien <nouiz@nouiz.org>
parents:
1429
diff
changeset
|
132 |
1408
2993b2a5c1af
allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents:
1406
diff
changeset
|
133 """ |
1402
b14f3d6f5cd4
first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff
changeset
|
134 assert name in ['harry','terry','ule'] |
1429
b0141efbf6a2
fix loading of sparse utlc dataset when PYLEARN_DATA_ROOT have more then 1 directory.
Frederic Bastien <nouiz@nouiz.org>
parents:
1428
diff
changeset
|
135 common = os.path.join('UTLC','sparse',name+'_') |
b0141efbf6a2
fix loading of sparse utlc dataset when PYLEARN_DATA_ROOT have more then 1 directory.
Frederic Bastien <nouiz@nouiz.org>
parents:
1428
diff
changeset
|
136 trname,vname,tename = [config.get_filepath_in_roots(common+subset+'.npy.gz', |
b0141efbf6a2
fix loading of sparse utlc dataset when PYLEARN_DATA_ROOT have more then 1 directory.
Frederic Bastien <nouiz@nouiz.org>
parents:
1428
diff
changeset
|
137 common+subset+'.npy') |
1402
b14f3d6f5cd4
first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff
changeset
|
138 for subset in ['train','valid','test']] |
b14f3d6f5cd4
first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff
changeset
|
139 train = load_sparse(trname) |
b14f3d6f5cd4
first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff
changeset
|
140 valid = load_sparse(vname) |
b14f3d6f5cd4
first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff
changeset
|
141 test = load_sparse(tename) |
1430
931a19eeab5a
'allow to randomize the sparse valid/test utlc dataset at load time'
Frederic Bastien <nouiz@nouiz.org>
parents:
1429
diff
changeset
|
142 |
931a19eeab5a
'allow to randomize the sparse valid/test utlc dataset at load time'
Frederic Bastien <nouiz@nouiz.org>
parents:
1429
diff
changeset
|
143 # Data should already be in csr format that support |
931a19eeab5a
'allow to randomize the sparse valid/test utlc dataset at load time'
Frederic Bastien <nouiz@nouiz.org>
parents:
1429
diff
changeset
|
144 # this type of indexing. |
931a19eeab5a
'allow to randomize the sparse valid/test utlc dataset at load time'
Frederic Bastien <nouiz@nouiz.org>
parents:
1429
diff
changeset
|
145 if randomize_valid: |
931a19eeab5a
'allow to randomize the sparse valid/test utlc dataset at load time'
Frederic Bastien <nouiz@nouiz.org>
parents:
1429
diff
changeset
|
146 rng = numpy.random.RandomState([1,2,3,4]) |
931a19eeab5a
'allow to randomize the sparse valid/test utlc dataset at load time'
Frederic Bastien <nouiz@nouiz.org>
parents:
1429
diff
changeset
|
147 perm = rng.permutation(valid.shape[0]) |
931a19eeab5a
'allow to randomize the sparse valid/test utlc dataset at load time'
Frederic Bastien <nouiz@nouiz.org>
parents:
1429
diff
changeset
|
148 valid = valid[perm] |
931a19eeab5a
'allow to randomize the sparse valid/test utlc dataset at load time'
Frederic Bastien <nouiz@nouiz.org>
parents:
1429
diff
changeset
|
149 if randomize_test: |
931a19eeab5a
'allow to randomize the sparse valid/test utlc dataset at load time'
Frederic Bastien <nouiz@nouiz.org>
parents:
1429
diff
changeset
|
150 rng = numpy.random.RandomState([1,2,3,4]) |
931a19eeab5a
'allow to randomize the sparse valid/test utlc dataset at load time'
Frederic Bastien <nouiz@nouiz.org>
parents:
1429
diff
changeset
|
151 perm = rng.permutation(test.shape[0]) |
931a19eeab5a
'allow to randomize the sparse valid/test utlc dataset at load time'
Frederic Bastien <nouiz@nouiz.org>
parents:
1429
diff
changeset
|
152 test = test[perm] |
931a19eeab5a
'allow to randomize the sparse valid/test utlc dataset at load time'
Frederic Bastien <nouiz@nouiz.org>
parents:
1429
diff
changeset
|
153 |
1404
89017617ab36
normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
1402
diff
changeset
|
154 if normalize: |
89017617ab36
normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
1402
diff
changeset
|
155 if name == "ule": |
89017617ab36
normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
1402
diff
changeset
|
156 train = train.astype(theano.config.floatX) / 255 |
89017617ab36
normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
1402
diff
changeset
|
157 valid = valid.astype(theano.config.floatX) / 255 |
89017617ab36
normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
1402
diff
changeset
|
158 test = test.astype(theano.config.floatX) / 255 |
89017617ab36
normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
1402
diff
changeset
|
159 elif name == "harry": |
89017617ab36
normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
1402
diff
changeset
|
160 train = train.astype(theano.config.floatX) |
89017617ab36
normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
1402
diff
changeset
|
161 valid = valid.astype(theano.config.floatX) |
89017617ab36
normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
1402
diff
changeset
|
162 test = test.astype(theano.config.floatX) |
89017617ab36
normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
1402
diff
changeset
|
163 std = 0.69336046033925791#train.std()slow to compute |
89017617ab36
normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
1402
diff
changeset
|
164 train = (train) / std |
89017617ab36
normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
1402
diff
changeset
|
165 valid = (valid) / std |
1427
a36d3a406c59
fix whitespace/indentation.
Frederic Bastien <nouiz@nouiz.org>
parents:
1426
diff
changeset
|
166 test = (test) / std |
1406
6003f733a994
added the normalization of the last UTLC dataset
Frederic Bastien <nouiz@nouiz.org>
parents:
1404
diff
changeset
|
167 elif name == "terry": |
6003f733a994
added the normalization of the last UTLC dataset
Frederic Bastien <nouiz@nouiz.org>
parents:
1404
diff
changeset
|
168 train = train.astype(theano.config.floatX) |
6003f733a994
added the normalization of the last UTLC dataset
Frederic Bastien <nouiz@nouiz.org>
parents:
1404
diff
changeset
|
169 valid = valid.astype(theano.config.floatX) |
6003f733a994
added the normalization of the last UTLC dataset
Frederic Bastien <nouiz@nouiz.org>
parents:
1404
diff
changeset
|
170 test = test.astype(theano.config.floatX) |
6003f733a994
added the normalization of the last UTLC dataset
Frederic Bastien <nouiz@nouiz.org>
parents:
1404
diff
changeset
|
171 train = (train) / 300 |
6003f733a994
added the normalization of the last UTLC dataset
Frederic Bastien <nouiz@nouiz.org>
parents:
1404
diff
changeset
|
172 valid = (valid) / 300 |
6003f733a994
added the normalization of the last UTLC dataset
Frederic Bastien <nouiz@nouiz.org>
parents:
1404
diff
changeset
|
173 test = (test) / 300 |
1404
89017617ab36
normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
1402
diff
changeset
|
174 else: |
89017617ab36
normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
1402
diff
changeset
|
175 raise Exception("This dataset don't have its normalization defined") |
1408
2993b2a5c1af
allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents:
1406
diff
changeset
|
176 if transfer: |
2993b2a5c1af
allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents:
1406
diff
changeset
|
177 transfer = load_sparse(os.path.join(config.data_root(),"UTLC","sparse",name+"_transfer.npy")) |
2993b2a5c1af
allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents:
1406
diff
changeset
|
178 return train, valid, test, transfer |
2993b2a5c1af
allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents:
1406
diff
changeset
|
179 else: |
2993b2a5c1af
allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents:
1406
diff
changeset
|
180 return train, valid, test |
1427
a36d3a406c59
fix whitespace/indentation.
Frederic Bastien <nouiz@nouiz.org>
parents:
1426
diff
changeset
|
181 |
1411
68fdb895f53f
allow to load the ule labels.
Frederic Bastien <nouiz@nouiz.org>
parents:
1410
diff
changeset
|
182 def load_ndarray_label(name): |
68fdb895f53f
allow to load the ule labels.
Frederic Bastien <nouiz@nouiz.org>
parents:
1410
diff
changeset
|
183 """ Load the train,valid,test data for the dataset `name` |
68fdb895f53f
allow to load the ule labels.
Frederic Bastien <nouiz@nouiz.org>
parents:
1410
diff
changeset
|
184 and return it in ndarray format. |
1427
a36d3a406c59
fix whitespace/indentation.
Frederic Bastien <nouiz@nouiz.org>
parents:
1426
diff
changeset
|
185 |
1411
68fdb895f53f
allow to load the ule labels.
Frederic Bastien <nouiz@nouiz.org>
parents:
1410
diff
changeset
|
186 This is only available for the toy dataset ule. |
68fdb895f53f
allow to load the ule labels.
Frederic Bastien <nouiz@nouiz.org>
parents:
1410
diff
changeset
|
187 """ |
68fdb895f53f
allow to load the ule labels.
Frederic Bastien <nouiz@nouiz.org>
parents:
1410
diff
changeset
|
188 assert name in ['ule'] |
68fdb895f53f
allow to load the ule labels.
Frederic Bastien <nouiz@nouiz.org>
parents:
1410
diff
changeset
|
189 trname,vname,tename = [os.path.join(config.data_root(), |
68fdb895f53f
allow to load the ule labels.
Frederic Bastien <nouiz@nouiz.org>
parents:
1410
diff
changeset
|
190 'UTLC','filetensor', |
1427
a36d3a406c59
fix whitespace/indentation.
Frederic Bastien <nouiz@nouiz.org>
parents:
1426
diff
changeset
|
191 name+'_'+subset+'.ft') |
1411
68fdb895f53f
allow to load the ule labels.
Frederic Bastien <nouiz@nouiz.org>
parents:
1410
diff
changeset
|
192 for subset in ['trainl','validl','testl']] |
68fdb895f53f
allow to load the ule labels.
Frederic Bastien <nouiz@nouiz.org>
parents:
1410
diff
changeset
|
193 trainl = load_filetensor(trname) |
68fdb895f53f
allow to load the ule labels.
Frederic Bastien <nouiz@nouiz.org>
parents:
1410
diff
changeset
|
194 validl = load_filetensor(vname) |
68fdb895f53f
allow to load the ule labels.
Frederic Bastien <nouiz@nouiz.org>
parents:
1410
diff
changeset
|
195 testl = load_filetensor(tename) |
68fdb895f53f
allow to load the ule labels.
Frederic Bastien <nouiz@nouiz.org>
parents:
1410
diff
changeset
|
196 return trainl, validl, testl |
68fdb895f53f
allow to load the ule labels.
Frederic Bastien <nouiz@nouiz.org>
parents:
1410
diff
changeset
|
197 |
1402
b14f3d6f5cd4
first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff
changeset
|
198 def load_filetensor(fname): |
b14f3d6f5cd4
first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff
changeset
|
199 f = None |
b14f3d6f5cd4
first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff
changeset
|
200 try: |
b14f3d6f5cd4
first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff
changeset
|
201 if not os.path.exists(fname): |
b14f3d6f5cd4
first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff
changeset
|
202 fname = fname+'.gz' |
b14f3d6f5cd4
first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff
changeset
|
203 assert os.path.exists(fname) |
b14f3d6f5cd4
first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff
changeset
|
204 f = gzip.open(fname) |
1426
4988f8ea0836
in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents:
1411
diff
changeset
|
205 elif fname.endswith('.gz'): |
4988f8ea0836
in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents:
1411
diff
changeset
|
206 f = gzip.open(fname) |
1402
b14f3d6f5cd4
first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff
changeset
|
207 else: |
b14f3d6f5cd4
first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff
changeset
|
208 f = open(fname) |
b14f3d6f5cd4
first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff
changeset
|
209 d = ft.read(f) |
b14f3d6f5cd4
first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff
changeset
|
210 finally: |
b14f3d6f5cd4
first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff
changeset
|
211 if f: |
b14f3d6f5cd4
first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff
changeset
|
212 f.close() |
b14f3d6f5cd4
first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff
changeset
|
213 |
b14f3d6f5cd4
first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff
changeset
|
214 return d |
b14f3d6f5cd4
first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff
changeset
|
215 |
b14f3d6f5cd4
first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff
changeset
|
216 def load_sparse(fname): |
b14f3d6f5cd4
first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff
changeset
|
217 f = None |
b14f3d6f5cd4
first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff
changeset
|
218 try: |
b14f3d6f5cd4
first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff
changeset
|
219 if not os.path.exists(fname): |
b14f3d6f5cd4
first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff
changeset
|
220 fname = fname+'.gz' |
b14f3d6f5cd4
first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff
changeset
|
221 assert os.path.exists(fname) |
b14f3d6f5cd4
first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff
changeset
|
222 f = gzip.open(fname) |
1429
b0141efbf6a2
fix loading of sparse utlc dataset when PYLEARN_DATA_ROOT have more then 1 directory.
Frederic Bastien <nouiz@nouiz.org>
parents:
1428
diff
changeset
|
223 elif fname.endswith('.gz'): |
b0141efbf6a2
fix loading of sparse utlc dataset when PYLEARN_DATA_ROOT have more then 1 directory.
Frederic Bastien <nouiz@nouiz.org>
parents:
1428
diff
changeset
|
224 f = gzip.open(fname) |
1402
b14f3d6f5cd4
first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff
changeset
|
225 else: |
b14f3d6f5cd4
first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff
changeset
|
226 f = open(fname) |
b14f3d6f5cd4
first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff
changeset
|
227 d = cPickle.load(f) |
b14f3d6f5cd4
first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff
changeset
|
228 finally: |
b14f3d6f5cd4
first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff
changeset
|
229 if f: |
b14f3d6f5cd4
first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff
changeset
|
230 f.close() |
b14f3d6f5cd4
first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff
changeset
|
231 return d |
b14f3d6f5cd4
first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff
changeset
|
232 |
b14f3d6f5cd4
first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff
changeset
|
233 if __name__ == '__main__': |
b14f3d6f5cd4
first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff
changeset
|
234 import numpy |
b14f3d6f5cd4
first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff
changeset
|
235 import scipy.sparse |
1408
2993b2a5c1af
allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents:
1406
diff
changeset
|
236 |
2993b2a5c1af
allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents:
1406
diff
changeset
|
237 # Test loading of transfer data |
2993b2a5c1af
allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents:
1406
diff
changeset
|
238 train, valid, test, transfer = load_ndarray_dataset("ule", normalize=True, transfer=True) |
2993b2a5c1af
allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents:
1406
diff
changeset
|
239 assert train.shape[0]==transfer.shape[0] |
2993b2a5c1af
allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents:
1406
diff
changeset
|
240 |
1402
b14f3d6f5cd4
first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff
changeset
|
241 for name in ['avicenna','harry','rita','sylvester','ule']: |
1404
89017617ab36
normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
1402
diff
changeset
|
242 train, valid, test = load_ndarray_dataset(name, normalize=True) |
89017617ab36
normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
1402
diff
changeset
|
243 print name,"dtype, max, min, mean, std" |
89017617ab36
normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
1402
diff
changeset
|
244 print train.dtype, train.max(), train.min(), train.mean(), train.std() |
1402
b14f3d6f5cd4
first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff
changeset
|
245 assert isinstance(train, numpy.ndarray) |
b14f3d6f5cd4
first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff
changeset
|
246 assert isinstance(valid, numpy.ndarray) |
b14f3d6f5cd4
first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff
changeset
|
247 assert isinstance(test, numpy.ndarray) |
b14f3d6f5cd4
first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff
changeset
|
248 assert train.shape[1]==test.shape[1]==valid.shape[1] |
b14f3d6f5cd4
first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff
changeset
|
249 |
1408
2993b2a5c1af
allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents:
1406
diff
changeset
|
250 # Test loading of transfer data |
2993b2a5c1af
allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents:
1406
diff
changeset
|
251 train, valid, test, transfer = load_sparse_dataset("ule", normalize=True, transfer=True) |
2993b2a5c1af
allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents:
1406
diff
changeset
|
252 assert train.shape[0]==transfer.shape[0] |
2993b2a5c1af
allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents:
1406
diff
changeset
|
253 |
1406
6003f733a994
added the normalization of the last UTLC dataset
Frederic Bastien <nouiz@nouiz.org>
parents:
1404
diff
changeset
|
254 for name in ['harry','terry','ule']: |
6003f733a994
added the normalization of the last UTLC dataset
Frederic Bastien <nouiz@nouiz.org>
parents:
1404
diff
changeset
|
255 train, valid, test = load_sparse_dataset(name, normalize=True) |
1404
89017617ab36
normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
1402
diff
changeset
|
256 nb_elem = numpy.prod(train.shape) |
89017617ab36
normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
1402
diff
changeset
|
257 mi = train.data.min() |
89017617ab36
normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
1402
diff
changeset
|
258 ma = train.data.max() |
89017617ab36
normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
1402
diff
changeset
|
259 mi = min(0, mi) |
89017617ab36
normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
1402
diff
changeset
|
260 ma = max(0, ma) |
89017617ab36
normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
1402
diff
changeset
|
261 su = train.data.sum() |
89017617ab36
normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
1402
diff
changeset
|
262 mean = float(su)/nb_elem |
89017617ab36
normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
1402
diff
changeset
|
263 print name,"dtype, max, min, mean, nb non-zero, nb element, %sparse" |
89017617ab36
normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
1402
diff
changeset
|
264 print train.dtype, ma, mi, mean, train.nnz, nb_elem, (nb_elem-float(train.nnz))/nb_elem |
1406
6003f733a994
added the normalization of the last UTLC dataset
Frederic Bastien <nouiz@nouiz.org>
parents:
1404
diff
changeset
|
265 print name,"max, min, mean, std (all stats on non-zero element)" |
6003f733a994
added the normalization of the last UTLC dataset
Frederic Bastien <nouiz@nouiz.org>
parents:
1404
diff
changeset
|
266 print train.data.max(), train.data.min(), train.data.mean(), train.data.std() |
1402
b14f3d6f5cd4
first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff
changeset
|
267 assert scipy.sparse.issparse(train) |
b14f3d6f5cd4
first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff
changeset
|
268 assert scipy.sparse.issparse(valid) |
b14f3d6f5cd4
first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff
changeset
|
269 assert scipy.sparse.issparse(test) |
b14f3d6f5cd4
first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff
changeset
|
270 assert train.shape[1]==test.shape[1]==valid.shape[1] |