annotate pylearn/datasets/utlc.py @ 1428:3823dbfff6cf

add parameter to randomize the valid and test data.
author Frederic Bastien <nouiz@nouiz.org>
date Tue, 08 Feb 2011 12:57:15 -0500
parents a36d3a406c59
children b0141efbf6a2
rev   line source
1427
a36d3a406c59 fix whitespace/indentation.
Frederic Bastien <nouiz@nouiz.org>
parents: 1426
diff changeset
1 """
1402
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
2 user should use the load _ndarray_dataset or load_sparse_dataset function
1404
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
3
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
4 See the file ${PYLEARN_DATA_ROOT}/UTCL/README for detail on the datasets.
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
5
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
6 See the end of this file for an example.
1402
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
7 """
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
8
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
9 import cPickle
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
10 import gzip
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
11 import os
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
12
1404
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
13 import numpy
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
14 import theano
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
15
1402
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
16 import pylearn.io.filetensor as ft
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
17 import config
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
18
1428
3823dbfff6cf add parameter to randomize the valid and test data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1427
diff changeset
19 def load_ndarray_dataset(name, normalize=True, transfer=False,
3823dbfff6cf add parameter to randomize the valid and test data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1427
diff changeset
20 normalize_on_the_fly=False, randomize_valid=False,
3823dbfff6cf add parameter to randomize the valid and test data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1427
diff changeset
21 randomize_test=False):
1408
2993b2a5c1af allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1406
diff changeset
22 """ Load the train,valid,test data for the dataset `name`
2993b2a5c1af allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1406
diff changeset
23 and return it in ndarray format.
1427
a36d3a406c59 fix whitespace/indentation.
Frederic Bastien <nouiz@nouiz.org>
parents: 1426
diff changeset
24
1408
2993b2a5c1af allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1406
diff changeset
25 :param normalize: If True, we normalize the train dataset
2993b2a5c1af allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1406
diff changeset
26 before returning it
1428
3823dbfff6cf add parameter to randomize the valid and test data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1427
diff changeset
27 :param transfer: If True also return the transfer label(currently only available for ule)
1426
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
28 :param normalize_on_the_fly: If True, we return a Theano Variable that will give
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
29 as output the normalized value. If the user only
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
30 take a subtensor of that variable, Theano optimization
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
31 should make that we will only have in memory the subtensor
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
32 portion that is computed in normalized form. We store
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
33 the original data in shared memory in its original dtype.
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
34
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
35 This is usefull to have the original data in its original
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
36 dtype in memory to same memory. Especialy usefull to
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
37 be able to use rita and harry with 1G per jobs.
1428
3823dbfff6cf add parameter to randomize the valid and test data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1427
diff changeset
38 :param randomize_valid: Do we randomize the order of the valid set?
3823dbfff6cf add parameter to randomize the valid and test data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1427
diff changeset
39 We always use the same random order
3823dbfff6cf add parameter to randomize the valid and test data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1427
diff changeset
40 If False, return in the same order as downloaded on the web
3823dbfff6cf add parameter to randomize the valid and test data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1427
diff changeset
41 :param randomize_test: Do we randomize the order of the test set?
3823dbfff6cf add parameter to randomize the valid and test data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1427
diff changeset
42 We always use the same random order
3823dbfff6cf add parameter to randomize the valid and test data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1427
diff changeset
43 If False, return in the same order as downloaded on the web
1408
2993b2a5c1af allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1406
diff changeset
44 """
1426
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
45 assert not (normalize and normalize_on_the_fly), "Can't normalize in 2 way at the same time!"
1427
a36d3a406c59 fix whitespace/indentation.
Frederic Bastien <nouiz@nouiz.org>
parents: 1426
diff changeset
46
1402
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
47 assert name in ['avicenna','harry','rita','sylvester','ule']
1426
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
48 common = os.path.join('UTLC','filetensor',name+'_')
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
49 trname,vname,tename = [config.get_filepath_in_roots(common+subset+'.ft.gz',
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
50 common+subset+'.ft')
1402
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
51 for subset in ['train','valid','test']]
1426
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
52
1402
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
53 train = load_filetensor(trname)
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
54 valid = load_filetensor(vname)
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
55 test = load_filetensor(tename)
1428
3823dbfff6cf add parameter to randomize the valid and test data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1427
diff changeset
56 if randomize_valid:
3823dbfff6cf add parameter to randomize the valid and test data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1427
diff changeset
57 rng = numpy.random.RandomState([1,2,3,4])
3823dbfff6cf add parameter to randomize the valid and test data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1427
diff changeset
58 perm = rng.permutation(valid.shape[0])
3823dbfff6cf add parameter to randomize the valid and test data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1427
diff changeset
59 valid = valid[perm]
3823dbfff6cf add parameter to randomize the valid and test data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1427
diff changeset
60 if randomize_test:
3823dbfff6cf add parameter to randomize the valid and test data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1427
diff changeset
61 rng = numpy.random.RandomState([1,2,3,4])
3823dbfff6cf add parameter to randomize the valid and test data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1427
diff changeset
62 perm = rng.permutation(test.shape[0])
3823dbfff6cf add parameter to randomize the valid and test data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1427
diff changeset
63 test = test[perm]
1426
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
64
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
65 if normalize or normalize_on_the_fly:
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
66 if normalize_on_the_fly:
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
67 train = theano.shared(train, borrow=True, name=name+"_train")
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
68 valid = theano.shared(valid, borrow=True, name=name+"_valid")
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
69 test = theano.shared(test, borrow=True, name=name+"_test")
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
70 else:
1404
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
71 train = numpy.asarray(train, theano.config.floatX)
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
72 valid = numpy.asarray(valid, theano.config.floatX)
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
73 test = numpy.asarray(test, theano.config.floatX)
1426
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
74
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
75 if name == "ule":
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
76 train /= 255
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
77 valid /= 255
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
78 test /= 255
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
79 elif name in ["avicenna", "sylvester"]:
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
80 if name == "avicenna":
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
81 train_mean = 514.62154022835455
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
82 train_std = 6.829096494224145
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
83 else:
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
84 train_mean = 403.81889927027686
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
85 train_std = 96.43841050784053
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
86 train -= train_mean
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
87 valid -= train_mean
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
88 test -= train_mean
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
89 train /= train_std
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
90 valid /= train_std
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
91 test /= train_std
1404
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
92 elif name == "harry":
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
93 std = 0.69336046033925791#train.std()slow to compute
1410
e7844692e6e2 normalize the utlc ndarray dataset inplace to use less memory.
Frederic Bastien <nouiz@nouiz.org>
parents: 1408
diff changeset
94 train /= std
e7844692e6e2 normalize the utlc ndarray dataset inplace to use less memory.
Frederic Bastien <nouiz@nouiz.org>
parents: 1408
diff changeset
95 valid /= std
1427
a36d3a406c59 fix whitespace/indentation.
Frederic Bastien <nouiz@nouiz.org>
parents: 1426
diff changeset
96 test /= std
1404
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
97 elif name == "rita":
1426
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
98 v = numpy.asarray(230, dtype=theano.config.floatX)
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
99 train /= v
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
100 valid /= v
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
101 test /= v
1404
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
102 else:
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
103 raise Exception("This dataset don't have its normalization defined")
1408
2993b2a5c1af allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1406
diff changeset
104 if transfer:
2993b2a5c1af allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1406
diff changeset
105 transfer = load_filetensor(os.path.join(config.data_root(),"UTLC","filetensor",name+"_transfer.ft"))
2993b2a5c1af allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1406
diff changeset
106 return train, valid, test, transfer
2993b2a5c1af allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1406
diff changeset
107 else:
2993b2a5c1af allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1406
diff changeset
108 return train, valid, test
1402
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
109
1408
2993b2a5c1af allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1406
diff changeset
110 def load_sparse_dataset(name, normalize=True, transfer=False):
2993b2a5c1af allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1406
diff changeset
111 """ Load the train,valid,test data for the dataset `name`
2993b2a5c1af allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1406
diff changeset
112 and return it in sparse format.
1427
a36d3a406c59 fix whitespace/indentation.
Frederic Bastien <nouiz@nouiz.org>
parents: 1426
diff changeset
113
1408
2993b2a5c1af allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1406
diff changeset
114 :param normalize: If True, we normalize the train dataset
2993b2a5c1af allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1406
diff changeset
115 before returning it
2993b2a5c1af allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1406
diff changeset
116 :param transfer: If True also return the transfer label
2993b2a5c1af allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1406
diff changeset
117 """
1402
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
118 assert name in ['harry','terry','ule']
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
119 trname,vname,tename = [os.path.join(config.data_root(),
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
120 'UTLC','sparse',
1427
a36d3a406c59 fix whitespace/indentation.
Frederic Bastien <nouiz@nouiz.org>
parents: 1426
diff changeset
121 name+'_'+subset+'.npy')
1402
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
122 for subset in ['train','valid','test']]
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
123 train = load_sparse(trname)
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
124 valid = load_sparse(vname)
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
125 test = load_sparse(tename)
1404
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
126 if normalize:
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
127 if name == "ule":
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
128 train = train.astype(theano.config.floatX) / 255
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
129 valid = valid.astype(theano.config.floatX) / 255
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
130 test = test.astype(theano.config.floatX) / 255
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
131 elif name == "harry":
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
132 train = train.astype(theano.config.floatX)
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
133 valid = valid.astype(theano.config.floatX)
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
134 test = test.astype(theano.config.floatX)
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
135 std = 0.69336046033925791#train.std()slow to compute
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
136 train = (train) / std
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
137 valid = (valid) / std
1427
a36d3a406c59 fix whitespace/indentation.
Frederic Bastien <nouiz@nouiz.org>
parents: 1426
diff changeset
138 test = (test) / std
1406
6003f733a994 added the normalization of the last UTLC dataset
Frederic Bastien <nouiz@nouiz.org>
parents: 1404
diff changeset
139 elif name == "terry":
6003f733a994 added the normalization of the last UTLC dataset
Frederic Bastien <nouiz@nouiz.org>
parents: 1404
diff changeset
140 train = train.astype(theano.config.floatX)
6003f733a994 added the normalization of the last UTLC dataset
Frederic Bastien <nouiz@nouiz.org>
parents: 1404
diff changeset
141 valid = valid.astype(theano.config.floatX)
6003f733a994 added the normalization of the last UTLC dataset
Frederic Bastien <nouiz@nouiz.org>
parents: 1404
diff changeset
142 test = test.astype(theano.config.floatX)
6003f733a994 added the normalization of the last UTLC dataset
Frederic Bastien <nouiz@nouiz.org>
parents: 1404
diff changeset
143 train = (train) / 300
6003f733a994 added the normalization of the last UTLC dataset
Frederic Bastien <nouiz@nouiz.org>
parents: 1404
diff changeset
144 valid = (valid) / 300
6003f733a994 added the normalization of the last UTLC dataset
Frederic Bastien <nouiz@nouiz.org>
parents: 1404
diff changeset
145 test = (test) / 300
1404
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
146 else:
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
147 raise Exception("This dataset don't have its normalization defined")
1408
2993b2a5c1af allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1406
diff changeset
148 if transfer:
2993b2a5c1af allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1406
diff changeset
149 transfer = load_sparse(os.path.join(config.data_root(),"UTLC","sparse",name+"_transfer.npy"))
2993b2a5c1af allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1406
diff changeset
150 return train, valid, test, transfer
2993b2a5c1af allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1406
diff changeset
151 else:
2993b2a5c1af allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1406
diff changeset
152 return train, valid, test
1427
a36d3a406c59 fix whitespace/indentation.
Frederic Bastien <nouiz@nouiz.org>
parents: 1426
diff changeset
153
1411
68fdb895f53f allow to load the ule labels.
Frederic Bastien <nouiz@nouiz.org>
parents: 1410
diff changeset
154 def load_ndarray_label(name):
68fdb895f53f allow to load the ule labels.
Frederic Bastien <nouiz@nouiz.org>
parents: 1410
diff changeset
155 """ Load the train,valid,test data for the dataset `name`
68fdb895f53f allow to load the ule labels.
Frederic Bastien <nouiz@nouiz.org>
parents: 1410
diff changeset
156 and return it in ndarray format.
1427
a36d3a406c59 fix whitespace/indentation.
Frederic Bastien <nouiz@nouiz.org>
parents: 1426
diff changeset
157
1411
68fdb895f53f allow to load the ule labels.
Frederic Bastien <nouiz@nouiz.org>
parents: 1410
diff changeset
158 This is only available for the toy dataset ule.
68fdb895f53f allow to load the ule labels.
Frederic Bastien <nouiz@nouiz.org>
parents: 1410
diff changeset
159 """
68fdb895f53f allow to load the ule labels.
Frederic Bastien <nouiz@nouiz.org>
parents: 1410
diff changeset
160 assert name in ['ule']
68fdb895f53f allow to load the ule labels.
Frederic Bastien <nouiz@nouiz.org>
parents: 1410
diff changeset
161 trname,vname,tename = [os.path.join(config.data_root(),
68fdb895f53f allow to load the ule labels.
Frederic Bastien <nouiz@nouiz.org>
parents: 1410
diff changeset
162 'UTLC','filetensor',
1427
a36d3a406c59 fix whitespace/indentation.
Frederic Bastien <nouiz@nouiz.org>
parents: 1426
diff changeset
163 name+'_'+subset+'.ft')
1411
68fdb895f53f allow to load the ule labels.
Frederic Bastien <nouiz@nouiz.org>
parents: 1410
diff changeset
164 for subset in ['trainl','validl','testl']]
68fdb895f53f allow to load the ule labels.
Frederic Bastien <nouiz@nouiz.org>
parents: 1410
diff changeset
165 trainl = load_filetensor(trname)
68fdb895f53f allow to load the ule labels.
Frederic Bastien <nouiz@nouiz.org>
parents: 1410
diff changeset
166 validl = load_filetensor(vname)
68fdb895f53f allow to load the ule labels.
Frederic Bastien <nouiz@nouiz.org>
parents: 1410
diff changeset
167 testl = load_filetensor(tename)
68fdb895f53f allow to load the ule labels.
Frederic Bastien <nouiz@nouiz.org>
parents: 1410
diff changeset
168 return trainl, validl, testl
68fdb895f53f allow to load the ule labels.
Frederic Bastien <nouiz@nouiz.org>
parents: 1410
diff changeset
169
1402
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
170 def load_filetensor(fname):
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
171 f = None
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
172 try:
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
173 if not os.path.exists(fname):
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
174 fname = fname+'.gz'
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
175 assert os.path.exists(fname)
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
176 f = gzip.open(fname)
1426
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
177 elif fname.endswith('.gz'):
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
178 f = gzip.open(fname)
1402
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
179 else:
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
180 f = open(fname)
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
181 d = ft.read(f)
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
182 finally:
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
183 if f:
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
184 f.close()
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
185
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
186 return d
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
187
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
188 def load_sparse(fname):
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
189 f = None
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
190 try:
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
191 if not os.path.exists(fname):
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
192 fname = fname+'.gz'
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
193 assert os.path.exists(fname)
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
194 f = gzip.open(fname)
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
195 else:
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
196 f = open(fname)
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
197 d = cPickle.load(f)
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
198 finally:
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
199 if f:
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
200 f.close()
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
201 return d
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
202
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
203 if __name__ == '__main__':
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
204 import numpy
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
205 import scipy.sparse
1408
2993b2a5c1af allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1406
diff changeset
206
2993b2a5c1af allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1406
diff changeset
207 # Test loading of transfer data
2993b2a5c1af allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1406
diff changeset
208 train, valid, test, transfer = load_ndarray_dataset("ule", normalize=True, transfer=True)
2993b2a5c1af allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1406
diff changeset
209 assert train.shape[0]==transfer.shape[0]
2993b2a5c1af allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1406
diff changeset
210
1402
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
211 for name in ['avicenna','harry','rita','sylvester','ule']:
1404
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
212 train, valid, test = load_ndarray_dataset(name, normalize=True)
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
213 print name,"dtype, max, min, mean, std"
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
214 print train.dtype, train.max(), train.min(), train.mean(), train.std()
1402
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
215 assert isinstance(train, numpy.ndarray)
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
216 assert isinstance(valid, numpy.ndarray)
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
217 assert isinstance(test, numpy.ndarray)
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
218 assert train.shape[1]==test.shape[1]==valid.shape[1]
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
219
1408
2993b2a5c1af allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1406
diff changeset
220 # Test loading of transfer data
2993b2a5c1af allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1406
diff changeset
221 train, valid, test, transfer = load_sparse_dataset("ule", normalize=True, transfer=True)
2993b2a5c1af allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1406
diff changeset
222 assert train.shape[0]==transfer.shape[0]
2993b2a5c1af allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1406
diff changeset
223
1406
6003f733a994 added the normalization of the last UTLC dataset
Frederic Bastien <nouiz@nouiz.org>
parents: 1404
diff changeset
224 for name in ['harry','terry','ule']:
6003f733a994 added the normalization of the last UTLC dataset
Frederic Bastien <nouiz@nouiz.org>
parents: 1404
diff changeset
225 train, valid, test = load_sparse_dataset(name, normalize=True)
1404
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
226 nb_elem = numpy.prod(train.shape)
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
227 mi = train.data.min()
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
228 ma = train.data.max()
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
229 mi = min(0, mi)
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
230 ma = max(0, ma)
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
231 su = train.data.sum()
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
232 mean = float(su)/nb_elem
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
233 print name,"dtype, max, min, mean, nb non-zero, nb element, %sparse"
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
234 print train.dtype, ma, mi, mean, train.nnz, nb_elem, (nb_elem-float(train.nnz))/nb_elem
1406
6003f733a994 added the normalization of the last UTLC dataset
Frederic Bastien <nouiz@nouiz.org>
parents: 1404
diff changeset
235 print name,"max, min, mean, std (all stats on non-zero element)"
6003f733a994 added the normalization of the last UTLC dataset
Frederic Bastien <nouiz@nouiz.org>
parents: 1404
diff changeset
236 print train.data.max(), train.data.min(), train.data.mean(), train.data.std()
1402
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
237 assert scipy.sparse.issparse(train)
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
238 assert scipy.sparse.issparse(valid)
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
239 assert scipy.sparse.issparse(test)
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
240 assert train.shape[1]==test.shape[1]==valid.shape[1]