annotate pylearn/datasets/utlc.py @ 1432:8661f8ad407a

Add a cast in the chain of transformation of initial data set
author Pascal Lamblin <lamblinp@iro.umontreal.ca>
date Mon, 14 Feb 2011 19:27:37 -0500
parents dce602150b5f
children 08beb6f28809
rev   line source
1427
a36d3a406c59 fix whitespace/indentation.
Frederic Bastien <nouiz@nouiz.org>
parents: 1426
diff changeset
1 """
1402
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
2 user should use the load _ndarray_dataset or load_sparse_dataset function
1404
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
3
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
4 See the file ${PYLEARN_DATA_ROOT}/UTCL/README for detail on the datasets.
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
5
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
6 See the end of this file for an example.
1402
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
7 """
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
8
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
9 import cPickle
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
10 import gzip
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
11 import os
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
12
1404
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
13 import numpy
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
14 import theano
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
15
1402
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
16 import pylearn.io.filetensor as ft
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
17 import config
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
18
1428
3823dbfff6cf add parameter to randomize the valid and test data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1427
diff changeset
19 def load_ndarray_dataset(name, normalize=True, transfer=False,
3823dbfff6cf add parameter to randomize the valid and test data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1427
diff changeset
20 normalize_on_the_fly=False, randomize_valid=False,
3823dbfff6cf add parameter to randomize the valid and test data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1427
diff changeset
21 randomize_test=False):
1408
2993b2a5c1af allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1406
diff changeset
22 """ Load the train,valid,test data for the dataset `name`
2993b2a5c1af allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1406
diff changeset
23 and return it in ndarray format.
1427
a36d3a406c59 fix whitespace/indentation.
Frederic Bastien <nouiz@nouiz.org>
parents: 1426
diff changeset
24
1431
dce602150b5f added comment.
Frederic Bastien <nouiz@nouiz.org>
parents: 1430
diff changeset
25 We suppose the data was created with ift6266h11/pretraitement/to_npy.py that
dce602150b5f added comment.
Frederic Bastien <nouiz@nouiz.org>
parents: 1430
diff changeset
26 shuffle the train. So the train should already be shuffled.
dce602150b5f added comment.
Frederic Bastien <nouiz@nouiz.org>
parents: 1430
diff changeset
27
1408
2993b2a5c1af allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1406
diff changeset
28 :param normalize: If True, we normalize the train dataset
2993b2a5c1af allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1406
diff changeset
29 before returning it
1428
3823dbfff6cf add parameter to randomize the valid and test data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1427
diff changeset
30 :param transfer: If True also return the transfer label(currently only available for ule)
1426
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
31 :param normalize_on_the_fly: If True, we return a Theano Variable that will give
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
32 as output the normalized value. If the user only
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
33 take a subtensor of that variable, Theano optimization
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
34 should make that we will only have in memory the subtensor
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
35 portion that is computed in normalized form. We store
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
36 the original data in shared memory in its original dtype.
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
37
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
38 This is usefull to have the original data in its original
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
39 dtype in memory to same memory. Especialy usefull to
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
40 be able to use rita and harry with 1G per jobs.
1428
3823dbfff6cf add parameter to randomize the valid and test data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1427
diff changeset
41 :param randomize_valid: Do we randomize the order of the valid set?
3823dbfff6cf add parameter to randomize the valid and test data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1427
diff changeset
42 We always use the same random order
3823dbfff6cf add parameter to randomize the valid and test data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1427
diff changeset
43 If False, return in the same order as downloaded on the web
3823dbfff6cf add parameter to randomize the valid and test data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1427
diff changeset
44 :param randomize_test: Do we randomize the order of the test set?
3823dbfff6cf add parameter to randomize the valid and test data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1427
diff changeset
45 We always use the same random order
3823dbfff6cf add parameter to randomize the valid and test data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1427
diff changeset
46 If False, return in the same order as downloaded on the web
1408
2993b2a5c1af allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1406
diff changeset
47 """
1426
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
48 assert not (normalize and normalize_on_the_fly), "Can't normalize in 2 way at the same time!"
1427
a36d3a406c59 fix whitespace/indentation.
Frederic Bastien <nouiz@nouiz.org>
parents: 1426
diff changeset
49
1402
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
50 assert name in ['avicenna','harry','rita','sylvester','ule']
1426
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
51 common = os.path.join('UTLC','filetensor',name+'_')
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
52 trname,vname,tename = [config.get_filepath_in_roots(common+subset+'.ft.gz',
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
53 common+subset+'.ft')
1402
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
54 for subset in ['train','valid','test']]
1426
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
55
1402
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
56 train = load_filetensor(trname)
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
57 valid = load_filetensor(vname)
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
58 test = load_filetensor(tename)
1428
3823dbfff6cf add parameter to randomize the valid and test data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1427
diff changeset
59 if randomize_valid:
3823dbfff6cf add parameter to randomize the valid and test data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1427
diff changeset
60 rng = numpy.random.RandomState([1,2,3,4])
3823dbfff6cf add parameter to randomize the valid and test data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1427
diff changeset
61 perm = rng.permutation(valid.shape[0])
3823dbfff6cf add parameter to randomize the valid and test data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1427
diff changeset
62 valid = valid[perm]
3823dbfff6cf add parameter to randomize the valid and test data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1427
diff changeset
63 if randomize_test:
3823dbfff6cf add parameter to randomize the valid and test data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1427
diff changeset
64 rng = numpy.random.RandomState([1,2,3,4])
3823dbfff6cf add parameter to randomize the valid and test data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1427
diff changeset
65 perm = rng.permutation(test.shape[0])
3823dbfff6cf add parameter to randomize the valid and test data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1427
diff changeset
66 test = test[perm]
1426
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
67
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
68 if normalize or normalize_on_the_fly:
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
69 if normalize_on_the_fly:
1432
8661f8ad407a Add a cast in the chain of transformation of initial data set
Pascal Lamblin <lamblinp@iro.umontreal.ca>
parents: 1431
diff changeset
70 # Shared variables of the original type
1426
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
71 train = theano.shared(train, borrow=True, name=name+"_train")
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
72 valid = theano.shared(valid, borrow=True, name=name+"_valid")
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
73 test = theano.shared(test, borrow=True, name=name+"_test")
1432
8661f8ad407a Add a cast in the chain of transformation of initial data set
Pascal Lamblin <lamblinp@iro.umontreal.ca>
parents: 1431
diff changeset
74 # Symbolic variables cast into floatX
8661f8ad407a Add a cast in the chain of transformation of initial data set
Pascal Lamblin <lamblinp@iro.umontreal.ca>
parents: 1431
diff changeset
75 train = theano.tensor.cast(train, theano.config.floatX)
8661f8ad407a Add a cast in the chain of transformation of initial data set
Pascal Lamblin <lamblinp@iro.umontreal.ca>
parents: 1431
diff changeset
76 valid = theano.tensor.cast(valid, theano.config.floatX)
8661f8ad407a Add a cast in the chain of transformation of initial data set
Pascal Lamblin <lamblinp@iro.umontreal.ca>
parents: 1431
diff changeset
77 test = theano.tensor.cast(test, theano.config.floatX)
1426
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
78 else:
1404
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
79 train = numpy.asarray(train, theano.config.floatX)
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
80 valid = numpy.asarray(valid, theano.config.floatX)
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
81 test = numpy.asarray(test, theano.config.floatX)
1426
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
82
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
83 if name == "ule":
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
84 train /= 255
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
85 valid /= 255
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
86 test /= 255
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
87 elif name in ["avicenna", "sylvester"]:
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
88 if name == "avicenna":
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
89 train_mean = 514.62154022835455
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
90 train_std = 6.829096494224145
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
91 else:
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
92 train_mean = 403.81889927027686
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
93 train_std = 96.43841050784053
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
94 train -= train_mean
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
95 valid -= train_mean
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
96 test -= train_mean
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
97 train /= train_std
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
98 valid /= train_std
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
99 test /= train_std
1404
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
100 elif name == "harry":
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
101 std = 0.69336046033925791#train.std()slow to compute
1410
e7844692e6e2 normalize the utlc ndarray dataset inplace to use less memory.
Frederic Bastien <nouiz@nouiz.org>
parents: 1408
diff changeset
102 train /= std
e7844692e6e2 normalize the utlc ndarray dataset inplace to use less memory.
Frederic Bastien <nouiz@nouiz.org>
parents: 1408
diff changeset
103 valid /= std
1427
a36d3a406c59 fix whitespace/indentation.
Frederic Bastien <nouiz@nouiz.org>
parents: 1426
diff changeset
104 test /= std
1404
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
105 elif name == "rita":
1426
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
106 v = numpy.asarray(230, dtype=theano.config.floatX)
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
107 train /= v
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
108 valid /= v
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
109 test /= v
1404
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
110 else:
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
111 raise Exception("This dataset don't have its normalization defined")
1408
2993b2a5c1af allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1406
diff changeset
112 if transfer:
2993b2a5c1af allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1406
diff changeset
113 transfer = load_filetensor(os.path.join(config.data_root(),"UTLC","filetensor",name+"_transfer.ft"))
2993b2a5c1af allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1406
diff changeset
114 return train, valid, test, transfer
2993b2a5c1af allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1406
diff changeset
115 else:
2993b2a5c1af allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1406
diff changeset
116 return train, valid, test
1402
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
117
1430
931a19eeab5a 'allow to randomize the sparse valid/test utlc dataset at load time'
Frederic Bastien <nouiz@nouiz.org>
parents: 1429
diff changeset
118 def load_sparse_dataset(name, normalize=True, transfer=False,
931a19eeab5a 'allow to randomize the sparse valid/test utlc dataset at load time'
Frederic Bastien <nouiz@nouiz.org>
parents: 1429
diff changeset
119 randomize_valid=False,
931a19eeab5a 'allow to randomize the sparse valid/test utlc dataset at load time'
Frederic Bastien <nouiz@nouiz.org>
parents: 1429
diff changeset
120 randomize_test=False):
1408
2993b2a5c1af allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1406
diff changeset
121 """ Load the train,valid,test data for the dataset `name`
2993b2a5c1af allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1406
diff changeset
122 and return it in sparse format.
1427
a36d3a406c59 fix whitespace/indentation.
Frederic Bastien <nouiz@nouiz.org>
parents: 1426
diff changeset
123
1431
dce602150b5f added comment.
Frederic Bastien <nouiz@nouiz.org>
parents: 1430
diff changeset
124 We suppose the data was created with ift6266h11/pretraitement/to_npy.py that
dce602150b5f added comment.
Frederic Bastien <nouiz@nouiz.org>
parents: 1430
diff changeset
125 shuffle the train. So the train should already be shuffled.
dce602150b5f added comment.
Frederic Bastien <nouiz@nouiz.org>
parents: 1430
diff changeset
126
1408
2993b2a5c1af allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1406
diff changeset
127 :param normalize: If True, we normalize the train dataset
2993b2a5c1af allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1406
diff changeset
128 before returning it
2993b2a5c1af allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1406
diff changeset
129 :param transfer: If True also return the transfer label
1430
931a19eeab5a 'allow to randomize the sparse valid/test utlc dataset at load time'
Frederic Bastien <nouiz@nouiz.org>
parents: 1429
diff changeset
130 :param randomize_valid: see same option for load_ndarray_dataset
931a19eeab5a 'allow to randomize the sparse valid/test utlc dataset at load time'
Frederic Bastien <nouiz@nouiz.org>
parents: 1429
diff changeset
131 :param randomize_test: see same option for load_ndarray_dataset
931a19eeab5a 'allow to randomize the sparse valid/test utlc dataset at load time'
Frederic Bastien <nouiz@nouiz.org>
parents: 1429
diff changeset
132
1408
2993b2a5c1af allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1406
diff changeset
133 """
1402
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
134 assert name in ['harry','terry','ule']
1429
b0141efbf6a2 fix loading of sparse utlc dataset when PYLEARN_DATA_ROOT have more then 1 directory.
Frederic Bastien <nouiz@nouiz.org>
parents: 1428
diff changeset
135 common = os.path.join('UTLC','sparse',name+'_')
b0141efbf6a2 fix loading of sparse utlc dataset when PYLEARN_DATA_ROOT have more then 1 directory.
Frederic Bastien <nouiz@nouiz.org>
parents: 1428
diff changeset
136 trname,vname,tename = [config.get_filepath_in_roots(common+subset+'.npy.gz',
b0141efbf6a2 fix loading of sparse utlc dataset when PYLEARN_DATA_ROOT have more then 1 directory.
Frederic Bastien <nouiz@nouiz.org>
parents: 1428
diff changeset
137 common+subset+'.npy')
1402
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
138 for subset in ['train','valid','test']]
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
139 train = load_sparse(trname)
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
140 valid = load_sparse(vname)
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
141 test = load_sparse(tename)
1430
931a19eeab5a 'allow to randomize the sparse valid/test utlc dataset at load time'
Frederic Bastien <nouiz@nouiz.org>
parents: 1429
diff changeset
142
931a19eeab5a 'allow to randomize the sparse valid/test utlc dataset at load time'
Frederic Bastien <nouiz@nouiz.org>
parents: 1429
diff changeset
143 # Data should already be in csr format that support
931a19eeab5a 'allow to randomize the sparse valid/test utlc dataset at load time'
Frederic Bastien <nouiz@nouiz.org>
parents: 1429
diff changeset
144 # this type of indexing.
931a19eeab5a 'allow to randomize the sparse valid/test utlc dataset at load time'
Frederic Bastien <nouiz@nouiz.org>
parents: 1429
diff changeset
145 if randomize_valid:
931a19eeab5a 'allow to randomize the sparse valid/test utlc dataset at load time'
Frederic Bastien <nouiz@nouiz.org>
parents: 1429
diff changeset
146 rng = numpy.random.RandomState([1,2,3,4])
931a19eeab5a 'allow to randomize the sparse valid/test utlc dataset at load time'
Frederic Bastien <nouiz@nouiz.org>
parents: 1429
diff changeset
147 perm = rng.permutation(valid.shape[0])
931a19eeab5a 'allow to randomize the sparse valid/test utlc dataset at load time'
Frederic Bastien <nouiz@nouiz.org>
parents: 1429
diff changeset
148 valid = valid[perm]
931a19eeab5a 'allow to randomize the sparse valid/test utlc dataset at load time'
Frederic Bastien <nouiz@nouiz.org>
parents: 1429
diff changeset
149 if randomize_test:
931a19eeab5a 'allow to randomize the sparse valid/test utlc dataset at load time'
Frederic Bastien <nouiz@nouiz.org>
parents: 1429
diff changeset
150 rng = numpy.random.RandomState([1,2,3,4])
931a19eeab5a 'allow to randomize the sparse valid/test utlc dataset at load time'
Frederic Bastien <nouiz@nouiz.org>
parents: 1429
diff changeset
151 perm = rng.permutation(test.shape[0])
931a19eeab5a 'allow to randomize the sparse valid/test utlc dataset at load time'
Frederic Bastien <nouiz@nouiz.org>
parents: 1429
diff changeset
152 test = test[perm]
931a19eeab5a 'allow to randomize the sparse valid/test utlc dataset at load time'
Frederic Bastien <nouiz@nouiz.org>
parents: 1429
diff changeset
153
1404
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
154 if normalize:
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
155 if name == "ule":
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
156 train = train.astype(theano.config.floatX) / 255
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
157 valid = valid.astype(theano.config.floatX) / 255
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
158 test = test.astype(theano.config.floatX) / 255
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
159 elif name == "harry":
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
160 train = train.astype(theano.config.floatX)
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
161 valid = valid.astype(theano.config.floatX)
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
162 test = test.astype(theano.config.floatX)
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
163 std = 0.69336046033925791#train.std()slow to compute
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
164 train = (train) / std
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
165 valid = (valid) / std
1427
a36d3a406c59 fix whitespace/indentation.
Frederic Bastien <nouiz@nouiz.org>
parents: 1426
diff changeset
166 test = (test) / std
1406
6003f733a994 added the normalization of the last UTLC dataset
Frederic Bastien <nouiz@nouiz.org>
parents: 1404
diff changeset
167 elif name == "terry":
6003f733a994 added the normalization of the last UTLC dataset
Frederic Bastien <nouiz@nouiz.org>
parents: 1404
diff changeset
168 train = train.astype(theano.config.floatX)
6003f733a994 added the normalization of the last UTLC dataset
Frederic Bastien <nouiz@nouiz.org>
parents: 1404
diff changeset
169 valid = valid.astype(theano.config.floatX)
6003f733a994 added the normalization of the last UTLC dataset
Frederic Bastien <nouiz@nouiz.org>
parents: 1404
diff changeset
170 test = test.astype(theano.config.floatX)
6003f733a994 added the normalization of the last UTLC dataset
Frederic Bastien <nouiz@nouiz.org>
parents: 1404
diff changeset
171 train = (train) / 300
6003f733a994 added the normalization of the last UTLC dataset
Frederic Bastien <nouiz@nouiz.org>
parents: 1404
diff changeset
172 valid = (valid) / 300
6003f733a994 added the normalization of the last UTLC dataset
Frederic Bastien <nouiz@nouiz.org>
parents: 1404
diff changeset
173 test = (test) / 300
1404
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
174 else:
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
175 raise Exception("This dataset don't have its normalization defined")
1408
2993b2a5c1af allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1406
diff changeset
176 if transfer:
2993b2a5c1af allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1406
diff changeset
177 transfer = load_sparse(os.path.join(config.data_root(),"UTLC","sparse",name+"_transfer.npy"))
2993b2a5c1af allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1406
diff changeset
178 return train, valid, test, transfer
2993b2a5c1af allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1406
diff changeset
179 else:
2993b2a5c1af allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1406
diff changeset
180 return train, valid, test
1427
a36d3a406c59 fix whitespace/indentation.
Frederic Bastien <nouiz@nouiz.org>
parents: 1426
diff changeset
181
1411
68fdb895f53f allow to load the ule labels.
Frederic Bastien <nouiz@nouiz.org>
parents: 1410
diff changeset
182 def load_ndarray_label(name):
68fdb895f53f allow to load the ule labels.
Frederic Bastien <nouiz@nouiz.org>
parents: 1410
diff changeset
183 """ Load the train,valid,test data for the dataset `name`
68fdb895f53f allow to load the ule labels.
Frederic Bastien <nouiz@nouiz.org>
parents: 1410
diff changeset
184 and return it in ndarray format.
1427
a36d3a406c59 fix whitespace/indentation.
Frederic Bastien <nouiz@nouiz.org>
parents: 1426
diff changeset
185
1411
68fdb895f53f allow to load the ule labels.
Frederic Bastien <nouiz@nouiz.org>
parents: 1410
diff changeset
186 This is only available for the toy dataset ule.
68fdb895f53f allow to load the ule labels.
Frederic Bastien <nouiz@nouiz.org>
parents: 1410
diff changeset
187 """
68fdb895f53f allow to load the ule labels.
Frederic Bastien <nouiz@nouiz.org>
parents: 1410
diff changeset
188 assert name in ['ule']
68fdb895f53f allow to load the ule labels.
Frederic Bastien <nouiz@nouiz.org>
parents: 1410
diff changeset
189 trname,vname,tename = [os.path.join(config.data_root(),
68fdb895f53f allow to load the ule labels.
Frederic Bastien <nouiz@nouiz.org>
parents: 1410
diff changeset
190 'UTLC','filetensor',
1427
a36d3a406c59 fix whitespace/indentation.
Frederic Bastien <nouiz@nouiz.org>
parents: 1426
diff changeset
191 name+'_'+subset+'.ft')
1411
68fdb895f53f allow to load the ule labels.
Frederic Bastien <nouiz@nouiz.org>
parents: 1410
diff changeset
192 for subset in ['trainl','validl','testl']]
68fdb895f53f allow to load the ule labels.
Frederic Bastien <nouiz@nouiz.org>
parents: 1410
diff changeset
193 trainl = load_filetensor(trname)
68fdb895f53f allow to load the ule labels.
Frederic Bastien <nouiz@nouiz.org>
parents: 1410
diff changeset
194 validl = load_filetensor(vname)
68fdb895f53f allow to load the ule labels.
Frederic Bastien <nouiz@nouiz.org>
parents: 1410
diff changeset
195 testl = load_filetensor(tename)
68fdb895f53f allow to load the ule labels.
Frederic Bastien <nouiz@nouiz.org>
parents: 1410
diff changeset
196 return trainl, validl, testl
68fdb895f53f allow to load the ule labels.
Frederic Bastien <nouiz@nouiz.org>
parents: 1410
diff changeset
197
1402
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
198 def load_filetensor(fname):
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
199 f = None
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
200 try:
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
201 if not os.path.exists(fname):
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
202 fname = fname+'.gz'
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
203 assert os.path.exists(fname)
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
204 f = gzip.open(fname)
1426
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
205 elif fname.endswith('.gz'):
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
206 f = gzip.open(fname)
1402
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
207 else:
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
208 f = open(fname)
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
209 d = ft.read(f)
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
210 finally:
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
211 if f:
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
212 f.close()
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
213
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
214 return d
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
215
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
216 def load_sparse(fname):
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
217 f = None
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
218 try:
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
219 if not os.path.exists(fname):
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
220 fname = fname+'.gz'
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
221 assert os.path.exists(fname)
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
222 f = gzip.open(fname)
1429
b0141efbf6a2 fix loading of sparse utlc dataset when PYLEARN_DATA_ROOT have more then 1 directory.
Frederic Bastien <nouiz@nouiz.org>
parents: 1428
diff changeset
223 elif fname.endswith('.gz'):
b0141efbf6a2 fix loading of sparse utlc dataset when PYLEARN_DATA_ROOT have more then 1 directory.
Frederic Bastien <nouiz@nouiz.org>
parents: 1428
diff changeset
224 f = gzip.open(fname)
1402
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
225 else:
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
226 f = open(fname)
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
227 d = cPickle.load(f)
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
228 finally:
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
229 if f:
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
230 f.close()
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
231 return d
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
232
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
233 if __name__ == '__main__':
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
234 import numpy
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
235 import scipy.sparse
1408
2993b2a5c1af allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1406
diff changeset
236
2993b2a5c1af allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1406
diff changeset
237 # Test loading of transfer data
2993b2a5c1af allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1406
diff changeset
238 train, valid, test, transfer = load_ndarray_dataset("ule", normalize=True, transfer=True)
2993b2a5c1af allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1406
diff changeset
239 assert train.shape[0]==transfer.shape[0]
2993b2a5c1af allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1406
diff changeset
240
1402
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
241 for name in ['avicenna','harry','rita','sylvester','ule']:
1404
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
242 train, valid, test = load_ndarray_dataset(name, normalize=True)
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
243 print name,"dtype, max, min, mean, std"
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
244 print train.dtype, train.max(), train.min(), train.mean(), train.std()
1402
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
245 assert isinstance(train, numpy.ndarray)
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
246 assert isinstance(valid, numpy.ndarray)
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
247 assert isinstance(test, numpy.ndarray)
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
248 assert train.shape[1]==test.shape[1]==valid.shape[1]
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
249
1408
2993b2a5c1af allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1406
diff changeset
250 # Test loading of transfer data
2993b2a5c1af allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1406
diff changeset
251 train, valid, test, transfer = load_sparse_dataset("ule", normalize=True, transfer=True)
2993b2a5c1af allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1406
diff changeset
252 assert train.shape[0]==transfer.shape[0]
2993b2a5c1af allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1406
diff changeset
253
1406
6003f733a994 added the normalization of the last UTLC dataset
Frederic Bastien <nouiz@nouiz.org>
parents: 1404
diff changeset
254 for name in ['harry','terry','ule']:
6003f733a994 added the normalization of the last UTLC dataset
Frederic Bastien <nouiz@nouiz.org>
parents: 1404
diff changeset
255 train, valid, test = load_sparse_dataset(name, normalize=True)
1404
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
256 nb_elem = numpy.prod(train.shape)
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
257 mi = train.data.min()
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
258 ma = train.data.max()
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
259 mi = min(0, mi)
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
260 ma = max(0, ma)
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
261 su = train.data.sum()
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
262 mean = float(su)/nb_elem
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
263 print name,"dtype, max, min, mean, nb non-zero, nb element, %sparse"
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
264 print train.dtype, ma, mi, mean, train.nnz, nb_elem, (nb_elem-float(train.nnz))/nb_elem
1406
6003f733a994 added the normalization of the last UTLC dataset
Frederic Bastien <nouiz@nouiz.org>
parents: 1404
diff changeset
265 print name,"max, min, mean, std (all stats on non-zero element)"
6003f733a994 added the normalization of the last UTLC dataset
Frederic Bastien <nouiz@nouiz.org>
parents: 1404
diff changeset
266 print train.data.max(), train.data.min(), train.data.mean(), train.data.std()
1402
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
267 assert scipy.sparse.issparse(train)
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
268 assert scipy.sparse.issparse(valid)
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
269 assert scipy.sparse.issparse(test)
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
270 assert train.shape[1]==test.shape[1]==valid.shape[1]