annotate pylearn/datasets/utlc.py @ 1477:48efafaaf7fb

Add function for loading the transfer labels of utlc
author Pascal Lamblin <lamblinp@iro.umontreal.ca>
date Sat, 21 May 2011 01:03:10 -0400
parents 2aa80f5b5bbc
children
rev   line source
1427
a36d3a406c59 fix whitespace/indentation.
Frederic Bastien <nouiz@nouiz.org>
parents: 1426
diff changeset
1 """
1402
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
2 user should use the load _ndarray_dataset or load_sparse_dataset function
1404
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
3
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
4 See the file ${PYLEARN_DATA_ROOT}/UTCL/README for detail on the datasets.
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
5
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
6 See the end of this file for an example.
1402
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
7 """
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
8
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
9 import cPickle
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
10 import gzip
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
11 import os
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
12
1404
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
13 import numpy
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
14 import theano
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
15
1402
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
16 import pylearn.io.filetensor as ft
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
17 import config
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
18
1428
3823dbfff6cf add parameter to randomize the valid and test data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1427
diff changeset
19 def load_ndarray_dataset(name, normalize=True, transfer=False,
3823dbfff6cf add parameter to randomize the valid and test data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1427
diff changeset
20 normalize_on_the_fly=False, randomize_valid=False,
3823dbfff6cf add parameter to randomize the valid and test data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1427
diff changeset
21 randomize_test=False):
1408
2993b2a5c1af allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1406
diff changeset
22 """ Load the train,valid,test data for the dataset `name`
2993b2a5c1af allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1406
diff changeset
23 and return it in ndarray format.
1427
a36d3a406c59 fix whitespace/indentation.
Frederic Bastien <nouiz@nouiz.org>
parents: 1426
diff changeset
24
1431
dce602150b5f added comment.
Frederic Bastien <nouiz@nouiz.org>
parents: 1430
diff changeset
25 We suppose the data was created with ift6266h11/pretraitement/to_npy.py that
dce602150b5f added comment.
Frederic Bastien <nouiz@nouiz.org>
parents: 1430
diff changeset
26 shuffle the train. So the train should already be shuffled.
dce602150b5f added comment.
Frederic Bastien <nouiz@nouiz.org>
parents: 1430
diff changeset
27
1408
2993b2a5c1af allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1406
diff changeset
28 :param normalize: If True, we normalize the train dataset
2993b2a5c1af allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1406
diff changeset
29 before returning it
1442
08beb6f28809 Update inline doc.
Pascal Lamblin <lamblinp@iro.umontreal.ca>
parents: 1432
diff changeset
30 :param transfer: If True also return the transfer labels
1426
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
31 :param normalize_on_the_fly: If True, we return a Theano Variable that will give
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
32 as output the normalized value. If the user only
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
33 take a subtensor of that variable, Theano optimization
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
34 should make that we will only have in memory the subtensor
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
35 portion that is computed in normalized form. We store
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
36 the original data in shared memory in its original dtype.
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
37
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
38 This is usefull to have the original data in its original
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
39 dtype in memory to same memory. Especialy usefull to
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
40 be able to use rita and harry with 1G per jobs.
1428
3823dbfff6cf add parameter to randomize the valid and test data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1427
diff changeset
41 :param randomize_valid: Do we randomize the order of the valid set?
3823dbfff6cf add parameter to randomize the valid and test data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1427
diff changeset
42 We always use the same random order
3823dbfff6cf add parameter to randomize the valid and test data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1427
diff changeset
43 If False, return in the same order as downloaded on the web
3823dbfff6cf add parameter to randomize the valid and test data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1427
diff changeset
44 :param randomize_test: Do we randomize the order of the test set?
3823dbfff6cf add parameter to randomize the valid and test data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1427
diff changeset
45 We always use the same random order
3823dbfff6cf add parameter to randomize the valid and test data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1427
diff changeset
46 If False, return in the same order as downloaded on the web
1408
2993b2a5c1af allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1406
diff changeset
47 """
1426
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
48 assert not (normalize and normalize_on_the_fly), "Can't normalize in 2 way at the same time!"
1427
a36d3a406c59 fix whitespace/indentation.
Frederic Bastien <nouiz@nouiz.org>
parents: 1426
diff changeset
49
1402
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
50 assert name in ['avicenna','harry','rita','sylvester','ule']
1426
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
51 common = os.path.join('UTLC','filetensor',name+'_')
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
52 trname,vname,tename = [config.get_filepath_in_roots(common+subset+'.ft.gz',
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
53 common+subset+'.ft')
1402
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
54 for subset in ['train','valid','test']]
1426
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
55
1402
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
56 train = load_filetensor(trname)
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
57 valid = load_filetensor(vname)
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
58 test = load_filetensor(tename)
1428
3823dbfff6cf add parameter to randomize the valid and test data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1427
diff changeset
59 if randomize_valid:
3823dbfff6cf add parameter to randomize the valid and test data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1427
diff changeset
60 rng = numpy.random.RandomState([1,2,3,4])
3823dbfff6cf add parameter to randomize the valid and test data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1427
diff changeset
61 perm = rng.permutation(valid.shape[0])
3823dbfff6cf add parameter to randomize the valid and test data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1427
diff changeset
62 valid = valid[perm]
3823dbfff6cf add parameter to randomize the valid and test data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1427
diff changeset
63 if randomize_test:
3823dbfff6cf add parameter to randomize the valid and test data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1427
diff changeset
64 rng = numpy.random.RandomState([1,2,3,4])
3823dbfff6cf add parameter to randomize the valid and test data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1427
diff changeset
65 perm = rng.permutation(test.shape[0])
3823dbfff6cf add parameter to randomize the valid and test data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1427
diff changeset
66 test = test[perm]
1426
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
67
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
68 if normalize or normalize_on_the_fly:
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
69 if normalize_on_the_fly:
1432
8661f8ad407a Add a cast in the chain of transformation of initial data set
Pascal Lamblin <lamblinp@iro.umontreal.ca>
parents: 1431
diff changeset
70 # Shared variables of the original type
1426
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
71 train = theano.shared(train, borrow=True, name=name+"_train")
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
72 valid = theano.shared(valid, borrow=True, name=name+"_valid")
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
73 test = theano.shared(test, borrow=True, name=name+"_test")
1432
8661f8ad407a Add a cast in the chain of transformation of initial data set
Pascal Lamblin <lamblinp@iro.umontreal.ca>
parents: 1431
diff changeset
74 # Symbolic variables cast into floatX
8661f8ad407a Add a cast in the chain of transformation of initial data set
Pascal Lamblin <lamblinp@iro.umontreal.ca>
parents: 1431
diff changeset
75 train = theano.tensor.cast(train, theano.config.floatX)
8661f8ad407a Add a cast in the chain of transformation of initial data set
Pascal Lamblin <lamblinp@iro.umontreal.ca>
parents: 1431
diff changeset
76 valid = theano.tensor.cast(valid, theano.config.floatX)
8661f8ad407a Add a cast in the chain of transformation of initial data set
Pascal Lamblin <lamblinp@iro.umontreal.ca>
parents: 1431
diff changeset
77 test = theano.tensor.cast(test, theano.config.floatX)
1426
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
78 else:
1404
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
79 train = numpy.asarray(train, theano.config.floatX)
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
80 valid = numpy.asarray(valid, theano.config.floatX)
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
81 test = numpy.asarray(test, theano.config.floatX)
1426
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
82
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
83 if name == "ule":
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
84 train /= 255
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
85 valid /= 255
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
86 test /= 255
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
87 elif name in ["avicenna", "sylvester"]:
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
88 if name == "avicenna":
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
89 train_mean = 514.62154022835455
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
90 train_std = 6.829096494224145
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
91 else:
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
92 train_mean = 403.81889927027686
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
93 train_std = 96.43841050784053
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
94 train -= train_mean
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
95 valid -= train_mean
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
96 test -= train_mean
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
97 train /= train_std
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
98 valid /= train_std
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
99 test /= train_std
1404
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
100 elif name == "harry":
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
101 std = 0.69336046033925791#train.std()slow to compute
1410
e7844692e6e2 normalize the utlc ndarray dataset inplace to use less memory.
Frederic Bastien <nouiz@nouiz.org>
parents: 1408
diff changeset
102 train /= std
e7844692e6e2 normalize the utlc ndarray dataset inplace to use less memory.
Frederic Bastien <nouiz@nouiz.org>
parents: 1408
diff changeset
103 valid /= std
1427
a36d3a406c59 fix whitespace/indentation.
Frederic Bastien <nouiz@nouiz.org>
parents: 1426
diff changeset
104 test /= std
1404
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
105 elif name == "rita":
1426
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
106 v = numpy.asarray(230, dtype=theano.config.floatX)
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
107 train /= v
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
108 valid /= v
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
109 test /= v
1404
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
110 else:
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
111 raise Exception("This dataset don't have its normalization defined")
1408
2993b2a5c1af allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1406
diff changeset
112 if transfer:
1477
48efafaaf7fb Add function for loading the transfer labels of utlc
Pascal Lamblin <lamblinp@iro.umontreal.ca>
parents: 1461
diff changeset
113 transfer = load_ndarray_transfer(name)
1408
2993b2a5c1af allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1406
diff changeset
114 return train, valid, test, transfer
2993b2a5c1af allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1406
diff changeset
115 else:
2993b2a5c1af allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1406
diff changeset
116 return train, valid, test
1402
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
117
1430
931a19eeab5a 'allow to randomize the sparse valid/test utlc dataset at load time'
Frederic Bastien <nouiz@nouiz.org>
parents: 1429
diff changeset
118 def load_sparse_dataset(name, normalize=True, transfer=False,
931a19eeab5a 'allow to randomize the sparse valid/test utlc dataset at load time'
Frederic Bastien <nouiz@nouiz.org>
parents: 1429
diff changeset
119 randomize_valid=False,
931a19eeab5a 'allow to randomize the sparse valid/test utlc dataset at load time'
Frederic Bastien <nouiz@nouiz.org>
parents: 1429
diff changeset
120 randomize_test=False):
1408
2993b2a5c1af allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1406
diff changeset
121 """ Load the train,valid,test data for the dataset `name`
2993b2a5c1af allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1406
diff changeset
122 and return it in sparse format.
1427
a36d3a406c59 fix whitespace/indentation.
Frederic Bastien <nouiz@nouiz.org>
parents: 1426
diff changeset
123
1431
dce602150b5f added comment.
Frederic Bastien <nouiz@nouiz.org>
parents: 1430
diff changeset
124 We suppose the data was created with ift6266h11/pretraitement/to_npy.py that
dce602150b5f added comment.
Frederic Bastien <nouiz@nouiz.org>
parents: 1430
diff changeset
125 shuffle the train. So the train should already be shuffled.
dce602150b5f added comment.
Frederic Bastien <nouiz@nouiz.org>
parents: 1430
diff changeset
126
1408
2993b2a5c1af allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1406
diff changeset
127 :param normalize: If True, we normalize the train dataset
2993b2a5c1af allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1406
diff changeset
128 before returning it
2993b2a5c1af allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1406
diff changeset
129 :param transfer: If True also return the transfer label
1430
931a19eeab5a 'allow to randomize the sparse valid/test utlc dataset at load time'
Frederic Bastien <nouiz@nouiz.org>
parents: 1429
diff changeset
130 :param randomize_valid: see same option for load_ndarray_dataset
931a19eeab5a 'allow to randomize the sparse valid/test utlc dataset at load time'
Frederic Bastien <nouiz@nouiz.org>
parents: 1429
diff changeset
131 :param randomize_test: see same option for load_ndarray_dataset
931a19eeab5a 'allow to randomize the sparse valid/test utlc dataset at load time'
Frederic Bastien <nouiz@nouiz.org>
parents: 1429
diff changeset
132
1408
2993b2a5c1af allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1406
diff changeset
133 """
1402
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
134 assert name in ['harry','terry','ule']
1429
b0141efbf6a2 fix loading of sparse utlc dataset when PYLEARN_DATA_ROOT have more then 1 directory.
Frederic Bastien <nouiz@nouiz.org>
parents: 1428
diff changeset
135 common = os.path.join('UTLC','sparse',name+'_')
b0141efbf6a2 fix loading of sparse utlc dataset when PYLEARN_DATA_ROOT have more then 1 directory.
Frederic Bastien <nouiz@nouiz.org>
parents: 1428
diff changeset
136 trname,vname,tename = [config.get_filepath_in_roots(common+subset+'.npy.gz',
b0141efbf6a2 fix loading of sparse utlc dataset when PYLEARN_DATA_ROOT have more then 1 directory.
Frederic Bastien <nouiz@nouiz.org>
parents: 1428
diff changeset
137 common+subset+'.npy')
1402
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
138 for subset in ['train','valid','test']]
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
139 train = load_sparse(trname)
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
140 valid = load_sparse(vname)
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
141 test = load_sparse(tename)
1430
931a19eeab5a 'allow to randomize the sparse valid/test utlc dataset at load time'
Frederic Bastien <nouiz@nouiz.org>
parents: 1429
diff changeset
142
931a19eeab5a 'allow to randomize the sparse valid/test utlc dataset at load time'
Frederic Bastien <nouiz@nouiz.org>
parents: 1429
diff changeset
143 # Data should already be in csr format that support
931a19eeab5a 'allow to randomize the sparse valid/test utlc dataset at load time'
Frederic Bastien <nouiz@nouiz.org>
parents: 1429
diff changeset
144 # this type of indexing.
931a19eeab5a 'allow to randomize the sparse valid/test utlc dataset at load time'
Frederic Bastien <nouiz@nouiz.org>
parents: 1429
diff changeset
145 if randomize_valid:
931a19eeab5a 'allow to randomize the sparse valid/test utlc dataset at load time'
Frederic Bastien <nouiz@nouiz.org>
parents: 1429
diff changeset
146 rng = numpy.random.RandomState([1,2,3,4])
931a19eeab5a 'allow to randomize the sparse valid/test utlc dataset at load time'
Frederic Bastien <nouiz@nouiz.org>
parents: 1429
diff changeset
147 perm = rng.permutation(valid.shape[0])
931a19eeab5a 'allow to randomize the sparse valid/test utlc dataset at load time'
Frederic Bastien <nouiz@nouiz.org>
parents: 1429
diff changeset
148 valid = valid[perm]
931a19eeab5a 'allow to randomize the sparse valid/test utlc dataset at load time'
Frederic Bastien <nouiz@nouiz.org>
parents: 1429
diff changeset
149 if randomize_test:
931a19eeab5a 'allow to randomize the sparse valid/test utlc dataset at load time'
Frederic Bastien <nouiz@nouiz.org>
parents: 1429
diff changeset
150 rng = numpy.random.RandomState([1,2,3,4])
931a19eeab5a 'allow to randomize the sparse valid/test utlc dataset at load time'
Frederic Bastien <nouiz@nouiz.org>
parents: 1429
diff changeset
151 perm = rng.permutation(test.shape[0])
931a19eeab5a 'allow to randomize the sparse valid/test utlc dataset at load time'
Frederic Bastien <nouiz@nouiz.org>
parents: 1429
diff changeset
152 test = test[perm]
931a19eeab5a 'allow to randomize the sparse valid/test utlc dataset at load time'
Frederic Bastien <nouiz@nouiz.org>
parents: 1429
diff changeset
153
1404
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
154 if normalize:
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
155 if name == "ule":
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
156 train = train.astype(theano.config.floatX) / 255
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
157 valid = valid.astype(theano.config.floatX) / 255
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
158 test = test.astype(theano.config.floatX) / 255
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
159 elif name == "harry":
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
160 train = train.astype(theano.config.floatX)
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
161 valid = valid.astype(theano.config.floatX)
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
162 test = test.astype(theano.config.floatX)
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
163 std = 0.69336046033925791#train.std()slow to compute
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
164 train = (train) / std
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
165 valid = (valid) / std
1427
a36d3a406c59 fix whitespace/indentation.
Frederic Bastien <nouiz@nouiz.org>
parents: 1426
diff changeset
166 test = (test) / std
1406
6003f733a994 added the normalization of the last UTLC dataset
Frederic Bastien <nouiz@nouiz.org>
parents: 1404
diff changeset
167 elif name == "terry":
6003f733a994 added the normalization of the last UTLC dataset
Frederic Bastien <nouiz@nouiz.org>
parents: 1404
diff changeset
168 train = train.astype(theano.config.floatX)
6003f733a994 added the normalization of the last UTLC dataset
Frederic Bastien <nouiz@nouiz.org>
parents: 1404
diff changeset
169 valid = valid.astype(theano.config.floatX)
6003f733a994 added the normalization of the last UTLC dataset
Frederic Bastien <nouiz@nouiz.org>
parents: 1404
diff changeset
170 test = test.astype(theano.config.floatX)
6003f733a994 added the normalization of the last UTLC dataset
Frederic Bastien <nouiz@nouiz.org>
parents: 1404
diff changeset
171 train = (train) / 300
6003f733a994 added the normalization of the last UTLC dataset
Frederic Bastien <nouiz@nouiz.org>
parents: 1404
diff changeset
172 valid = (valid) / 300
6003f733a994 added the normalization of the last UTLC dataset
Frederic Bastien <nouiz@nouiz.org>
parents: 1404
diff changeset
173 test = (test) / 300
1404
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
174 else:
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
175 raise Exception("This dataset don't have its normalization defined")
1408
2993b2a5c1af allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1406
diff changeset
176 if transfer:
1461
2aa80f5b5bbc When loading sparse UTLC datasets, do not load sparse labels.
Philippe Serhal <philippe.serhal@umontreal.ca>
parents: 1442
diff changeset
177 transfer = load_filetensor(os.path.join(config.data_root(),"UTLC","filetensor",name+"_transfer.ft"))
1408
2993b2a5c1af allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1406
diff changeset
178 return train, valid, test, transfer
2993b2a5c1af allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1406
diff changeset
179 else:
2993b2a5c1af allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1406
diff changeset
180 return train, valid, test
1427
a36d3a406c59 fix whitespace/indentation.
Frederic Bastien <nouiz@nouiz.org>
parents: 1426
diff changeset
181
1477
48efafaaf7fb Add function for loading the transfer labels of utlc
Pascal Lamblin <lamblinp@iro.umontreal.ca>
parents: 1461
diff changeset
182 def load_ndarray_transfer(name):
48efafaaf7fb Add function for loading the transfer labels of utlc
Pascal Lamblin <lamblinp@iro.umontreal.ca>
parents: 1461
diff changeset
183 """
48efafaaf7fb Add function for loading the transfer labels of utlc
Pascal Lamblin <lamblinp@iro.umontreal.ca>
parents: 1461
diff changeset
184 Load the transfer labels for the training set of data set `name`.
48efafaaf7fb Add function for loading the transfer labels of utlc
Pascal Lamblin <lamblinp@iro.umontreal.ca>
parents: 1461
diff changeset
185
48efafaaf7fb Add function for loading the transfer labels of utlc
Pascal Lamblin <lamblinp@iro.umontreal.ca>
parents: 1461
diff changeset
186 It will be returned in ndarray format.
48efafaaf7fb Add function for loading the transfer labels of utlc
Pascal Lamblin <lamblinp@iro.umontreal.ca>
parents: 1461
diff changeset
187 """
48efafaaf7fb Add function for loading the transfer labels of utlc
Pascal Lamblin <lamblinp@iro.umontreal.ca>
parents: 1461
diff changeset
188 assert name in ['avicenna','harry','rita','sylvester','terry','ule']
48efafaaf7fb Add function for loading the transfer labels of utlc
Pascal Lamblin <lamblinp@iro.umontreal.ca>
parents: 1461
diff changeset
189 transfer = load_filetensor(os.path.join(config.data_root(), 'UTLC',
48efafaaf7fb Add function for loading the transfer labels of utlc
Pascal Lamblin <lamblinp@iro.umontreal.ca>
parents: 1461
diff changeset
190 'filetensor', name+'_transfer.ft'))
48efafaaf7fb Add function for loading the transfer labels of utlc
Pascal Lamblin <lamblinp@iro.umontreal.ca>
parents: 1461
diff changeset
191 return transfer
48efafaaf7fb Add function for loading the transfer labels of utlc
Pascal Lamblin <lamblinp@iro.umontreal.ca>
parents: 1461
diff changeset
192
1411
68fdb895f53f allow to load the ule labels.
Frederic Bastien <nouiz@nouiz.org>
parents: 1410
diff changeset
193 def load_ndarray_label(name):
68fdb895f53f allow to load the ule labels.
Frederic Bastien <nouiz@nouiz.org>
parents: 1410
diff changeset
194 """ Load the train,valid,test data for the dataset `name`
68fdb895f53f allow to load the ule labels.
Frederic Bastien <nouiz@nouiz.org>
parents: 1410
diff changeset
195 and return it in ndarray format.
1427
a36d3a406c59 fix whitespace/indentation.
Frederic Bastien <nouiz@nouiz.org>
parents: 1426
diff changeset
196
1411
68fdb895f53f allow to load the ule labels.
Frederic Bastien <nouiz@nouiz.org>
parents: 1410
diff changeset
197 This is only available for the toy dataset ule.
68fdb895f53f allow to load the ule labels.
Frederic Bastien <nouiz@nouiz.org>
parents: 1410
diff changeset
198 """
68fdb895f53f allow to load the ule labels.
Frederic Bastien <nouiz@nouiz.org>
parents: 1410
diff changeset
199 assert name in ['ule']
68fdb895f53f allow to load the ule labels.
Frederic Bastien <nouiz@nouiz.org>
parents: 1410
diff changeset
200 trname,vname,tename = [os.path.join(config.data_root(),
68fdb895f53f allow to load the ule labels.
Frederic Bastien <nouiz@nouiz.org>
parents: 1410
diff changeset
201 'UTLC','filetensor',
1427
a36d3a406c59 fix whitespace/indentation.
Frederic Bastien <nouiz@nouiz.org>
parents: 1426
diff changeset
202 name+'_'+subset+'.ft')
1411
68fdb895f53f allow to load the ule labels.
Frederic Bastien <nouiz@nouiz.org>
parents: 1410
diff changeset
203 for subset in ['trainl','validl','testl']]
68fdb895f53f allow to load the ule labels.
Frederic Bastien <nouiz@nouiz.org>
parents: 1410
diff changeset
204 trainl = load_filetensor(trname)
68fdb895f53f allow to load the ule labels.
Frederic Bastien <nouiz@nouiz.org>
parents: 1410
diff changeset
205 validl = load_filetensor(vname)
68fdb895f53f allow to load the ule labels.
Frederic Bastien <nouiz@nouiz.org>
parents: 1410
diff changeset
206 testl = load_filetensor(tename)
68fdb895f53f allow to load the ule labels.
Frederic Bastien <nouiz@nouiz.org>
parents: 1410
diff changeset
207 return trainl, validl, testl
68fdb895f53f allow to load the ule labels.
Frederic Bastien <nouiz@nouiz.org>
parents: 1410
diff changeset
208
1402
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
209 def load_filetensor(fname):
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
210 f = None
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
211 try:
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
212 if not os.path.exists(fname):
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
213 fname = fname+'.gz'
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
214 assert os.path.exists(fname)
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
215 f = gzip.open(fname)
1426
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
216 elif fname.endswith('.gz'):
4988f8ea0836 in utlc.py, implement a parameter that return a Theano variable that represent the normalized data. Usefull for on the fly normalization.
Frederic Bastien <nouiz@nouiz.org>
parents: 1411
diff changeset
217 f = gzip.open(fname)
1402
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
218 else:
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
219 f = open(fname)
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
220 d = ft.read(f)
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
221 finally:
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
222 if f:
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
223 f.close()
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
224
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
225 return d
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
226
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
227 def load_sparse(fname):
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
228 f = None
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
229 try:
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
230 if not os.path.exists(fname):
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
231 fname = fname+'.gz'
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
232 assert os.path.exists(fname)
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
233 f = gzip.open(fname)
1429
b0141efbf6a2 fix loading of sparse utlc dataset when PYLEARN_DATA_ROOT have more then 1 directory.
Frederic Bastien <nouiz@nouiz.org>
parents: 1428
diff changeset
234 elif fname.endswith('.gz'):
b0141efbf6a2 fix loading of sparse utlc dataset when PYLEARN_DATA_ROOT have more then 1 directory.
Frederic Bastien <nouiz@nouiz.org>
parents: 1428
diff changeset
235 f = gzip.open(fname)
1402
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
236 else:
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
237 f = open(fname)
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
238 d = cPickle.load(f)
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
239 finally:
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
240 if f:
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
241 f.close()
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
242 return d
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
243
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
244 if __name__ == '__main__':
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
245 import numpy
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
246 import scipy.sparse
1408
2993b2a5c1af allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1406
diff changeset
247
2993b2a5c1af allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1406
diff changeset
248 # Test loading of transfer data
2993b2a5c1af allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1406
diff changeset
249 train, valid, test, transfer = load_ndarray_dataset("ule", normalize=True, transfer=True)
2993b2a5c1af allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1406
diff changeset
250 assert train.shape[0]==transfer.shape[0]
2993b2a5c1af allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1406
diff changeset
251
1402
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
252 for name in ['avicenna','harry','rita','sylvester','ule']:
1404
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
253 train, valid, test = load_ndarray_dataset(name, normalize=True)
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
254 print name,"dtype, max, min, mean, std"
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
255 print train.dtype, train.max(), train.min(), train.mean(), train.std()
1402
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
256 assert isinstance(train, numpy.ndarray)
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
257 assert isinstance(valid, numpy.ndarray)
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
258 assert isinstance(test, numpy.ndarray)
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
259 assert train.shape[1]==test.shape[1]==valid.shape[1]
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
260
1408
2993b2a5c1af allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1406
diff changeset
261 # Test loading of transfer data
2993b2a5c1af allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1406
diff changeset
262 train, valid, test, transfer = load_sparse_dataset("ule", normalize=True, transfer=True)
2993b2a5c1af allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1406
diff changeset
263 assert train.shape[0]==transfer.shape[0]
2993b2a5c1af allow to load UTLC transfer label data.
Frederic Bastien <nouiz@nouiz.org>
parents: 1406
diff changeset
264
1406
6003f733a994 added the normalization of the last UTLC dataset
Frederic Bastien <nouiz@nouiz.org>
parents: 1404
diff changeset
265 for name in ['harry','terry','ule']:
6003f733a994 added the normalization of the last UTLC dataset
Frederic Bastien <nouiz@nouiz.org>
parents: 1404
diff changeset
266 train, valid, test = load_sparse_dataset(name, normalize=True)
1404
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
267 nb_elem = numpy.prod(train.shape)
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
268 mi = train.data.min()
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
269 ma = train.data.max()
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
270 mi = min(0, mi)
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
271 ma = max(0, ma)
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
272 su = train.data.sum()
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
273 mean = float(su)/nb_elem
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
274 print name,"dtype, max, min, mean, nb non-zero, nb element, %sparse"
89017617ab36 normalize 5 of the UTLC datasets.
Frederic Bastien <nouiz@nouiz.org>
parents: 1402
diff changeset
275 print train.dtype, ma, mi, mean, train.nnz, nb_elem, (nb_elem-float(train.nnz))/nb_elem
1406
6003f733a994 added the normalization of the last UTLC dataset
Frederic Bastien <nouiz@nouiz.org>
parents: 1404
diff changeset
276 print name,"max, min, mean, std (all stats on non-zero element)"
6003f733a994 added the normalization of the last UTLC dataset
Frederic Bastien <nouiz@nouiz.org>
parents: 1404
diff changeset
277 print train.data.max(), train.data.min(), train.data.mean(), train.data.std()
1402
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
278 assert scipy.sparse.issparse(train)
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
279 assert scipy.sparse.issparse(valid)
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
280 assert scipy.sparse.issparse(test)
b14f3d6f5cd4 first version of a script to load the utlc datasets.
Frederic Bastien <nouiz@nouiz.org>
parents:
diff changeset
281 assert train.shape[1]==test.shape[1]==valid.shape[1]