annotate datasets/defs.py @ 613:5e481b224117

fix the reading of PNIST dataset following Dumi compression of the data.
author Frederic Bastien <nouiz@nouiz.org>
date Thu, 06 Jan 2011 13:57:05 -0500
parents 22efb4968054
children
rev   line source
211
476da2ba6a12 Add nist_P07 datasets to the predefs.
Arnaud Bergeron <abergeron@gmail.com>
parents: 181
diff changeset
1 __all__ = ['nist_digits', 'nist_lower', 'nist_upper', 'nist_all', 'ocr',
349
22efb4968054 added pnist support, will check in code for data set iterator later
xaviermuller
parents: 269
diff changeset
2 'nist_P07', 'PNIST07', 'mnist']
163
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
3
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
4 from ftfile import FTDataSet
222
4cfd0eb438af Add mnist to datasets (and supporting code).
Arnaud Bergeron <abergeron@gmail.com>
parents: 211
diff changeset
5 from gzpklfile import GzpklDataSet
180
76bc047df5ee Add dtype conversion and rescaling to the read path.
Arnaud Bergeron <abergeron@gmail.com>
parents: 175
diff changeset
6 import theano
231
6f4e3719a3cc Added the possibility to get the paths from an env. variable + cleaned up the way we build the paths
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 222
diff changeset
7 import os
175
224321bf043a Define the ocr dataset and use the existing split for nist.
Arnaud Bergeron <abergeron@gmail.com>
parents: 164
diff changeset
8
231
6f4e3719a3cc Added the possibility to get the paths from an env. variable + cleaned up the way we build the paths
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 222
diff changeset
9 # if the environmental variables exist, get the path from them,
6f4e3719a3cc Added the possibility to get the paths from an env. variable + cleaned up the way we build the paths
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 222
diff changeset
10 # otherwise fall back on the default
6f4e3719a3cc Added the possibility to get the paths from an env. variable + cleaned up the way we build the paths
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 222
diff changeset
11 NIST_PATH = os.getenv('NIST_PATH','/data/lisa/data/nist/by_class/')
6f4e3719a3cc Added the possibility to get the paths from an env. variable + cleaned up the way we build the paths
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 222
diff changeset
12 DATA_PATH = os.getenv('DATA_PATH','/data/lisa/data/ift6266h10/')
6f4e3719a3cc Added the possibility to get the paths from an env. variable + cleaned up the way we build the paths
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 222
diff changeset
13
257
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 231
diff changeset
14 nist_digits = lambda maxsize=None: FTDataSet(train_data = [os.path.join(NIST_PATH,'digits/digits_train_data.ft')],
231
6f4e3719a3cc Added the possibility to get the paths from an env. variable + cleaned up the way we build the paths
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 222
diff changeset
15 train_lbl = [os.path.join(NIST_PATH,'digits/digits_train_labels.ft')],
6f4e3719a3cc Added the possibility to get the paths from an env. variable + cleaned up the way we build the paths
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 222
diff changeset
16 test_data = [os.path.join(NIST_PATH,'digits/digits_test_data.ft')],
6f4e3719a3cc Added the possibility to get the paths from an env. variable + cleaned up the way we build the paths
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 222
diff changeset
17 test_lbl = [os.path.join(NIST_PATH,'digits/digits_test_labels.ft')],
257
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 231
diff changeset
18 indtype=theano.config.floatX, inscale=255., maxsize=maxsize)
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 231
diff changeset
19 nist_lower = lambda maxsize=None: FTDataSet(train_data = [os.path.join(NIST_PATH,'lower/lower_train_data.ft')],
231
6f4e3719a3cc Added the possibility to get the paths from an env. variable + cleaned up the way we build the paths
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 222
diff changeset
20 train_lbl = [os.path.join(NIST_PATH,'lower/lower_train_labels.ft')],
6f4e3719a3cc Added the possibility to get the paths from an env. variable + cleaned up the way we build the paths
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 222
diff changeset
21 test_data = [os.path.join(NIST_PATH,'lower/lower_test_data.ft')],
6f4e3719a3cc Added the possibility to get the paths from an env. variable + cleaned up the way we build the paths
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 222
diff changeset
22 test_lbl = [os.path.join(NIST_PATH,'lower/lower_test_labels.ft')],
257
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 231
diff changeset
23 indtype=theano.config.floatX, inscale=255., maxsize=maxsize)
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 231
diff changeset
24 nist_upper = lambda maxsize=None: FTDataSet(train_data = [os.path.join(NIST_PATH,'upper/upper_train_data.ft')],
231
6f4e3719a3cc Added the possibility to get the paths from an env. variable + cleaned up the way we build the paths
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 222
diff changeset
25 train_lbl = [os.path.join(NIST_PATH,'upper/upper_train_labels.ft')],
6f4e3719a3cc Added the possibility to get the paths from an env. variable + cleaned up the way we build the paths
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 222
diff changeset
26 test_data = [os.path.join(NIST_PATH,'upper/upper_test_data.ft')],
6f4e3719a3cc Added the possibility to get the paths from an env. variable + cleaned up the way we build the paths
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 222
diff changeset
27 test_lbl = [os.path.join(NIST_PATH,'upper/upper_test_labels.ft')],
257
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 231
diff changeset
28 indtype=theano.config.floatX, inscale=255., maxsize=maxsize)
163
4b28d7382dbf Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff changeset
29
257
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 231
diff changeset
30 nist_all = lambda maxsize=None: FTDataSet(train_data = [os.path.join(DATA_PATH,'train_data.ft')],
231
6f4e3719a3cc Added the possibility to get the paths from an env. variable + cleaned up the way we build the paths
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 222
diff changeset
31 train_lbl = [os.path.join(DATA_PATH,'train_labels.ft')],
6f4e3719a3cc Added the possibility to get the paths from an env. variable + cleaned up the way we build the paths
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 222
diff changeset
32 test_data = [os.path.join(DATA_PATH,'test_data.ft')],
6f4e3719a3cc Added the possibility to get the paths from an env. variable + cleaned up the way we build the paths
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 222
diff changeset
33 test_lbl = [os.path.join(DATA_PATH,'test_labels.ft')],
6f4e3719a3cc Added the possibility to get the paths from an env. variable + cleaned up the way we build the paths
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 222
diff changeset
34 valid_data = [os.path.join(DATA_PATH,'valid_data.ft')],
6f4e3719a3cc Added the possibility to get the paths from an env. variable + cleaned up the way we build the paths
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 222
diff changeset
35 valid_lbl = [os.path.join(DATA_PATH,'valid_labels.ft')],
257
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 231
diff changeset
36 indtype=theano.config.floatX, inscale=255., maxsize=maxsize)
175
224321bf043a Define the ocr dataset and use the existing split for nist.
Arnaud Bergeron <abergeron@gmail.com>
parents: 164
diff changeset
37
257
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 231
diff changeset
38 ocr = lambda maxsize=None: FTDataSet(train_data = [os.path.join(DATA_PATH,'ocr_train_data.ft')],
231
6f4e3719a3cc Added the possibility to get the paths from an env. variable + cleaned up the way we build the paths
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 222
diff changeset
39 train_lbl = [os.path.join(DATA_PATH,'ocr_train_labels.ft')],
6f4e3719a3cc Added the possibility to get the paths from an env. variable + cleaned up the way we build the paths
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 222
diff changeset
40 test_data = [os.path.join(DATA_PATH,'ocr_test_data.ft')],
6f4e3719a3cc Added the possibility to get the paths from an env. variable + cleaned up the way we build the paths
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 222
diff changeset
41 test_lbl = [os.path.join(DATA_PATH,'ocr_test_labels.ft')],
6f4e3719a3cc Added the possibility to get the paths from an env. variable + cleaned up the way we build the paths
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 222
diff changeset
42 valid_data = [os.path.join(DATA_PATH,'ocr_valid_data.ft')],
6f4e3719a3cc Added the possibility to get the paths from an env. variable + cleaned up the way we build the paths
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 222
diff changeset
43 valid_lbl = [os.path.join(DATA_PATH,'ocr_valid_labels.ft')],
257
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 231
diff changeset
44 indtype=theano.config.floatX, inscale=255., maxsize=maxsize)
211
476da2ba6a12 Add nist_P07 datasets to the predefs.
Arnaud Bergeron <abergeron@gmail.com>
parents: 181
diff changeset
45
269
4533350d7361 Ajout d'une fonctionnalite pour pouvoir definir un range de fichiers d'entrainement de P07 utilises. Utile pour pre-entrainer et fine-tuner avec des donnees differentes
SylvainPL <sylvain.pannetier.lebeuf@umontreal.ca>
parents: 257
diff changeset
46 #There is 2 more arguments here to can choose smaller datasets based on the file number.
4533350d7361 Ajout d'une fonctionnalite pour pouvoir definir un range de fichiers d'entrainement de P07 utilises. Utile pour pre-entrainer et fine-tuner avec des donnees differentes
SylvainPL <sylvain.pannetier.lebeuf@umontreal.ca>
parents: 257
diff changeset
47 #This is usefull to get different data for pre-training and finetuning
4533350d7361 Ajout d'une fonctionnalite pour pouvoir definir un range de fichiers d'entrainement de P07 utilises. Utile pour pre-entrainer et fine-tuner avec des donnees differentes
SylvainPL <sylvain.pannetier.lebeuf@umontreal.ca>
parents: 257
diff changeset
48 nist_P07 = lambda maxsize=None, min_file=0, max_file=100: FTDataSet(train_data = [os.path.join(DATA_PATH,'data/P07_train'+str(i)+'_data.ft') for i in range(min_file, max_file)],
4533350d7361 Ajout d'une fonctionnalite pour pouvoir definir un range de fichiers d'entrainement de P07 utilises. Utile pour pre-entrainer et fine-tuner avec des donnees differentes
SylvainPL <sylvain.pannetier.lebeuf@umontreal.ca>
parents: 257
diff changeset
49 train_lbl = [os.path.join(DATA_PATH,'data/P07_train'+str(i)+'_labels.ft') for i in range(min_file, max_file)],
231
6f4e3719a3cc Added the possibility to get the paths from an env. variable + cleaned up the way we build the paths
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 222
diff changeset
50 test_data = [os.path.join(DATA_PATH,'data/P07_test_data.ft')],
6f4e3719a3cc Added the possibility to get the paths from an env. variable + cleaned up the way we build the paths
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 222
diff changeset
51 test_lbl = [os.path.join(DATA_PATH,'data/P07_test_labels.ft')],
6f4e3719a3cc Added the possibility to get the paths from an env. variable + cleaned up the way we build the paths
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 222
diff changeset
52 valid_data = [os.path.join(DATA_PATH,'data/P07_valid_data.ft')],
6f4e3719a3cc Added the possibility to get the paths from an env. variable + cleaned up the way we build the paths
Dumitru Erhan <dumitru.erhan@gmail.com>
parents: 222
diff changeset
53 valid_lbl = [os.path.join(DATA_PATH,'data/P07_valid_labels.ft')],
257
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 231
diff changeset
54 indtype=theano.config.floatX, inscale=255., maxsize=maxsize)
349
22efb4968054 added pnist support, will check in code for data set iterator later
xaviermuller
parents: 269
diff changeset
55
22efb4968054 added pnist support, will check in code for data set iterator later
xaviermuller
parents: 269
diff changeset
56 #Added PNIST07
22efb4968054 added pnist support, will check in code for data set iterator later
xaviermuller
parents: 269
diff changeset
57 PNIST07 = lambda maxsize=None, min_file=0, max_file=100: FTDataSet(train_data = [os.path.join(DATA_PATH,'data/PNIST07_train'+str(i)+'_data.ft') for i in range(min_file, max_file)],
22efb4968054 added pnist support, will check in code for data set iterator later
xaviermuller
parents: 269
diff changeset
58 train_lbl = [os.path.join(DATA_PATH,'data/PNIST07_train'+str(i)+'_labels.ft') for i in range(min_file, max_file)],
22efb4968054 added pnist support, will check in code for data set iterator later
xaviermuller
parents: 269
diff changeset
59 test_data = [os.path.join(DATA_PATH,'data/PNIST07_test_data.ft')],
22efb4968054 added pnist support, will check in code for data set iterator later
xaviermuller
parents: 269
diff changeset
60 test_lbl = [os.path.join(DATA_PATH,'data/PNIST07_test_labels.ft')],
22efb4968054 added pnist support, will check in code for data set iterator later
xaviermuller
parents: 269
diff changeset
61 valid_data = [os.path.join(DATA_PATH,'data/PNIST07_valid_data.ft')],
22efb4968054 added pnist support, will check in code for data set iterator later
xaviermuller
parents: 269
diff changeset
62 valid_lbl = [os.path.join(DATA_PATH,'data/PNIST07_valid_labels.ft')],
22efb4968054 added pnist support, will check in code for data set iterator later
xaviermuller
parents: 269
diff changeset
63 indtype=theano.config.floatX, inscale=255., maxsize=maxsize)
222
4cfd0eb438af Add mnist to datasets (and supporting code).
Arnaud Bergeron <abergeron@gmail.com>
parents: 211
diff changeset
64
257
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 231
diff changeset
65 mnist = lambda maxsize=None: GzpklDataSet(os.path.join(DATA_PATH,'mnist.pkl.gz'),
966272e7f14b Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents: 231
diff changeset
66 maxsize=maxsize)