Mercurial > ift6266
annotate datasets/defs.py @ 613:5e481b224117
fix the reading of PNIST dataset following Dumi compression of the data.
author | Frederic Bastien <nouiz@nouiz.org> |
---|---|
date | Thu, 06 Jan 2011 13:57:05 -0500 |
parents | 22efb4968054 |
children |
rev | line source |
---|---|
211
476da2ba6a12
Add nist_P07 datasets to the predefs.
Arnaud Bergeron <abergeron@gmail.com>
parents:
181
diff
changeset
|
1 __all__ = ['nist_digits', 'nist_lower', 'nist_upper', 'nist_all', 'ocr', |
349
22efb4968054
added pnist support, will check in code for data set iterator later
xaviermuller
parents:
269
diff
changeset
|
2 'nist_P07', 'PNIST07', 'mnist'] |
163
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
3 |
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
4 from ftfile import FTDataSet |
222
4cfd0eb438af
Add mnist to datasets (and supporting code).
Arnaud Bergeron <abergeron@gmail.com>
parents:
211
diff
changeset
|
5 from gzpklfile import GzpklDataSet |
180
76bc047df5ee
Add dtype conversion and rescaling to the read path.
Arnaud Bergeron <abergeron@gmail.com>
parents:
175
diff
changeset
|
6 import theano |
231
6f4e3719a3cc
Added the possibility to get the paths from an env. variable + cleaned up the way we build the paths
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
222
diff
changeset
|
7 import os |
175
224321bf043a
Define the ocr dataset and use the existing split for nist.
Arnaud Bergeron <abergeron@gmail.com>
parents:
164
diff
changeset
|
8 |
231
6f4e3719a3cc
Added the possibility to get the paths from an env. variable + cleaned up the way we build the paths
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
222
diff
changeset
|
9 # if the environmental variables exist, get the path from them, |
6f4e3719a3cc
Added the possibility to get the paths from an env. variable + cleaned up the way we build the paths
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
222
diff
changeset
|
10 # otherwise fall back on the default |
6f4e3719a3cc
Added the possibility to get the paths from an env. variable + cleaned up the way we build the paths
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
222
diff
changeset
|
11 NIST_PATH = os.getenv('NIST_PATH','/data/lisa/data/nist/by_class/') |
6f4e3719a3cc
Added the possibility to get the paths from an env. variable + cleaned up the way we build the paths
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
222
diff
changeset
|
12 DATA_PATH = os.getenv('DATA_PATH','/data/lisa/data/ift6266h10/') |
6f4e3719a3cc
Added the possibility to get the paths from an env. variable + cleaned up the way we build the paths
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
222
diff
changeset
|
13 |
257
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
231
diff
changeset
|
14 nist_digits = lambda maxsize=None: FTDataSet(train_data = [os.path.join(NIST_PATH,'digits/digits_train_data.ft')], |
231
6f4e3719a3cc
Added the possibility to get the paths from an env. variable + cleaned up the way we build the paths
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
222
diff
changeset
|
15 train_lbl = [os.path.join(NIST_PATH,'digits/digits_train_labels.ft')], |
6f4e3719a3cc
Added the possibility to get the paths from an env. variable + cleaned up the way we build the paths
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
222
diff
changeset
|
16 test_data = [os.path.join(NIST_PATH,'digits/digits_test_data.ft')], |
6f4e3719a3cc
Added the possibility to get the paths from an env. variable + cleaned up the way we build the paths
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
222
diff
changeset
|
17 test_lbl = [os.path.join(NIST_PATH,'digits/digits_test_labels.ft')], |
257
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
231
diff
changeset
|
18 indtype=theano.config.floatX, inscale=255., maxsize=maxsize) |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
231
diff
changeset
|
19 nist_lower = lambda maxsize=None: FTDataSet(train_data = [os.path.join(NIST_PATH,'lower/lower_train_data.ft')], |
231
6f4e3719a3cc
Added the possibility to get the paths from an env. variable + cleaned up the way we build the paths
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
222
diff
changeset
|
20 train_lbl = [os.path.join(NIST_PATH,'lower/lower_train_labels.ft')], |
6f4e3719a3cc
Added the possibility to get the paths from an env. variable + cleaned up the way we build the paths
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
222
diff
changeset
|
21 test_data = [os.path.join(NIST_PATH,'lower/lower_test_data.ft')], |
6f4e3719a3cc
Added the possibility to get the paths from an env. variable + cleaned up the way we build the paths
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
222
diff
changeset
|
22 test_lbl = [os.path.join(NIST_PATH,'lower/lower_test_labels.ft')], |
257
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
231
diff
changeset
|
23 indtype=theano.config.floatX, inscale=255., maxsize=maxsize) |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
231
diff
changeset
|
24 nist_upper = lambda maxsize=None: FTDataSet(train_data = [os.path.join(NIST_PATH,'upper/upper_train_data.ft')], |
231
6f4e3719a3cc
Added the possibility to get the paths from an env. variable + cleaned up the way we build the paths
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
222
diff
changeset
|
25 train_lbl = [os.path.join(NIST_PATH,'upper/upper_train_labels.ft')], |
6f4e3719a3cc
Added the possibility to get the paths from an env. variable + cleaned up the way we build the paths
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
222
diff
changeset
|
26 test_data = [os.path.join(NIST_PATH,'upper/upper_test_data.ft')], |
6f4e3719a3cc
Added the possibility to get the paths from an env. variable + cleaned up the way we build the paths
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
222
diff
changeset
|
27 test_lbl = [os.path.join(NIST_PATH,'upper/upper_test_labels.ft')], |
257
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
231
diff
changeset
|
28 indtype=theano.config.floatX, inscale=255., maxsize=maxsize) |
163
4b28d7382dbf
Add inital implementation of datasets.
Arnaud Bergeron <abergeron@gmail.com>
parents:
diff
changeset
|
29 |
257
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
231
diff
changeset
|
30 nist_all = lambda maxsize=None: FTDataSet(train_data = [os.path.join(DATA_PATH,'train_data.ft')], |
231
6f4e3719a3cc
Added the possibility to get the paths from an env. variable + cleaned up the way we build the paths
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
222
diff
changeset
|
31 train_lbl = [os.path.join(DATA_PATH,'train_labels.ft')], |
6f4e3719a3cc
Added the possibility to get the paths from an env. variable + cleaned up the way we build the paths
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
222
diff
changeset
|
32 test_data = [os.path.join(DATA_PATH,'test_data.ft')], |
6f4e3719a3cc
Added the possibility to get the paths from an env. variable + cleaned up the way we build the paths
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
222
diff
changeset
|
33 test_lbl = [os.path.join(DATA_PATH,'test_labels.ft')], |
6f4e3719a3cc
Added the possibility to get the paths from an env. variable + cleaned up the way we build the paths
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
222
diff
changeset
|
34 valid_data = [os.path.join(DATA_PATH,'valid_data.ft')], |
6f4e3719a3cc
Added the possibility to get the paths from an env. variable + cleaned up the way we build the paths
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
222
diff
changeset
|
35 valid_lbl = [os.path.join(DATA_PATH,'valid_labels.ft')], |
257
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
231
diff
changeset
|
36 indtype=theano.config.floatX, inscale=255., maxsize=maxsize) |
175
224321bf043a
Define the ocr dataset and use the existing split for nist.
Arnaud Bergeron <abergeron@gmail.com>
parents:
164
diff
changeset
|
37 |
257
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
231
diff
changeset
|
38 ocr = lambda maxsize=None: FTDataSet(train_data = [os.path.join(DATA_PATH,'ocr_train_data.ft')], |
231
6f4e3719a3cc
Added the possibility to get the paths from an env. variable + cleaned up the way we build the paths
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
222
diff
changeset
|
39 train_lbl = [os.path.join(DATA_PATH,'ocr_train_labels.ft')], |
6f4e3719a3cc
Added the possibility to get the paths from an env. variable + cleaned up the way we build the paths
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
222
diff
changeset
|
40 test_data = [os.path.join(DATA_PATH,'ocr_test_data.ft')], |
6f4e3719a3cc
Added the possibility to get the paths from an env. variable + cleaned up the way we build the paths
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
222
diff
changeset
|
41 test_lbl = [os.path.join(DATA_PATH,'ocr_test_labels.ft')], |
6f4e3719a3cc
Added the possibility to get the paths from an env. variable + cleaned up the way we build the paths
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
222
diff
changeset
|
42 valid_data = [os.path.join(DATA_PATH,'ocr_valid_data.ft')], |
6f4e3719a3cc
Added the possibility to get the paths from an env. variable + cleaned up the way we build the paths
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
222
diff
changeset
|
43 valid_lbl = [os.path.join(DATA_PATH,'ocr_valid_labels.ft')], |
257
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
231
diff
changeset
|
44 indtype=theano.config.floatX, inscale=255., maxsize=maxsize) |
211
476da2ba6a12
Add nist_P07 datasets to the predefs.
Arnaud Bergeron <abergeron@gmail.com>
parents:
181
diff
changeset
|
45 |
269
4533350d7361
Ajout d'une fonctionnalite pour pouvoir definir un range de fichiers d'entrainement de P07 utilises. Utile pour pre-entrainer et fine-tuner avec des donnees differentes
SylvainPL <sylvain.pannetier.lebeuf@umontreal.ca>
parents:
257
diff
changeset
|
46 #There is 2 more arguments here to can choose smaller datasets based on the file number. |
4533350d7361
Ajout d'une fonctionnalite pour pouvoir definir un range de fichiers d'entrainement de P07 utilises. Utile pour pre-entrainer et fine-tuner avec des donnees differentes
SylvainPL <sylvain.pannetier.lebeuf@umontreal.ca>
parents:
257
diff
changeset
|
47 #This is usefull to get different data for pre-training and finetuning |
4533350d7361
Ajout d'une fonctionnalite pour pouvoir definir un range de fichiers d'entrainement de P07 utilises. Utile pour pre-entrainer et fine-tuner avec des donnees differentes
SylvainPL <sylvain.pannetier.lebeuf@umontreal.ca>
parents:
257
diff
changeset
|
48 nist_P07 = lambda maxsize=None, min_file=0, max_file=100: FTDataSet(train_data = [os.path.join(DATA_PATH,'data/P07_train'+str(i)+'_data.ft') for i in range(min_file, max_file)], |
4533350d7361
Ajout d'une fonctionnalite pour pouvoir definir un range de fichiers d'entrainement de P07 utilises. Utile pour pre-entrainer et fine-tuner avec des donnees differentes
SylvainPL <sylvain.pannetier.lebeuf@umontreal.ca>
parents:
257
diff
changeset
|
49 train_lbl = [os.path.join(DATA_PATH,'data/P07_train'+str(i)+'_labels.ft') for i in range(min_file, max_file)], |
231
6f4e3719a3cc
Added the possibility to get the paths from an env. variable + cleaned up the way we build the paths
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
222
diff
changeset
|
50 test_data = [os.path.join(DATA_PATH,'data/P07_test_data.ft')], |
6f4e3719a3cc
Added the possibility to get the paths from an env. variable + cleaned up the way we build the paths
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
222
diff
changeset
|
51 test_lbl = [os.path.join(DATA_PATH,'data/P07_test_labels.ft')], |
6f4e3719a3cc
Added the possibility to get the paths from an env. variable + cleaned up the way we build the paths
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
222
diff
changeset
|
52 valid_data = [os.path.join(DATA_PATH,'data/P07_valid_data.ft')], |
6f4e3719a3cc
Added the possibility to get the paths from an env. variable + cleaned up the way we build the paths
Dumitru Erhan <dumitru.erhan@gmail.com>
parents:
222
diff
changeset
|
53 valid_lbl = [os.path.join(DATA_PATH,'data/P07_valid_labels.ft')], |
257
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
231
diff
changeset
|
54 indtype=theano.config.floatX, inscale=255., maxsize=maxsize) |
349
22efb4968054
added pnist support, will check in code for data set iterator later
xaviermuller
parents:
269
diff
changeset
|
55 |
22efb4968054
added pnist support, will check in code for data set iterator later
xaviermuller
parents:
269
diff
changeset
|
56 #Added PNIST07 |
22efb4968054
added pnist support, will check in code for data set iterator later
xaviermuller
parents:
269
diff
changeset
|
57 PNIST07 = lambda maxsize=None, min_file=0, max_file=100: FTDataSet(train_data = [os.path.join(DATA_PATH,'data/PNIST07_train'+str(i)+'_data.ft') for i in range(min_file, max_file)], |
22efb4968054
added pnist support, will check in code for data set iterator later
xaviermuller
parents:
269
diff
changeset
|
58 train_lbl = [os.path.join(DATA_PATH,'data/PNIST07_train'+str(i)+'_labels.ft') for i in range(min_file, max_file)], |
22efb4968054
added pnist support, will check in code for data set iterator later
xaviermuller
parents:
269
diff
changeset
|
59 test_data = [os.path.join(DATA_PATH,'data/PNIST07_test_data.ft')], |
22efb4968054
added pnist support, will check in code for data set iterator later
xaviermuller
parents:
269
diff
changeset
|
60 test_lbl = [os.path.join(DATA_PATH,'data/PNIST07_test_labels.ft')], |
22efb4968054
added pnist support, will check in code for data set iterator later
xaviermuller
parents:
269
diff
changeset
|
61 valid_data = [os.path.join(DATA_PATH,'data/PNIST07_valid_data.ft')], |
22efb4968054
added pnist support, will check in code for data set iterator later
xaviermuller
parents:
269
diff
changeset
|
62 valid_lbl = [os.path.join(DATA_PATH,'data/PNIST07_valid_labels.ft')], |
22efb4968054
added pnist support, will check in code for data set iterator later
xaviermuller
parents:
269
diff
changeset
|
63 indtype=theano.config.floatX, inscale=255., maxsize=maxsize) |
222
4cfd0eb438af
Add mnist to datasets (and supporting code).
Arnaud Bergeron <abergeron@gmail.com>
parents:
211
diff
changeset
|
64 |
257
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
231
diff
changeset
|
65 mnist = lambda maxsize=None: GzpklDataSet(os.path.join(DATA_PATH,'mnist.pkl.gz'), |
966272e7f14b
Make the datasets lazy-loading and add a maxsize parameter.
Arnaud Bergeron <abergeron@gmail.com>
parents:
231
diff
changeset
|
66 maxsize=maxsize) |