annotate data_generation/pipeline/filter_nist.py @ 635:d2d7ce0f0942

merge
author Yoshua Bengio <bengioy@iro.umontreal.ca>
date Sat, 19 Mar 2011 22:58:06 -0400
parents 75dbbe409578
children
rev   line source
626
75dbbe409578 Added code for deep mlp, experiment code to go along with it. Also added code I used to filter the P07 / PNIST07 datasets to keep only digits.
fsavard
parents:
diff changeset
1 import numpy
75dbbe409578 Added code for deep mlp, experiment code to go along with it. Also added code I used to filter the P07 / PNIST07 datasets to keep only digits.
fsavard
parents:
diff changeset
2 from pylearn.io import filetensor as ft
75dbbe409578 Added code for deep mlp, experiment code to go along with it. Also added code I used to filter the P07 / PNIST07 datasets to keep only digits.
fsavard
parents:
diff changeset
3 from ift6266 import datasets
75dbbe409578 Added code for deep mlp, experiment code to go along with it. Also added code I used to filter the P07 / PNIST07 datasets to keep only digits.
fsavard
parents:
diff changeset
4 from ift6266.datasets.ftfile import FTDataSet
75dbbe409578 Added code for deep mlp, experiment code to go along with it. Also added code I used to filter the P07 / PNIST07 datasets to keep only digits.
fsavard
parents:
diff changeset
5
75dbbe409578 Added code for deep mlp, experiment code to go along with it. Also added code I used to filter the P07 / PNIST07 datasets to keep only digits.
fsavard
parents:
diff changeset
6 dataset_str = 'P07_' # NISTP # 'P07safe_'
75dbbe409578 Added code for deep mlp, experiment code to go along with it. Also added code I used to filter the P07 / PNIST07 datasets to keep only digits.
fsavard
parents:
diff changeset
7
75dbbe409578 Added code for deep mlp, experiment code to go along with it. Also added code I used to filter the P07 / PNIST07 datasets to keep only digits.
fsavard
parents:
diff changeset
8 #base_path = '/data/lisatmp/ift6266h10/data/'+dataset_str
75dbbe409578 Added code for deep mlp, experiment code to go along with it. Also added code I used to filter the P07 / PNIST07 datasets to keep only digits.
fsavard
parents:
diff changeset
9 #base_output_path = '/data/lisatmp/ift6266h10/data/transformed_digits/'+dataset_str+'train'
75dbbe409578 Added code for deep mlp, experiment code to go along with it. Also added code I used to filter the P07 / PNIST07 datasets to keep only digits.
fsavard
parents:
diff changeset
10
75dbbe409578 Added code for deep mlp, experiment code to go along with it. Also added code I used to filter the P07 / PNIST07 datasets to keep only digits.
fsavard
parents:
diff changeset
11 base_path = '/data/lisa/data/ift6266h10/data/'+dataset_str
75dbbe409578 Added code for deep mlp, experiment code to go along with it. Also added code I used to filter the P07 / PNIST07 datasets to keep only digits.
fsavard
parents:
diff changeset
12 base_output_path = '/data/lisatmp/ift6266h10/data/transformed_digits/'+dataset_str+'train'
75dbbe409578 Added code for deep mlp, experiment code to go along with it. Also added code I used to filter the P07 / PNIST07 datasets to keep only digits.
fsavard
parents:
diff changeset
13
75dbbe409578 Added code for deep mlp, experiment code to go along with it. Also added code I used to filter the P07 / PNIST07 datasets to keep only digits.
fsavard
parents:
diff changeset
14 for fileno in range(100):
75dbbe409578 Added code for deep mlp, experiment code to go along with it. Also added code I used to filter the P07 / PNIST07 datasets to keep only digits.
fsavard
parents:
diff changeset
15 print "Processing file no ", fileno
75dbbe409578 Added code for deep mlp, experiment code to go along with it. Also added code I used to filter the P07 / PNIST07 datasets to keep only digits.
fsavard
parents:
diff changeset
16
75dbbe409578 Added code for deep mlp, experiment code to go along with it. Also added code I used to filter the P07 / PNIST07 datasets to keep only digits.
fsavard
parents:
diff changeset
17 output_data_file = base_output_path+str(fileno)+'_data.ft'
75dbbe409578 Added code for deep mlp, experiment code to go along with it. Also added code I used to filter the P07 / PNIST07 datasets to keep only digits.
fsavard
parents:
diff changeset
18 output_labels_file = base_output_path+str(fileno)+'_labels.ft'
75dbbe409578 Added code for deep mlp, experiment code to go along with it. Also added code I used to filter the P07 / PNIST07 datasets to keep only digits.
fsavard
parents:
diff changeset
19
75dbbe409578 Added code for deep mlp, experiment code to go along with it. Also added code I used to filter the P07 / PNIST07 datasets to keep only digits.
fsavard
parents:
diff changeset
20 print "Reading from ",base_path+'train'+str(fileno)+'_data.ft'
75dbbe409578 Added code for deep mlp, experiment code to go along with it. Also added code I used to filter the P07 / PNIST07 datasets to keep only digits.
fsavard
parents:
diff changeset
21
75dbbe409578 Added code for deep mlp, experiment code to go along with it. Also added code I used to filter the P07 / PNIST07 datasets to keep only digits.
fsavard
parents:
diff changeset
22 dataset = lambda maxsize=None, min_file=0, max_file=100: \
75dbbe409578 Added code for deep mlp, experiment code to go along with it. Also added code I used to filter the P07 / PNIST07 datasets to keep only digits.
fsavard
parents:
diff changeset
23 FTDataSet(train_data = [base_path+'train'+str(fileno)+'_data.ft'],
75dbbe409578 Added code for deep mlp, experiment code to go along with it. Also added code I used to filter the P07 / PNIST07 datasets to keep only digits.
fsavard
parents:
diff changeset
24 train_lbl = [base_path+'train'+str(fileno)+'_labels.ft'],
75dbbe409578 Added code for deep mlp, experiment code to go along with it. Also added code I used to filter the P07 / PNIST07 datasets to keep only digits.
fsavard
parents:
diff changeset
25 test_data = [base_path+'_test_data.ft'],
75dbbe409578 Added code for deep mlp, experiment code to go along with it. Also added code I used to filter the P07 / PNIST07 datasets to keep only digits.
fsavard
parents:
diff changeset
26 test_lbl = [base_path+'_test_labels.ft'],
75dbbe409578 Added code for deep mlp, experiment code to go along with it. Also added code I used to filter the P07 / PNIST07 datasets to keep only digits.
fsavard
parents:
diff changeset
27 valid_data = [base_path+'_valid_data.ft'],
75dbbe409578 Added code for deep mlp, experiment code to go along with it. Also added code I used to filter the P07 / PNIST07 datasets to keep only digits.
fsavard
parents:
diff changeset
28 valid_lbl = [base_path+'_valid_labels.ft'])
75dbbe409578 Added code for deep mlp, experiment code to go along with it. Also added code I used to filter the P07 / PNIST07 datasets to keep only digits.
fsavard
parents:
diff changeset
29 # no conversion or scaling... keep data as is
75dbbe409578 Added code for deep mlp, experiment code to go along with it. Also added code I used to filter the P07 / PNIST07 datasets to keep only digits.
fsavard
parents:
diff changeset
30 #indtype=theano.config.floatX, inscale=255., maxsize=maxsize)
75dbbe409578 Added code for deep mlp, experiment code to go along with it. Also added code I used to filter the P07 / PNIST07 datasets to keep only digits.
fsavard
parents:
diff changeset
31
75dbbe409578 Added code for deep mlp, experiment code to go along with it. Also added code I used to filter the P07 / PNIST07 datasets to keep only digits.
fsavard
parents:
diff changeset
32 ds = dataset()
75dbbe409578 Added code for deep mlp, experiment code to go along with it. Also added code I used to filter the P07 / PNIST07 datasets to keep only digits.
fsavard
parents:
diff changeset
33
75dbbe409578 Added code for deep mlp, experiment code to go along with it. Also added code I used to filter the P07 / PNIST07 datasets to keep only digits.
fsavard
parents:
diff changeset
34 all_x = []
75dbbe409578 Added code for deep mlp, experiment code to go along with it. Also added code I used to filter the P07 / PNIST07 datasets to keep only digits.
fsavard
parents:
diff changeset
35 all_y = []
75dbbe409578 Added code for deep mlp, experiment code to go along with it. Also added code I used to filter the P07 / PNIST07 datasets to keep only digits.
fsavard
parents:
diff changeset
36
75dbbe409578 Added code for deep mlp, experiment code to go along with it. Also added code I used to filter the P07 / PNIST07 datasets to keep only digits.
fsavard
parents:
diff changeset
37 all_count = 0
75dbbe409578 Added code for deep mlp, experiment code to go along with it. Also added code I used to filter the P07 / PNIST07 datasets to keep only digits.
fsavard
parents:
diff changeset
38
75dbbe409578 Added code for deep mlp, experiment code to go along with it. Also added code I used to filter the P07 / PNIST07 datasets to keep only digits.
fsavard
parents:
diff changeset
39 for mb_x,mb_y in ds.train(1):
75dbbe409578 Added code for deep mlp, experiment code to go along with it. Also added code I used to filter the P07 / PNIST07 datasets to keep only digits.
fsavard
parents:
diff changeset
40 if mb_y[0] <= 9:
75dbbe409578 Added code for deep mlp, experiment code to go along with it. Also added code I used to filter the P07 / PNIST07 datasets to keep only digits.
fsavard
parents:
diff changeset
41 all_x.append(mb_x[0])
75dbbe409578 Added code for deep mlp, experiment code to go along with it. Also added code I used to filter the P07 / PNIST07 datasets to keep only digits.
fsavard
parents:
diff changeset
42 all_y.append(mb_y[0])
75dbbe409578 Added code for deep mlp, experiment code to go along with it. Also added code I used to filter the P07 / PNIST07 datasets to keep only digits.
fsavard
parents:
diff changeset
43
75dbbe409578 Added code for deep mlp, experiment code to go along with it. Also added code I used to filter the P07 / PNIST07 datasets to keep only digits.
fsavard
parents:
diff changeset
44 if (all_count+1) % 100000 == 0:
75dbbe409578 Added code for deep mlp, experiment code to go along with it. Also added code I used to filter the P07 / PNIST07 datasets to keep only digits.
fsavard
parents:
diff changeset
45 print "Done next 100k"
75dbbe409578 Added code for deep mlp, experiment code to go along with it. Also added code I used to filter the P07 / PNIST07 datasets to keep only digits.
fsavard
parents:
diff changeset
46
75dbbe409578 Added code for deep mlp, experiment code to go along with it. Also added code I used to filter the P07 / PNIST07 datasets to keep only digits.
fsavard
parents:
diff changeset
47 all_count += 1
75dbbe409578 Added code for deep mlp, experiment code to go along with it. Also added code I used to filter the P07 / PNIST07 datasets to keep only digits.
fsavard
parents:
diff changeset
48
75dbbe409578 Added code for deep mlp, experiment code to go along with it. Also added code I used to filter the P07 / PNIST07 datasets to keep only digits.
fsavard
parents:
diff changeset
49 # data is stored as uint8 on 0-255
75dbbe409578 Added code for deep mlp, experiment code to go along with it. Also added code I used to filter the P07 / PNIST07 datasets to keep only digits.
fsavard
parents:
diff changeset
50 merged_x = numpy.asarray(all_x, dtype=numpy.uint8)
75dbbe409578 Added code for deep mlp, experiment code to go along with it. Also added code I used to filter the P07 / PNIST07 datasets to keep only digits.
fsavard
parents:
diff changeset
51 merged_y = numpy.asarray(all_y, dtype=numpy.int32)
75dbbe409578 Added code for deep mlp, experiment code to go along with it. Also added code I used to filter the P07 / PNIST07 datasets to keep only digits.
fsavard
parents:
diff changeset
52
75dbbe409578 Added code for deep mlp, experiment code to go along with it. Also added code I used to filter the P07 / PNIST07 datasets to keep only digits.
fsavard
parents:
diff changeset
53 print "Kept", len(all_x), "(shape ", merged_x.shape, ") examples from", all_count
75dbbe409578 Added code for deep mlp, experiment code to go along with it. Also added code I used to filter the P07 / PNIST07 datasets to keep only digits.
fsavard
parents:
diff changeset
54
75dbbe409578 Added code for deep mlp, experiment code to go along with it. Also added code I used to filter the P07 / PNIST07 datasets to keep only digits.
fsavard
parents:
diff changeset
55 f = open(output_data_file, 'wb')
75dbbe409578 Added code for deep mlp, experiment code to go along with it. Also added code I used to filter the P07 / PNIST07 datasets to keep only digits.
fsavard
parents:
diff changeset
56 ft.write(f, merged_x)
75dbbe409578 Added code for deep mlp, experiment code to go along with it. Also added code I used to filter the P07 / PNIST07 datasets to keep only digits.
fsavard
parents:
diff changeset
57 f.close()
75dbbe409578 Added code for deep mlp, experiment code to go along with it. Also added code I used to filter the P07 / PNIST07 datasets to keep only digits.
fsavard
parents:
diff changeset
58
75dbbe409578 Added code for deep mlp, experiment code to go along with it. Also added code I used to filter the P07 / PNIST07 datasets to keep only digits.
fsavard
parents:
diff changeset
59 f = open(output_labels_file, 'wb')
75dbbe409578 Added code for deep mlp, experiment code to go along with it. Also added code I used to filter the P07 / PNIST07 datasets to keep only digits.
fsavard
parents:
diff changeset
60 ft.write(f, merged_y)
75dbbe409578 Added code for deep mlp, experiment code to go along with it. Also added code I used to filter the P07 / PNIST07 datasets to keep only digits.
fsavard
parents:
diff changeset
61 f.close()
75dbbe409578 Added code for deep mlp, experiment code to go along with it. Also added code I used to filter the P07 / PNIST07 datasets to keep only digits.
fsavard
parents:
diff changeset
62