annotate scripts/ocr_divide.py @ 167:1f5937e9e530

More moves - transformations into data_generation, added "deep" folder
author Dumitru Erhan <dumitru.erhan@gmail.com>
date Fri, 26 Feb 2010 14:15:38 -0500
parents 728e232eaf45
children 2b6a28e4cadc
rev   line source
137
728e232eaf45 Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff changeset
1 #!/usr/bin/env python
728e232eaf45 Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff changeset
2
728e232eaf45 Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff changeset
3 '''
728e232eaf45 Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff changeset
4 creation des ensembles train, valid et test OCR
728e232eaf45 Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff changeset
5 ensemble valid est trainorig[:20000]
728e232eaf45 Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff changeset
6 ensemble test est trainorig[20000:40000]
728e232eaf45 Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff changeset
7 ensemble train est trainorig[40000:]
728e232eaf45 Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff changeset
8 trainorig est deja shuffled
728e232eaf45 Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff changeset
9 '''
728e232eaf45 Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff changeset
10
728e232eaf45 Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff changeset
11 from pylearn.io import filetensor as ft
728e232eaf45 Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff changeset
12 import numpy, os
728e232eaf45 Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff changeset
13
728e232eaf45 Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff changeset
14 dir1 = '/data/lisa/data/ocr_breuel/filetensor/'
728e232eaf45 Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff changeset
15 dir2 = "/data/lisa/data/ift6266h10/"
728e232eaf45 Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff changeset
16
728e232eaf45 Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff changeset
17 f = open(dir1 + 'unlv-corrected-2010-02-01-shuffled.ft')
728e232eaf45 Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff changeset
18 d = ft.read(f)
728e232eaf45 Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff changeset
19 f = open(dir2 + "ocr_valid_data.ft", 'wb')
728e232eaf45 Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff changeset
20 ft.write(f, d[:20000])
728e232eaf45 Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff changeset
21 f = open(dir2 + "ocr_test_data.ft", 'wb')
728e232eaf45 Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff changeset
22 ft.write(f, d[20000:40000])
728e232eaf45 Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff changeset
23 f = open(dir2 + "ocr_train_data.ft", 'wb')
728e232eaf45 Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff changeset
24 ft.write(f, d[40000:])
728e232eaf45 Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff changeset
25
728e232eaf45 Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff changeset
26 f = open(dir1 + 'unlv-corrected-2010-02-01-labels-shuffled.ft')
728e232eaf45 Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff changeset
27 d = ft.read(f)
728e232eaf45 Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff changeset
28 f = open(dir2 + "ocr_valid_labels.ft", 'wb')
728e232eaf45 Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff changeset
29 ft.write(f, d[:20000])
728e232eaf45 Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff changeset
30 f = open(dir2 + "ocr_test_labels.ft", 'wb')
728e232eaf45 Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff changeset
31 ft.write(f, d[20000:40000])
728e232eaf45 Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff changeset
32 f = open(dir2 + "ocr_train_labels.ft", 'wb')
728e232eaf45 Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff changeset
33 ft.write(f, d[40000:])
728e232eaf45 Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff changeset
34
728e232eaf45 Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff changeset
35 for i in ["train", "valid", "test"]:
728e232eaf45 Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff changeset
36 os.chmod(dir2 + "ocr_" + i + "_data.ft", 0744)
728e232eaf45 Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff changeset
37 os.chmod(dir2 + "ocr_" + i + "_labels.ft", 0744)
728e232eaf45 Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff changeset
38
728e232eaf45 Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff changeset
39
728e232eaf45 Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff changeset
40