annotate scripts/ocr_divide.py @ 239:42005ec87747

Mergé (manuellement) les changements de Sylvain pour utiliser le code de dataset d'Arnaud, à cette différence près que je n'utilse pas les givens. J'ai probablement une approche différente pour limiter la taille du dataset dans mon débuggage, aussi.
author fsavard
date Mon, 15 Mar 2010 18:30:21 -0400
parents 2b6a28e4cadc
children
rev   line source
137
728e232eaf45 Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff changeset
1 #!/usr/bin/env python
728e232eaf45 Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff changeset
2
728e232eaf45 Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff changeset
3 '''
728e232eaf45 Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff changeset
4 creation des ensembles train, valid et test OCR
182
2b6a28e4cadc J'ai reséparé NIST/OCR purs pour avoir des ensembles de test et de validation de 80000 plutôt que 20000, comme on a discuté au cours
boulanni <nicolas_boulanger@hotmail.com>
parents: 137
diff changeset
5 ensemble valid est trainorig[:80000]
2b6a28e4cadc J'ai reséparé NIST/OCR purs pour avoir des ensembles de test et de validation de 80000 plutôt que 20000, comme on a discuté au cours
boulanni <nicolas_boulanger@hotmail.com>
parents: 137
diff changeset
6 ensemble test est trainorig[80000:160000]
2b6a28e4cadc J'ai reséparé NIST/OCR purs pour avoir des ensembles de test et de validation de 80000 plutôt que 20000, comme on a discuté au cours
boulanni <nicolas_boulanger@hotmail.com>
parents: 137
diff changeset
7 ensemble train est trainorig[160000:]
137
728e232eaf45 Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff changeset
8 trainorig est deja shuffled
728e232eaf45 Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff changeset
9 '''
728e232eaf45 Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff changeset
10
728e232eaf45 Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff changeset
11 from pylearn.io import filetensor as ft
728e232eaf45 Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff changeset
12 import numpy, os
728e232eaf45 Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff changeset
13
728e232eaf45 Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff changeset
14 dir1 = '/data/lisa/data/ocr_breuel/filetensor/'
728e232eaf45 Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff changeset
15 dir2 = "/data/lisa/data/ift6266h10/"
728e232eaf45 Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff changeset
16
728e232eaf45 Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff changeset
17 f = open(dir1 + 'unlv-corrected-2010-02-01-shuffled.ft')
728e232eaf45 Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff changeset
18 d = ft.read(f)
728e232eaf45 Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff changeset
19 f = open(dir2 + "ocr_valid_data.ft", 'wb')
182
2b6a28e4cadc J'ai reséparé NIST/OCR purs pour avoir des ensembles de test et de validation de 80000 plutôt que 20000, comme on a discuté au cours
boulanni <nicolas_boulanger@hotmail.com>
parents: 137
diff changeset
20 ft.write(f, d[:80000])
137
728e232eaf45 Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff changeset
21 f = open(dir2 + "ocr_test_data.ft", 'wb')
182
2b6a28e4cadc J'ai reséparé NIST/OCR purs pour avoir des ensembles de test et de validation de 80000 plutôt que 20000, comme on a discuté au cours
boulanni <nicolas_boulanger@hotmail.com>
parents: 137
diff changeset
22 ft.write(f, d[80000:160000])
137
728e232eaf45 Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff changeset
23 f = open(dir2 + "ocr_train_data.ft", 'wb')
182
2b6a28e4cadc J'ai reséparé NIST/OCR purs pour avoir des ensembles de test et de validation de 80000 plutôt que 20000, comme on a discuté au cours
boulanni <nicolas_boulanger@hotmail.com>
parents: 137
diff changeset
24 ft.write(f, d[160000:])
137
728e232eaf45 Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff changeset
25
728e232eaf45 Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff changeset
26 f = open(dir1 + 'unlv-corrected-2010-02-01-labels-shuffled.ft')
728e232eaf45 Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff changeset
27 d = ft.read(f)
728e232eaf45 Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff changeset
28 f = open(dir2 + "ocr_valid_labels.ft", 'wb')
182
2b6a28e4cadc J'ai reséparé NIST/OCR purs pour avoir des ensembles de test et de validation de 80000 plutôt que 20000, comme on a discuté au cours
boulanni <nicolas_boulanger@hotmail.com>
parents: 137
diff changeset
29 ft.write(f, d[:80000])
137
728e232eaf45 Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff changeset
30 f = open(dir2 + "ocr_test_labels.ft", 'wb')
182
2b6a28e4cadc J'ai reséparé NIST/OCR purs pour avoir des ensembles de test et de validation de 80000 plutôt que 20000, comme on a discuté au cours
boulanni <nicolas_boulanger@hotmail.com>
parents: 137
diff changeset
31 ft.write(f, d[80000:160000])
137
728e232eaf45 Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff changeset
32 f = open(dir2 + "ocr_train_labels.ft", 'wb')
182
2b6a28e4cadc J'ai reséparé NIST/OCR purs pour avoir des ensembles de test et de validation de 80000 plutôt que 20000, comme on a discuté au cours
boulanni <nicolas_boulanger@hotmail.com>
parents: 137
diff changeset
33 ft.write(f, d[160000:])
137
728e232eaf45 Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff changeset
34
728e232eaf45 Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff changeset
35 for i in ["train", "valid", "test"]:
728e232eaf45 Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff changeset
36 os.chmod(dir2 + "ocr_" + i + "_data.ft", 0744)
728e232eaf45 Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff changeset
37 os.chmod(dir2 + "ocr_" + i + "_labels.ft", 0744)
728e232eaf45 Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff changeset
38
728e232eaf45 Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff changeset
39
728e232eaf45 Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff changeset
40