Mercurial > ift6266
annotate scripts/ocr_divide.py @ 335:5ddb1878dfbc
noisyness -> noise
author | Arnaud Bergeron <abergeron@gmail.com> |
---|---|
date | Thu, 15 Apr 2010 12:53:03 -0400 |
parents | 2b6a28e4cadc |
children |
rev | line source |
---|---|
137
728e232eaf45
Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff
changeset
|
1 #!/usr/bin/env python |
728e232eaf45
Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff
changeset
|
2 |
728e232eaf45
Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff
changeset
|
3 ''' |
728e232eaf45
Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff
changeset
|
4 creation des ensembles train, valid et test OCR |
182
2b6a28e4cadc
J'ai reséparé NIST/OCR purs pour avoir des ensembles de test et de validation de 80000 plutôt que 20000, comme on a discuté au cours
boulanni <nicolas_boulanger@hotmail.com>
parents:
137
diff
changeset
|
5 ensemble valid est trainorig[:80000] |
2b6a28e4cadc
J'ai reséparé NIST/OCR purs pour avoir des ensembles de test et de validation de 80000 plutôt que 20000, comme on a discuté au cours
boulanni <nicolas_boulanger@hotmail.com>
parents:
137
diff
changeset
|
6 ensemble test est trainorig[80000:160000] |
2b6a28e4cadc
J'ai reséparé NIST/OCR purs pour avoir des ensembles de test et de validation de 80000 plutôt que 20000, comme on a discuté au cours
boulanni <nicolas_boulanger@hotmail.com>
parents:
137
diff
changeset
|
7 ensemble train est trainorig[160000:] |
137
728e232eaf45
Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff
changeset
|
8 trainorig est deja shuffled |
728e232eaf45
Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff
changeset
|
9 ''' |
728e232eaf45
Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff
changeset
|
10 |
728e232eaf45
Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff
changeset
|
11 from pylearn.io import filetensor as ft |
728e232eaf45
Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff
changeset
|
12 import numpy, os |
728e232eaf45
Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff
changeset
|
13 |
728e232eaf45
Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff
changeset
|
14 dir1 = '/data/lisa/data/ocr_breuel/filetensor/' |
728e232eaf45
Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff
changeset
|
15 dir2 = "/data/lisa/data/ift6266h10/" |
728e232eaf45
Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff
changeset
|
16 |
728e232eaf45
Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff
changeset
|
17 f = open(dir1 + 'unlv-corrected-2010-02-01-shuffled.ft') |
728e232eaf45
Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff
changeset
|
18 d = ft.read(f) |
728e232eaf45
Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff
changeset
|
19 f = open(dir2 + "ocr_valid_data.ft", 'wb') |
182
2b6a28e4cadc
J'ai reséparé NIST/OCR purs pour avoir des ensembles de test et de validation de 80000 plutôt que 20000, comme on a discuté au cours
boulanni <nicolas_boulanger@hotmail.com>
parents:
137
diff
changeset
|
20 ft.write(f, d[:80000]) |
137
728e232eaf45
Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff
changeset
|
21 f = open(dir2 + "ocr_test_data.ft", 'wb') |
182
2b6a28e4cadc
J'ai reséparé NIST/OCR purs pour avoir des ensembles de test et de validation de 80000 plutôt que 20000, comme on a discuté au cours
boulanni <nicolas_boulanger@hotmail.com>
parents:
137
diff
changeset
|
22 ft.write(f, d[80000:160000]) |
137
728e232eaf45
Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff
changeset
|
23 f = open(dir2 + "ocr_train_data.ft", 'wb') |
182
2b6a28e4cadc
J'ai reséparé NIST/OCR purs pour avoir des ensembles de test et de validation de 80000 plutôt que 20000, comme on a discuté au cours
boulanni <nicolas_boulanger@hotmail.com>
parents:
137
diff
changeset
|
24 ft.write(f, d[160000:]) |
137
728e232eaf45
Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff
changeset
|
25 |
728e232eaf45
Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff
changeset
|
26 f = open(dir1 + 'unlv-corrected-2010-02-01-labels-shuffled.ft') |
728e232eaf45
Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff
changeset
|
27 d = ft.read(f) |
728e232eaf45
Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff
changeset
|
28 f = open(dir2 + "ocr_valid_labels.ft", 'wb') |
182
2b6a28e4cadc
J'ai reséparé NIST/OCR purs pour avoir des ensembles de test et de validation de 80000 plutôt que 20000, comme on a discuté au cours
boulanni <nicolas_boulanger@hotmail.com>
parents:
137
diff
changeset
|
29 ft.write(f, d[:80000]) |
137
728e232eaf45
Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff
changeset
|
30 f = open(dir2 + "ocr_test_labels.ft", 'wb') |
182
2b6a28e4cadc
J'ai reséparé NIST/OCR purs pour avoir des ensembles de test et de validation de 80000 plutôt que 20000, comme on a discuté au cours
boulanni <nicolas_boulanger@hotmail.com>
parents:
137
diff
changeset
|
31 ft.write(f, d[80000:160000]) |
137
728e232eaf45
Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff
changeset
|
32 f = open(dir2 + "ocr_train_labels.ft", 'wb') |
182
2b6a28e4cadc
J'ai reséparé NIST/OCR purs pour avoir des ensembles de test et de validation de 80000 plutôt que 20000, comme on a discuté au cours
boulanni <nicolas_boulanger@hotmail.com>
parents:
137
diff
changeset
|
33 ft.write(f, d[160000:]) |
137
728e232eaf45
Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff
changeset
|
34 |
728e232eaf45
Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff
changeset
|
35 for i in ["train", "valid", "test"]: |
728e232eaf45
Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff
changeset
|
36 os.chmod(dir2 + "ocr_" + i + "_data.ft", 0744) |
728e232eaf45
Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff
changeset
|
37 os.chmod(dir2 + "ocr_" + i + "_labels.ft", 0744) |
728e232eaf45
Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff
changeset
|
38 |
728e232eaf45
Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff
changeset
|
39 |
728e232eaf45
Added script to separate OCR data in train, validation and test sets (raw data)
boulanni <nicolas_boulanger@hotmail.com>
parents:
diff
changeset
|
40 |