Jan. 25, 2024


有本寛 (一橋大学経済研究所)

Textizing Statistical Tables using OCR at Scale

Yutaka Arimoto (Institute of Economic Research, Hitotsubashi University)



 This study describes the requirements and methods for textizing statistical tables using OCR (optical character recognition) at scale. A major challenge of textizing statistical tables using OCR is analyzing the table layout with high accuracy. I develop a Python tookit, ocrstats, which supports the task by providing batch processing, automation of routine processes, use of external OCR, and table layout analysis with practical accuracy. I also explain the practical tips learnt from the process of textizing the Japan Imperial Statistical Yearbook using ocrstats.

Full Text

書誌情報Bibliographic information

Vol. 73, No. 1, 2022 , pp. 15-28
HERMES-IR(一橋大学機関リポジトリ): https://hdl.handle.net/10086/72558
JEL Classification Codes: Y1, No1