Home  | Publications | OZF26

Optical Character Recognition for the International Phonetic Alphabet

MCML Authors

Abstract

As grammar books are increasingly used as additional reference resources specifically for very low-resource languages, a significant portion comes from scans and relies on the quality of the Optical Character Recognition (OCR) tool. We focus here on a particular script used in linguistics to transcribe sounds: the International Phonetic Alphabet (IPA). We consider two data sources: actual grammar book PDFs for two languages under documentation, Japhug and Kagayanen, and a synthetically generated dataset based on Wiktionary. We compare two neural OCR frameworks, Tesseract and Calamari, and a recent large vision-language model, Qwen2.5-VL-7B, all three in an off-the-shelf setting and with fine-tuning. While their zero-shot performance is relatively poor for IPA characters in general due to character set mismatch, fine-tuning with the synthetic dataset leads to notable improvements.

inproceedings OZF26


EACL 2026

19th Conference of the European Chapter of the Association for Computational Linguistics. Rabat, Morocco, Mar 24-29, 2026.
Conference logo
A Conference

Authors

S. Okabe • D. Zelo • A. Fraser

Links

DOI

Research Area

 B2 | Natural Language Processing

BibTeXKey: OZF26

Back to Top