Abstract
Before the common availability of portable electronics, researching endangered languages required recording voices and writing lexicon cards. Lexicon cards describe a word and provide phonetic symbols depicting its pronunciation. In a typical study, multiple researchers with different handwriting styles may produce these cards. Variety in writing styles and the assortment of symbols used often makes optical character recognition difficult. This research addresses data capture challenges with a multi-phase process for accurately digitizing handwritten lexicon cards. First, lexicon cards were scanned into images and submitted to Google Cloud Vision for processing. Google Cloud Vision returned the recognized characters and mathematical bounding boxes denoting the physical locations of all text on the cards. Next, deep learning was employed to decode the phonetic symbols. These symbols were extracted manually using the bounding boxes provided by Google. A convolutional neural network-based application then processed the images and stored the most promising prediction of the symbol that matched the image. Because the automated processes thus far were not 100 % accurate, a further step involved human review and editing. Manual editing is commonly accepted as tedious and error-prone, indicating it also did not meet accuracy goals. An essential final step was creating a game to encourage review of the digital results. Not only does the game encourage an additional review, but it simultaneously provides practice and training to linguists studying the language. Through this process, the digitization of lexicon cards reached 100 % accuracy. This new approach can significantly help revitalize dormant language studies.