Teaching Computers to Decipher Old Newsprint—in Gaelic

April 29, 2023

The closest analogy that you might be familiar with would be the old Gothic fonts used in printing in German up until the early 20th century. But there are some aspects of the Irish language that don’t correspond directly with a typical Romance language system or English language system, and those sounds are conveyed through characters that aren’t present in typical contemporary Latin-character based alphabets. But it has not been done extensively with this particular font for this language compared to, say, the English language. On the one hand, here’s a newspaper that gives us a window into the history of the Irish language, Irish song, or Irish folklore. That’s the media history component for academics—we’re trying to understand the history of certain minority language media.

What is Irish type? And why does it make digitization more difficult?
The closest analogy that you might be familiar with would be the old Gothic fonts used in printing in German up until the early 20th century. In the case of Irish, the old sean-chló, as it’s called, or the old type that was used, was derived from the manuscripts that Irish was written in before that, and that in turn had roots in medieval handwriting. There’s a some correspondence between the characters in these fonts and what you would see in contemporary Irish, so it’s not as though a completely different system was used. But there are some aspects of the Irish language that don’t correspond directly with a typical Romance language system or English language system, and those sounds are conveyed through characters that aren’t present in typical contemporary Latin-character based alphabets. And so to do that in print, writers in Irish had to make slight adjustments to Latin-alphabet based characters to convey the additional letters needed for Irish. It’s not impossible to train a machine to learn this, but it does take a little more work.

So how do you train computers to read this font?
We have to produce the training data, which is where the extra work comes in. So we’ll look at a page and transcribe it. Once we have about 70 pages transcribed, we can integrate that into an optical character recognition machine learning process, and try to gauge its success in reading pages it hasn’t seen before. What typically happens is that about 2 to 5% of the transcribed pages are held back, and at the end of every cycle the computer will try to adjust its knowledge or ability to recognize the characters by checking against our validation set. The bottom line is we have to do those transcriptions to give it what is called a “ground truth” to work from—otherwise it would have no way of knowing if it was getting the answer right. Printed text recognition is one of the oldest machine learning challenges, so this is a typical process that’s been around quite a while. But it has not been done extensively with this particular font for this language compared to, say, the English language.

Who will benefit from being able to access the digital, searchable An Gaodhal files, once the project is complete?
There are definitely a few categories of folks who would benefit. On the one hand, here’s a newspaper that gives us a window into the history of the Irish language, Irish song, or Irish folklore. That’s the media history component for academics—we’re trying to understand the history of certain minority language media. And then linguists might want to look up a certain word in Irish and see all the historical contexts in which it appears, which can be more precise than a modern dictionary. But certainly this could also be useful to the general community who might be looking into genealogy, or even the local history of New York City. If you look up an address in New York, and it happens to be mentioned in this newspaper, you get a sense of what was happening at that location, whether that’s a place where a Celtic society met or the address of a person who sent in a snippet of an Irish language poem. And for genealogy, someone could search for a great-grandparent and see if the name comes up, giving a sense of where they were at a given point in time and their touchpoints with the Irish immigrant community..

Research

Teaching Computers to Decipher Old Newsprint—in Gaelic

Resource

The Most Exciting Research Topics in Robotics: Insights from Leading PhD Programs

Most viewed

Featured Jobs

Popular in Research

1

2

3

4

5

World

Business

Campus