Teaching Computers to Decipher Old Newsprint—in Gaelic

April 29, 2023

The closest analogy that you might be familiar with would be the old Gothic fonts used in printing in German up until the early 20th century. But there are some aspects of the Irish language that don’t correspond directly with a typical Romance language system or English language system, and those sounds are conveyed through characters that aren’t present in typical contemporary Latin-character based alphabets. But it has not been done extensively with this particular font for this language compared to, say, the English language. On the one hand, here’s a newspaper that gives us a window into the history of the Irish language, Irish song, or Irish folklore. That’s the media history component for academics—we’re trying to understand the history of certain minority language media.

What is Irish type? And why does it make digitization more difficult?
The closest analogy that you might be familiar with would be the old Gothic fonts used in printing in German up until the early 20th century. In the case of Irish, the old sean-chló, as it’s called, or the old type that was used, was derived from the manuscripts that Irish was written in before that, and that in turn had roots in medieval handwriting. There’s a some correspondence between the characters in these fonts and what you would see in contemporary Irish, so it’s not as though a completely different system was used. But there are some aspects of the Irish language that don’t correspond directly with a typical Romance language system or English language system, and those sounds are conveyed through characters that aren’t present in typical contemporary Latin-character based alphabets. And so to do that in print, writers in Irish had to make slight adjustments to Latin-alphabet based characters to convey the additional letters needed for Irish. It’s not impossible to train a machine to learn this, but it does take a little more work.

So how do you train computers to read this font?
We have to produce the training data, which is where the extra work comes in. So we’ll look at a page and transcribe it. Once we have about 70 pages transcribed, we can integrate that into an optical character recognition machine learning process, and try to gauge its success in reading pages it hasn’t seen before. What typically happens is that about 2 to 5% of the transcribed pages are held back, and at the end of every cycle the computer will try to adjust its knowledge or ability to recognize the characters by checking against our validation set. The bottom line is we have to do those transcriptions to give it what is called a “ground truth” to work from—otherwise it would have no way of knowing if it was getting the answer right. Printed text recognition is one of the oldest machine learning challenges, so this is a typical process that’s been around quite a while. But it has not been done extensively with this particular font for this language compared to, say, the English language.

Who will benefit from being able to access the digital, searchable An Gaodhal files, once the project is complete?
There are definitely a few categories of folks who would benefit. On the one hand, here’s a newspaper that gives us a window into the history of the Irish language, Irish song, or Irish folklore. That’s the media history component for academics—we’re trying to understand the history of certain minority language media. And then linguists might want to look up a certain word in Irish and see all the historical contexts in which it appears, which can be more precise than a modern dictionary. But certainly this could also be useful to the general community who might be looking into genealogy, or even the local history of New York City. If you look up an address in New York, and it happens to be mentioned in this newspaper, you get a sense of what was happening at that location, whether that’s a place where a Celtic society met or the address of a person who sent in a snippet of an Irish language poem. And for genealogy, someone could search for a great-grandparent and see if the name comes up, giving a sense of where they were at a given point in time and their touchpoints with the Irish immigrant community..

The source of this news is from New York University

Popular in Research

1

3 days ago

Bringing Crises into Focus

2

Jun 14, 2024

Visionary cognitive neuroscientist Susan Courtney dies at 57

3

Jun 7, 2024

From ashes to adversity: Lessons from South Australia's business recovery amidst bushfires and pandemic

4

Jun 14, 2024

Tornados leave trail of destruction as 4 injured, hundreds of homes damaged

5

Jun 14, 2024

SV-faculty becomes partner in two new doctoral networks

Statement by NYU Spokesperson John Beckman about Situation on Greene St. Walkway

3 days ago

NATO chief says Ukraine can still win war despite Russian advances

3 days ago

Trump, Biden and CNN Prepare for a Hostile Debate (With Muted Mics)

3 days ago

The Resistance to a New Trump Administration Has Already Started

3 days ago

Weaving memory into textiles

3 hours ago

How Likely are English Learners to Graduate from High School? New Study Shows It Depends on Race, Gender, and Income

1 day ago