The content of billions of ancient texts written in a now-obsolete Japanese script have long puzzled researchers struggling to decode the secrets they might hold.
Known as Kuzushiji, the ancient cursive script was used from the 8th century to the start of the 20th, however less than 0.01 per cent of the world’s population can currently read it.
Only a fraction of Kuzushiji texts have been converted to modern Kanji characters, with hundreds of years needed to transcribe the language by hand.
However, project researcher at Japan’s ROIS-DS Centre for Open Data in the Humanities Tarin Clanuwat is on the brink of a breakthrough.
She is currently working on the development of a deep learning optical character recognition system to transcribe Kuzushiji writing into searchable Kanji characters.
“Everything we know about the Japanese culture and history from literature has been done by hand,” she told nine.com.au.
“People just refer to what other people did before them and that’s how research has been done.
“If we can transcribe [the texts] and create search engine that can find a specific word, we will know what is in other texts more quickly.”
Montreal Institute for Learning Algorithms research scientist Alex Lamb has been helping with the project and said Google’s AI technical support and a labelled dataset of 17th to 19th century books from the National Institute of Japanese Literature has seen the machine learning program now decode over 4000 different characters from the language.
“Whenever you start a machine learning program you want to start with something really simple so you can make sure you are doing the right thing and not something grossly difficult,” he told nine.com.au.
“In our first iteration, we literally just detected the one character and it gradually moved up to 400 characters and then 4000. It’s constantly improving and there’s still a lot of room for progress.”
Ms Clanuwat said the model can decipher one page of text in two seconds, with an average accuracy of 85 per cent.
“The problem with the Japanese language is there are so many different characters out there, we can’t find enough samples for the machine to learn [enough for 100 per cent accuracy],” she said.
“One thing that has impressed me is the model can distinguish if something is a character or an image – illustrations look close to characters and I’m surprised it’s smart enough to skip it.
“I think with more collaborations between humans and machines, it’s feasible to accuracy will go much higher.”