ARCHIVE. Over the past few decades, the Internet has removed a huge barrier between people and massive amounts of information. Every day, more and more becomes available, to anyone, anywhere, anytime, with the click of a mouse or the touch of a screen. And yet just as this barrier roots itself in history, a new one reveals itself more than ever. Language.
It’s great to make information available. But what if it’s not in your language? Who will translate everything, and how could humans possibly keep up? Without the help of technology, they simply can’t. Yet, to be of real help, technology must cover the bases of the human mind related to languages. Researchers across the world are working in the field of Natural Language Processing - also known as NLP - to help computers learn languages and process them for translation, education and many other purposes.
“This requires a full understanding of human intelligence … how human intelligence works, not just language,” said Kemal Oflazer, a computer science professor at Carnegie Mellon University in Qatar and lead PI on several QNRF-funded research projects focused on NLP. “Understanding and making this technology work is as hard as understanding how human intelligence works.”
Prof. Oflazer and his research colleagues - based in Qatar and the US - are looking specifically at technologies that can greatly enhance translation between Arabic and English, as well as technologies that can improve Arabic NLP.
“Ours is more of an engineering discipline, so we’re building the plane and not getting caught up in how the bird is flying,” he said. “In fact, most likely these two ideas will not talk to each other for a long while. This is not linguistics either. Just telling me something about the language is not telling me how I can make use of that to make my system work properly. Breaking down the flapping of the bird’s wings doesn’t tell me how planes fly.”
Across thousands of languages, the structure, definitions and implications of any given word, phrase or idea can vary greatly. Some languages have not been studied extensively. Arabic is definitely one of them. In fact, when Arabic is written, the letters for most vowels are omitted, even though those vowels are obviously pronounced in speech. The special considerations of every language can present unique challenges in the field of natural language processing.
“So basically that’s an idiosyncrasy of Arabic,” Prof. Oflazer explained. “Chinese sentences and even paragraphs are one chunk. They don’t tell you ‘I have a word here I have a word here’ … you have to figure it out as you read. The Japanese have three alphabets - they have the Kanji, the Katakana and the Hiragana - and they use each for different purposes. So every language has its own strange thing—if you want to engineer systems around them, you have to understand this.”
Prof. Oflazer’s team, and NLP researchers worldwide, are working to create tools that are as general as possible so they can be tailored to work with any language. The systems are engineered based on what is called machine learning, in which millions of words are taken from news services, proceedings and other formal texts to decipher patterns that indicate the human approach to the language.
“Until 1990 or so, people really tried to build systems by hand, which didn’t work. Now they use machine learning techniques which extract patters from massive amounts of language data,” Prof. Oflazer said.
Arabic and English are both essential to online life in the Middle East.
Some applications of machine learning on Arabic are complicated by the fact that capitalization and vowels are not used. Context then becomes essential to the machine learning process to fill in what is missing or not explicitly available, he said.
“You have to figure out what the extent of the phrase is. And then you have to tell me if this is a person, or an organization, or a political entity, or a geographical entity and it’s tricky. Because there’s Clinton PA and then there’s Clinton the President”
In the case of English, a lot of time and resources have gone into analyzing it, so it’s more approachable. In fact, some NLP research relies on English as a backdrop to gain understanding around Arabic.
“There’s a process called word alignment,” Prof. Oflazer explained. “ Take two translations of the same ideas. We can say ‘how does the name translate on the Arabic side’? One example isn’t sufficient, so we have to look at evidence from multiple sentences. We can say ‘this is a name and a country.’ We can use projection to alleviate the lack of resources and the lack of effort. If we had a hundred million words of Arabic annotated like that, we probably could learn it as well as we’ve learned English. But we don’t have that, it’s very expensive. We did a bunch of other similar things in that project to basically bootstrap Arabic capability from English capability.”
The team is involved in six projects on English and Arabic NLP. One of the projects was the first of its kind to allow for the recognition of proper names in texts that are not formal news sources. The research specifically focused on Wikipedia information, which is less formal and more challenging to identify due to the variability in style and content.
Other projects involve: building tools for non-native speakers to read English more easily through interaction with text; identifying errors in Arabic text, and “supersense” tagging, whereby all nouns and verbs are placed into one of 15 supersense categories. Yet just as critical as the computer science and programming progress is the work with humans who can bring more definition to Arabic.
“Another important focus is annotation,” Prof. Oflazer said. “Here we have native speakers of Arabic who sit down on a task in front of a computer screen, there’s a text and we ask him or her to tell us which words are names, what the subject of the sentence is, what the verb is, etc. Machine learning systems rely on these annotations to achieve more accurate learning, so the more annotation the better. There’s no data like more data but this is expensive … you’re training people, making sure that they agree—annotation is the crucial component here … for the whole field, not just Arabic.”
In the end, Oflazer says that none of this work and discovery would be possible without QNRF and that its impacts will extend into the translation of other languages and the availing of information to people around the world. His team is grateful for the support.
“If this kind of research is going to be done somewhere, it has to be in this region,” he said. “Until now, research in Arabic was mainly done in the US for many reasons … but if this is going to help the citizens of this part of the world, it has to be done here.
QNRF have made this possible … we have gained significant exposure in the world as a source of publications and contributions to the field … we made the annotated data produced here available to the community at large—this is very precious data. And if researchers use this and improve upon us, using some clever technique, it’s good; they’ll just advance the state of the art.
Learning from Comparable Corpora for Improved English-Arabic Statistical Machine Translation