Uploaded by Rashid kisejjere

termpaper

advertisement
ENGLISH TO LUGANDA MACHINE
TRANSLATION
* FINAL
TERM PAPER
KISEJJERE RASHID
BACHELORS OF SCIENCE IN SOFTWARE ENGINEERING)
MAKERERE UNIVERSITY
COURSE INSTRUCTOR: MR.GALIWANGO MARVIN
KAMPALA, UGANDA
rashidkisejjere0784@gmail.com
Abstract—English to Luganda Translation using different
machine classification models and deep learning. Their many
machine translation sources right now but there isn’t any public
English to Luganda Translation paper showing how the translation process occurs. Luganda is a very common language and it’s
the mother language of Uganda. It has got a very big vocabulary
of words which means that working on it requires a very big
dataset. For the sake of this paper, am going to be going through
different machine-learning models that can be implemented to
translate a given English text to Luganda. Because Luganda is
a new field in translation so am going to be experimenting with
multiple machine learning classification models of SVMs, Logistic
regression models, and many more, finally deep learning models
of RNNs, LSTMs, and also incorporating advanced mechanisms
of Attention, Transformers plus some additional techniques of
transfer learning.
Index Terms—Artificial Intelligence, machine translation, classification task, RNNs, transfer learning, LSTMs, attention
I. I NTRODUCTION
Machine translation is a field that was and is still in the
research and so far there are multiple machine translation Approaches that researchers have come up with. These machine
translation techniques mainly include; Rule-Based Machine
Translation (RBMT), Statistical Machine Learning (SMT), and
Neural Machine translation (NMT). A detailed explanation of
these approaches is in the next chapter. Machine translation
is one of the major subcategories of NLP as it involves
a proper understanding of two different languages. This is
always challenging as languages tend to have a very huge
vocabulary so a lot of computer resources are needed for
a machine translation system to come out as accurately as
possible. Also, the data used in the process is supposed to be
very accurate and this as result also tends to affect the accuracy
of these models so coming up with a very accurate model is
very tricky. Throughout this paper, I will be explaining how I
was able to come up with a couple of translation models using
different strategies of machine learning.
II. BACKGROUND AND MOTIVATION
The background of ML translation comes from majorly
three translation processes I.e. the Rule Based Machine Translation (RBMT), Statistical Machine Learning (SMT) and Neural Machine translation (NMT) of this laptop
A. Rule Based Machine translation(RBMT)
Rule base machine translation (RBMT) as the name states
it’s mainly about researchers coming up with different rules
through which a text in a given can follow to come up with
its respective translation. It is the oldest machine translation
technique and it was used in the 1970s.
B. Statistical Machine Learning (SMT)
Statistical Machine translation (SMT) is an old translation
technique that uses a statistical model to create a representation
of the relationships between sentences, words, and phrases in a
given text. This method is then applied to a second language
to convert these elements into the new language.One of the
main advantages of this technique is that it can improve on
the rule-based MT while sharing the same problems.
C. Neural Machine Translation(NMT)
Neural network translation was developed using deep learning techniques. This method is faster and more accurate than
other methods. Neural MT is rapidly becoming the standard
in MT engine development.
The translation of a given language to another different
language is a machine classification problem. This type of
problem can predict only values within a known domain. The
domain in this case could be either the number of characters
that make up the vocabulary of a given language or the number
of words in that given vocabulary. So this shows that the
classification model could be either word-based or character
based. I will elaborate more on this issue in the next topics.
III. LITERATURE REVIEW
Translation is a crucial aspect of communication for
individuals who speak different languages. With the advent
of Artificial Intelligence (AI), translation has become more
efficient and accurate, making it possible to communicate
with individuals in other languages in real-time. There are
basically two major learning techniques that can be used ;
Supervised learning is a type of machine learning where the
model is trained on a labeled dataset and makes predictions
based on the input data and the labeled output. Supervised
learning algorithms have been used to train AI-powered
English to Luganda translation systems. The model is trained
on a large corpus of bilingual text data, which helps it learn
the relationships between English and Luganda words and
phrases. This allows the model to make predictions about the
Luganda translation of an English text based on the input
data. This is the famous type of machine learning and it
involves the famous deep neural networks.
Unsupervised learning is a type of machine learning
where the model is not trained on labeled data but instead
learns from the input data. Unsupervised learning algorithms
can also been used to develop AI-powered English to
Luganda translation systems. The model uses techniques
such as clustering and dimensionality reduction to learn
the relationships between English and Luganda words and
phrases. This allows the model to make predictions about the
Luganda translation of an English text based on the input data.
In conclusion, AI-powered English to Luganda translation
has the potential to greatly improve the speed and accuracy of
translations.
IV. RESEARCH GAPS
Below are some of the major research Gaps in the field of
machine translation.
•
•
•
Limited Training Data: The quality of AI-powered translations is heavily dependent on the amount and quality
of training data used to train the model. Further research
is needed to explore methods for obtaining high-quality
training data.
Lack of Cultural Sensitivity: AI-powered translation systems can produce translations that are grammatically
correct but lack the cultural sensitivity of human translations. This can result in translations that are culturally
inappropriate or that do not accurately convey the original
message.
Vulnerability to Errors of the machine learning system.
AI can only understand what it has been trained on. So
in cases where the input is not similar to the data which
it was trained on, AI then can easily create undesired
results.
V. C ONTRIBUTIONS OF THIS PAPER
One of the major aim of this paper is lay a foundation for
further and much more detailed research in the translation of
large vocabulary languages like Luganda. Through showing
the different machine learning techniques that can be used ti
achieve this.
VI. METHODOLOGY
The problem being investigated in this project is to develop
an AI-powered English to Luganda translation system. The
significance of this problem lies in the growing demand for
high-quality and culturally sensitive translations, particularly
in the field of commerce and communication between English
and Luganda-speaking communities.
The scope of the project is to develop an AI system that is
capable of accurately translating English text into Luganda
text, while also preserving the meaning and cultural context
of the original text.
To address this problem, the proposed AI approach is to
develop a neural machine translation (NMT) model. The
NMT model will be trained on the English and Luganda
parallel corpus dataset, and will use this data to learn the
relationship between the two languages.The AI process can
be summarized as follows:
Data Collection: Collect a large corpus of parallel text data
in English and Luganda.
Pre-processing: Pre-process the data to remove irrelevant
information and standardize the text.
Model Selection: Choose the neural machine translation
model that is best suited for the problem.
Model Training: Train the NMT model on the pr-processed
data.
Model Evaluation: Evaluate the trained model on a held-out
set of data to determine its performance.
Deployment: Deploy the trained model for use in a
real-world setting.
Continuous Improvement: Continuously evaluate the
performance of the model and make improvements as needed.
The AI evaluation framework used in this project are the
accuracy metrics mainly. This is a major of how the model
will be able to translate a given text correctly.
In conclusion, the proposed AI approach for this project is to
develop a neural machine translation model that can accurately
translate English text into Luganda text while preserving the
meaning and cultural context of the original text.
VII. DATASET DESCRIPTION
through the visualization of the data. Below are the visualizations and their meanings;
1) Word Cloud: A word cloud is graphical representation
of the words that are used frequently in the dataset. This is
important as it shows that the model will highlt depend on
those particular words .
The dataset [15] I used was created by Makerere University
and it contains approximately 15k English sentences with
there respective Luganda translation. Below are the factors for
considering this dataset.
• Scarcity of Luganda datasets. Luganda isn’t a famous
language world wide and it is mainly used in the Country
Uganda only so the only major dataset I could find was
this one.
• Cost. The dataset is available for free for anyone to use
and edit.
• The accuracy of the dataset isn’t bad at all so it is the
best option to use.
• The dataset is relatively large and diverse enough to be
able to create a very good model out of.
For the Luganda sentences
VIII. DATA PREPARATION AND EXPLORATORY
DATA ANALYSIS.
A. DATA PREPARATION
Data preparation refers to the steps taken to prepare raw
data into improved data which can be used to train a machine
learning model. The data preparation process for my model
was as follows;
• Removal of any punctuation plus any unnecessary spaces
this is necessary to prevent the model from training on a
large amount of unnecessary data.
• Converting the case of words in the dataset to lowercase.
Since python is case-sensitive a word like “Hello” is
different from “hello”. to avoid this dilemma I had to
change the case.
• Vectorization of the dataset. Vectorization is referred to
as the process of converting a given text into numerical
indices. This is necessary because the machine learning
pipeline can only be trained on numerical data.
• Removal of null values. Here all the rows that had null
data had to be dropped because for textual data it is very
difficult to estimate the value in the null spot.
Those were the data preparation processes I used in the
model creation process.
2) Correlation matrix: This is a matrix showing the correlation of the different values to each other. Plotting a 2d
correlation matrix for the entire dataset is almost impossible
but what is possible is the plot of a particular sentence. The
matrix below shows the correlation for a given sentence. Here
the model will have to pay a lot of attention to the words that
are highly correlated to each other.
B. DATA ANALYSIS
Exploratory data analysis is referred to as the process of
performing initial investigations on data to discover anomalies
and patterns. Exploratory data analysis is mainly carried out
For the it’s Luganda Sentence
For the respective Luganda Sentence
3) Sentence Lengths plots: Through these plots, we are
ale to determine what should all the sentences of the datasets
be padded to because during the training process they are all
supposed to be of the same length
In a conclusion, data preparation and exploratory data
analysis are key steps in the creation of a very accurate model.
IX. AI MODEL SELECTION AND OPTIMIZATION
These figures show the maximum sentence lengths for the
English and the Luganda sentences receptively.
4) Box Plot: A box plot is visual representation that can
be used to show the major outliers in the dataset. Plotting
a box plot for the entire spot is also almost impossible but
what is possible is the plotting of the box plot for a particular
sentence, this as a result shows on the possible outliers in the
sentence thus the model during the training process ends up
not paying a lot of attention to those particular words.
Box plot for one of the sentences in the dataset
Throughout the project, I created three models. I.e one
with recurrent neural networks, the other with the attention
mechanism, and finally the last one with transfer learning on
the per-trained hugging face transformer model.
• The recurrent neural network model was a simple model
that uses RNNs to translate the model. Its accuracy was
very bad because the vocabulary for the two languages
was very big. These types of RNNs are best for simple
vocabularies.
• The attention mechanism model. This happened to be
much much better compared to the RNN model. Attention
is a mechanism used in deep neural networks where the
model can focus on only the important parts of a given
text by assigning them more weights.
• The other model I created used transformers. Transformers are also deep learning models that are built on top
of attention layers. This makes them much more efficient
when it comes to NLP tasks.. This information includes
ram, processor, brand, storage, type, screen size and many
more.
X. ACCOUNTABILITY
In this context of AI, “accountability” refers to the
expectation that organizations or individuals will use to
ensure the proper functioning, throughout the AI systems that
they design, develop, operate or deploy, following their roles
and applicable regulatory frameworks, and for demonstrating
this through their actions and decision-making process (for
example, by providing documentation on key decisions
throughout the AI system lifecycle or conducting or allowing
auditing were justified).
AI accountability is very important because it’s a means
of safeguarding against unintended uses. Most AI systems are
designed for a specific use case; using them for a different use
case would produce incorrect results. Through this am also to
apply accountability to my model by making sure that Since
my AI model mainly depends on the dataset. Hence, it’s best to
make sure that the quality of the dataset is constantly improved
and filtered. Because of any slight modifications in the spelling
of the words then the model’s accuracy will decrease.
XI. R ESULTS AND D ISCUSSION
I spitted the data into training and the validation set below
are the results;
Words that were predicted with a very high probability are
more coloured.
XII. CONCLUSION AND FUTURE WORKS
I hope this paper will give a basic understanding of the
different machine learning methods that can be used to create
a deep learning model capable of translating a given English
text into Luganda. The same idea can be used to translate
different languages.
The model currently is overfitting the dataset. One way to
overcome this is to increase the size of the data because the
dataset contains of only about 15k sentences. So for the model
to become much more accurate increasing the dataset to about
a million sentences will tremendously improve on its accuracy.
The training accuracy is of 92
A. Validation and Accuracy plot
Usage of other machine learning techniques like transformers. The model illustrated above was based on the attention
mechanism of neural networks. Using the transformers will
improve the quality of the model even more. Though transformers are usually complicated to use instead fine tuning an
already trained model is what I would recommend, this is
called transfer learning.
A. Dataset and python source code
LINK to the Final Python Source Code https : //colab.research.google.com/drive
1N sRAxdf tGIzqzeIM w3N F Y 9xClLk49f
LINK to the used dataset https : //zenodo.org/record/5855017
Its clear that the model is over fitting the dataset but it’s
accuracy is still fairly good.
LINK to the YouTube Video https : //youtu.be/RLXf M 0iLQag
XIII. ACKNOWLEDGMENT
B. ATTENTION PLOT
An attention plot is a figure showing how the model was
able to predict the given output.
Special Thanks to Mr.Ggaliwango Marvin for his never
ending support towards my research on this project. I also
want to appreciate Dr. Rose Nakibuule for the provision of
the foundation knowledge needed for this project. [4]
R EFERENCES
[1] M. Singh, R. Kumar, and I. Chana, ”Neural-Based Machine Translation System Outperforming Statistical Phrase-Based Machine Translation for Low-Resource Languages”, 2019 Twelfth International Conference on Contemporary Computing (IC3), 2019, pp. 1-7, DOI:
10.1109/IC3.2019.8844915. V. Bakarola and J. Nasriwala, ”Attention
based Neural Machine Translation with Sequence to Sequence Learning
on Low Resourced Indic Languages,” 2021 2nd International Conference on Advances in Computing, Communication, Embedded and
Secure Systems (ACCESS), 2021, pp. 178-182, DOI: 10.1109/AC
CESS51619.2021.9563317. .
[2] Academy, E. (2022) How to Write a Research Hypothesis
—
Enago
Academy,
Enago
Academy.
Available
at:
https://www.enago.com/academy/how-to-developa-good-research-hypothesis/
(Accessed:
17
November
2022). What is the project scope? (2022). Available at:
https://www.techtarget.com/searchcio/definition/project-scope
(Accessed: 17 November 2022).
[3] Machine
translation
–
Wikipedia
(2022).
Available
at:
https://en.wikipedia.org/wiki/Machine translation (Accessed: 17
November 2022).
[4] K. Chen et al., ”Towards More Diverse Input Representation for
Neural Machine Translation,” in IEEE/ACM Transactions on Audio,
Speech, and Language Processing, vol. 28, pp. 1586-1597, 2020, doi:
10.1109/TASLP.2020.2996077.
[5] O. Mekpiroon, P. Tammarattananont, N. Apitiwongmanit, N. Buasroung,
T. Charoenporn and T. Supnithi, ”Integrating Translation Feature Using
Machine Translation in Open Source LMS,” 2009 Ninth IEEE International Conference on Advanced Learning Technologies, 2009, pp. 403404, doi: 10.1109/ICALT.2009.136.
[6] J. -W. Hung, J. -R. Lin and L. -Y. Zhuang, ”The Evaluation Study of
the Deep Learning Model Transformer in Speech Translation,” 2021 7th
International Conference on Applied System Innovation (ICASI), 2021,
pp. 30-33, doi: 10.1109/ICASI52993.2021.9568450.
[7] V. Alves, J. Ribeiro, P. Faria and L. Romero, ”Neural Machine Translation Approach in Automatic Translations between Portuguese Language
and Portuguese Sign Language Glosses,” 2022 17th Iberian Conference
on Information Systems and Technologies (CISTI), 2022, pp. 1-7, doi:
10.23919/CISTI54924.2022.9820212.
[8] Machine Translation – Towards Data Science. (2022). Retrieved 24
November 2022, from https://towardsdatascience.com/tagged/machine
translation
[9] H. Sun, R. Wang, K. Chen, M. Utiyama, E. Sumita and T. Zhao, ”Unsupervised Neural Machine Translation With Cross-Lingual Language
Representation Agreement,” in IEEE/ACM Transactions on Audio,
Speech, and Language Processing, vol. 28, pp. 1170-1182, 2020, doi:
10.1109/TASLP.2020.2982282.
[10] Y. Wu, ”A Chinese-English Machine Translation Model Based on
Deep Neural Network,” 2020 International Conference on Intelligent
Transportation, Big Data and Smart City (ICITBS), 2020, pp. 828-831,
doi: 10.1109/ICITBS49701.2020.00182.
[11] L. Wang, ”Adaptability of English Literature Translation from the
Perspective of Machine Learning Linguistics,” 2020 International Conference on Computers, Information Processing and Advanced Education
(CIPAE), 2020, pp. 130-133, doi: 10.1109/CIPAE51077.2020.00042.
[12] S. P. Singh, H. Darbari, A. Kumar, S. Jain and A. Lohan, ”Overview of
Neural Machine Translation for English-Hindi,” 2019 International Conference on Issues and Challenges in Intelligent Computing Techniques
(ICICT), 2019, pp. 1-4, doi: 10.1109/ICICT46931.2019.8977715
[13] R. F. Gibadullin, M. Y. Perukhin and A. V. Ilin, ”Speech
Recognition and Machine Translation Using Neural Networks,”
2021 International Conference on Industrial Engineering, Applications and Manufacturing (ICIEAM), 2021, pp. 398-403, doi:
10.1109/ICIEAM51226.2021.9446474.
[14] How to Build Accountability into Your AI. (2021). Retrieved 24 November 2022, from https://hbr.org/2021/08/how-to-build-accountability-intoyour-ai
[15] Mukiibi, J., Hussein, A., Meyer, J., Katumba, A., and Nakatumba
Nabende, J. (2022). The Makerere Radio Speech Corpus: A Luganda
Radio Corpus for Automatic Speech Recognition. Retrieved 24 November 2022, from https://zenodo.org/record/5855017
Download