A Dual-Task Large Language Model for Adding Diacritics and Translating Jordanian Arabic to Modern Standard Arabic
编号:18 访问权限:仅限参会人 更新:2025-11-19 09:12:30 浏览:8次 拓展类型1

报告开始:暂无开始时间(Asia/Amman)

报告时间:暂无持续时间

所在会场:[暂无会议] [暂无会议段]

暂无文件

摘要
The Arabic language presents unique challenges for natural language processing due to its complex grammar, diverse dialects, and frequent omission of diacritics. This paper proposes a unified token-free model based on ByT5 that simultaneously performs spelling correction (including Jordanian dialect-to-Modern Standard Arabic (MSA) translation) and diacritization. Our approach uses task-specific prefixes (“correct:” for correction and “diacritize:” for combined correction and diacritization) to enable flexible multi-task learning. The model was fine-tuned on the JODA dataset (Jordanian dialect/MSA pairs) and high-quality Tashkeela subsets (Clean-50 and Clean-400), with synthetic errors injection to enhance robustness. Automatic evaluation showed an overall evaluation score of 78.06% on JODA and 92.45% on the combined test set of JODA and Tashkeela. Manual evaluation of 200 JODA samples revealed a character error rate of 4.41% and diacritic error rate of 1.32%, demonstrating practical efficacy in handling Arabic’s complexities.
关键词
Arabic NLP,Dialect Translation,Jordanian Dialect,Diacritization,Spelling Correction,ByT5,Transformer Models,Multi-Task Learning
报告人
Rabie Otoum
RAN Optimization and University of Jordan

稿件作者
Rabie Otoum University of Jordan
Gheith Abandah University of Jordan
Mohammad Abdel-Majeed University of Jordan
发表评论
验证码 看不清楚,更换一张
全部评论
重要日期
  • 会议日期

    12月29日

    2025

    12月31日

    2025

  • 11月30日 2025

    初稿截稿日期

  • 12月30日 2025

    报告提交截止日期

  • 12月30日 2025

    注册截止日期

主办单位
国际科学联合会
承办单位
扎尔卡大学
历届会议
移动端
在手机上打开
小程序
打开微信小程序
客服
扫码或点此咨询