Adaptation and Development of Universal Dependencies for Punjabi (Shahmukhi) Script: Challenges and Linguistic Insights
Abstract
This paper explores the Universal Dependencies (UD) framework applied to Punjabi (Shahmukhi) through the development of a treebank, addressing both theoretical and practical aspects. Universal Dependencies is a standardized annotation scheme designed to enhance the development of multilingual parsers, facilitate cross-linguistic research, and promote consistency in syntactic annotation across different languages. Originating from the Stanford Dependencies and incorporating principles from Google’s universal tagset and Interset, UD aims to provide a universal set of grammatical categories applicable across various languages. Punjabi, spoken primarily in Pakistan and written in the Shahmukhi script, presents unique challenges due to its less standardized orthography and script-related ambiguities. This paper discusses the adaptation of UD for Punjabi by addressing these orthographic issues, such as the absence of consistent diacritics, which can lead to significant semantic ambiguity. The project aims to offer a comprehensive linguistic resource for Punjabi in Shahmukhi script, thereby supporting research and applications in Natural Language Processing (NLP) and contributing to the broader UD ecosystem. By detailing the specific adaptations made for Punjabi, including handling of diacritics and grammatical features like word classes, gender rules, and tonal characteristics, this work seeks to enhance the utility and accuracy of the UD framework for low-resource languages. The paper highlights the importance of this adaptation for advancing multilingual parsing and syntactic analysis in less-resourced linguistic contexts.