AI Breakthrough Bridges the Digital Gap for the Kashmiri Language

Despite its rich cultural heritage, the Kashmiri language is considered 'low-resource' in the field of Natural Language Processing (NLP), meaning computers lack the data required to learn it effectively. To address this gap, researchers have developed a pioneering dataset comprising 15,036 news snippets designed specifically for text classification tasks.

The team constructed this dataset by translating English news into Kashmiri using digital tools, followed by a rigorous manual refinement process to ensure accuracy. The data spans ten diverse categories, including Politics, Technology, Medical, and Art and Craft. This effort represents the first known attempt to build a manually labelled corpus—a structured collection of text—for Kashmiri news classification.

Once the data was prepared, the researchers experimented with various machine learning algorithms and Large Language Models (LLMs). The standout performer was a fine-tuned transformer model known as ParsBERT-Uncased, which achieved an F1 score of 0.98, indicating near-perfect precision. This work establishes a critical foundation for future AI development in underrepresented languages.