Unsupervised Learning for Extractive Urdu Document Text Summarization: A Clustering Approach
Atif Khan, Department of Computer Science Islamia College, Peshawar, Pakistan.
Obaid Ajmal, Department of Computer Science Islamia College, Peshawar, Pakistan.
Farhan Ahmad Awan, Department of Computer Science Islamia College, Peshawar, Pakistan.
Fazal Wahid, Department of Computer Science Islamia College, Peshawar, Pakistan.
Corresponding Author:
Atif Khan (atifkhan@icp.edu.pk)
Abstract:
As digital media continues to grow, the abundance of textual data from diverse sources such as documents, entertainment, books, and articles is rapidly expanding. In recent years, Urdu linguistics has seen significant progress, leading to the emergence of numerous portals and news websites that produce substantial amounts of data on a daily basis. However, this vast volume of text often contains redundant, insignificant, or less meaningful content, necessitating the development of efficient tools for automatically condensing extensive textual data into concise summaries. Text summarization is the process of creating a brief yet meaningful version of a document and can be categorized into two types: abstractive and extractive. Abstractive summarization involves generating a summary in a more abstract form, while extractive summarization selects essential sentences directly from the original document. Various techniques have been proposed for both types of summarizations. One major challenge in automatic extractive text summarization is accurately identifying the most relevant information in the source document. In our study, we propose an unsupervised learning algorithm for automatic text summarization in the Urdu language that is domain-independent. Our approach utilizes Hierarchical Agglomerative Clustering to effectively group similar sentences and select the most representative sentence from each cluster to compose the summary. The quality of the computer-generated summary is evaluated by comparing it to human-generated reference summaries using ROUGE-1 and ROUGE-2 metrics. We conducted extensive experiments on a dataset of Urdu language derived from BBC news articles to assess the performance of our proposed method.
Keywords:
Unsupervised Learning; Text Summarization; Agglomerative Hierarchical Clustering; Urdu Language; News Articles