10.5120/17701-8680 |
Diab Abuaiadah, Jihad El Sana and Walid Abusalah. Article: On the Impact of Dataset Characteristics on Arabic Document Classification. International Journal of Computer Applications 101(7):31-38, September 2014. Full text available. BibTeX
@article{key:article, author = {Diab Abuaiadah and Jihad El Sana and Walid Abusalah}, title = {Article: On the Impact of Dataset Characteristics on Arabic Document Classification}, journal = {International Journal of Computer Applications}, year = {2014}, volume = {101}, number = {7}, pages = {31-38}, month = {September}, note = {Full text available} }
Abstract
This paper describes the impact of dataset characteristics on the results of Arabic document classification algorithms using TF-IDF representations. The experiments compared different stemmers, different categories and different training set sizes, and found that different dataset characteristics produced widely differing results, in one case attaining a remarkable 99% recall (accuracy). The use of a standard dataset would eliminate this variability and enable researchers to gain comparable knowledge from the published results.
References
- Rocchio, J. 1971. Relevance feedback in information retrieval. In The SMART Retrieval System: Experiments in Automatic Document Processing. Englewood Cliffs, NJ, Prentice-Hall, 313-323.
- Salton, G. and Buckley, C. 1988. Term weighting approaches in automatic text retrieval. In Information Processing and Management, vol. 24, no. 5, 513-523.
- Salton, G. 1989. Automatic text processing: the transformation, analysis, and retrieval of information by computer. Boston: Addison-Wesley Longman.
- Cavnar, W. and Trenkle, J. 1994. N-Gram-Based Text Categorization. In Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval.
- Newsri, A. 2008. Effective Retrieval Techniques for Arabic Text (Doctoral dissertation). RMIT, Melbourne.
- Lovins, J. 1968. Development of a stemming algorithm. In Mechanical Translation and Computational Linguistics, vol. 11, 22-31.
- Syiam, M. , Fayed, Z. and Habib, M. 2006. An Intelligent System for Arabic Text Categorization. In International Journal of Intelligent Computing and Information Sciences, vol. 6, no. 1, 1-19.
- Al-Shammari, E. and Lin, J. 2008. Towards an error-free Arabic stemming. In Proceedings of the ACM International Conference on Information and Knowledge Management (CIKM-iNEWS'08).
- Al-Kabi, N. and Al-Radaideh, A. 2011. Benchmarking and assessing the performance of Arabic stemmers. In Journal of Information Science, vol. 37, no. 2, 111-119.
- Shatnawi, M. , Yassein, M. and Mahafza, R. 2013. A framework for retrieving Arabic documents based on queries written in Arabic slang language. In Journal of Information Science, vol. 38, no. 4, 350-365.
- Lewis, D. 1997. Reuters-21578 text categorization test collection. Reuter.
- Elkourdi, M. , Bensaid, M. and Rachidi, T. 2004. Automatic Arabic Document Categorization Based on the Naïve Bayes Algorithm. In Proceedings of COLING 20th Workshop on Computational Approaches to Arabic Script-based Languages. Geneva.
- Al-Shalabi, R. and Evan, M. A computational morphology system for Arabic. In Proceedings of the Workshop on Computational Approaches to Semitic Languages (COLING-ACL '98), Quebec, 1998.
- Mesleh, A. 2007. Chi Square Feature Extraction Based Svms Arabic Language Text Categorization System. In Journal of Computer Science, vol. 3, no. 6, 430-435.
- Al-Saleem, M. 2010. Associative Classification to Categorize Arabic Data Sets. In The International Journal of ACM Jordan (ISSN 2078-7952), vol. 1, no. 3, 118-127.
- Khreisat, L. 2006. Arabic Text Classification Using N-Gram Frequency Statistics: A Comparative Study. In Proceedings of the 2006 International Conference on Data Mining, DMIN'06.
- El-Halees, A. 2007. Arabic Text Classification Using Maximum Entropy. In The Islamic University Journal (Series of Natural Studies and Engineering), vol. 15, no. 1, 157-167.
- Zahran, B. and Kanaan, G. 2009. Text Feature Selection using Particle Swarm Optimization Algorithm. In World Applied Sciences Journal, vol. 7, 69-74.
- Kennedy, J. and Eberhart, R. 1995. Particle Swarm Optimization. In Proc. IEEE, International Conference on Neural Networks. Piscataway.
- Zaki, T. , Mammas, D. , Ennaji, A. and Nouboud, F. 2010. Classification of Arabic Documents by a Model of Fuzzy Proximity with a Radial Basis Function. In International Journal of Future Generation, Communication and Networking, vol. 3, no. 4.
- Khorsheed, M. S. , and Thubaity, A. O. 2013. Comparative evaluation of text classification techniques using a large diverse Arabic dataset. In Language Resources and Evaluation, vol. 47, no. 2, 513-538.
- Ababneh, J. , Almomani, O. , Hadi, W. , El-Omari, N. and Al-Ibrahim, A. 2014. Vector Space Models to Classify Arabic Text. In International Journal of Computer Trends and Technology (IJCTT), vol. 7, no. 4.
- Zaki, T. , Es-saady, Y. , Mammass, D. , Ennaji, A. and Nicolas, S. 2014. A Hybrid Method N-Grams-TFIDF with radial basis for indexing and classification of Arabic documents. In International Journal of Software Engineering and Its Applications, vol. 7, no. 2, 127-144.
- Larkey, L. , Ballesteros, L. and Connell, M. 2007. Light Stemming for Arabic Information Retrieval. In Text, Speech and Language Technology, vol. 38, 221-243.
- Chen, A. and Gey, F. 2002. Building an Arabic stemmer for information retrieval. In NIST Special Publication 500-251: Proceedings of the Eleventh Text Retrieval Conference (TREC 2002).
- Khoja, S. and Garside, R. 1999. Stemming Arabic text. Lancaster University, Lancaster.
- Al-Shargabi, B. , Olayah, F. and Al-Romimah, W. 2011. An Experimental Study for the Effect of Stop Words Elimination for Arabic Text Classification Algorithms. In International Journal of Information Technology and Web Engineering (IJITWE), vol. 6, no. 2.
- Wahbeh, A. , Al-Kabi, M. , Al-Radaidah, Q. , Al-Shawakfa, E. and Alsamdi, I. 2011. The Effect of Stemming on Arabic Text Classification: An Empirical Study. In International Journal of Information Retrieval Research (IJIRR), vol. 1, no. 3.