Text classification is the process of classifying documents into a predefined set of categories based on their content. A variety of classifiers are used to classify Arabic text documents. The main objective of this research is to improve Arabic text documents classification by combining different classification algorithms. To achieve this objective we build four models using different combination methods.
The first combined model was built using fixed combination rules, we used five fixed rules to combine different classifiers; and for each rule we used different number of classifiers; the best classification accuracy was achieved using majority voting rule and it was 95.3% using seven classifiers, the time required to build this model was 835.94 seconds.
The second combination approach we used was stacking, which consists of two stages of classification; the first stage was done by the base classifiers, where the second one was done by a Meta classifier. In our experiments we used two different Meta classifiers Naïve Bayes and Linear Regression; and we used different number of base classifiers. Stacking achieved a very high classification accuracy of 99.2% when using Naïve Bayes as a Meta classifier and 99.4% when using Linear Regression as a Meta classifier. Stacking needed a long time to build the models because it consists of two stages of learning and it was 1962.73 seconds using naïve Bayes and 3718.07 seconds using Linear Regression.
The third experiment was done using AdaBoost; we boosted C4.5 classifier with different number of iterations. Boosting improves the classification accuracy of C4.5 classifier; it was 95.3% using 5 iterations and needed 1174.58 seconds to build the model, where the accuracy was 99.5% using 10 iterations and needed 1965.72 seconds to build the model.
The fourth approach was done using Bagging, which was designed to improve the stability and accuracy of machine learning algorithms, we used decision tree with bagging, the results were 93.7% achieved through 295.85 seconds when using 5 iterations and 99.4% when using 10 iteration which needed 470.99 seconds. We used three datasets to test the combined models, BBC Arabic, CNN Arabic and OSAC datasets. The experiments were done using WEKA and RapidMiner data mining tools.