The extension of the CI-BAYES categorizer with revoke function

Posted by | · · | Software developement | No Comments on The extension of the CI-BAYES categorizer with revoke function

The CI-Bayes is a Java library, which is suitable for the classification of texts. Joseph Ottinger demonstrates the use of the library very well in his article entitled Using CI-Baye.

In his article Joseph Ottinger uses version 1.0.4, however, I used version 2.0.0. RELEASE (I did not notice any differences in the API).
The CI-Bayes library is available in the Maven repository:CI-Bayes maven repository.

The Problem

We wish to create a document store where users can classify documents in pre-defined categories, and can also delete categories assigned to the documents.
The aim is to train the system through these feedbacks, and to make it able to use this information for automatic categorization.

The difficulty lies in deleting the category assigned to the document, since, as a result of this, the assent of the given document would have to be removed from the categorizing system – that is, from the acquired information concerning the category. Unfortunately, this is not supported by the CI-Bayes API.

The Idea

As the delete function was necessary, I began to study the source code of CI-Bayes, and explore how its training works. The method com.enigmastation.classifier.impl.ClassifierImpl.train(Object item, String category) is responsible for the training, where the item is the text of the document, and the category is the category into which the user has classified the given document. The implementation of the train() method is as follows:


public void train(Object item, String category) {
    Set<String> features = wordLister.getUniqueWords(item);

    for (String f : features) {
        incf(f, category);
    }
    incc(category);
}

This algorithm carries out the following:

  • it takes different words from the text
  • it adds those words of the document to the knowledge base of the given category which were previously not included in it (the multiplicity of these will be one), or increases the multiplicity of the existing words by one
  • it adds the given category to its knowledge base if it did not exist before (then the multiplicity of the category will be one), otherwise it increases the multiplicity of the category

The training itself consists of just that.
A document is classified into a category through using this acquired information. The CI-Bayes library provides three Bayesian classifiers to achieve this: the “simple Bayesian” classifier, the naïve classifier, and – the most useful – Fisher classifier.
As we could see above, the classifier is trained through the increment of the counters, and thus the acquired information can be revoked through the decrement of the counters.

Realization

To revoke an information previously taught to the classifier (that a given document belongs to a given category) I implemented a revokeTrain(Object item, String category) method modelled on the train(Object item, String category) method:

  • it takes different words from the text
  • it decreases the multiplicity of those words in the knowledge base of the given category which were present in the document, and in case the multiplicity of any word decreases to zero, it deletes that word from its knowledge base
  • it decreases the multiplicity of the given category in its knowledge base, and if the multiplicity decreases to zero, it deletes the category from the knowledge base.

For the realization I made a subclass of com.enigmastation.classifier.impl.FisherClassifierImpl (naturally, NaiveClassifierImpl or ClassifierImpl would work as well depending on which one we intend to use)

My implementation is the following:


import com.enigmastation.classifier.CategoryIncrement;
import com.enigmastation.classifier.ClassifierListener;
import com.enigmastation.classifier.ClassifierProbability;
import com.enigmastation.classifier.FeatureIncrement;
import com.enigmastation.classifier.impl.FisherClassifierImpl;
import com.google.common.collect.Sets;
import java.util.Arrays;
import java.util.Collections;
import java.util.Comparator;
import java.util.Map;
import java.util.Set;

public class MyFisherClassifier extends FisherClassifierImpl {

    private Set<ClassifierListener> trainingListeners;

    @Override
    public void addListener(ClassifierListener listener) {
        super.addListener(listener);
        if (trainingListeners == null) {
            trainingListeners = Sets.newHashSet();
        }
        trainingListeners.add(listener);
    }

    public void revokeTrain(Object item, String category) {
        Set<String> features = wordLister.getUniqueWords(item);

        for (String feature : features) {
            decf(feature, category);
        }
        decc(category);
    }

    void decf(String feature, String category) {
        Map<String, Integer> fm = getClassifierDataModelFactory().getFeatureMap(feature);
        if (fm == null) {
            throw new IllegalStateException("You must be able to create a feature map");
        }

        decrementCategory(fm, category);

        final Integer count = fm.get(category);

        if (trainingListeners != null) {
            FeatureIncrement fi = new FeatureIncrement(feature, category, count);
            for (ClassifierListener listener : trainingListeners) {
                listener.handleFeatureUpdate(fi);
            }
        }

        if (count == 0) {
            fm.remove(category);
        }
    }

    void decc(String category) {
        decrementCategory(getCategoryDocCount(), category);

        final Integer categoryDocCount = getCategoryDocCount().get(category);

        if (trainingListeners != null) {
            CategoryIncrement ci = new CategoryIncrement(category, categoryDocCount);
            for (ClassifierListener listener : trainingListeners) {
                listener.handleCategoryUpdate(ci);
            }
        }

        if (categoryDocCount == 0) {
            getCategoryDocCount().remove(category);
        }
    }

    private void decrementCategory(Map<String, Integer> map, String category) {
        Integer val = map.get(category);
        if (val != null) {
            map.put(category, val - 1);
        } else {
            map.put(category, 1);
        }
    }
}

 

 

Resources

What is Text Classification- Stanford NLP

Naive Bayes – Stanford NLP

Naive Bayes classifier

CI-Bayes


No Comments

Leave a comment