# Machine Learning in Automated Text Categorization 2002

Back to Research-Papers

# datascience

Text Categorization is the task of assigning a boolean value to each pair `(dj, ci)` in `D x C`, where D is domain of documents and `C = {c1, ..., cn}`, `n = |C|`, is a set of predefined categories.

More formally, the task is to approximate the unknown target function `fi: D x C -> {T, F}`, where fi is the classifier.

• categories are symbolic labels
• no exogenous knowledge (data provided from external sources) is available
• categories must be classified based on endogenous knowledge (data extracted from the documents)
• inferred from the semantics of a document, subjective in nature
• because it lacks objectivity, it is non-deterministic
• subject to judgement of one's teaching the algorithm

### Single-label versus multi-label TC

Different constraints may be enforced on the TC task.

• Perhaps it should map to exactly one category (single-label categories)
• if mapped to 0 to |C| categories, it is multi-label

### Bayes Rule

`P(a|b) = P(b|a) * P(a) / P(b)`

### Naive Baysian Algorithm

Uses the TC model, and two assumptions, to calculate the category that maximizes `P(ci | d)` (probability of each category for a given document), expressed as `c = argmax( P(ci | d) )`.

• Assumption 1: Conditionally Independent, we assume the words in the document have no order.
• Assumption 2: Assume that the probability that document is in a category is equal to the product of each word's probability of being in the category
• Laplace Smoothening is applied to this assumption to remove the multiply by 0 possibility
• for each `P(xi | c)`, where `xi` is some word in `d`, add 1 to the numerator and `V` to denominator, `V` is the total number of words in the vocabulary
• Apply Bayes Rule to original expression, `c = argmax( P(d | ci) * P(ci) / P(d) )`, since `P(d)` is constant for all ci, it can be ignored
• `c = argmax( Π( P(xj|ci) ) * P(ci) )`, where `P(ci)` is the percentage of total documents that are in the category `ci`

### Persistant Data

• total number of docs per class
• total number of docs
• hashmap of word occurrances per class
• total number of word occurances per class
• total number of distinct words per class