datascience

Text Categorization is the task of assigning a boolean value to each pair (dj, ci) in D x C, where D is domain of documents and C = {c1, ..., cn}, n = |C|, is a set of predefined categories.

More formally, the task is to approximate the unknown target function fi: D x C -> {T, F}, where fi is the classifier.

categories are symbolic labels
- no exogenous knowledge (data provided from external sources) is available
categories must be classified based on endogenous knowledge (data extracted from the documents)
- inferred from the semantics of a document, subjective in nature
- because it lacks objectivity, it is non-deterministic
- subject to judgement of one's teaching the algorithm

Single-label versus multi-label TC

Different constraints may be enforced on the TC task.

Perhaps it should map to exactly one category (single-label categories)
if mapped to 0 to |C| categories, it is multi-label

Bayes Rule

P(a|b) = P(b|a) * P(a) / P(b)

Naive Baysian Algorithm

Uses the TC model, and two assumptions, to calculate the category that maximizes P(ci | d) (probability of each category for a given document), expressed as c = argmax( P(ci | d) ).

Assumption 1: Conditionally Independent, we assume the words in the document have no order.
Assumption 2: Assume that the probability that document is in a category is equal to the product of each word's probability of being in the category
- Laplace Smoothening is applied to this assumption to remove the multiply by 0 possibility
- for each P(xi | c), where xi is some word in d, add 1 to the numerator and V to denominator, V is the total number of words in the vocabulary
Apply Bayes Rule to original expression, c = argmax( P(d | ci) * P(ci) / P(d) ), since P(d) is constant for all ci, it can be ignored
c = argmax( Π( P(xj|ci) ) * P(ci) ), where P(ci) is the percentage of total documents that are in the category ci

Persistant Data

total number of docs per class
total number of docs
hashmap of word occurrances per class
total number of word occurances per class
total number of distinct words per class

Machine Learning in Automated Text Categorization 2002

datascience

Single-label versus multi-label TC

Bayes Rule

Naive Baysian Algorithm

Persistant Data