Machine Learning in Automated Text Categorization 2002

Back to Research-Papers

datascience

Text Categorization is the task of assigning a boolean value to each pair (dj, ci) in D x C, where D is domain of documents and C = {c1, ..., cn}, n = |C|, is a set of predefined categories.

More formally, the task is to approximate the unknown target function fi: D x C -> {T, F}, where fi is the classifier.

Single-label versus multi-label TC

Different constraints may be enforced on the TC task.

Bayes Rule

P(a|b) = P(b|a) * P(a) / P(b)

Naive Baysian Algorithm

Uses the TC model, and two assumptions, to calculate the category that maximizes P(ci | d) (probability of each category for a given document), expressed as c = argmax( P(ci | d) ).

Persistant Data