标签:本质 ISE repr 特征选择 war 回归 分析 represent als
文本分类任务是NLP十分常见的任务大类,他的输入一般是文本信息,输出则是预测得到的分类标签。主要的文本分类任务有主题分类、情感分析 、作品归属、真伪检测等,很多问题其实通过转化后也能用分类的方法去做。
1. 朴素贝叶斯 Naive Bayes
优点:Fast to “train” and classify; robust, low- variance; good for low data situations; optimal classifier if independence assumption is correct; extremely simple to implement.
缺点:Independence assumption rarely holds; low accuracy compared to similar methods in most situations; smoothing required for unseen class/ feature combinations
2. 逻辑回归 Logistic Regression
逻辑回归是由线性回归做了点改动得来的,利用一个link function进行转化,有点”化曲为直“的味道,能够输出一个0-1的概率。
优点: Unlike Na?ve Bayes not confounded by diverse, correlated features
缺点: High bias; slow to train; some feature scaling issues; often needs a lot of data to work well; choosing regularisation a nuisance but important since overfitting is a big problem
3. Support Vector Machines (SVD)
优点: fast and accurate linear classifier; can do non-linearity with kernel trick; works well with huge feature sets
缺点: Multiclass classification awkward; feature scaling can be tricky; deals poorly with class imbalances; uninterpretable
4. K-Nearest Neighbour (KNN)
优点: Simple, effective; no training required; inherently multiclass; optimal with infinite data
缺点: Have to select k; issues with unbalanced classes; often slow (need to find those k-neighbours); features must be selected carefully
5. 决策树 Decision Tree
优点: in theory, very interpretable; fast to build and test; feature representation/scaling irrelevant; good for small feature sets, handles non-linearly-separable problems
缺点: In practice, often not that interpretable; highly redundant sub-trees; not competitive for large feature sets
6. 随机森林 Random Forest
优点: Usually more accurate and more robust than decision trees, a great classifier for small- to moderate-sized feature sets; training easily parallelised
缺点: Same negatives as decision trees: too slow with large feature sets
7. 神经网络 Neural Network
主要思想:将多个神经层节点之间相互联系,每个节点把前一层的weight传递到下一层,这里不展开,其实本质还是linear regression。
优点: Extremely powerful, state-of-the-art accuracy on many tasks in natural language processing and vision
缺点: Not an off-the-shelf classifier, very difficult to choose good parameters; slow to train; prone to overfitting
我们在使用训练集训练完数据后,可以用验证集进行调参,常用的调参方法有k-fold cross-validation,grid search
Accuracy = 正确数/总数
Precision = tp/tp+fp
Recall = tp/tp+fn
F1-score = 2 * precision * recall / (precision + recall)
另外还有macro f-score 和 micro f-score,想进一步了解的可以点这里。
标签:本质 ISE repr 特征选择 war 回归 分析 represent als