Kaggle竞赛题之——Sentiment Analysis on Movie Reviews

时间：2015-01-18 14:25:37 阅读：172 评论：0 收藏：0 [点我收藏+]

Classify the sentiment of sentences from the Rotten Tomatoes dataset

题目链接：https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews

越来越喜欢iPython notebook了。以下所有工作都可以在一个页面上完成，FireFox支持比Chrome要好。

数据集分为train.tsv和test.tsv。字段以\t分隔，每一行有四个字段：PhraseId，SentenceId，Phrase，Sentiment。

情感标识:

0 - negative
1 - somewhat negative
2 - neutral
3 - somewhat positive
4 - positive

import pandas as pd
df = pd.read_csv('train.tsv',header=0,delimiter='\t')
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 156060 entries, 0 to 156059
Data columns (total 4 columns):
PhraseId      156060 non-null int64
SentenceId    156060 non-null int64
Phrase        156060 non-null object
Sentiment     156060 non-null int64
dtypes: int64(3), object(1)

df.head()

Out[6]:

	PhraseId	SentenceId	Phrase	Sentiment
0	1	1	A series of escapades demonstrating the adage ...	1
1	2	1	A series of escapades demonstrating the adage ...	2
2	3	1	A series	2
3	4	1	A	2
4	5	1	series	2

In [13]:
df.Sentiment.value_counts()/df.Sentiment.count()
Out[13]:
2    0.509945
3    0.210989
1    0.174760
4    0.058990
0    0.045316
dtype: float64

直接用训练集的前5行做分类准确性测试：

X_train = df['Phrase']
y_train = df['Sentiment']
import numpy as np
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
text_clf = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', LogisticRegression()),
                      ])
text_clf = text_clf.fit(X_train,y_train)
X_test = df.head()['Phrase']
predicted = text_clf.predict(X_test)
print np.mean(predicted == df.head()['Sentiment'])
for phrase, sentiment in zip(X_test, predicted):
    print('%r => %s' % (phrase, sentiment))

分类准确率及结果：

0.8
'A series of escapades demonstrating the adage that what is good for the goose is also good for the gander , some of which occasionally amuses but none of which amounts to much of a story .' => 3
'A series of escapades demonstrating the adage that what is good for the goose' => 2
'A series' => 2
'A' => 2
'series' => 2

df.head()['Sentiment']
0    1
1    2
2    2
3    2
4    2

第一个分类错误。
测试数据集：

test_df = pd.read_csv('test.tsv',header=0,delimiter='\t')
test_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 66292 entries, 0 to 66291
Data columns (total 3 columns):
PhraseId      66292 non-null int64
SentenceId    66292 non-null int64
Phrase        66292 non-null object
dtypes: int64(2), object(1)

用训练好的模型对测试数据集进行分类：

from numpy import savetxt
X_test = test_df['Phrase']
phraseIds = test_df['PhraseId']
predicted = text_clf.predict(X_test)
pred = [[index+156061,x] for index,x in enumerate(predicted)]
savetxt('../Submissions/lr_benchmark.csv',pred,delimiter=',',fmt='%d,%d',header='PhraseId,Sentiment',comments='')

提交结果：

参考：http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

Kaggle竞赛题之——Sentiment Analysis on Movie Reviews

标签：机器学习 machine learning kaggle

原文地址：http://blog.csdn.net/laozhaokun/article/details/42807241

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行