码迷,mamicode.com
首页 > 系统相关 > 详细

[kaggle入门] Titanic Machine Learning from Disaster

时间:2017-06-12 23:52:50      阅读:598      评论:0      收藏:0      [点我收藏+]

标签:div   gradient   adr   sign   dai   dex   wfs   sort   ctr   

 

Titanic Data Science Solutions

https://www.kaggle.com/startupsci/titanic-data-science-solutions

数据挖掘竞赛七个步骤:

  1. Question or problem definition.
  2. Acquire training and testing data.
  3. Wrangle, prepare, cleanse the data.
  4. Analyze, identify patterns, and explore the data.
  5. Model, predict and solve the problem.
  6. Visualize, report, and present the problem solving steps and final solution.
  7. Supply or submit the results.

数据挖掘竞赛的七种目标:

  1. Classifying: classify or categorize our samples and may also want to understand the implications or correlation of different classes with our solution goal.
  2. Correlating: Correlating certain features may help in creating, completing, or correcting features.
  3. Converting: For instance converting text categorical values to numeric values.
  4. Completing: Estimate any missing values within a feature.
  5. Correcting: Detect any outliers among our samples or features and may discard a feature if it is not contribting to the analysis or may significantly skew the results.
  6. Creating: Create new features based on an existing feature or a set of features.(correlation, conversion, completeness..)
  7. Charting: Select the right visualization plots and charts
 

Question or problem definition

https://www.kaggle.com/c/titanic

  1. The question or problem definition for Titanic Survival competition Knowing from a training set of samples listing passengers who survived or did not survive the Titanic disaster, can our model determine based on a given test dataset not containing the survival information, if these passengers in the test dataset survived or not.
  2. Some early understanding about the domain of our problem. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. Translated 32% survival rate. One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.
In [1]:
# data analysis and wrangling 数据分析和清洗工具
import pandas as pd
import numpy as np
import random as rnd

# visualization 数据可视化工具
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# machine learning 机器学习模型
from sklearn.linear_model import LogisticRegression # 逻辑回归
from sklearn.svm import SVC, LinearSVC # 支持向量机
from sklearn.ensemble import RandomForestClassifier # 随机森林
from sklearn.neighbors import KNeighborsClassifier # K近邻
from sklearn.naive_bayes import GaussianNB # 贝叶斯算法
from sklearn.linear_model import Perceptron # 感知机
from sklearn.linear_model import SGDClassifier # 随机梯度下降分类器
from sklearn.tree import DecisionTreeClassifier # 决策树
 

Acquire training and testing data

In [2]:
train_df = pd.read_csv(‘data/train.csv‘) # 用pandas的read_csv方法读出DataFrame数据
test_df = pd.read_csv(‘data/test.csv‘)
combine = [train_df, test_df] # combine为一个数据集,方便对训练集和测试集做相同的数据清洗操作
 

Analyze by describing data

https://www.kaggle.com/c/titanic/data

In [3]:
print(train_df.columns.values) # 导出列名:features的名字
 
[‘PassengerId‘ ‘Survived‘ ‘Pclass‘ ‘Name‘ ‘Sex‘ ‘Age‘ ‘SibSp‘ ‘Parch‘
 ‘Ticket‘ ‘Fare‘ ‘Cabin‘ ‘Embarked‘]
In [4]:
# preview the data
train_df.head()  # 默认前5行
Out[4]:
 
 PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
In [5]:
train_df.tail() # 默认后5行
Out[5]:
 
 PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.00 NaN S
887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.00 B42 S
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.45 NaN S
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.00 C148 C
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.75 NaN Q
In [6]:
train_df.info()
print(‘_‘*40)
test_df.info()
 
<class ‘pandas.core.frame.DataFrame‘>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
________________________________________
<class ‘pandas.core.frame.DataFrame‘>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Name           418 non-null object
Sex            418 non-null object
Age            332 non-null float64
SibSp          418 non-null int64
Parch          418 non-null int64
Ticket         418 non-null object
Fare           417 non-null float64
Cabin          91 non-null object
Embarked       418 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB
 
  1. Which features are categorical?
    Categorical: Survived, Sex, and Embarked.
    Ordinal: Pclass.
  2. Which features are numerical?
    Continous: Age, Fare.
    Discrete: SibSp, Parch.
  3. Which features are mixed data types?
    Ticket is a mix of numeric and alphanumeric data types.
    Cabin is alphanumeric.
  4. Which features may contain errors or typos?
    Name feature may contain errors or typos as there are several ways used to describe a name including titles, round brackets, and quotes used for alternative or short names.
  5. Which features contain blank, null or empty values?
    Cabin > Age > Embarked features contain a number of null values in that order for the training dataset.
    Cabin > Age are incomplete in case of test dataset.
  6. What are the data types for various features?
    Seven features are integer or floats. Six in case of test dataset.
    Five features are strings (object).
In [7]:
train_df.describe()  # 数据的描述(总数、均值、标准差、最大、最小、25%、50%、75%)
# Review survived rate using `percentiles=[.61, .62]` knowing our problem description mentions 38% survival rate.
# Review Parch distribution using `percentiles=[.75, .8]`
# SibSp distribution `[.68, .69]`
# Age and Fare `[.1, .2, .3, .4, .5, .6, .7, .8, .9, .99]`
Out[7]:
 
 PassengerIdSurvivedPclassAgeSibSpParchFare
count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200
In [8]:
train_df.describe(include=[‘O‘])  # 找出特征中几个出现的不同值和频率最高
Out[8]:
 
 NameSexTicketCabinEmbarked
count 891 891 891 204 889
unique 891 2 681 147 3
top Caldwell, Mrs. Albert Francis (Sylvia Mae Harb... male 347082 B96 B98 S
freq 1 577 7 4 644
 
  1. What is the distribution of numerical feature values across the samples?
    Total samples are 891 or 40% of the actual number of passengers on board the Titanic (2,224).
    Survived is a categorical feature with 0 or 1 values.
    Around 38% samples survived representative of the actual survival rate at 32%.
    Most passengers (> 75%) did not travel with parents or children.
    Nearly 30% of the passengers had siblings and/or spouse aboard.
    Fares varied significantly with few passengers (\<1%) paying as high as 512.
    Few elderly passengers (\<1%) within age range 65-80.
  2. What is the distribution of categorical features?
    Names are unique across the dataset (count=unique=891).
    Sex variable as two possible values with 65% male (top=male, freq=577/count=891).
    Cabin values have several dupicates across samples. Alternatively several passengers shared a cabin.
    Embarked takes three possible values. S port used by most passengers (top=S).
    Ticket feature has high ratio (22%) of duplicate values (unique=681).
 

Assumtions based on data analysis

Correlating
Completing
Correcting
Creating
Classifying

In [9]:
# 通过groupby找出该特征与目标之间的关联
train_df[[‘Pclass‘, ‘Survived‘]].groupby([‘Pclass‘], as_index=False).mean().sort_values(by=‘Survived‘, ascending=False)
Out[9]:
 
 PclassSurvived
0 1 0.629630
1 2 0.472826
2 3 0.242363
In [10]:
train_df[["Sex", "Survived"]].groupby([‘Sex‘], as_index=False).mean().sort_values(by=‘Survived‘, ascending=False)
Out[10]:
 
 SexSurvived
0 female 0.742038
1 male 0.188908
In [11]:
train_df[["SibSp", "Survived"]].groupby([‘SibSp‘], as_index=False).mean().sort_values(by=‘Survived‘, ascending=False)
Out[11]:
 
 SibSpSurvived
1 1 0.535885
2 2 0.464286
0 0 0.345395
3 3 0.250000
4 4 0.166667
5 5 0.000000
6 8 0.000000
In [12]:
train_df[["Parch", "Survived"]].groupby([‘Parch‘], as_index=False).mean().sort_values(by=‘Survived‘, ascending=False)
Out[12]:
 
 ParchSurvived
3 3 0.600000
1 1 0.550847
2 2 0.500000
0 0 0.343658
5 5 0.200000
4 4 0.000000
6 6 0.000000
 

Analyze by visualizing data

In [13]:
g = sns.FacetGrid(train_df, col=‘Survived‘)
g.map(plt.hist, ‘Age‘, bins=20)
Out[13]:
<seaborn.axisgrid.FacetGrid at 0x2a742a46828>
 
技术分享
In [14]:
# grid = sns.FacetGrid(train_df, col=‘Pclass‘, hue=‘Survived‘)
grid = sns.FacetGrid(train_df, col=‘Survived‘, row=‘Pclass‘, size=2.2, aspect=1.6)
grid.map(plt.hist, ‘Age‘, alpha=.5, bins=20)
grid.add_legend();
 
技术分享
In [15]:
# grid = sns.FacetGrid(train_df, col=‘Embarked‘)
grid = sns.FacetGrid(train_df, row=‘Embarked‘, size=2.2, aspect=1.6)
grid.map(sns.pointplot, ‘Pclass‘, ‘Survived‘, ‘Sex‘, palette=‘deep‘)
grid.add_legend()
Out[15]:
<seaborn.axisgrid.FacetGrid at 0x2a7435e7198>
 
技术分享
In [16]:
# grid = sns.FacetGrid(train_df, col=‘Embarked‘, hue=‘Survived‘, palette={0: ‘k‘, 1: ‘w‘})
grid = sns.FacetGrid(train_df, row=‘Embarked‘, col=‘Survived‘, size=2.2, aspect=1.6)
grid.map(sns.barplot, ‘Sex‘, ‘Fare‘, alpha=.5, ci=None)
grid.add_legend()
Out[16]:
<seaborn.axisgrid.FacetGrid at 0x2a7435e7978>
 
技术分享
 

Wrangle, prepare, cleanse the data

Correcting by dropping features
drop the Cabin (correcting #2) and Ticket (correcting #1) features

In [17]:
print("Before", train_df.shape, test_df.shape, combine[0].shape, combine[1].shape)

train_df = train_df.drop([‘Ticket‘, ‘Cabin‘], axis=1)
test_df = test_df.drop([‘Ticket‘, ‘Cabin‘], axis=1)
combine = [train_df, test_df]

print("After", train_df.shape, test_df.shape, combine[0].shape, combine[1].shape)
 
Before (891, 12) (418, 11) (891, 12) (418, 11)
After (891, 10) (418, 9) (891, 10) (418, 9)
 

Creating new feature extracting from existing

In [18]:
for dataset in combine:
    dataset[‘Title‘] = dataset.Name.str.extract(‘ ([A-Za-z]+)\.‘, expand=False)

pd.crosstab(train_df[‘Title‘], train_df[‘Sex‘])
Out[18]:
 
Sexfemalemale
Title  
Capt 0 1
Col 0 2
Countess 1 0
Don 0 1
Dr 1 6
Jonkheer 0 1
Lady 1 0
Major 0 2
Master 0 40
Miss 182 0
Mlle 2 0
Mme 1 0
Mr 0 517
Mrs 125 0
Ms 1 0
Rev 0 6
Sir 0 1
In [19]:
for dataset in combine:
    dataset[‘Title‘] = dataset[‘Title‘].replace([‘Lady‘, ‘Countess‘,‘Capt‘, ‘Col‘, 	‘Don‘, ‘Dr‘, ‘Major‘, ‘Rev‘, ‘Sir‘, ‘Jonkheer‘, ‘Dona‘], ‘Rare‘)

    dataset[‘Title‘] = dataset[‘Title‘].replace(‘Mlle‘, ‘Miss‘)
    dataset[‘Title‘] = dataset[‘Title‘].replace(‘Ms‘, ‘Miss‘)
    dataset[‘Title‘] = dataset[‘Title‘].replace(‘Mme‘, ‘Mrs‘)
    
train_df[[‘Title‘, ‘Survived‘]].groupby([‘Title‘], as_index=False).mean()
Out[19]:
 
 TitleSurvived
0 Master 0.575000
1 Miss 0.702703
2 Mr 0.156673
3 Mrs 0.793651
4 Rare 0.347826
In [20]:
title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Rare": 5}
for dataset in combine:
    dataset[‘Title‘] = dataset[‘Title‘].map(title_mapping)
    dataset[‘Title‘] = dataset[‘Title‘].fillna(0)

train_df.head()
Out[20]:
 
 PassengerIdSurvivedPclassNameSexAgeSibSpParchFareEmbarkedTitle
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 7.2500 S 1
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 71.2833 C 3
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 7.9250 S 2
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 53.1000 S 3
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 8.0500 S 1
In [21]:
train_df = train_df.drop([‘Name‘, ‘PassengerId‘], axis=1)
test_df = test_df.drop([‘Name‘], axis=1)
combine = [train_df, test_df]
train_df.shape, test_df.shape
Out[21]:
((891, 9), (418, 9))
In [22]:
for dataset in combine:
    dataset[‘Sex‘] = dataset[‘Sex‘].map( {‘female‘: 1, ‘male‘: 0} ).astype(int)

train_df.head()
Out[22]:
 
 SurvivedPclassSexAgeSibSpParchFareEmbarkedTitle
0 0 3 0 22.0 1 0 7.2500 S 1
1 1 1 1 38.0 1 0 71.2833 C 3
2 1 3 1 26.0 0 0 7.9250 S 2
3 1 1 1 35.0 1 0 53.1000 S 3
4 0 3 0 35.0 0 0 8.0500 S 1
In [23]:
# grid = sns.FacetGrid(train_df, col=‘Pclass‘, hue=‘Gender‘)
grid = sns.FacetGrid(train_df, row=‘Pclass‘, col=‘Sex‘, size=2.2, aspect=1.6)
grid.map(plt.hist, ‘Age‘, alpha=.5, bins=20)
grid.add_legend()
Out[23]:
<seaborn.axisgrid.FacetGrid at 0x2a74330acf8>
 
技术分享
In [24]:
guess_ages = np.zeros((2,3))
guess_ages
Out[24]:
array([[ 0.,  0.,  0.],
       [ 0.,  0.,  0.]])
In [25]:
for dataset in combine:
    for i in range(0, 2):
        for j in range(0, 3):
            guess_df = dataset[(dataset[‘Sex‘] == i) &                                   (dataset[‘Pclass‘] == j+1)][‘Age‘].dropna()

            # age_mean = guess_df.mean()
            # age_std = guess_df.std()
            # age_guess = rnd.uniform(age_mean - age_std, age_mean + age_std)

            age_guess = guess_df.median()

            # Convert random age float to nearest .5 age
            guess_ages[i,j] = int( age_guess/0.5 + 0.5 ) * 0.5
            
    for i in range(0, 2):
        for j in range(0, 3):
            dataset.loc[ (dataset.Age.isnull()) & (dataset.Sex == i) & (dataset.Pclass == j+1),                    ‘Age‘] = guess_ages[i,j]

    dataset[‘Age‘] = dataset[‘Age‘].astype(int)

train_df.head()
Out[25]:
 
 SurvivedPclassSexAgeSibSpParchFareEmbarkedTitle
0 0 3 0 22 1 0 7.2500 S 1
1 1 1 1 38 1 0 71.2833 C 3
2 1 3 1 26 0 0 7.9250 S 2
3 1 1 1 35 1 0 53.1000 S 3
4 0 3 0 35 0 0 8.0500 S 1
In [26]:
train_df[‘AgeBand‘] = pd.cut(train_df[‘Age‘], 5)
train_df[[‘AgeBand‘, ‘Survived‘]].groupby([‘AgeBand‘], as_index=False).mean().sort_values(by=‘AgeBand‘, ascending=True)
Out[26]:
 
 AgeBandSurvived
0 (-0.08, 16.0] 0.550000
1 (16.0, 32.0] 0.337374
2 (32.0, 48.0] 0.412037
3 (48.0, 64.0] 0.434783
4 (64.0, 80.0] 0.090909
In [27]:
for dataset in combine:    
    dataset.loc[ dataset[‘Age‘] <= 16, ‘Age‘] = 0
    dataset.loc[(dataset[‘Age‘] > 16) & (dataset[‘Age‘] <= 32), ‘Age‘] = 1
    dataset.loc[(dataset[‘Age‘] > 32) & (dataset[‘Age‘] <= 48), ‘Age‘] = 2
    dataset.loc[(dataset[‘Age‘] > 48) & (dataset[‘Age‘] <= 64), ‘Age‘] = 3
    dataset.loc[ dataset[‘Age‘] > 64, ‘Age‘]
train_df.head()
Out[27]:
 
 SurvivedPclassSexAgeSibSpParchFareEmbarkedTitleAgeBand
0 0 3 0 1 1 0 7.2500 S 1 (16.0, 32.0]
1 1 1 1 2 1 0 71.2833 C 3 (32.0, 48.0]
2 1 3 1 1 0 0 7.9250 S 2 (16.0, 32.0]
3 1 1 1 2 1 0 53.1000 S 3 (32.0, 48.0]
4 0 3 0 2 0 0 8.0500 S 1 (32.0, 48.0]
In [28]:
train_df = train_df.drop([‘AgeBand‘], axis=1)
combine = [train_df, test_df]
train_df.head()
Out[28]:
 
 SurvivedPclassSexAgeSibSpParchFareEmbarkedTitle
0 0 3 0 1 1 0 7.2500 S 1
1 1 1 1 2 1 0 71.2833 C 3
2 1 3 1 1 0 0 7.9250 S 2
3 1 1 1 2 1 0 53.1000 S 3
4 0 3 0 2 0 0 8.0500 S 1
In [29]:
for dataset in combine:
    dataset[‘FamilySize‘] = dataset[‘SibSp‘] + dataset[‘Parch‘] + 1

train_df[[‘FamilySize‘, ‘Survived‘]].groupby([‘FamilySize‘], as_index=False).mean().sort_values(by=‘Survived‘, ascending=False)
Out[29]:
 
 FamilySizeSurvived
3 4 0.724138
2 3 0.578431
1 2 0.552795
6 7 0.333333
0 1 0.303538
4 5 0.200000
5 6 0.136364
7 8 0.000000
8 11 0.000000
In [30]:
for dataset in combine:
    dataset[‘IsAlone‘] = 0
    dataset.loc[dataset[‘FamilySize‘] == 1, ‘IsAlone‘] = 1

train_df[[‘IsAlone‘, ‘Survived‘]].groupby([‘IsAlone‘], as_index=False).mean()
Out[30]:
 
 IsAloneSurvived
0 0 0.505650
1 1 0.303538
In [31]:
train_df = train_df.drop([‘Parch‘, ‘SibSp‘, ‘FamilySize‘], axis=1)
test_df = test_df.drop([‘Parch‘, ‘SibSp‘, ‘FamilySize‘], axis=1)
combine = [train_df, test_df]

train_df.head()
Out[31]:
 
 SurvivedPclassSexAgeFareEmbarkedTitleIsAlone
0 0 3 0 1 7.2500 S 1 0
1 1 1 1 2 71.2833 C 3 0
2 1 3 1 1 7.9250 S 2 1
3 1 1 1 2 53.1000 S 3 0
4 0 3 0 2 8.0500 S 1 1
In [32]:
for dataset in combine:
    dataset[‘Age*Class‘] = dataset.Age * dataset.Pclass

train_df.loc[:, [‘Age*Class‘, ‘Age‘, ‘Pclass‘]].head(10)
Out[32]:
 
 Age*ClassAgePclass
0 3 1 3
1 2 2 1
2 3 1 3
3 2 2 1
4 6 2 3
5 3 1 3
6 3 3 1
7 0 0 3
8 3 1 3
9 0 0 2
In [33]:
freq_port = train_df.Embarked.dropna().mode()[0]
freq_port
Out[33]:
‘S‘
In [34]:
for dataset in combine:
    dataset[‘Embarked‘] = dataset[‘Embarked‘].fillna(freq_port)
    
train_df[[‘Embarked‘, ‘Survived‘]].groupby([‘Embarked‘], as_index=False).mean().sort_values(by=‘Survived‘, ascending=False)
Out[34]:
 
 EmbarkedSurvived
0 C 0.553571
1 Q 0.389610
2 S 0.339009
In [35]:
for dataset in combine:
    dataset[‘Embarked‘] = dataset[‘Embarked‘].map( {‘S‘: 0, ‘C‘: 1, ‘Q‘: 2} ).astype(int)

train_df.head()
Out[35]:
 
 SurvivedPclassSexAgeFareEmbarkedTitleIsAloneAge*Class
0 0 3 0 1 7.2500 0 1 0 3
1 1 1 1 2 71.2833 1 3 0 2
2 1 3 1 1 7.9250 0 2 1 3
3 1 1 1 2 53.1000 0 3 0 2
4 0 3 0 2 8.0500 0 1 1 6
In [36]:
test_df[‘Fare‘].fillna(test_df[‘Fare‘].dropna().median(), inplace=True)
test_df.head()
Out[36]:
 
 PassengerIdPclassSexAgeFareEmbarkedTitleIsAloneAge*Class
0 892 3 0 2 7.8292 2 1 1 6
1 893 3 1 2 7.0000 0 3 0 6
2 894 2 0 3 9.6875 2 1 1 6
3 895 3 0 1 8.6625 0 1 1 3
4 896 3 1 1 12.2875 0 3 0 3
In [37]:
train_df[‘FareBand‘] = pd.qcut(train_df[‘Fare‘], 4)
train_df[[‘FareBand‘, ‘Survived‘]].groupby([‘FareBand‘], as_index=False).mean().sort_values(by=‘FareBand‘, ascending=True)
Out[37]:
 
 FareBandSurvived
0 (-0.001, 7.91] 0.197309
1 (7.91, 14.454] 0.303571
2 (14.454, 31.0] 0.454955
3 (31.0, 512.329] 0.581081
In [38]:
for dataset in combine:
    dataset.loc[ dataset[‘Fare‘] <= 7.91, ‘Fare‘] = 0
    dataset.loc[(dataset[‘Fare‘] > 7.91) & (dataset[‘Fare‘] <= 14.454), ‘Fare‘] = 1
    dataset.loc[(dataset[‘Fare‘] > 14.454) & (dataset[‘Fare‘] <= 31), ‘Fare‘]   = 2
    dataset.loc[ dataset[‘Fare‘] > 31, ‘Fare‘] = 3
    dataset[‘Fare‘] = dataset[‘Fare‘].astype(int)

train_df = train_df.drop([‘FareBand‘], axis=1)
combine = [train_df, test_df]
    
train_df.head(10)
Out[38]:
 
 SurvivedPclassSexAgeFareEmbarkedTitleIsAloneAge*Class
0 0 3 0 1 0 0 1 0 3
1 1 1 1 2 3 1 3 0 2
2 1 3 1 1 1 0 2 1 3
3 1 1 1 2 3 0 3 0 2
4 0 3 0 2 1 0 1 1 6
5 0 3 0 1 1 2 1 1 3
6 0 1 0 3 3 0 1 1 3
7 0 3 0 0 2 0 4 0 0
8 1 3 1 1 1 0 3 0 3
9 1 2 1 0 2 1 3 0 0
In [39]:
test_df.head(10)
Out[39]:
 
 PassengerIdPclassSexAgeFareEmbarkedTitleIsAloneAge*Class
0 892 3 0 2 0 2 1 1 6
1 893 3 1 2 0 0 3 0 6
2 894 2 0 3 1 2 1 1 6
3 895 3 0 1 1 0 1 1 3
4 896 3 1 1 1 0 3 0 3
5 897 3 0 0 1 0 1 1 0
6 898 3 1 1 0 2 2 1 3
7 899 2 0 1 2 0 1 0 2
8 900 3 1 1 0 1 3 1 3
9 901 3 0 1 2 0 1 0 3
In [40]:
X_train = train_df.drop("Survived", axis=1)
Y_train = train_df["Survived"]
X_test  = test_df.drop("PassengerId", axis=1).copy()
X_train.shape, Y_train.shape, X_test.shape
Out[40]:
((891, 8), (891,), (418, 8))
In [41]:
# Logistic Regression

logreg = LogisticRegression()
logreg.fit(X_train, Y_train)
Y_pred = logreg.predict(X_test)
acc_log = round(logreg.score(X_train, Y_train) * 100, 2)
acc_log
Out[41]:
80.359999999999999
In [42]:
coeff_df = pd.DataFrame(train_df.columns.delete(0))
coeff_df.columns = [‘Feature‘]
coeff_df["Correlation"] = pd.Series(logreg.coef_[0])

coeff_df.sort_values(by=‘Correlation‘, ascending=False)
Out[42]:
 
 FeatureCorrelation
1 Sex 2.201527
5 Title 0.398234
2 Age 0.287164
4 Embarked 0.261762
6 IsAlone 0.129140
3 Fare -0.085150
7 Age*Class -0.311199
0 Pclass -0.749006
In [43]:
# Support Vector Machines

svc = SVC()
svc.fit(X_train, Y_train)
Y_pred = svc.predict(X_test)
acc_svc = round(svc.score(X_train, Y_train) * 100, 2)
acc_svc
Out[43]:
83.840000000000003
In [44]:
knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(X_train, Y_train)
Y_pred = knn.predict(X_test)
acc_knn = round(knn.score(X_train, Y_train) * 100, 2)
acc_knn
Out[44]:
84.739999999999995
In [45]:
# Gaussian Naive Bayes

gaussian = GaussianNB()
gaussian.fit(X_train, Y_train)
Y_pred = gaussian.predict(X_test)
acc_gaussian = round(gaussian.score(X_train, Y_train) * 100, 2)
acc_gaussian
Out[45]:
72.280000000000001
In [46]:
# Perceptron

perceptron = Perceptron()
perceptron.fit(X_train, Y_train)
Y_pred = perceptron.predict(X_test)
acc_perceptron = round(perceptron.score(X_train, Y_train) * 100, 2)
acc_perceptron
Out[46]:
78.0
In [47]:
# Linear SVC

linear_svc = LinearSVC()
linear_svc.fit(X_train, Y_train)
Y_pred = linear_svc.predict(X_test)
acc_linear_svc = round(linear_svc.score(X_train, Y_train) * 100, 2)
acc_linear_svc
Out[47]:
79.120000000000005
In [48]:
# Stochastic Gradient Descent

sgd = SGDClassifier()
sgd.fit(X_train, Y_train)
Y_pred = sgd.predict(X_test)
acc_sgd = round(sgd.score(X_train, Y_train) * 100, 2)
acc_sgd
Out[48]:
76.879999999999995
In [49]:
# Decision Tree

decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train, Y_train)
Y_pred = decision_tree.predict(X_test)
acc_decision_tree = round(decision_tree.score(X_train, Y_train) * 100, 2)
acc_decision_tree
Out[49]:
86.760000000000005
In [50]:
# Random Forest

random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(X_train, Y_train)
Y_pred = random_forest.predict(X_test)
random_forest.score(X_train, Y_train)
acc_random_forest = round(random_forest.score(X_train, Y_train) * 100, 2)
acc_random_forest
Out[50]:
86.760000000000005
In [51]:
models = pd.DataFrame({
    ‘Model‘: [‘Support Vector Machines‘, ‘KNN‘, ‘Logistic Regression‘, 
              ‘Random Forest‘, ‘Naive Bayes‘, ‘Perceptron‘, 
              ‘Stochastic Gradient Decent‘, ‘Linear SVC‘, 
              ‘Decision Tree‘],
    ‘Score‘: [acc_svc, acc_knn, acc_log, 
              acc_random_forest, acc_gaussian, acc_perceptron, 
              acc_sgd, acc_linear_svc, acc_decision_tree]})
models.sort_values(by=‘Score‘, ascending=False)
Out[51]:
 
 ModelScore
3 Random Forest 86.76
8 Decision Tree 86.76
1 KNN 84.74
0 Support Vector Machines 83.84
2 Logistic Regression 80.36
7 Linear SVC 79.12
5 Perceptron 78.00
6 Stochastic Gradient Decent 76.88
4 Naive Bayes 72.28
In [52]:
submission = pd.DataFrame({
        "PassengerId": test_df["PassengerId"],
        "Survived": Y_pred
    })
# submission.to_csv(‘../output/submission.csv‘, index=False)
In [ ]:
 

[kaggle入门] Titanic Machine Learning from Disaster

标签:div   gradient   adr   sign   dai   dex   wfs   sort   ctr   

原文地址:http://www.cnblogs.com/daigz1224/p/6995349.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!