标签:lex 目录 补充 down 自定义 target 字符串分割 val rdf
{
"cells": [
{
"cell_type": "markdown",
"source": [
"# fk\n",
"\n"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"source": [
"目录\n",
"1. 提出问题(Business Understanding )\n",
"2. 理解数据(Data Understanding)\n",
" * 采集数据\n",
" * 导入数据\n",
" * 查看数据集信息\n",
"3. 数据清洗(Data Preparation )\n",
" * 数据预处理\n",
" * 特征工程(Feature Engineering)\n",
"4. 构建模型(Modeling) \n",
"5. 模型评估(Evaluation) \n",
"6. 方案实施 (Deployment)\n",
" * 提交结果到Kaggle\n",
" * 报告撰写\n",
"\n",
"\n",
"\n"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"source": [
"# 2.理解数据"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"source": [
"## 2.1 采集数据"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"source": [
"从Kaggle泰坦尼克号项目页面下载数据:https://www.kaggle.com/c/titanic"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"source": [
"## 2.2 导入数据"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": 776,
"outputs": [],
"source": [
"# 忽略警告提示\n",
"import warnings\n",
"warnings.filterwarnings(‘ignore‘)\n",
"\n",
"#导入处理数据包\n",
"import numpy as np\n",
"import pandas as pd\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.linear_model import LogisticRegression"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n",
"is_executing": false
}
}
},
{
"cell_type": "code",
"execution_count": 777,
"outputs": [
{
"name": "stdout",
"text": [
"训练数据集: (891, 12) 测试数据集: (418, 11)\n"
],
"output_type": "stream"
}
],
"source": [
"#导入数据\n",
"#训练数据集\n",
"train = pd.read_csv(\"C:/Users/Administrator/Desktop/ml/file/train.csv\", sep=‘,‘, encoding = \"gbk\")\n",
"#测试数据集\n",
"test = pd.read_csv(\"C:/Users/Administrator/Desktop/ml/file/test.csv\", sep=‘,‘, encoding = \"gbk\")\n",
"#这里要记住训练数据集有891条数据,方便后面从中拆分出测试数据集用于提交Kaggle结果\n",
"print (‘训练数据集:‘,train.shape,‘测试数据集:‘,test.shape)"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n",
"is_executing": false
}
}
},
{
"cell_type": "code",
"execution_count": 778,
"outputs": [
{
"name": "stdout",
"text": [
"kaggle训练数据集有多少行数据: 891 ,kaggle测试数据集有多少行数据: 418\n"
],
"output_type": "stream"
}
],
"source": [
"rowNum_train=train.shape[0]\n",
"rowNum_test=test.shape[0]\n",
"print(‘kaggle训练数据集有多少行数据:‘,rowNum_train,\n",
" ‘,kaggle测试数据集有多少行数据:‘,rowNum_test,)"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n",
"is_executing": false
}
}
},
{
"cell_type": "code",
"execution_count": 779,
"outputs": [
{
"name": "stdout",
"text": [
"合并后的数据集: (1309, 12)\n"
],
"output_type": "stream"
}
],
"source": [
"#合并数据集,方便同时对两个数据集进行清洗\n",
"full = train.append( test , ignore_index = True )\n",
"\n",
"print (‘合并后的数据集:‘,full.shape)"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n",
"is_executing": false
}
}
},
{
"cell_type": "markdown",
"source": [
"## 2.3 查看数据集信息"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": 780,
"outputs": [
{
"data": {
"text/plain": " Age Cabin Embarked Fare \\\n0 22.0 NaN S 7.2500 \n1 38.0 C85 C 71.2833 \n2 26.0 NaN S 7.9250 \n3 35.0 C123 S 53.1000 \n4 35.0 NaN S 8.0500 \n\n Name Parch PassengerId \\\n0 Braund, Mr. Owen Harris 0 1 \n1 Cumings, Mrs. John Bradley (Florence Briggs Th... 0 2 \n2 Heikkinen, Miss. Laina 0 3 \n3 Futrelle, Mrs. Jacques Heath (Lily May Peel) 0 4 \n4 Allen, Mr. William Henry 0 5 \n\n Pclass Sex SibSp Survived Ticket \n0 3 male 1 0.0 A/5 21171 \n1 1 female 1 1.0 PC 17599 \n2 3 female 0 1.0 STON/O2. 3101282 \n3 1 female 1 1.0 113803 \n4 3 male 0 0.0 373450 ",
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>Age</th>\n <th>Cabin</th>\n <th>Embarked</th>\n <th>Fare</th>\n <th>Name</th>\n <th>Parch</th>\n <th>PassengerId</th>\n <th>Pclass</th>\n <th>Sex</th>\n <th>SibSp</th>\n <th>Survived</th>\n <th>Ticket</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <td>0</td>\n <td>22.0</td>\n <td>NaN</td>\n <td>S</td>\n <td>7.2500</td>\n <td>Braund, Mr. Owen Harris</td>\n <td>0</td>\n <td>1</td>\n <td>3</td>\n <td>male</td>\n <td>1</td>\n <td>0.0</td>\n <td>A/5 21171</td>\n </tr>\n <tr>\n <td>1</td>\n <td>38.0</td>\n <td>C85</td>\n <td>C</td>\n <td>71.2833</td>\n <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td>\n <td>0</td>\n <td>2</td>\n <td>1</td>\n <td>female</td>\n <td>1</td>\n <td>1.0</td>\n <td>PC 17599</td>\n </tr>\n <tr>\n <td>2</td>\n <td>26.0</td>\n <td>NaN</td>\n <td>S</td>\n <td>7.9250</td>\n <td>Heikkinen, Miss. Laina</td>\n <td>0</td>\n <td>3</td>\n <td>3</td>\n <td>female</td>\n <td>0</td>\n <td>1.0</td>\n <td>STON/O2. 3101282</td>\n </tr>\n <tr>\n <td>3</td>\n <td>35.0</td>\n <td>C123</td>\n <td>S</td>\n <td>53.1000</td>\n <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n <td>0</td>\n <td>4</td>\n <td>1</td>\n <td>female</td>\n <td>1</td>\n <td>1.0</td>\n <td>113803</td>\n </tr>\n <tr>\n <td>4</td>\n <td>35.0</td>\n <td>NaN</td>\n <td>S</td>\n <td>8.0500</td>\n <td>Allen, Mr. William Henry</td>\n <td>0</td>\n <td>5</td>\n <td>3</td>\n <td>male</td>\n <td>0</td>\n <td>0.0</td>\n <td>373450</td>\n </tr>\n </tbody>\n</table>\n</div>"
},
"metadata": {},
"output_type": "execute_result",
"execution_count": 780
}
],
"source": [
"#查看数据\n",
"full.head()"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n",
"is_executing": false
}
}
},
{
"cell_type": "code",
"execution_count": 781,
"outputs": [
{
"data": {
"text/plain": " Age Fare Parch PassengerId Pclass \\\ncount 1046.000000 1308.000000 1309.000000 1309.000000 1309.000000 \nmean 29.881138 33.295479 0.385027 655.000000 2.294882 \nstd 14.413493 51.758668 0.865560 378.020061 0.837836 \nmin 0.170000 0.000000 0.000000 1.000000 1.000000 \n25% 21.000000 7.895800 0.000000 328.000000 2.000000 \n50% 28.000000 14.454200 0.000000 655.000000 3.000000 \n75% 39.000000 31.275000 0.000000 982.000000 3.000000 \nmax 80.000000 512.329200 9.000000 1309.000000 3.000000 \n\n SibSp Survived \ncount 1309.000000 891.000000 \nmean 0.498854 0.383838 \nstd 1.041658 0.486592 \nmin 0.000000 0.000000 \n25% 0.000000 0.000000 \n50% 0.000000 0.000000 \n75% 1.000000 1.000000 \nmax 8.000000 1.000000 ",
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>Age</th>\n <th>Fare</th>\n <th>Parch</th>\n <th>PassengerId</th>\n <th>Pclass</th>\n <th>SibSp</th>\n <th>Survived</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <td>count</td>\n <td>1046.000000</td>\n <td>1308.000000</td>\n <td>1309.000000</td>\n <td>1309.000000</td>\n <td>1309.000000</td>\n <td>1309.000000</td>\n <td>891.000000</td>\n </tr>\n <tr>\n <td>mean</td>\n <td>29.881138</td>\n <td>33.295479</td>\n <td>0.385027</td>\n <td>655.000000</td>\n <td>2.294882</td>\n <td>0.498854</td>\n <td>0.383838</td>\n </tr>\n <tr>\n <td>std</td>\n <td>14.413493</td>\n <td>51.758668</td>\n <td>0.865560</td>\n <td>378.020061</td>\n <td>0.837836</td>\n <td>1.041658</td>\n <td>0.486592</td>\n </tr>\n <tr>\n <td>min</td>\n <td>0.170000</td>\n <td>0.000000</td>\n <td>0.000000</td>\n <td>1.000000</td>\n <td>1.000000</td>\n <td>0.000000</td>\n <td>0.000000</td>\n </tr>\n <tr>\n <td>25%</td>\n <td>21.000000</td>\n <td>7.895800</td>\n <td>0.000000</td>\n <td>328.000000</td>\n <td>2.000000</td>\n <td>0.000000</td>\n <td>0.000000</td>\n </tr>\n <tr>\n <td>50%</td>\n <td>28.000000</td>\n <td>14.454200</td>\n <td>0.000000</td>\n <td>655.000000</td>\n <td>3.000000</td>\n <td>0.000000</td>\n <td>0.000000</td>\n </tr>\n <tr>\n <td>75%</td>\n <td>39.000000</td>\n <td>31.275000</td>\n <td>0.000000</td>\n <td>982.000000</td>\n <td>3.000000</td>\n <td>1.000000</td>\n <td>1.000000</td>\n </tr>\n <tr>\n <td>max</td>\n <td>80.000000</td>\n <td>512.329200</td>\n <td>9.000000</td>\n <td>1309.000000</td>\n <td>3.000000</td>\n <td>8.000000</td>\n <td>1.000000</td>\n </tr>\n </tbody>\n</table>\n</div>"
},
"metadata": {},
"output_type": "execute_result",
"execution_count": 781
}
],
"source": [
"‘‘‘\n",
"describe只能查看数据类型的描述统计信息,对于其他类型的数据不显示,比如字符串类型姓名(name),客舱号(Cabin)\n",
"这很好理解,因为描述统计指标是计算数值,所以需要该列的数据类型是数据\n",
"‘‘‘\n",
"#获取数据类型列的描述统计信息\n",
"full.describe()"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n",
"is_executing": false
}
}
},
{
"cell_type": "code",
"execution_count": 782,
"outputs": [
{
"name": "stdout",
"text": [
"<class ‘pandas.core.frame.DataFrame‘>\n",
"RangeIndex: 1309 entries, 0 to 1308\n",
"Data columns (total 12 columns):\n",
"Age 1046 non-null float64\n",
"Cabin 295 non-null object\n",
"Embarked 1307 non-null object\n",
"Fare 1308 non-null float64\n",
"Name 1309 non-null object\n",
"Parch 1309 non-null int64\n",
"PassengerId 1309 non-null int64\n",
"Pclass 1309 non-null int64\n",
"Sex 1309 non-null object\n",
"SibSp 1309 non-null int64\n",
"Survived 891 non-null float64\n",
"Ticket 1309 non-null object\n",
"dtypes: float64(3), int64(4), object(5)\n",
"memory usage: 122.8+ KB\n"
],
"output_type": "stream"
},
{
"data": {
"text/plain": "‘\\n我们发现数据总共有1309行。\\n其中数据类型列:年龄(Age)、船舱号(Cabin)里面有缺失数据:\\n1)年龄(Age)里面数据总数是1046条,缺失了1309-1046=263,缺失率263/1309=20%\\n2)船票价格(Fare)里面数据总数是1308条,缺失了1条数据\\n\\n字符串列:\\n1)登船港口(Embarked)里面数据总数是1307,只缺失了2条数据,缺失比较少\\n2)船舱号(Cabin)里面数据总数是295,缺失了1309-295=1014,缺失率=1014/1309=77.5%,缺失比较大\\n这为我们下一步数据清洗指明了方向,只有知道哪些数据缺失数据,我们才能有针对性的处理。\\n‘"
},
"metadata": {},
"output_type": "execute_result",
"execution_count": 782
}
],
"source": [
"# 查看每一列的数据类型,和数据总数\n",
"full.info()\n",
"‘‘‘\n",
"我们发现数据总共有1309行。\n",
"其中数据类型列:年龄(Age)、船舱号(Cabin)里面有缺失数据:\n",
"1)年龄(Age)里面数据总数是1046条,缺失了1309-1046=263,缺失率263/1309=20%\n",
"2)船票价格(Fare)里面数据总数是1308条,缺失了1条数据\n",
"\n",
"字符串列:\n",
"1)登船港口(Embarked)里面数据总数是1307,只缺失了2条数据,缺失比较少\n",
"2)船舱号(Cabin)里面数据总数是295,缺失了1309-295=1014,缺失率=1014/1309=77.5%,缺失比较大\n",
"这为我们下一步数据清洗指明了方向,只有知道哪些数据缺失数据,我们才能有针对性的处理。\n",
"‘‘‘"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n",
"is_executing": false
}
}
},
{
"cell_type": "markdown",
"source": [
"# 3.数据清洗(Data Preparation )"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"source": [
"## 3.1 数据预处理"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"source": [
"### 缺失值处理"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"source": [
"在前面,理解数据阶段,我们发现数据总共有1309行。\n",
"其中数据类型列:年龄(Age)、船舱号(Cabin)里面有缺失数据。\n",
"字符串列:登船港口(Embarked)、船舱号(Cabin)里面有缺失数据。\n",
"\n",
"这为我们下一步数据清洗指明了方向,只有知道哪些数据缺失数据,我们才能有针对性的处理。\n",
"\n",
"很多机器学习算法为了训练模型,要求所传入的特征中不能有空值。\n",
"\n",
"\n",
"1. 如果是数值类型,用平均值取代\n",
"2. 如果是分类数据,用最常见的类别取代\n",
"3. 使用模型预测缺失值,例如:K-NN"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": 783,
"outputs": [
{
"name": "stdout",
"text": [
"处理前:\n",
"<class ‘pandas.core.frame.DataFrame‘>\n",
"RangeIndex: 1309 entries, 0 to 1308\n",
"Data columns (total 12 columns):\n",
"Age 1046 non-null float64\n",
"Cabin 295 non-null object\n",
"Embarked 1307 non-null object\n",
"Fare 1308 non-null float64\n",
"Name 1309 non-null object\n",
"Parch 1309 non-null int64\n",
"PassengerId 1309 non-null int64\n",
"Pclass 1309 non-null int64\n",
"Sex 1309 non-null object\n",
"SibSp 1309 non-null int64\n",
"Survived 891 non-null float64\n",
"Ticket 1309 non-null object\n",
"dtypes: float64(3), int64(4), object(5)\n",
"memory usage: 122.8+ KB\n",
"处理红后:\n",
"<class ‘pandas.core.frame.DataFrame‘>\n",
"RangeIndex: 1309 entries, 0 to 1308\n",
"Data columns (total 12 columns):\n",
"Age 1309 non-null float64\n",
"Cabin 295 non-null object\n",
"Embarked 1307 non-null object\n",
"Fare 1309 non-null float64\n",
"Name 1309 non-null object\n",
"Parch 1309 non-null int64\n",
"PassengerId 1309 non-null int64\n",
"Pclass 1309 non-null int64\n",
"Sex 1309 non-null object\n",
"SibSp 1309 non-null int64\n",
"Survived 891 non-null float64\n",
"Ticket 1309 non-null object\n",
"dtypes: float64(3), int64(4), object(5)\n",
"memory usage: 122.8+ KB\n"
],
"output_type": "stream"
}
],
"source": [
"‘‘‘\n",
"我们发现数据总共有1309行。\n",
"其中数据类型列:年龄(Age)、船舱号(Cabin)里面有缺失数据:\n",
"1)年龄(Age)里面数据总数是1046条,缺失了1309-1046=263,缺失率263/1309=20%\n",
"2)船票价格(Fare)里面数据总数是1308条,缺失了1条数据\n",
"\n",
"对于数据类型,处理缺失值最简单的方法就是用平均数来填充缺失值\n",
"‘‘‘\n",
"print(‘处理前:‘)\n",
"full.info()\n",
"#年龄(Age)\n",
"full[‘Age‘]=full[‘Age‘].fillna( full[‘Age‘].mean() )\n",
"#船票价格(Fare)\n",
"full[‘Fare‘] = full[‘Fare‘].fillna( full[‘Fare‘].mean() )\n",
"print(‘处理红后:‘)\n",
"full.info()"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n",
"is_executing": false
}
}
},
{
"cell_type": "code",
"execution_count": 784,
"outputs": [
{
"data": {
"text/plain": " Age Cabin Embarked Fare \\\n0 22.0 NaN S 7.2500 \n1 38.0 C85 C 71.2833 \n2 26.0 NaN S 7.9250 \n3 35.0 C123 S 53.1000 \n4 35.0 NaN S 8.0500 \n\n Name Parch PassengerId \\\n0 Braund, Mr. Owen Harris 0 1 \n1 Cumings, Mrs. John Bradley (Florence Briggs Th... 0 2 \n2 Heikkinen, Miss. Laina 0 3 \n3 Futrelle, Mrs. Jacques Heath (Lily May Peel) 0 4 \n4 Allen, Mr. William Henry 0 5 \n\n Pclass Sex SibSp Survived Ticket \n0 3 male 1 0.0 A/5 21171 \n1 1 female 1 1.0 PC 17599 \n2 3 female 0 1.0 STON/O2. 3101282 \n3 1 female 1 1.0 113803 \n4 3 male 0 0.0 373450 ",
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>Age</th>\n <th>Cabin</th>\n <th>Embarked</th>\n <th>Fare</th>\n <th>Name</th>\n <th>Parch</th>\n <th>PassengerId</th>\n <th>Pclass</th>\n <th>Sex</th>\n <th>SibSp</th>\n <th>Survived</th>\n <th>Ticket</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <td>0</td>\n <td>22.0</td>\n <td>NaN</td>\n <td>S</td>\n <td>7.2500</td>\n <td>Braund, Mr. Owen Harris</td>\n <td>0</td>\n <td>1</td>\n <td>3</td>\n <td>male</td>\n <td>1</td>\n <td>0.0</td>\n <td>A/5 21171</td>\n </tr>\n <tr>\n <td>1</td>\n <td>38.0</td>\n <td>C85</td>\n <td>C</td>\n <td>71.2833</td>\n <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td>\n <td>0</td>\n <td>2</td>\n <td>1</td>\n <td>female</td>\n <td>1</td>\n <td>1.0</td>\n <td>PC 17599</td>\n </tr>\n <tr>\n <td>2</td>\n <td>26.0</td>\n <td>NaN</td>\n <td>S</td>\n <td>7.9250</td>\n <td>Heikkinen, Miss. Laina</td>\n <td>0</td>\n <td>3</td>\n <td>3</td>\n <td>female</td>\n <td>0</td>\n <td>1.0</td>\n <td>STON/O2. 3101282</td>\n </tr>\n <tr>\n <td>3</td>\n <td>35.0</td>\n <td>C123</td>\n <td>S</td>\n <td>53.1000</td>\n <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n <td>0</td>\n <td>4</td>\n <td>1</td>\n <td>female</td>\n <td>1</td>\n <td>1.0</td>\n <td>113803</td>\n </tr>\n <tr>\n <td>4</td>\n <td>35.0</td>\n <td>NaN</td>\n <td>S</td>\n <td>8.0500</td>\n <td>Allen, Mr. William Henry</td>\n <td>0</td>\n <td>5</td>\n <td>3</td>\n <td>male</td>\n <td>0</td>\n <td>0.0</td>\n <td>373450</td>\n </tr>\n </tbody>\n</table>\n</div>"
},
"metadata": {},
"output_type": "execute_result",
"execution_count": 784
}
],
"source": [
"#检查数据处理是否正常\n",
"full.head()"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n",
"is_executing": false
}
}
},
{
"cell_type": "code",
"execution_count": 785,
"outputs": [
{
"data": {
"text/plain": "0 S\n1 C\n2 S\n3 S\n4 S\nName: Embarked, dtype: object"
},
"metadata": {},
"output_type": "execute_result",
"execution_count": 785
}
],
"source": [
"‘‘‘\n",
"总数据是1309\n",
"字符串列:\n",
"1)登船港口(Embarked)里面数据总数是1307,只缺失了2条数据,缺失比较少\n",
"2)船舱号(Cabin)里面数据总数是295,缺失了1309-295=1014,缺失率=1014/1309=77.5%,缺失比较大\n",
"‘‘‘\n",
"#登船港口(Embarked):查看里面数据长啥样\n",
"‘‘‘\n",
"出发地点:S=英国南安普顿Southampton\n",
"途径地点1:C=法国 瑟堡市Cherbourg\n",
"途径地点2:Q=爱尔兰 昆士敦Queenstown\n",
"‘‘‘\n",
"full[‘Embarked‘].head()"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n",
"is_executing": false
}
}
},
{
"cell_type": "code",
"execution_count": 786,
"outputs": [
{
"data": {
"text/plain": "S 914\nC 270\nQ 123\nName: Embarked, dtype: int64"
},
"metadata": {},
"output_type": "execute_result",
"execution_count": 786
}
],
"source": [
"‘‘‘\n",
"分类变量Embarked,看下最常见的类别,用其填充\n",
"‘‘‘\n",
"full[‘Embarked‘].value_counts()"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n",
"is_executing": false
}
}
},
{
"cell_type": "code",
"execution_count": 787,
"outputs": [],
"source": [
"‘‘‘\n",
"从结果来看,S类别最常见。我们将缺失值填充为最频繁出现的值:\n",
"S=英国南安普顿Southampton\n",
"‘‘‘\n",
"full[‘Embarked‘] = full[‘Embarked‘].fillna( ‘S‘ )"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n",
"is_executing": false
}
}
},
{
"cell_type": "code",
"execution_count": 788,
"outputs": [
{
"data": {
"text/plain": "0 NaN\n1 C85\n2 NaN\n3 C123\n4 NaN\nName: Cabin, dtype: object"
},
"metadata": {},
"output_type": "execute_result",
"execution_count": 788
}
],
"source": [
"#船舱号(Cabin):查看里面数据长啥样\n",
"full[‘Cabin‘].head()"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n",
"is_executing": false
}
}
},
{
"cell_type": "code",
"execution_count": 789,
"outputs": [],
"source": [
"#缺失数据比较多,船舱号(Cabin)缺失值填充为U,表示未知(Uknow) \n",
"full[‘Cabin‘] = full[‘Cabin‘].fillna( ‘U‘ )"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n",
"is_executing": false
}
}
},
{
"cell_type": "code",
"execution_count": 790,
"outputs": [
{
"data": {
"text/plain": " Age Cabin Embarked Fare \\\n0 22.0 U S 7.2500 \n1 38.0 C85 C 71.2833 \n2 26.0 U S 7.9250 \n3 35.0 C123 S 53.1000 \n4 35.0 U S 8.0500 \n\n Name Parch PassengerId \\\n0 Braund, Mr. Owen Harris 0 1 \n1 Cumings, Mrs. John Bradley (Florence Briggs Th... 0 2 \n2 Heikkinen, Miss. Laina 0 3 \n3 Futrelle, Mrs. Jacques Heath (Lily May Peel) 0 4 \n4 Allen, Mr. William Henry 0 5 \n\n Pclass Sex SibSp Survived Ticket \n0 3 male 1 0.0 A/5 21171 \n1 1 female 1 1.0 PC 17599 \n2 3 female 0 1.0 STON/O2. 3101282 \n3 1 female 1 1.0 113803 \n4 3 male 0 0.0 373450 ",
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>Age</th>\n <th>Cabin</th>\n <th>Embarked</th>\n <th>Fare</th>\n <th>Name</th>\n <th>Parch</th>\n <th>PassengerId</th>\n <th>Pclass</th>\n <th>Sex</th>\n <th>SibSp</th>\n <th>Survived</th>\n <th>Ticket</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <td>0</td>\n <td>22.0</td>\n <td>U</td>\n <td>S</td>\n <td>7.2500</td>\n <td>Braund, Mr. Owen Harris</td>\n <td>0</td>\n <td>1</td>\n <td>3</td>\n <td>male</td>\n <td>1</td>\n <td>0.0</td>\n <td>A/5 21171</td>\n </tr>\n <tr>\n <td>1</td>\n <td>38.0</td>\n <td>C85</td>\n <td>C</td>\n <td>71.2833</td>\n <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td>\n <td>0</td>\n <td>2</td>\n <td>1</td>\n <td>female</td>\n <td>1</td>\n <td>1.0</td>\n <td>PC 17599</td>\n </tr>\n <tr>\n <td>2</td>\n <td>26.0</td>\n <td>U</td>\n <td>S</td>\n <td>7.9250</td>\n <td>Heikkinen, Miss. Laina</td>\n <td>0</td>\n <td>3</td>\n <td>3</td>\n <td>female</td>\n <td>0</td>\n <td>1.0</td>\n <td>STON/O2. 3101282</td>\n </tr>\n <tr>\n <td>3</td>\n <td>35.0</td>\n <td>C123</td>\n <td>S</td>\n <td>53.1000</td>\n <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n <td>0</td>\n <td>4</td>\n <td>1</td>\n <td>female</td>\n <td>1</td>\n <td>1.0</td>\n <td>113803</td>\n </tr>\n <tr>\n <td>4</td>\n <td>35.0</td>\n <td>U</td>\n <td>S</td>\n <td>8.0500</td>\n <td>Allen, Mr. William Henry</td>\n <td>0</td>\n <td>5</td>\n <td>3</td>\n <td>male</td>\n <td>0</td>\n <td>0.0</td>\n <td>373450</td>\n </tr>\n </tbody>\n</table>\n</div>"
},
"metadata": {},
"output_type": "execute_result",
"execution_count": 790
}
],
"source": [
"#检查数据处理是否正常\n",
"full.head()"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n",
"is_executing": false
}
}
},
{
"cell_type": "code",
"execution_count": 791,
"outputs": [
{
"name": "stdout",
"text": [
"<class ‘pandas.core.frame.DataFrame‘>\n",
"RangeIndex: 1309 entries, 0 to 1308\n",
"Data columns (total 12 columns):\n",
"Age 1309 non-null float64\n",
"Cabin 1309 non-null object\n",
"Embarked 1309 non-null object\n",
"Fare 1309 non-null float64\n",
"Name 1309 non-null object\n",
"Parch 1309 non-null int64\n",
"PassengerId 1309 non-null int64\n",
"Pclass 1309 non-null int64\n",
"Sex 1309 non-null object\n",
"SibSp 1309 non-null int64\n",
"Survived 891 non-null float64\n",
"Ticket 1309 non-null object\n",
"dtypes: float64(3), int64(4), object(5)\n",
"memory usage: 122.8+ KB\n"
],
"output_type": "stream"
}
],
"source": [
"#查看最终缺失值处理情况,记住生成情况(Survived)这里一列是我们的标签,用来做机器学习预测的,不需要处理这一列\n",
"full.info()"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n",
"is_executing": false
}
}
},
{
"cell_type": "markdown",
"source": [
"## 3.2 特征提取"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"source": [
"### 3.2.1数据分类"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"source": [
"查看数据类型,分为3种数据类型。并对类别数据处理:用数值代替类别,并进行One-hot编码"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": 792,
"outputs": [
{
"name": "stdout",
"text": [
"<class ‘pandas.core.frame.DataFrame‘>\n",
"RangeIndex: 1309 entries, 0 to 1308\n",
"Data columns (total 12 columns):\n",
"Age 1309 non-null float64\n",
"Cabin 1309 non-null object\n",
"Embarked 1309 non-null object\n",
"Fare 1309 non-null float64\n",
"Name 1309 non-null object\n",
"Parch 1309 non-null int64\n",
"PassengerId 1309 non-null int64\n",
"Pclass 1309 non-null int64\n",
"Sex 1309 non-null object\n",
"SibSp 1309 non-null int64\n",
"Survived 891 non-null float64\n",
"Ticket 1309 non-null object\n",
"dtypes: float64(3), int64(4), object(5)\n",
"memory usage: 122.8+ KB\n"
],
"output_type": "stream"
}
],
"source": [
"‘‘‘\n",
"1.数值类型:\n",
"乘客编号(PassengerId),年龄(Age),船票价格(Fare),同代直系亲属人数(SibSp),不同代直系亲属人数(Parch)\n",
"2.时间序列:无\n",
"3.分类数据:\n",
"1)有直接类别的\n",
"乘客性别(Sex):男性male,女性female\n",
"登船港口(Embarked):出发地点S=英国南安普顿Southampton,途径地点1:C=法国 瑟堡市Cherbourg,出发地点2:Q=爱尔兰 昆士敦Queenstown\n",
"客舱等级(Pclass):1=1等舱,2=2等舱,3=3等舱\n",
"2)字符串类型:可能从这里面提取出特征来,也归到分类数据中\n",
"乘客姓名(Name)\n",
"客舱号(Cabin)\n",
"船票编号(Ticket)\n",
"‘‘‘\n",
"full.info()"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n",
"is_executing": false
}
}
},
{
"cell_type": "markdown",
"source": [
"### 3.2.1 分类数据:有直接类别的"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"source": [
"1. 乘客性别(Sex):\n",
"男性male,女性female\n",
"2. 登船港口(Embarked):出发地点S=英国南安普顿Southampton,途径地点1:C=法国 瑟堡市Cherbourg,出发地点2:Q=爱尔兰 昆士敦Queenstown\n",
"3. 客舱等级(Pclass):1=1等舱,2=2等舱,3=3等舱"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"source": [
"#### 性别"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": 793,
"outputs": [
{
"data": {
"text/plain": "0 male\n1 female\n2 female\n3 female\n4 male\nName: Sex, dtype: object"
},
"metadata": {},
"output_type": "execute_result",
"execution_count": 793
}
],
"source": [
"#查看性别数据这一列\n",
"full[‘Sex‘].head()"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n",
"is_executing": false
}
}
},
{
"cell_type": "code",
"execution_count": 794,
"outputs": [
{
"data": {
"text/plain": " Age Cabin Embarked Fare \\\n0 22.0 U S 7.2500 \n1 38.0 C85 C 71.2833 \n2 26.0 U S 7.9250 \n3 35.0 C123 S 53.1000 \n4 35.0 U S 8.0500 \n\n Name Parch PassengerId \\\n0 Braund, Mr. Owen Harris 0 1 \n1 Cumings, Mrs. John Bradley (Florence Briggs Th... 0 2 \n2 Heikkinen, Miss. Laina 0 3 \n3 Futrelle, Mrs. Jacques Heath (Lily May Peel) 0 4 \n4 Allen, Mr. William Henry 0 5 \n\n Pclass Sex SibSp Survived Ticket \n0 3 1 1 0.0 A/5 21171 \n1 1 0 1 1.0 PC 17599 \n2 3 0 0 1.0 STON/O2. 3101282 \n3 1 0 1 1.0 113803 \n4 3 1 0 0.0 373450 ",
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>Age</th>\n <th>Cabin</th>\n <th>Embarked</th>\n <th>Fare</th>\n <th>Name</th>\n <th>Parch</th>\n <th>PassengerId</th>\n <th>Pclass</th>\n <th>Sex</th>\n <th>SibSp</th>\n <th>Survived</th>\n <th>Ticket</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <td>0</td>\n <td>22.0</td>\n <td>U</td>\n <td>S</td>\n <td>7.2500</td>\n <td>Braund, Mr. Owen Harris</td>\n <td>0</td>\n <td>1</td>\n <td>3</td>\n <td>1</td>\n <td>1</td>\n <td>0.0</td>\n <td>A/5 21171</td>\n </tr>\n <tr>\n <td>1</td>\n <td>38.0</td>\n <td>C85</td>\n <td>C</td>\n <td>71.2833</td>\n <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td>\n <td>0</td>\n <td>2</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n <td>1.0</td>\n <td>PC 17599</td>\n </tr>\n <tr>\n <td>2</td>\n <td>26.0</td>\n <td>U</td>\n <td>S</td>\n <td>7.9250</td>\n <td>Heikkinen, Miss. Laina</td>\n <td>0</td>\n <td>3</td>\n <td>3</td>\n <td>0</td>\n <td>0</td>\n <td>1.0</td>\n <td>STON/O2. 3101282</td>\n </tr>\n <tr>\n <td>3</td>\n <td>35.0</td>\n <td>C123</td>\n <td>S</td>\n <td>53.1000</td>\n <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n <td>0</td>\n <td>4</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n <td>1.0</td>\n <td>113803</td>\n </tr>\n <tr>\n <td>4</td>\n <td>35.0</td>\n <td>U</td>\n <td>S</td>\n <td>8.0500</td>\n <td>Allen, Mr. William Henry</td>\n <td>0</td>\n <td>5</td>\n <td>3</td>\n <td>1</td>\n <td>0</td>\n <td>0.0</td>\n <td>373450</td>\n </tr>\n </tbody>\n</table>\n</div>"
},
"metadata": {},
"output_type": "execute_result",
"execution_count": 794
}
],
"source": [
"‘‘‘\n",
"将性别的值映射为数值\n",
"男(male)对应数值1,女(female)对应数值0\n",
"‘‘‘\n",
"sex_mapDict={‘male‘:1,\n",
" ‘female‘:0}\n",
"#map函数:对Series每个数据应用自定义的函数计算\n",
"full[‘Sex‘]=full[‘Sex‘].map(sex_mapDict)\n",
"full.head()"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n",
"is_executing": false
}
}
},
{
"cell_type": "markdown",
"source": [
"#### 登船港口(Embarked)"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": 795,
"outputs": [
{
"data": {
"text/plain": "0 S\n1 C\n2 S\n3 S\n4 S\nName: Embarked, dtype: object"
},
"metadata": {},
"output_type": "execute_result",
"execution_count": 795
}
],
"source": [
"‘‘‘\n",
"登船港口(Embarked)的值是:\n",
"出发地点:S=英国南安普顿Southampton\n",
"途径地点1:C=法国 瑟堡市Cherbourg\n",
"途径地点2:Q=爱尔兰 昆士敦Queenstown\n",
"‘‘‘\n",
"#查看该类数据内容\n",
"full[‘Embarked‘].head()"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n",
"is_executing": false
}
}
},
{
"cell_type": "code",
"execution_count": 796,
"outputs": [
{
"data": {
"text/plain": " Embarked_C Embarked_Q Embarked_S\n0 0 0 1\n1 1 0 0\n2 0 0 1\n3 0 0 1\n4 0 0 1",
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>Embarked_C</th>\n <th>Embarked_Q</th>\n <th>Embarked_S</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n </tr>\n <tr>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n </tr>\n <tr>\n <td>2</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n </tr>\n <tr>\n <td>3</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n </tr>\n <tr>\n <td>4</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n </tr>\n </tbody>\n</table>\n</div>"
},
"metadata": {},
"output_type": "execute_result",
"execution_count": 796
}
],
"source": [
"#存放提取后的特征\n",
"embarkedDf = pd.DataFrame()\n",
"\n",
"‘‘‘\n",
"使用get_dummies进行one-hot编码,产生虚拟变量(dummy variables),列名前缀是Embarked\n",
"‘‘‘\n",
"embarkedDf = pd.get_dummies( full[‘Embarked‘] , prefix=‘Embarked‘ )\n",
"embarkedDf.head()"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n",
"is_executing": false
}
}
},
{
"cell_type": "code",
"execution_count": 797,
"outputs": [
{
"data": {
"text/plain": " Age Cabin Fare Name \\\n0 22.0 U 7.2500 Braund, Mr. Owen Harris \n1 38.0 C85 71.2833 Cumings, Mrs. John Bradley (Florence Briggs Th... \n2 26.0 U 7.9250 Heikkinen, Miss. Laina \n3 35.0 C123 53.1000 Futrelle, Mrs. Jacques Heath (Lily May Peel) \n4 35.0 U 8.0500 Allen, Mr. William Henry \n\n Parch PassengerId Pclass Sex SibSp Survived Ticket \\\n0 0 1 3 1 1 0.0 A/5 21171 \n1 0 2 1 0 1 1.0 PC 17599 \n2 0 3 3 0 0 1.0 STON/O2. 3101282 \n3 0 4 1 0 1 1.0 113803 \n4 0 5 3 1 0 0.0 373450 \n\n Embarked_C Embarked_Q Embarked_S \n0 0 0 1 \n1 1 0 0 \n2 0 0 1 \n3 0 0 1 \n4 0 0 1 ",
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>Age</th>\n <th>Cabin</th>\n <th>Fare</th>\n <th>Name</th>\n <th>Parch</th>\n <th>PassengerId</th>\n <th>Pclass</th>\n <th>Sex</th>\n <th>SibSp</th>\n <th>Survived</th>\n <th>Ticket</th>\n <th>Embarked_C</th>\n <th>Embarked_Q</th>\n <th>Embarked_S</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <td>0</td>\n <td>22.0</td>\n <td>U</td>\n <td>7.2500</td>\n <td>Braund, Mr. Owen Harris</td>\n <td>0</td>\n <td>1</td>\n <td>3</td>\n <td>1</td>\n <td>1</td>\n <td>0.0</td>\n <td>A/5 21171</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n </tr>\n <tr>\n <td>1</td>\n <td>38.0</td>\n <td>C85</td>\n <td>71.2833</td>\n <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td>\n <td>0</td>\n <td>2</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n <td>1.0</td>\n <td>PC 17599</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n </tr>\n <tr>\n <td>2</td>\n <td>26.0</td>\n <td>U</td>\n <td>7.9250</td>\n <td>Heikkinen, Miss. Laina</td>\n <td>0</td>\n <td>3</td>\n <td>3</td>\n <td>0</td>\n <td>0</td>\n <td>1.0</td>\n <td>STON/O2. 3101282</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n </tr>\n <tr>\n <td>3</td>\n <td>35.0</td>\n <td>C123</td>\n <td>53.1000</td>\n <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n <td>0</td>\n <td>4</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n <td>1.0</td>\n <td>113803</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n </tr>\n <tr>\n <td>4</td>\n <td>35.0</td>\n <td>U</td>\n <td>8.0500</td>\n <td>Allen, Mr. William Henry</td>\n <td>0</td>\n <td>5</td>\n <td>3</td>\n <td>1</td>\n <td>0</td>\n <td>0.0</td>\n <td>373450</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n </tr>\n </tbody>\n</table>\n</div>"
},
"metadata": {},
"output_type": "execute_result",
"execution_count": 797
}
],
"source": [
"#添加one-hot编码产生的虚拟变量(dummy variables)到泰坦尼克号数据集full\n",
"full = pd.concat([full,embarkedDf],axis=1)\n",
"\n",
"‘‘‘\n",
"因为已经使用登船港口(Embarked)进行了one-hot编码产生了它的虚拟变量(dummy variables)\n",
"所以这里把登船港口(Embarked)删掉\n",
"‘‘‘\n",
"full.drop(‘Embarked‘,axis=1,inplace=True)\n",
"full.head()"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n",
"is_executing": false
}
}
},
{
"cell_type": "markdown",
"source": [
"#### 客舱等级(Pclass)"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": 798,
"outputs": [
{
"data": {
"text/plain": " Pclass_1 Pclass_2 Pclass_3\n0 0 0 1\n1 1 0 0\n2 0 0 1\n3 1 0 0\n4 0 0 1",
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>Pclass_1</th>\n <th>Pclass_2</th>\n <th>Pclass_3</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n </tr>\n <tr>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n </tr>\n <tr>\n <td>2</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n </tr>\n <tr>\n <td>3</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n </tr>\n <tr>\n <td>4</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n </tr>\n </tbody>\n</table>\n</div>"
},
"metadata": {},
"output_type": "execute_result",
"execution_count": 798
}
],
"source": [
"‘‘‘\n",
"客舱等级(Pclass):\n",
"1=1等舱,2=2等舱,3=3等舱\n",
"‘‘‘\n",
"#存放提取后的特征\n",
"pclassDf = pd.DataFrame()\n",
"\n",
"#使用get_dummies进行one-hot编码,列名前缀是Pclass\n",
"pclassDf = pd.get_dummies( full[‘Pclass‘] , prefix=‘Pclass‘ )\n",
"pclassDf.head()"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n",
"is_executing": false
}
}
},
{
"cell_type": "code",
"execution_count": 799,
"outputs": [
{
"data": {
"text/plain": " Age Cabin Fare Name \\\n0 22.0 U 7.2500 Braund, Mr. Owen Harris \n1 38.0 C85 71.2833 Cumings, Mrs. John Bradley (Florence Briggs Th... \n2 26.0 U 7.9250 Heikkinen, Miss. Laina \n3 35.0 C123 53.1000 Futrelle, Mrs. Jacques Heath (Lily May Peel) \n4 35.0 U 8.0500 Allen, Mr. William Henry \n\n Parch PassengerId Sex SibSp Survived Ticket Embarked_C \\\n0 0 1 1 1 0.0 A/5 21171 0 \n1 0 2 0 1 1.0 PC 17599 1 \n2 0 3 0 0 1.0 STON/O2. 3101282 0 \n3 0 4 0 1 1.0 113803 0 \n4 0 5 1 0 0.0 373450 0 \n\n Embarked_Q Embarked_S Pclass_1 Pclass_2 Pclass_3 \n0 0 1 0 0 1 \n1 0 0 1 0 0 \n2 0 1 0 0 1 \n3 0 1 1 0 0 \n4 0 1 0 0 1 ",
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>Age</th>\n <th>Cabin</th>\n <th>Fare</th>\n <th>Name</th>\n <th>Parch</th>\n <th>PassengerId</th>\n <th>Sex</th>\n <th>SibSp</th>\n <th>Survived</th>\n <th>Ticket</th>\n <th>Embarked_C</th>\n <th>Embarked_Q</th>\n <th>Embarked_S</th>\n <th>Pclass_1</th>\n <th>Pclass_2</th>\n <th>Pclass_3</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <td>0</td>\n <td>22.0</td>\n <td>U</td>\n <td>7.2500</td>\n <td>Braund, Mr. Owen Harris</td>\n <td>0</td>\n <td>1</td>\n <td>1</td>\n <td>1</td>\n <td>0.0</td>\n <td>A/5 21171</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n </tr>\n <tr>\n <td>1</td>\n <td>38.0</td>\n <td>C85</td>\n <td>71.2833</td>\n <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td>\n <td>0</td>\n <td>2</td>\n <td>0</td>\n <td>1</td>\n <td>1.0</td>\n <td>PC 17599</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n </tr>\n <tr>\n <td>2</td>\n <td>26.0</td>\n <td>U</td>\n <td>7.9250</td>\n <td>Heikkinen, Miss. Laina</td>\n <td>0</td>\n <td>3</td>\n <td>0</td>\n <td>0</td>\n <td>1.0</td>\n <td>STON/O2. 3101282</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n </tr>\n <tr>\n <td>3</td>\n <td>35.0</td>\n <td>C123</td>\n <td>53.1000</td>\n <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n <td>0</td>\n <td>4</td>\n <td>0</td>\n <td>1</td>\n <td>1.0</td>\n <td>113803</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n </tr>\n <tr>\n <td>4</td>\n <td>35.0</td>\n <td>U</td>\n <td>8.0500</td>\n <td>Allen, Mr. William Henry</td>\n <td>0</td>\n <td>5</td>\n <td>1</td>\n <td>0</td>\n <td>0.0</td>\n <td>373450</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n </tr>\n </tbody>\n</table>\n</div>"
},
"metadata": {},
"output_type": "execute_result",
"execution_count": 799
}
],
"source": [
"#添加one-hot编码产生的虚拟变量(dummy variables)到泰坦尼克号数据集full\n",
"full = pd.concat([full,pclassDf],axis=1)\n",
"\n",
"#删掉客舱等级(Pclass)这一列\n",
"full.drop(‘Pclass‘,axis=1,inplace=True)\n",
"full.head()"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n",
"is_executing": false
}
}
},
{
"cell_type": "markdown",
"source": [
"### 3.2.1 分类数据:字符串类型"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"source": [
"字符串类型:可能从这里面提取出特征来,也归到分类数据中,这里数据有:\n",
"\n",
"1. 乘客姓名(Name)\n",
"2. 客舱号(Cabin)\n",
"3. 船票编号(Ticket)"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"source": [
"### 从姓名中提取头衔"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": 800,
"outputs": [
{
"data": {
"text/plain": "0 Braund, Mr. Owen Harris\n1 Cumings, Mrs. John Bradley (Florence Briggs Th...\n2 Heikkinen, Miss. Laina\n3 Futrelle, Mrs. Jacques Heath (Lily May Peel)\n4 Allen, Mr. William Henry\nName: Name, dtype: object"
},
"metadata": {},
"output_type": "execute_result",
"execution_count": 800
}
],
"source": [
"‘‘‘\n",
"查看姓名这一列长啥样\n",
"注意到在乘客名字(Name)中,有一个非常显著的特点:\n",
"乘客头衔每个名字当中都包含了具体的称谓或者说是头衔,将这部分信息提取出来后可以作为非常有用一个新变量,可以帮助我们进行预测。\n",
"例如:\n",
"Braund, Mr. Owen Harris\n",
"Heikkinen, Miss. Laina\n",
"Oliva y Ocana, Dona. Fermina\n",
"Peter, Master. Michael J\n",
"‘‘‘\n",
"full[ ‘Name‘ ].head()"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n",
"is_executing": false
}
}
},
{
"cell_type": "code",
"execution_count": 801,
"outputs": [],
"source": [
"#练习从字符串中提取头衔,例如Mr\n",
"#split用于字符串分割,返回一个列表\n",
"#我们看到姓名中‘Braund, Mr. Owen Harris‘,逗号前面的是“名”,逗号后面是‘头衔. 姓’\n",
"name1=‘Braund, Mr. Owen Harris‘\n",
"‘‘‘\n",
"split用于字符串按分隔符分割,返回一个列表。这里按逗号分隔字符串\n",
"也就是字符串‘Braund, Mr. Owen Harris‘被按分隔符,‘拆分成两部分[Braund,Mr. Owen Harris]\n",
"你可以把返回的列表打印出来瞧瞧,这里获取到列表中元素序号为1的元素,也就是获取到头衔所在的那部分,即Mr. Owen Harris这部分\n",
"‘‘‘\n",
"#Mr. Owen Harris\n",
"str1=name1.split( ‘,‘ )[1] \n",
"‘‘‘\n",
"继续对字符串Mr. Owen Harris按分隔符‘.‘拆分,得到这样一个列表[Mr, Owen Harris]\n",
"这里获取到列表中元素序号为0的元素,也就是获取到头衔所在的那部分Mr\n",
"‘‘‘\n",
"#Mr.\n",
"str2=str1.split( ‘.‘ )[0]\n",
"#strip() 方法用于移除字符串头尾指定的字符(默认为空格)\n",
"str3=str2.strip()"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n",
"is_executing": false
}
}
},
{
"cell_type": "code",
"execution_count": 802,
"outputs": [],
"source": [
"‘‘‘\n",
"定义函数:从姓名中获取头衔\n",
"‘‘‘\n",
"def getTitle(name):\n",
" str1=name.split( ‘,‘ )[1] #Mr. Owen Harris\n",
" str2=str1.split( ‘.‘ )[0]#Mr\n",
" #strip() 方法用于移除字符串头尾指定的字符(默认为空格)\n",
" str3=str2.strip()\n",
" return str3"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n",
"is_executing": false
}
}
},
{
"cell_type": "code",
"execution_count": 803,
"outputs": [
{
"data": {
"text/plain": " Title\n0 Mr\n1 Mrs\n2 Miss\n3 Mrs\n4 Mr",
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>Title</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <td>0</td>\n <td>Mr</td>\n </tr>\n <tr>\n <td>1</td>\n <td>Mrs</td>\n </tr>\n <tr>\n <td>2</td>\n <td>Miss</td>\n </tr>\n <tr>\n <td>3</td>\n <td>Mrs</td>\n </tr>\n <tr>\n <td>4</td>\n <td>Mr</td>\n </tr>\n </tbody>\n</table>\n</div>"
},
"metadata": {},
"output_type": "execute_result",
"execution_count": 803
}
],
"source": [
"#存放提取后的特征\n",
"titleDf = pd.DataFrame()\n",
"#map函数:对Series每个数据应用自定义的函数计算\n",
"titleDf[‘Title‘] = full[‘Name‘].map(getTitle)\n",
"titleDf.head()"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n",
"is_executing": false
}
}
},
{
"cell_type": "code",
"execution_count": 804,
"outputs": [
{
"data": {
"text/plain": " Master Miss Mr Mrs Officer Royalty\n0 0 0 1 0 0 0\n1 0 0 0 1 0 0\n2 0 1 0 0 0 0\n3 0 0 0 1 0 0\n4 0 0 1 0 0 0",
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>Master</th>\n <th>Miss</th>\n <th>Mr</th>\n <th>Mrs</th>\n <th>Officer</th>\n <th>Royalty</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n </tr>\n <tr>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n </tr>\n <tr>\n <td>2</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n </tr>\n <tr>\n <td>3</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n </tr>\n <tr>\n <td>4</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n </tr>\n </tbody>\n</table>\n</div>"
},
"metadata": {},
"output_type": "execute_result",
"execution_count": 804
}
],
"source": [
"‘‘‘\n",
"定义以下几种头衔类别:\n",
"Officer政府官员\n",
"Royalty王室(皇室)\n",
"Mr已婚男士\n",
"Mrs已婚妇女\n",
"Miss年轻未婚女子\n",
"Master有技能的人/教师\n",
"‘‘‘\n",
"#姓名中头衔字符串与定义头衔类别的映射关系\n",
"title_mapDict = {\n",
" \"Capt\": \"Officer\",\n",
" \"Col\": \"Officer\",\n",
" \"Major\": \"Officer\",\n",
" \"Jonkheer\": \"Royalty\",\n",
" \"Don\": \"Royalty\",\n",
" \"Sir\" : \"Royalty\",\n",
" \"Dr\": \"Officer\",\n",
" \"Rev\": \"Officer\",\n",
" \"the Countess\":\"Royalty\",\n",
" \"Dona\": \"Royalty\",\n",
" \"Mme\": \"Mrs\",\n",
" \"Mlle\": \"Miss\",\n",
" \"Ms\": \"Mrs\",\n",
" \"Mr\" : \"Mr\",\n",
" \"Mrs\" : \"Mrs\",\n",
" \"Miss\" : \"Miss\",\n",
" \"Master\" : \"Master\",\n",
" \"Lady\" : \"Royalty\"\n",
" }\n",
"\n",
"#map函数:对Series每个数据应用自定义的函数计算\n",
"titleDf[‘Title‘] = titleDf[‘Title‘].map(title_mapDict)\n",
"\n",
"#使用get_dummies进行one-hot编码\n",
"titleDf = pd.get_dummies(titleDf[‘Title‘])\n",
"titleDf.head()"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n",
"is_executing": false
}
}
},
{
"cell_type": "code",
"execution_count": 805,
"outputs": [
{
"data": {
"text/plain": " Age Cabin Fare Parch PassengerId Sex SibSp Survived \\\n0 22.0 U 7.2500 0 1 1 1 0.0 \n1 38.0 C85 71.2833 0 2 0 1 1.0 \n2 26.0 U 7.9250 0 3 0 0 1.0 \n3 35.0 C123 53.1000 0 4 0 1 1.0 \n4 35.0 U 8.0500 0 5 1 0 0.0 \n\n Ticket Embarked_C ... Embarked_S Pclass_1 Pclass_2 \\\n0 A/5 21171 0 ... 1 0 0 \n1 PC 17599 1 ... 0 1 0 \n2 STON/O2. 3101282 0 ... 1 0 0 \n3 113803 0 ... 1 1 0 \n4 373450 0 ... 1 0 0 \n\n Pclass_3 Master Miss Mr Mrs Officer Royalty \n0 1 0 0 1 0 0 0 \n1 0 0 0 0 1 0 0 \n2 1 0 1 0 0 0 0 \n3 0 0 0 0 1 0 0 \n4 1 0 0 1 0 0 0 \n\n[5 rows x 21 columns]",
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>Age</th>\n <th>Cabin</th>\n <th>Fare</th>\n <th>Parch</th>\n <th>PassengerId</th>\n <th>Sex</th>\n <th>SibSp</th>\n <th>Survived</th>\n <th>Ticket</th>\n <th>Embarked_C</th>\n <th>...</th>\n <th>Embarked_S</th>\n <th>Pclass_1</th>\n <th>Pclass_2</th>\n <th>Pclass_3</th>\n <th>Master</th>\n <th>Miss</th>\n <th>Mr</th>\n <th>Mrs</th>\n <th>Officer</th>\n <th>Royalty</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <td>0</td>\n <td>22.0</td>\n <td>U</td>\n <td>7.2500</td>\n <td>0</td>\n <td>1</td>\n <td>1</td>\n <td>1</td>\n <td>0.0</td>\n <td>A/5 21171</td>\n <td>0</td>\n <td>...</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n </tr>\n <tr>\n <td>1</td>\n <td>38.0</td>\n <td>C85</td>\n <td>71.2833</td>\n <td>0</td>\n <td>2</td>\n <td>0</td>\n <td>1</td>\n <td>1.0</td>\n <td>PC 17599</td>\n <td>1</td>\n <td>...</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n </tr>\n <tr>\n <td>2</td>\n <td>26.0</td>\n <td>U</td>\n <td>7.9250</td>\n <td>0</td>\n <td>3</td>\n <td>0</td>\n <td>0</td>\n <td>1.0</td>\n <td>STON/O2. 3101282</td>\n <td>0</td>\n <td>...</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n </tr>\n <tr>\n <td>3</td>\n <td>35.0</td>\n <td>C123</td>\n <td>53.1000</td>\n <td>0</td>\n <td>4</td>\n <td>0</td>\n <td>1</td>\n <td>1.0</td>\n <td>113803</td>\n <td>0</td>\n <td>...</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n </tr>\n <tr>\n <td>4</td>\n <td>35.0</td>\n <td>U</td>\n <td>8.0500</td>\n <td>0</td>\n <td>5</td>\n <td>1</td>\n <td>0</td>\n <td>0.0</td>\n <td>373450</td>\n <td>0</td>\n <td>...</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n </tr>\n </tbody>\n</table>\n<p>5 rows × 21 columns</p>\n</div>"
},
"metadata": {},
"output_type": "execute_result",
"execution_count": 805
}
],
"source": [
"#添加one-hot编码产生的虚拟变量(dummy variables)到泰坦尼克号数据集full\n",
"full = pd.concat([full,titleDf],axis=1)\n",
"\n",
"#删掉姓名这一列\n",
"full.drop(‘Name‘,axis=1,inplace=True)\n",
"full.head()"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n",
"is_executing": false
}
}
},
{
"cell_type": "markdown",
"source": [
"### 从客舱号中提取客舱类别"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": 806,
"outputs": [
{
"name": "stdout",
"text": [
"相加后的值为 : 30\n"
],
"output_type": "stream"
}
],
"source": [
"#补充知识:匿名函数\n",
"‘‘‘\n",
"python 使用 lambda 来创建匿名函数。\n",
"所谓匿名,意即不再使用 def 语句这样标准的形式定义一个函数,预防如下:\n",
"lambda 参数1,参数2:函数体或者表达式\n",
"‘‘‘\n",
"# 定义匿名函数:对两个数相加\n",
"sum = lambda a,b: a + b\n",
" \n",
"# 调用sum函数\n",
"print (\"相加后的值为 : \", sum(10,20))"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n",
"is_executing": false
}
}
},
{
"cell_type": "code",
"execution_count": 807,
"outputs": [
{
"data": {
"text/plain": "0 U\n1 C85\n2 U\n3 C123\n4 U\nName: Cabin, dtype: object"
},
"metadata": {},
"output_type": "execute_result",
"execution_count": 807
}
],
"source": [
"‘‘‘\n",
"客舱号的首字母是客舱的类别\n",
"‘‘‘\n",
"#查看客舱号的内容\n",
"full[‘Cabin‘].head()"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n",
"is_executing": false
}
}
},
{
"cell_type": "code",
"execution_count": 808,
"outputs": [
{
"data": {
"text/plain": " Cabin_A Cabin_B Cabin_C Cabin_D Cabin_E Cabin_F Cabin_G Cabin_T \\\n0 0 0 0 0 0 0 0 0 \n1 0 0 1 0 0 0 0 0 \n2 0 0 0 0 0 0 0 0 \n3 0 0 1 0 0 0 0 0 \n4 0 0 0 0 0 0 0 0 \n\n Cabin_U \n0 1 \n1 0 \n2 1 \n3 0 \n4 1 ",
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>Cabin_A</th>\n <th>Cabin_B</th>\n <th>Cabin_C</th>\n <th>Cabin_D</th>\n <th>Cabin_E</th>\n <th>Cabin_F</th>\n <th>Cabin_G</th>\n <th>Cabin_T</th>\n <th>Cabin_U</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n </tr>\n <tr>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n </tr>\n <tr>\n <td>2</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n </tr>\n <tr>\n <td>3</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n </tr>\n <tr>\n <td>4</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n </tr>\n </tbody>\n</table>\n</div>"
},
"metadata": {},
"output_type": "execute_result",
"execution_count": 808
}
],
"source": [
"#存放客舱号信息\n",
"cabinDf = pd.DataFrame()\n",
"\n",
"‘‘‘\n",
"客场号的类别值是首字母,例如:\n",
"C85 类别映射为首字母C\n",
"‘‘‘\n",
"full[ ‘Cabin‘ ] = full[ ‘Cabin‘ ].map( lambda c : c[0] )\n",
"\n",
"##使用get_dummies进行one-hot编码,列名前缀是Cabin\n",
"cabinDf = pd.get_dummies( full[‘Cabin‘] , prefix = ‘Cabin‘ )\n",
"\n",
"cabinDf.head()"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n",
"is_executing": false
}
}
},
{
"cell_type": "code",
"execution_count": 809,
"outputs": [
{
"data": {
"text/plain": " Age Fare Parch PassengerId Sex SibSp Survived Ticket \\\n0 22.0 7.2500 0 1 1 1 0.0 A/5 21171 \n1 38.0 71.2833 0 2 0 1 1.0 PC 17599 \n2 26.0 7.9250 0 3 0 0 1.0 STON/O2. 3101282 \n3 35.0 53.1000 0 4 0 1 1.0 113803 \n4 35.0 8.0500 0 5 1 0 0.0 373450 \n\n Embarked_C Embarked_Q ... Royalty Cabin_A Cabin_B Cabin_C Cabin_D \\\n0 0 0 ... 0 0 0 0 0 \n1 1 0 ... 0 0 0 1 0 \n2 0 0 ... 0 0 0 0 0 \n3 0 0 ... 0 0 0 1 0 \n4 0 0 ... 0 0 0 0 0 \n\n Cabin_E Cabin_F Cabin_G Cabin_T Cabin_U \n0 0 0 0 0 1 \n1 0 0 0 0 0 \n2 0 0 0 0 1 \n3 0 0 0 0 0 \n4 0 0 0 0 1 \n\n[5 rows x 29 columns]",
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>Age</th>\n <th>Fare</th>\n <th>Parch</th>\n <th>PassengerId</th>\n <th>Sex</th>\n <th>SibSp</th>\n <th>Survived</th>\n <th>Ticket</th>\n <th>Embarked_C</th>\n <th>Embarked_Q</th>\n <th>...</th>\n <th>Royalty</th>\n <th>Cabin_A</th>\n <th>Cabin_B</th>\n <th>Cabin_C</th>\n <th>Cabin_D</th>\n <th>Cabin_E</th>\n <th>Cabin_F</th>\n <th>Cabin_G</th>\n <th>Cabin_T</th>\n <th>Cabin_U</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <td>0</td>\n <td>22.0</td>\n <td>7.2500</td>\n <td>0</td>\n <td>1</td>\n <td>1</td>\n <td>1</td>\n <td>0.0</td>\n <td>A/5 21171</td>\n <td>0</td>\n <td>0</td>\n <td>...</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n </tr>\n <tr>\n <td>1</td>\n <td>38.0</td>\n <td>71.2833</td>\n <td>0</td>\n <td>2</td>\n <td>0</td>\n <td>1</td>\n <td>1.0</td>\n <td>PC 17599</td>\n <td>1</td>\n <td>0</td>\n <td>...</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n </tr>\n <tr>\n <td>2</td>\n <td>26.0</td>\n <td>7.9250</td>\n <td>0</td>\n <td>3</td>\n <td>0</td>\n <td>0</td>\n <td>1.0</td>\n <td>STON/O2. 3101282</td>\n <td>0</td>\n <td>0</td>\n <td>...</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n </tr>\n <tr>\n <td>3</td>\n <td>35.0</td>\n <td>53.1000</td>\n <td>0</td>\n <td>4</td>\n <td>0</td>\n <td>1</td>\n <td>1.0</td>\n <td>113803</td>\n <td>0</td>\n <td>0</td>\n <td>...</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n </tr>\n <tr>\n <td>4</td>\n <td>35.0</td>\n <td>8.0500</td>\n <td>0</td>\n <td>5</td>\n <td>1</td>\n <td>0</td>\n <td>0.0</td>\n <td>373450</td>\n <td>0</td>\n <td>0</td>\n <td>...</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n </tr>\n </tbody>\n</table>\n<p>5 rows × 29 columns</p>\n</div>"
},
"metadata": {},
"output_type": "execute_result",
"execution_count": 809
}
],
"source": [
"#添加one-hot编码产生的虚拟变量(dummy variables)到泰坦尼克号数据集full\n",
"full = pd.concat([full,cabinDf],axis=1)\n",
"\n",
"#删掉客舱号这一列\n",
"full.drop(‘Cabin‘,axis=1,inplace=True)\n",
"full.head()"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n",
"is_executing": false
}
}
},
{
"cell_type": "markdown",
"source": [
"### 建立家庭人数和家庭类别"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": 810,
"outputs": [
{
"data": {
"text/plain": " FamilySize Family_Single Family_Small Family_Large\n0 2 0 1 0\n1 2 0 1 0\n2 1 1 0 0\n3 2 0 1 0\n4 1 1 0 0",
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>FamilySize</th>\n <th>Family_Single</th>\n <th>Family_Small</th>\n <th>Family_Large</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <td>0</td>\n <td>2</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n </tr>\n <tr>\n <td>1</td>\n <td>2</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n </tr>\n <tr>\n <td>2</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n </tr>\n <tr>\n <td>3</td>\n <td>2</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n </tr>\n <tr>\n <td>4</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n </tr>\n </tbody>\n</table>\n</div>"
},
"metadata": {},
"output_type": "execute_result",
"execution_count": 810
}
],
"source": [
"#存放家庭信息\n",
"familyDf = pd.DataFrame()\n",
"\n",
"‘‘‘\n",
"家庭人数=同代直系亲属数(Parch)+不同代直系亲属数(SibSp)+乘客自己\n",
"(因为乘客自己也是家庭成员的一个,所以这里加1)\n",
"‘‘‘\n",
"familyDf[ ‘FamilySize‘ ] = full[ ‘Parch‘ ] + full[ ‘SibSp‘ ] + 1\n",
"\n",
"‘‘‘\n",
"家庭类别:\n",
"小家庭Family_Single:家庭人数=1\n",
"中等家庭Family_Small: 2<=家庭人数<=4\n",
"大家庭Family_Large: 家庭人数>=5\n",
"‘‘‘\n",
"#if 条件为真的时候返回if前面内容,否则返回0\n",
"familyDf[ ‘Family_Single‘ ] = familyDf[ ‘FamilySize‘ ].map( lambda s : 1 if s == 1 else 0 )\n",
"familyDf[ ‘Family_Small‘ ] = familyDf[ ‘FamilySize‘ ].map( lambda s : 1 if 2 <= s <= 4 else 0 )\n",
"familyDf[ ‘Family_Large‘ ] = familyDf[ ‘FamilySize‘ ].map( lambda s : 1 if 5 <= s else 0 )\n",
"\n",
"familyDf.head()"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n",
"is_executing": false
}
}
},
{
"cell_type": "code",
"execution_count": 811,
"outputs": [
{
"data": {
"text/plain": " Age Fare Parch PassengerId Sex SibSp Survived Ticket \\\n0 22.0 7.2500 0 1 1 1 0.0 A/5 21171 \n1 38.0 71.2833 0 2 0 1 1.0 PC 17599 \n2 26.0 7.9250 0 3 0 0 1.0 STON/O2. 3101282 \n3 35.0 53.1000 0 4 0 1 1.0 113803 \n4 35.0 8.0500 0 5 1 0 0.0 373450 \n\n Embarked_C Embarked_Q ... Cabin_D Cabin_E Cabin_F Cabin_G Cabin_T \\\n0 0 0 ... 0 0 0 0 0 \n1 1 0 ... 0 0 0 0 0 \n2 0 0 ... 0 0 0 0 0 \n3 0 0 ... 0 0 0 0 0 \n4 0 0 ... 0 0 0 0 0 \n\n Cabin_U FamilySize Family_Single Family_Small Family_Large \n0 1 2 0 1 0 \n1 0 2 0 1 0 \n2 1 1 1 0 0 \n3 0 2 0 1 0 \n4 1 1 1 0 0 \n\n[5 rows x 33 columns]",
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>Age</th>\n <th>Fare</th>\n <th>Parch</th>\n <th>PassengerId</th>\n <th>Sex</th>\n <th>SibSp</th>\n <th>Survived</th>\n <th>Ticket</th>\n <th>Embarked_C</th>\n <th>Embarked_Q</th>\n <th>...</th>\n <th>Cabin_D</th>\n <th>Cabin_E</th>\n <th>Cabin_F</th>\n <th>Cabin_G</th>\n <th>Cabin_T</th>\n <th>Cabin_U</th>\n <th>FamilySize</th>\n <th>Family_Single</th>\n <th>Family_Small</th>\n <th>Family_Large</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <td>0</td>\n <td>22.0</td>\n <td>7.2500</td>\n <td>0</td>\n <td>1</td>\n <td>1</td>\n <td>1</td>\n <td>0.0</td>\n <td>A/5 21171</td>\n <td>0</td>\n <td>0</td>\n <td>...</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>2</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n </tr>\n <tr>\n <td>1</td>\n <td>38.0</td>\n <td>71.2833</td>\n <td>0</td>\n <td>2</td>\n <td>0</td>\n <td>1</td>\n <td>1.0</td>\n <td>PC 17599</td>\n <td>1</td>\n <td>0</td>\n <td>...</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>2</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n </tr>\n <tr>\n <td>2</td>\n <td>26.0</td>\n <td>7.9250</td>\n <td>0</td>\n <td>3</td>\n <td>0</td>\n <td>0</td>\n <td>1.0</td>\n <td>STON/O2. 3101282</td>\n <td>0</td>\n <td>0</td>\n <td>...</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n </tr>\n <tr>\n <td>3</td>\n <td>35.0</td>\n <td>53.1000</td>\n <td>0</td>\n <td>4</td>\n <td>0</td>\n <td>1</td>\n <td>1.0</td>\n <td>113803</td>\n <td>0</td>\n <td>0</td>\n <td>...</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>2</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n </tr>\n <tr>\n <td>4</td>\n <td>35.0</td>\n <td>8.0500</td>\n <td>0</td>\n <td>5</td>\n <td>1</td>\n <td>0</td>\n <td>0.0</td>\n <td>373450</td>\n <td>0</td>\n <td>0</td>\n <td>...</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>1</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n </tr>\n </tbody>\n</table>\n<p>5 rows × 33 columns</p>\n</div>"
},
"metadata": {},
"output_type": "execute_result",
"execution_count": 811
}
],
"source": [
"#添加one-hot编码产生的虚拟变量(dummy variables)到泰坦尼克号数据集full\n",
"full = pd.concat([full,familyDf],axis=1)\n",
"full.head()"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n",
"is_executing": false
}
}
},
{
"cell_type": "code",
"execution_count": 812,
"outputs": [
{
"data": {
"text/plain": "(1309, 33)"
},
"metadata": {},
"output_type": "execute_result",
"execution_count": 812
}
],
"source": [
"#到现在我们已经有了这么多个特征了\n",
"full.shape"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n",
"is_executing": false
}
}
},
{
"cell_type": "markdown",
"source": [
"## 3.3 特征选择"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"source": [
"可以学习后面的课程后,再了解特征选择的方法。但是如果你已经具备了多种机器学习算法的知识,想提前学习,可以参考这些资料:\n",
" \n",
"* [如何做特征工程?](http://www.csuldw.com/2015/10/24/2015-10-24%20feature%20engineering/)\n",
"* [如何使用sklearn进行特征工程?](http://www.cnblogs.com/jasonfreak/p/5448385.html)\n",
"\n",
"* [泰坦尼克号如何进行特征选择?](https://ahmedbesbes.com/how-to-score-08134-in-titanic-kaggle-challenge.html)"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"source": [
"相关系数法:计算各个特征的相关系数"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": 813,
"outputs": [
{
"data": {
"text/plain": " Age Fare Parch PassengerId Sex SibSp \\\nAge 1.000000 0.171521 -0.130872 0.025731 0.057397 -0.190747 \nFare 0.171521 1.000000 0.221522 0.031416 -0.185484 0.160224 \nParch -0.130872 0.221522 1.000000 0.008942 -0.213125 0.373587 \nPassengerId 0.025731 0.031416 0.008942 1.000000 0.013406 -0.055224 \nSex 0.057397 -0.185484 -0.213125 0.013406 1.000000 -0.109609 \nSibSp -0.190747 0.160224 0.373587 -0.055224 -0.109609 1.000000 \nSurvived -0.070323 0.257307 0.081629 -0.005007 -0.543351 -0.035322 \nEmbarked_C 0.076179 0.286241 -0.008635 0.048101 -0.066564 -0.048396 \nEmbarked_Q -0.012718 -0.130054 -0.100943 0.011585 -0.088651 -0.048678 \nEmbarked_S -0.059153 -0.169894 0.071881 -0.049836 0.115193 0.073709 \nPclass_1 0.362587 0.599956 -0.013033 0.026495 -0.107371 -0.034256 \nPclass_2 -0.014193 -0.121372 -0.010057 0.022714 -0.028862 -0.052419 \nPclass_3 -0.302093 -0.419616 0.019521 -0.041544 0.116562 0.072610 \nMaster -0.363923 0.011596 0.253482 0.002254 0.164375 0.329171 \nMiss -0.254146 0.092051 0.066473 -0.050027 -0.672819 0.077564 \nMr 0.165476 -0.192192 -0.304780 0.014116 0.870678 -0.243104 \nMrs 0.198091 0.139235 0.213491 0.033299 -0.571176 0.061643 \nOfficer 0.162818 0.028696 -0.032631 0.002231 0.087288 -0.013813 \nRoyalty 0.059466 0.026214 -0.030197 0.004400 -0.020408 -0.010787 \nCabin_A 0.125177 0.020094 -0.030707 -0.002831 0.047561 -0.039808 \nCabin_B 0.113458 0.393743 0.073051 0.015895 -0.094453 -0.011569 \nCabin_C 0.167993 0.401370 0.009601 0.006092 -0.077473 0.048616 \nCabin_D 0.132886 0.072737 -0.027385 0.000549 -0.057396 -0.015727 \nCabin_E 0.106600 0.073949 0.001084 -0.008136 -0.040340 -0.027180 \nCabin_F -0.072644 -0.037567 0.020481 0.000306 -0.006655 -0.008619 \nCabin_G -0.085977 -0.022857 0.058325 -0.045949 -0.083285 0.006015 \nCabin_T 0.032461 0.001179 -0.012304 -0.023049 0.020558 -0.013247 \nCabin_U -0.271918 -0.507197 -0.036806 0.000208 0.137396 0.009064 \nFamilySize -0.196996 0.226465 0.792296 -0.031437 -0.188583 0.861952 \nFamily_Single 0.116675 -0.274826 -0.549022 0.028546 0.284537 -0.591077 \nFamily_Small -0.038189 0.197281 0.248532 0.002975 -0.255196 0.253590 \nFamily_Large -0.161210 0.170853 0.624627 -0.063415 -0.077748 0.699681 \n\n Survived Embarked_C Embarked_Q Embarked_S ... Cabin_D \\\nAge -0.070323 0.076179 -0.012718 -0.059153 ... 0.132886 \nFare 0.257307 0.286241 -0.130054 -0.169894 ... 0.072737 \nParch 0.081629 -0.008635 -0.100943 0.071881 ... -0.027385 \nPassengerId -0.005007 0.048101 0.011585 -0.049836 ... 0.000549 \nSex -0.543351 -0.066564 -0.088651 0.115193 ... -0.057396 \nSibSp -0.035322 -0.048396 -0.048678 0.073709 ... -0.015727 \nSurvived 1.000000 0.168240 0.003650 -0.149683 ... 0.150716 \nEmbarked_C 0.168240 1.000000 -0.164166 -0.778262 ... 0.107782 \nEmbarked_Q 0.003650 -0.164166 1.000000 -0.491656 ... -0.061459 \nEmbarked_S -0.149683 -0.778262 -0.491656 1.000000 ... -0.056023 \nPclass_1 0.285904 0.325722 -0.166101 -0.181800 ... 0.275698 \nPclass_2 0.093349 -0.134675 -0.121973 0.196532 ... -0.037929 \nPclass_3 -0.322308 -0.171430 0.243706 -0.003805 ... -0.207455 \nMaster 0.085221 -0.014172 -0.009091 0.018297 ... -0.042192 \nMiss 0.332795 -0.014351 0.198804 -0.113886 ... -0.012516 \nMr -0.549199 -0.065538 -0.080224 0.108924 ... -0.030261 \nMrs 0.344935 0.098379 -0.100374 -0.022950 ... 0.080393 \nOfficer -0.031316 0.003678 -0.003212 -0.001202 ... 0.006055 \nRoyalty 0.033391 0.077213 -0.021853 -0.054250 ... -0.012950 \nCabin_A 0.022287 0.094914 -0.042105 -0.056984 ... -0.024952 \nCabin_B 0.175095 0.161595 -0.073613 -0.095790 ... -0.043624 \nCabin_C 0.114652 0.158043 -0.059151 -0.101861 ... -0.053083 \nCabin_D 0.150716 0.107782 -0.061459 -0.056023 ... 1.000000 \nCabin_E 0.145321 0.027566 -0.042877 0.002960 ... -0.034317 \nCabin_F 0.057935 -0.020010 -0.020282 0.030575 ... -0.024369 \nCabin_G 0.016040 -0.031566 -0.019941 0.040560 ... -0.011817 \nCabin_T -0.026456 -0.014095 -0.008904 0.018111 ... -0.005277 \nCabin_U -0.316912 -0.258257 0.142369 0.137351 ... -0.353822 \nFamilySize 0.016639 -0.036553 -0.087190 0.087771 ... -0.025313 \nFamily_Single -0.203367 -0.107874 0.127214 0.014246 ... -0.074310 \nFamily_Small 0.279855 0.159594 -0.122491 -0.062909 ... 0.102432 \nFamily_Large -0.125147 -0.092825 -0.018423 0.093671 ... -0.049336 \n\n Cabin_E Cabin_F Cabin_G Cabin_T Cabin_U FamilySize \\\nAge 0.106600 -0.072644 -0.085977 0.032461 -0.271918 -0.196996 \nFare 0.073949 -0.037567 -0.022857 0.001179 -0.507197 0.226465 \nParch 0.001084 0.020481 0.058325 -0.012304 -0.036806 0.792296 \nPassengerId -0.008136 0.000306 -0.045949 -0.023049 0.000208 -0.031437 \nSex -0.040340 -0.006655 -0.083285 0.020558 0.137396 -0.188583 \nSibSp -0.027180 -0.008619 0.006015 -0.013247 0.009064 0.861952 \nSurvived 0.145321 0.057935 0.016040 -0.026456 -0.316912 0.016639 \nEmbarked_C 0.027566 -0.020010 -0.031566 -0.014095 -0.258257 -0.036553 \nEmbarked_Q -0.042877 -0.020282 -0.019941 -0.008904 0.142369 -0.087190 \nEmbarked_S 0.002960 0.030575 0.040560 0.018111 0.137351 0.087771 \nPclass_1 0.242963 -0.073083 -0.035441 0.048310 -0.776987 -0.029656 \nPclass_2 -0.050210 0.127371 -0.032081 -0.014325 0.176485 -0.039976 \nPclass_3 -0.169063 -0.041178 0.056964 -0.030057 0.527614 0.058430 \nMaster 0.001860 0.058311 -0.013690 -0.006113 0.041178 0.355061 \nMiss 0.008700 -0.003088 0.061881 -0.013832 -0.004364 0.087350 \nMr -0.032953 -0.026403 -0.072514 0.023611 0.131807 -0.326487 \nMrs 0.045538 0.013376 0.042547 -0.011742 -0.162253 0.157233 \nOfficer -0.024048 -0.017076 -0.008281 -0.003698 -0.067030 -0.026921 \nRoyalty -0.012202 -0.008665 -0.004202 -0.001876 -0.071672 -0.023600 \nCabin_A -0.023510 -0.016695 -0.008096 -0.003615 -0.242399 -0.042967 \nCabin_B -0.041103 -0.029188 -0.014154 -0.006320 -0.423794 0.032318 \nCabin_C -0.050016 -0.035516 -0.017224 -0.007691 -0.515684 0.037226 \nCabin_D -0.034317 -0.024369 -0.011817 -0.005277 -0.353822 -0.025313 \nCabin_E 1.000000 -0.022961 -0.011135 -0.004972 -0.333381 -0.017285 \nCabin_F -0.022961 1.000000 -0.007907 -0.003531 -0.236733 0.005525 \nCabin_G -0.011135 -0.007907 1.000000 -0.001712 -0.114803 0.035835 \nCabin_T -0.004972 -0.003531 -0.001712 1.000000 -0.051263 -0.015438 \nCabin_U -0.333381 -0.236733 -0.114803 -0.051263 1.000000 -0.014155 \nFamilySize -0.017285 0.005525 0.035835 -0.015438 -0.014155 1.000000 \nFamily_Single -0.042535 0.004055 -0.076397 0.022411 0.175812 -0.688864 \nFamily_Small 0.068007 0.012756 0.087471 -0.019574 -0.211367 0.302640 \nFamily_Large -0.046485 -0.033009 -0.016008 -0.007148 0.056438 0.801623 \n\n Family_Single Family_Small Family_Large \nAge 0.116675 -0.038189 -0.161210 \nFare -0.274826 0.197281 0.170853 \nParch -0.549022 0.248532 0.624627 \nPassengerId 0.028546 0.002975 -0.063415 \nSex 0.284537 -0.255196 -0.077748 \nSibSp -0.591077 0.253590 0.699681 \nSurvived -0.203367 0.279855 -0.125147 \nEmbarked_C -0.107874 0.159594 -0.092825 \nEmbarked_Q 0.127214 -0.122491 -0.018423 \nEmbarked_S 0.014246 -0.062909 0.093671 \nPclass_1 -0.126551 0.165965 -0.067523 \nPclass_2 -0.035075 0.097270 -0.118495 \nPclass_3 0.138250 -0.223338 0.155560 \nMaster -0.265355 0.120166 0.301809 \nMiss -0.023890 -0.018085 0.083422 \nMr 0.386262 -0.300872 -0.194207 \nMrs -0.354649 0.361247 0.012893 \nOfficer 0.013303 0.003966 -0.034572 \nRoyalty 0.008761 -0.000073 -0.017542 \nCabin_A 0.045227 -0.029546 -0.033799 \nCabin_B -0.087912 0.084268 0.013470 \nCabin_C -0.137498 0.141925 0.001362 \nCabin_D -0.074310 0.102432 -0.049336 \nCabin_E -0.042535 0.068007 -0.046485 \nCabin_F 0.004055 0.012756 -0.033009 \nCabin_G -0.076397 0.087471 -0.016008 \nCabin_T 0.022411 -0.019574 -0.007148 \nCabin_U 0.175812 -0.211367 0.056438 \nFamilySize -0.688864 0.302640 0.801623 \nFamily_Single 1.000000 -0.873398 -0.318944 \nFamily_Small -0.873398 1.000000 -0.183007 \nFamily_Large -0.318944 -0.183007 1.000000 \n\n[32 rows x 32 columns]",
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>Age</th>\n <th>Fare</th>\n <th>Parch</th>\n <th>PassengerId</th>\n <th>Sex</th>\n <th>SibSp</th>\n <th>Survived</th>\n <th>Embarked_C</th>\n <th>Embarked_Q</th>\n <th>Embarked_S</th>\n <th>...</th>\n <th>Cabin_D</th>\n <th>Cabin_E</th>\n <th>Cabin_F</th>\n <th>Cabin_G</th>\n <th>Cabin_T</th>\n <th>Cabin_U</th>\n <th>FamilySize</th>\n <th>Family_Single</th>\n <th>Family_Small</th>\n <th>Family_Large</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <td>Age</td>\n <td>1.000000</td>\n <td>0.171521</td>\n <td>-0.130872</td>\n <td>0.025731</td>\n <td>0.057397</td>\n <td>-0.190747</td>\n <td>-0.070323</td>\n <td>0.076179</td>\n <td>-0.012718</td>\n <td>-0.059153</td>\n <td>...</td>\n <td>0.132886</td>\n <td>0.106600</td>\n <td>-0.072644</td>\n <td>-0.085977</td>\n <td>0.032461</td>\n <td>-0.271918</td>\n <td>-0.196996</td>\n <td>0.116675</td>\n <td>-0.038189</td>\n <td>-0.161210</td>\n </tr>\n <tr>\n <td>Fare</td>\n <td>0.171521</td>\n <td>1.000000</td>\n <td>0.221522</td>\n <td>0.031416</td>\n <td>-0.185484</td>\n <td>0.160224</td>\n <td>0.257307</td>\n <td>0.286241</td>\n <td>-0.130054</td>\n <td>-0.169894</td>\n <td>...</td>\n <td>0.072737</td>\n <td>0.073949</td>\n <td>-0.037567</td>\n <td>-0.022857</td>\n <td>0.001179</td>\n <td>-0.507197</td>\n <td>0.226465</td>\n <td>-0.274826</td>\n <td>0.197281</td>\n <td>0.170853</td>\n </tr>\n <tr>\n <td>Parch</td>\n <td>-0.130872</td>\n <td>0.221522</td>\n <td>1.000000</td>\n <td>0.008942</td>\n <td>-0.213125</td>\n <td>0.373587</td>\n <td>0.081629</td>\n <td>-0.008635</td>\n <td>-0.100943</td>\n <td>0.071881</td>\n <td>...</td>\n <td>-0.027385</td>\n <td>0.001084</td>\n <td>0.020481</td>\n <td>0.058325</td>\n <td>-0.012304</td>\n <td>-0.036806</td>\n <td>0.792296</td>\n <td>-0.549022</td>\n <td>0.248532</td>\n <td>0.624627</td>\n </tr>\n <tr>\n <td>PassengerId</td>\n <td>0.025731</td>\n <td>0.031416</td>\n <td>0.008942</td>\n <td>1.000000</td>\n <td>0.013406</td>\n <td>-0.055224</td>\n <td>-0.005007</td>\n <td>0.048101</td>\n <td>0.011585</td>\n <td>-0.049836</td>\n <td>...</td>\n <td>0.000549</td>\n <td>-0.008136</td>\n <td>0.000306</td>\n <td>-0.045949</td>\n <td>-0.023049</td>\n <td>0.000208</td>\n <td>-0.031437</td>\n <td>0.028546</td>\n <td>0.002975</td>\n <td>-0.063415</td>\n </tr>\n <tr>\n <td>Sex</td>\n <td>0.057397</td>\n <td>-0.185484</td>\n <td>-0.213125</td>\n <td>0.013406</td>\n <td>1.000000</td>\n <td>-0.109609</td>\n <td>-0.543351</td>\n <td>-0.066564</td>\n <td>-0.088651</td>\n <td>0.115193</td>\n <td>...</td>\n <td>-0.057396</td>\n <td>-0.040340</td>\n <td>-0.006655</td>\n <td>-0.083285</td>\n <td>0.020558</td>\n <td>0.137396</td>\n <td>-0.188583</td>\n <td>0.284537</td>\n <td>-0.255196</td>\n <td>-0.077748</td>\n </tr>\n <tr>\n <td>SibSp</td>\n <td>-0.190747</td>\n <td>0.160224</td>\n <td>0.373587</td>\n <td>-0.055224</td>\n <td>-0.109609</td>\n <td>1.000000</td>\n <td>-0.035322</td>\n <td>-0.048396</td>\n <td>-0.048678</td>\n <td>0.073709</td>\n <td>...</td>\n <td>-0.015727</td>\n <td>-0.027180</td>\n <td>-0.008619</td>\n <td>0.006015</td>\n <td>-0.013247</td>\n <td>0.009064</td>\n <td>0.861952</td>\n <td>-0.591077</td>\n <td>0.253590</td>\n <td>0.699681</td>\n </tr>\n <tr>\n <td>Survived</td>\n <td>-0.070323</td>\n <td>0.257307</td>\n <td>0.081629</td>\n <td>-0.005007</td>\n <td>-0.543351</td>\n <td>-0.035322</td>\n <td>1.000000</td>\n <td>0.168240</td>\n <td>0.003650</td>\n <td>-0.149683</td>\n <td>...</td>\n <td>0.150716</td>\n <td>0.145321</td>\n <td>0.057935</td>\n <td>0.016040</td>\n <td>-0.026456</td>\n <td>-0.316912</td>\n <td>0.016639</td>\n <td>-0.203367</td>\n <td>0.279855</td>\n <td>-0.125147</td>\n </tr>\n <tr>\n <td>Embarked_C</td>\n <td>0.076179</td>\n <td>0.286241</td>\n <td>-0.008635</td>\n <td>0.048101</td>\n <td>-0.066564</td>\n <td>-0.048396</td>\n <td>0.168240</td>\n <td>1.000000</td>\n <td>-0.164166</td>\n <td>-0.778262</td>\n <td>...</td>\n <td>0.107782</td>\n <td>0.027566</td>\n <td>-0.020010</td>\n <td>-0.031566</td>\n <td>-0.014095</td>\n <td>-0.258257</td>\n <td>-0.036553</td>\n <td>-0.107874</td>\n <td>0.159594</td>\n <td>-0.092825</td>\n </tr>\n <tr>\n <td>Embarked_Q</td>\n <td>-0.012718</td>\n <td>-0.130054</td>\n <td>-0.100943</td>\n <td>0.011585</td>\n <td>-0.088651</td>\n <td>-0.048678</td>\n <td>0.003650</td>\n <td>-0.164166</td>\n <td>1.000000</td>\n <td>-0.491656</td>\n <td>...</td>\n <td>-0.061459</td>\n <td>-0.042877</td>\n <td>-0.020282</td>\n <td>-0.019941</td>\n <td>-0.008904</td>\n <td>0.142369</td>\n <td>-0.087190</td>\n <td>0.127214</td>\n <td>-0.122491</td>\n <td>-0.018423</td>\n </tr>\n <tr>\n <td>Embarked_S</td>\n <td>-0.059153</td>\n <td>-0.169894</td>\n <td>0.071881</td>\n <td>-0.049836</td>\n <td>0.115193</td>\n <td>0.073709</td>\n <td>-0.149683</td>\n <td>-0.778262</td>\n <td>-0.491656</td>\n <td>1.000000</td>\n <td>...</td>\n <td>-0.056023</td>\n <td>0.002960</td>\n <td>0.030575</td>\n <td>0.040560</td>\n <td>0.018111</td>\n <td>0.137351</td>\n <td>0.087771</td>\n <td>0.014246</td>\n <td>-0.062909</td>\n <td>0.093671</td>\n </tr>\n <tr>\n <td>Pclass_1</td>\n <td>0.362587</td>\n <td>0.599956</td>\n <td>-0.013033</td>\n <td>0.026495</td>\n <td>-0.107371</td>\n <td>-0.034256</td>\n <td>0.285904</td>\n <td>0.325722</td>\n <td>-0.166101</td>\n <td>-0.181800</td>\n <td>...</td>\n <td>0.275698</td>\n <td>0.242963</td>\n <td>-0.073083</td>\n <td>-0.035441</td>\n <td>0.048310</td>\n <td>-0.776987</td>\n <td>-0.029656</td>\n <td>-0.126551</td>\n <td>0.165965</td>\n <td>-0.067523</td>\n </tr>\n <tr>\n <td>Pclass_2</td>\n <td>-0.014193</td>\n <td>-0.121372</td>\n <td>-0.010057</td>\n <td>0.022714</td>\n <td>-0.028862</td>\n <td>-0.052419</td>\n <td>0.093349</td>\n <td>-0.134675</td>\n <td>-0.121973</td>\n <td>0.196532</td>\n <td>...</td>\n <td>-0.037929</td>\n <td>-0.050210</td>\n <td>0.127371</td>\n <td>-0.032081</td>\n <td>-0.014325</td>\n <td>0.176485</td>\n <td>-0.039976</td>\n <td>-0.035075</td>\n <td>0.097270</td>\n <td>-0.118495</td>\n </tr>\n <tr>\n <td>Pclass_3</td>\n <td>-0.302093</td>\n <td>-0.419616</td>\n <td>0.019521</td>\n <td>-0.041544</td>\n <td>0.116562</td>\n <td>0.072610</td>\n <td>-0.322308</td>\n <td>-0.171430</td>\n <td>0.243706</td>\n <td>-0.003805</td>\n <td>...</td>\n <td>-0.207455</td>\n <td>-0.169063</td>\n <td>-0.041178</td>\n <td>0.056964</td>\n <td>-0.030057</td>\n <td>0.527614</td>\n <td>0.058430</td>\n <td>0.138250</td>\n <td>-0.223338</td>\n <td>0.155560</td>\n </tr>\n <tr>\n <td>Master</td>\n <td>-0.363923</td>\n <td>0.011596</td>\n <td>0.253482</td>\n <td>0.002254</td>\n <td>0.164375</td>\n <td>0.329171</td>\n <td>0.085221</td>\n <td>-0.014172</td>\n <td>-0.009091</td>\n <td>0.018297</td>\n <td>...</td>\n <td>-0.042192</td>\n <td>0.001860</td>\n <td>0.058311</td>\n <td>-0.013690</td>\n <td>-0.006113</td>\n <td>0.041178</td>\n <td>0.355061</td>\n <td>-0.265355</td>\n <td>0.120166</td>\n <td>0.301809</td>\n </tr>\n <tr>\n <td>Miss</td>\n <td>-0.254146</td>\n <td>0.092051</td>\n <td>0.066473</td>\n <td>-0.050027</td>\n <td>-0.672819</td>\n <td>0.077564</td>\n <td>0.332795</td>\n <td>-0.014351</td>\n <td>0.198804</td>\n <td>-0.113886</td>\n <td>...</td>\n <td>-0.012516</td>\n <td>0.008700</td>\n <td>-0.003088</td>\n <td>0.061881</td>\n <td>-0.013832</td>\n <td>-0.004364</td>\n <td>0.087350</td>\n <td>-0.023890</td>\n <td>-0.018085</td>\n <td>0.083422</td>\n </tr>\n <tr>\n <td>Mr</td>\n <td>0.165476</td>\n <td>-0.192192</td>\n <td>-0.304780</td>\n <td>0.014116</td>\n <td>0.870678</td>\n <td>-0.243104</td>\n <td>-0.549199</td>\n <td>-0.065538</td>\n <td>-0.080224</td>\n <td>0.108924</td>\n <td>...</td>\n <td>-0.030261</td>\n <td>-0.032953</td>\n <td>-0.026403</td>\n <td>-0.072514</td>\n <td>0.023611</td>\n <td>0.131807</td>\n <td>-0.326487</td>\n <td>0.386262</td>\n <td>-0.300872</td>\n <td>-0.194207</td>\n </tr>\n <tr>\n <td>Mrs</td>\n <td>0.198091</td>\n <td>0.139235</td>\n <td>0.213491</td>\n <td>0.033299</td>\n <td>-0.571176</td>\n <td>0.061643</td>\n <td>0.344935</td>\n <td>0.098379</td>\n <td>-0.100374</td>\n <td>-0.022950</td>\n <td>...</td>\n <td>0.080393</td>\n <td>0.045538</td>\n <td>0.013376</td>\n <td>0.042547</td>\n <td>-0.011742</td>\n <td>-0.162253</td>\n <td>0.157233</td>\n <td>-0.354649</td>\n <td>0.361247</td>\n <td>0.012893</td>\n </tr>\n <tr>\n <td>Officer</td>\n <td>0.162818</td>\n <td>0.028696</td>\n <td>-0.032631</td>\n <td>0.002231</td>\n <td>0.087288</td>\n <td>-0.013813</td>\n <td>-0.031316</td>\n <td>0.003678</td>\n <td>-0.003212</td>\n <td>-0.001202</td>\n <td>...</td>\n <td>0.006055</td>\n <td>-0.024048</td>\n <td>-0.017076</td>\n <td>-0.008281</td>\n <td>-0.003698</td>\n <td>-0.067030</td>\n <td>-0.026921</td>\n <td>0.013303</td>\n <td>0.003966</td>\n <td>-0.034572</td>\n </tr>\n <tr>\n <td>Royalty</td>\n <td>0.059466</td>\n <td>0.026214</td>\n <td>-0.030197</td>\n <td>0.004400</td>\n <td>-0.020408</td>\n <td>-0.010787</td>\n <td>0.033391</td>\n <td>0.077213</td>\n <td>-0.021853</td>\n <td>-0.054250</td>\n <td>...</td>\n <td>-0.012950</td>\n <td>-0.012202</td>\n <td>-0.008665</td>\n <td>-0.004202</td>\n <td>-0.001876</td>\n <td>-0.071672</td>\n <td>-0.023600</td>\n <td>0.008761</td>\n <td>-0.000073</td>\n <td>-0.017542</td>\n </tr>\n <tr>\n <td>Cabin_A</td>\n <td>0.125177</td>\n <td>0.020094</td>\n <td>-0.030707</td>\n <td>-0.002831</td>\n <td>0.047561</td>\n <td>-0.039808</td>\n <td>0.022287</td>\n <td>0.094914</td>\n <td>-0.042105</td>\n <td>-0.056984</td>\n <td>...</td>\n <td>-0.024952</td>\n <td>-0.023510</td>\n <td>-0.016695</td>\n <td>-0.008096</td>\n <td>-0.003615</td>\n <td>-0.242399</td>\n <td>-0.042967</td>\n <td>0.045227</td>\n <td>-0.029546</td>\n <td>-0.033799</td>\n </tr>\n <tr>\n <td>Cabin_B</td>\n <td>0.113458</td>\n <td>0.393743</td>\n <td>0.073051</td>\n <td>0.015895</td>\n <td>-0.094453</td>\n <td>-0.011569</td>\n <td>0.175095</td>\n <td>0.161595</td>\n <td>-0.073613</td>\n <td>-0.095790</td>\n <td>...</td>\n <td>-0.043624</td>\n <td>-0.041103</td>\n <td>-0.029188</td>\n <td>-0.014154</td>\n <td>-0.006320</td>\n <td>-0.423794</td>\n <td>0.032318</td>\n <td>-0.087912</td>\n <td>0.084268</td>\n <td>0.013470</td>\n </tr>\n <tr>\n <td>Cabin_C</td>\n <td>0.167993</td>\n <td>0.401370</td>\n <td>0.009601</td>\n <td>0.006092</td>\n <td>-0.077473</td>\n <td>0.048616</td>\n <td>0.114652</td>\n <td>0.158043</td>\n <td>-0.059151</td>\n <td>-0.101861</td>\n <td>...</td>\n <td>-0.053083</td>\n <td>-0.050016</td>\n <td>-0.035516</td>\n <td>-0.017224</td>\n <td>-0.007691</td>\n <td>-0.515684</td>\n <td>0.037226</td>\n <td>-0.137498</td>\n <td>0.141925</td>\n <td>0.001362</td>\n </tr>\n <tr>\n <td>Cabin_D</td>\n <td>0.132886</td>\n <td>0.072737</td>\n <td>-0.027385</td>\n <td>0.000549</td>\n <td>-0.057396</td>\n <td>-0.015727</td>\n <td>0.150716</td>\n <td>0.107782</td>\n <td>-0.061459</td>\n <td>-0.056023</td>\n <td>...</td>\n <td>1.000000</td>\n <td>-0.034317</td>\n <td>-0.024369</td>\n <td>-0.011817</td>\n <td>-0.005277</td>\n <td>-0.353822</td>\n <td>-0.025313</td>\n <td>-0.074310</td>\n <td>0.102432</td>\n <td>-0.049336</td>\n </tr>\n <tr>\n <td>Cabin_E</td>\n <td>0.106600</td>\n <td>0.073949</td>\n <td>0.001084</td>\n <td>-0.008136</td>\n <td>-0.040340</td>\n <td>-0.027180</td>\n <td>0.145321</td>\n <td>0.027566</td>\n <td>-0.042877</td>\n <td>0.002960</td>\n <td>...</td>\n <td>-0.034317</td>\n <td>1.000000</td>\n <td>-0.022961</td>\n <td>-0.011135</td>\n <td>-0.004972</td>\n <td>-0.333381</td>\n <td>-0.017285</td>\n <td>-0.042535</td>\n <td>0.068007</td>\n <td>-0.046485</td>\n </tr>\n <tr>\n <td>Cabin_F</td>\n <td>-0.072644</td>\n <td>-0.037567</td>\n <td>0.020481</td>\n <td>0.000306</td>\n <td>-0.006655</td>\n <td>-0.008619</td>\n <td>0.057935</td>\n <td>-0.020010</td>\n <td>-0.020282</td>\n <td>0.030575</td>\n <td>...</td>\n <td>-0.024369</td>\n <td>-0.022961</td>\n <td>1.000000</td>\n <td>-0.007907</td>\n <td>-0.003531</td>\n <td>-0.236733</td>\n <td>0.005525</td>\n <td>0.004055</td>\n <td>0.012756</td>\n <td>-0.033009</td>\n </tr>\n <tr>\n <td>Cabin_G</td>\n <td>-0.085977</td>\n <td>-0.022857</td>\n <td>0.058325</td>\n <td>-0.045949</td>\n <td>-0.083285</td>\n <td>0.006015</td>\n <td>0.016040</td>\n <td>-0.031566</td>\n <td>-0.019941</td>\n <td>0.040560</td>\n <td>...</td>\n <td>-0.011817</td>\n <td>-0.011135</td>\n <td>-0.007907</td>\n <td>1.000000</td>\n <td>-0.001712</td>\n <td>-0.114803</td>\n <td>0.035835</td>\n <td>-0.076397</td>\n <td>0.087471</td>\n <td>-0.016008</td>\n </tr>\n <tr>\n <td>Cabin_T</td>\n <td>0.032461</td>\n <td>0.001179</td>\n <td>-0.012304</td>\n <td>-0.023049</td>\n <td>0.020558</td>\n <td>-0.013247</td>\n <td>-0.026456</td>\n <td>-0.014095</td>\n <td>-0.008904</td>\n <td>0.018111</td>\n <td>...</td>\n <td>-0.005277</td>\n <td>-0.004972</td>\n <td>-0.003531</td>\n <td>-0.001712</td>\n <td>1.000000</td>\n <td>-0.051263</td>\n <td>-0.015438</td>\n <td>0.022411</td>\n <td>-0.019574</td>\n <td>-0.007148</td>\n </tr>\n <tr>\n <td>Cabin_U</td>\n <td>-0.271918</td>\n <td>-0.507197</td>\n <td>-0.036806</td>\n <td>0.000208</td>\n <td>0.137396</td>\n <td>0.009064</td>\n <td>-0.316912</td>\n <td>-0.258257</td>\n <td>0.142369</td>\n <td>0.137351</td>\n <td>...</td>\n <td>-0.353822</td>\n <td>-0.333381</td>\n <td>-0.236733</td>\n <td>-0.114803</td>\n <td>-0.051263</td>\n <td>1.000000</td>\n <td>-0.014155</td>\n <td>0.175812</td>\n <td>-0.211367</td>\n <td>0.056438</td>\n </tr>\n <tr>\n <td>FamilySize</td>\n <td>-0.196996</td>\n <td>0.226465</td>\n <td>0.792296</td>\n <td>-0.031437</td>\n <td>-0.188583</td>\n <td>0.861952</td>\n <td>0.016639</td>\n <td>-0.036553</td>\n <td>-0.087190</td>\n <td>0.087771</td>\n <td>...</td>\n <td>-0.025313</td>\n <td>-0.017285</td>\n <td>0.005525</td>\n <td>0.035835</td>\n <td>-0.015438</td>\n <td>-0.014155</td>\n <td>1.000000</td>\n <td>-0.688864</td>\n <td>0.302640</td>\n <td>0.801623</td>\n </tr>\n <tr>\n <td>Family_Single</td>\n <td>0.116675</td>\n <td>-0.274826</td>\n <td>-0.549022</td>\n <td>0.028546</td>\n <td>0.284537</td>\n <td>-0.591077</td>\n <td>-0.203367</td>\n <td>-0.107874</td>\n <td>0.127214</td>\n <td>0.014246</td>\n <td>...</td>\n <td>-0.074310</td>\n <td>-0.042535</td>\n <td>0.004055</td>\n <td>-0.076397</td>\n <td>0.022411</td>\n <td>0.175812</td>\n <td>-0.688864</td>\n <td>1.000000</td>\n <td>-0.873398</td>\n <td>-0.318944</td>\n </tr>\n <tr>\n <td>Family_Small</td>\n <td>-0.038189</td>\n <td>0.197281</td>\n <td>0.248532</td>\n <td>0.002975</td>\n <td>-0.255196</td>\n <td>0.253590</td>\n <td>0.279855</td>\n <td>0.159594</td>\n <td>-0.122491</td>\n <td>-0.062909</td>\n <td>...</td>\n <td>0.102432</td>\n <td>0.068007</td>\n <td>0.012756</td>\n <td>0.087471</td>\n <td>-0.019574</td>\n <td>-0.211367</td>\n <td>0.302640</td>\n <td>-0.873398</td>\n <td>1.000000</td>\n <td>-0.183007</td>\n </tr>\n <tr>\n <td>Family_Large</td>\n <td>-0.161210</td>\n <td>0.170853</td>\n <td>0.624627</td>\n <td>-0.063415</td>\n <td>-0.077748</td>\n <td>0.699681</td>\n <td>-0.125147</td>\n <td>-0.092825</td>\n <td>-0.018423</td>\n <td>0.093671</td>\n <td>...</td>\n <td>-0.049336</td>\n <td>-0.046485</td>\n <td>-0.033009</td>\n <td>-0.016008</td>\n <td>-0.007148</td>\n <td>0.056438</td>\n <td>0.801623</td>\n <td>-0.318944</td>\n <td>-0.183007</td>\n <td>1.000000</td>\n </tr>\n </tbody>\n</table>\n<p>32 rows × 32 columns</p>\n</div>"
},
"metadata": {},
"output_type": "execute_result",
"execution_count": 813
}
],
"source": [
"#相关性矩阵\n",
"corrDf = full.corr() \n",
"corrDf"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n",
"is_executing": false
}
}
},
{
"cell_type": "code",
"execution_count": 814,
"outputs": [
{
"data": {
"text/plain": "Survived 1.000000\nMrs 0.344935\nMiss 0.332795\nPclass_1 0.285904\nFamily_Small 0.279855\nFare 0.257307\nCabin_B 0.175095\nEmbarked_C 0.168240\nCabin_D 0.150716\nCabin_E 0.145321\nCabin_C 0.114652\nPclass_2 0.093349\nMaster 0.085221\nParch 0.081629\nCabin_F 0.057935\nRoyalty 0.033391\nCabin_A 0.022287\nFamilySize 0.016639\nCabin_G 0.016040\nEmbarked_Q 0.003650\nPassengerId -0.005007\nCabin_T -0.026456\nOfficer -0.031316\nSibSp -0.035322\nAge -0.070323\nFamily_Large -0.125147\nEmbarked_S -0.149683\nFamily_Single -0.203367\nCabin_U -0.316912\nPclass_3 -0.322308\nSex -0.543351\nMr -0.549199\nName: Survived, dtype: float64"
},
"metadata": {},
"output_type": "execute_result",
"execution_count": 814
}
],
"source": [
"‘‘‘\n",
"查看各个特征与生成情况(Survived)的相关系数,\n",
"ascending=False表示按降序排列\n",
"‘‘‘\n",
"corrDf[‘Survived‘].sort_values(ascending =False)"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n",
"is_executing": false
}
}
},
{
"cell_type": "markdown",
"source": [
"根据各个特征与生成情况(Survived)的相关系数大小,我们选择了这几个特征作为模型的输入:\n",
"\n",
"头衔(前面所在的数据集titleDf)、客舱等级(pclassDf)、家庭大小(familyDf)、船票价格(Fare)、船舱号(cabinDf)、登船港口(embarkedDf)、性别(Sex)"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": 815,
"outputs": [
{
"data": {
"text/plain": " Master Miss Mr Mrs Officer Royalty Pclass_1 Pclass_2 Pclass_3 \\\n0 0 0 1 0 0 0 0 0 1 \n1 0 0 0 1 0 0 1 0 0 \n2 0 1 0 0 0 0 0 0 1 \n3 0 0 0 1 0 0 1 0 0 \n4 0 0 1 0 0 0 0 0 1 \n\n FamilySize ... Cabin_D Cabin_E Cabin_F Cabin_G Cabin_T Cabin_U \\\n0 2 ... 0 0 0 0 0 1 \n1 2 ... 0 0 0 0 0 0 \n2 1 ... 0 0 0 0 0 1 \n3 2 ... 0 0 0 0 0 0 \n4 1 ... 0 0 0 0 0 1 \n\n Embarked_C Embarked_Q Embarked_S Sex \n0 0 0 1 1 \n1 1 0 0 0 \n2 0 0 1 0 \n3 0 0 1 0 \n4 0 0 1 1 \n\n[5 rows x 27 columns]",
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>Master</th>\n <th>Miss</th>\n <th>Mr</th>\n <th>Mrs</th>\n <th>Officer</th>\n <th>Royalty</th>\n <th>Pclass_1</th>\n <th>Pclass_2</th>\n <th>Pclass_3</th>\n <th>FamilySize</th>\n <th>...</th>\n <th>Cabin_D</th>\n <th>Cabin_E</th>\n <th>Cabin_F</th>\n <th>Cabin_G</th>\n <th>Cabin_T</th>\n <th>Cabin_U</th>\n <th>Embarked_C</th>\n <th>Embarked_Q</th>\n <th>Embarked_S</th>\n <th>Sex</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>2</td>\n <td>...</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>1</td>\n </tr>\n <tr>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>2</td>\n <td>...</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n </tr>\n <tr>\n <td>2</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>1</td>\n <td>...</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n </tr>\n <tr>\n <td>3</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>2</td>\n <td>...</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n </tr>\n <tr>\n <td>4</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>1</td>\n <td>...</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>0</td>\n <td>0</td>\n <td>1</td>\n <td>1</td>\n </tr>\n </tbody>\n</table>\n<p>5 rows × 27 columns</p>\n</div>"
},
"metadata": {},
"output_type": "execute_result",
"execution_count": 815
}
],
"source": [
"#特征选择\n",
"full_X = pd.concat( [titleDf,#头衔\n",
" pclassDf,#客舱等级\n",
" familyDf,#家庭大小\n",
" full[‘Fare‘],#船票价格\n",
" cabinDf,#船舱号\n",
" embarkedDf,#登船港口\n",
" full[‘Sex‘]#性别\n",
" ] , axis=1 )\n",
"full_X.head()"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n",
"is_executing": false
}
}
},
{
"cell_type": "markdown",
"source": [
"# 4.构建模型"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"source": [
"用训练数据和某个机器学习算法得到机器学习模型,用测试数据评估模型"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"source": [
"## 4.1 建立训练数据集和测试数据集"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": 816,
"outputs": [],
"source": [
"‘‘‘\n",
"1)坦尼克号测试数据集因为是我们最后要提交给Kaggle的,里面没有生存情况的值,所以不能用于评估模型。\n",
"我们将Kaggle泰坦尼克号项目给我们的测试数据,叫做预测数据集(记为pred,也就是预测英文单词predict的缩写)。\n",
"也就是我们使用机器学习模型来对其生存情况就那些预测。\n",
"2)我们使用Kaggle泰坦尼克号项目给的训练数据集,做为我们的原始数据集(记为source),\n",
"从这个原始数据集中拆分出训练数据集(记为train:用于模型训练)和测试数据集(记为test:用于模型评估)。\n",
"\n",
"‘‘‘\n",
"#原始数据集有891行\n",
"sourceRow=891\n",
"\n",
"‘‘‘\n",
"sourceRow是我们在最开始合并数据前知道的,原始数据集有总共有891条数据\n",
"从特征集合full_X中提取原始数据集提取前891行数据时,我们要减去1,因为行号是从0开始的。\n",
"‘‘‘\n",
"#原始数据集:特征\n",
"source_X = full_X.loc[0:sourceRow-1,:]\n",
"#原始数据集:标签\n",
"source_y = full.loc[0:sourceRow-1,‘Survived‘] \n",
"\n",
"#预测数据集:特征\n",
"pred_X = full_X.loc[sourceRow:,:]"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n",
"is_executing": false
}
}
},
{
"cell_type": "code",
"execution_count": 817,
"outputs": [
{
"name": "stdout",
"text": [
"原始数据集有多少行: 891\n",
"原始数据集有多少行: 418\n"
],
"output_type": "stream"
}
],
"source": [
"‘‘‘\n",
"确保这里原始数据集取的是前891行的数据,不然后面模型会有错误\n",
"‘‘‘\n",
"#原始数据集有多少行\n",
"print(‘原始数据集有多少行:‘,source_X.shape[0])\n",
"#预测数据集大小\n",
"print(‘原始数据集有多少行:‘,pred_X.shape[0])"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n",
"is_executing": false
}
}
},
{
"cell_type": "code",
"execution_count": 818,
"outputs": [
{
"name": "stdout",
"text": [
"原始数据集特征: (891, 27) 训练数据集特征: (712, 27) 测试数据集特征: (179, 27)\n",
"原始数据集标签: (891,) 训练数据集标签: (712,) 测试数据集标签: (179,)\n"
],
"output_type": "stream"
}
],
"source": [
"‘‘‘\n",
"从原始数据集(source)中拆分出训练数据集(用于模型训练train),测试数据集(用于模型评估test)\n",
"train_test_split是交叉验证中常用的函数,功能是从样本中随机的按比例选取train data和test data\n",
"train_data:所要划分的样本特征集\n",
"train_target:所要划分的样本结果\n",
"test_size:样本占比,如果是整数的话就是样本的数量\n",
"‘‘‘\n",
"\n",
"#建立模型用的训练数据集和测试数据集\n",
"train_X, test_X, train_y, test_y = train_test_split(source_X ,\n",
" source_y,\n",
" train_size=.8)\n",
"\n",
"#输出数据集大小\n",
"print (‘原始数据集特征:‘,source_X.shape, \n",
" ‘训练数据集特征:‘,train_X.shape ,\n",
" ‘测试数据集特征:‘,test_X.shape)\n",
"\n",
"print (‘原始数据集标签:‘,source_y.shape, \n",
" ‘训练数据集标签:‘,train_y.shape ,\n",
" ‘测试数据集标签:‘,test_y.shape)"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n",
"is_executing": false
}
}
},
{
"cell_type": "code",
"execution_count": 819,
"outputs": [
{
"data": {
"text/plain": "0 0.0\n1 1.0\n2 1.0\n3 1.0\n4 0.0\nName: Survived, dtype: float64"
},
"metadata": {},
"output_type": "execute_result",
"execution_count": 819
}
],
"source": [
"#原始数据查看\n",
"source_y.head()"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n",
"is_executing": false
}
}
},
{
"cell_type": "markdown",
"source": [
"## 4.2 选择机器学习算法"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"source": [
"选择一个机器学习算法,用于模型的训练。如果你是新手,建议从逻辑回归算法开始"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": 820,
"outputs": [],
"source": [
"#第1步:导入算法\n",
"from sklearn.linear_model import LogisticRegression\n",
"#第2步:创建模型:逻辑回归(logisic regression)\n",
"model = LogisticRegression()"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n",
"is_executing": false
}
}
},
{
"cell_type": "code",
"execution_count": 821,
"outputs": [],
"source": [
"#随机森林Random Forests Model\n",
"#from sklearn.ensemble import RandomForestClassifier\n",
"#model = RandomForestClassifier(n_estimators=100)"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n",
"is_executing": false
}
}
},
{
"cell_type": "code",
"execution_count": 822,
"outputs": [],
"source": [
"#支持向量机Support Vector Machines\n",
"#from sklearn.svm import SVC, LinearSVC\n",
"#model = SVC()"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n",
"is_executing": false
}
}
},
{
"cell_type": "code",
"execution_count": 823,
"outputs": [],
"source": [
"#Gradient Boosting Classifier\n",
"#from sklearn.ensemble import GradientBoostingClassifier\n",
"#model = GradientBoostingClassifier()"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n",
"is_executing": false
}
}
},
{
"cell_type": "code",
"execution_count": 824,
"outputs": [],
"source": [
"#K-nearest neighbors\n",
"#from sklearn.neighbors import KNeighborsClassifier\n",
"#model = KNeighborsClassifier(n_neighbors = 3)"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n",
"is_executing": false
}
}
},
{
"cell_type": "code",
"execution_count": 825,
"outputs": [],
"source": [
"# Gaussian Naive Bayes\n",
"#from sklearn.naive_bayes import GaussianNB\n",
"#model = GaussianNB()"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n",
"is_executing": false
}
}
},
{
"cell_type": "markdown",
"source": [
"## 4.3 训练模型"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": 826,
"outputs": [
{
"data": {
"text/plain": "LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,\n intercept_scaling=1, l1_ratio=None, max_iter=1,\n multi_class=‘warn‘, n_jobs=None, penalty=‘l2‘,\n random_state=None, solver=‘warn‘, tol=0.0001, verbose=0,\n warm_start=False)"
},
"metadata": {},
"output_type": "execute_result",
"execution_count": 826
}
],
"source": [
"#第3步:训练模型\n",
"model.fit( train_X , train_y)"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n",
"is_executing": false
}
}
},
{
"cell_type": "markdown",
"source": [
"## 5.评估模型"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": 827,
"outputs": [
{
"data": {
"text/plain": "0.6927374301675978"
},
"metadata": {},
"output_type": "execute_result",
"execution_count": 827
}
],
"source": [
"# 分类问题,score得到的是模型的正确率\n",
"model.score(test_X , test_y )"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n",
"is_executing": false
}
}
},
{
"cell_type": "markdown",
"source": [
"# 6.方案实施(Deployment)"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"source": [
"## 6.1 得到预测结果上传到Kaggle"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"source": [
"使用预测数据集到底预测结果,并保存到csv文件中,上传到Kaggle中,就可以看到排名。"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": 828,
"outputs": [],
"source": [
"#使用机器学习模型,对预测数据集中的生存情况进行预测\n",
"pred_Y = model.predict(pred_X)\n",
"\n",
"‘‘‘\n",
"生成的预测值是浮点数(0.0,1,0)\n",
"但是Kaggle要求提交的结果是整型(0,1)\n",
"所以要对数据类型进行转换\n",
"‘‘‘\n",
"pred_Y=pred_Y.astype(int)\n",
"#乘客id\n",
"passenger_id = full.loc[sourceRow:,‘PassengerId‘]\n",
"#数据框:乘客id,预测生存情况的值\n",
"predDf = pd.DataFrame( \n",
" { ‘PassengerId‘: passenger_id , \n",
" ‘Survived‘: pred_Y } )\n",
"predDf.shape\n",
"predDf.head()\n",
"#保存结果\n",
"predDf.to_csv( ‘titanic_pred.csv‘ , index = False )"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n",
"is_executing": false
}
}
},
{
"cell_type": "markdown",
"source": [
"## 6.2 报告撰写"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"source": [
"下次课程我们通过《数据可视化》来详细聊聊如何做一份数据分析报告"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": 828,
"outputs": [],
"source": [],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n",
"is_executing": false
}
}
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.6"
},
"pycharm": {
"stem_cell": {
"cell_type": "raw",
"source": [],
"metadata": {
"collapsed": false
}
}
}
},
"nbformat": 4,
"nbformat_minor": 0
}
标签:lex 目录 补充 down 自定义 target 字符串分割 val rdf
原文地址:https://www.cnblogs.com/SmartCat994/p/12399016.html