特征工程

时间：2020-03-11 12:43:54 阅读：52 评论：0 收藏：0 [点我收藏+]

标签：type res country tle update name 创建长度 var

特征工程

特征工程

select_dtypes
可以选择指定类型的数据

# Create subset of only the numeric columns
so_numeric_df = so_survey_df.select_dtypes(include=['int', 'float'])

处理分类特征

pd.get_dummies()
values_counts()
统计每个特征的不相同的样本个数(之和）
datacamp的栗子

# Create a series out of the Country column
countries = so_survey_df['Country']

# Get the counts of each category
country_counts = countries.value_counts()

# Print the count values for each category
print(country_counts)

<script.py> output:
    South Africa    166
    USA             164
    Spain           134
    Sweeden         119
    France          115
    Russia           97
    UK               95
    India            95
    Ukraine           9
    Ireland           5
    Name: Country, dtype: int64

isin()
结果返回一个bool型的mask
接受一个列表，判断该列中元素是否在列表中

# Create a series out of the Country column
countries = so_survey_df['Country']

# Get the counts of each category
country_counts = countries.value_counts()

# Create a mask for only categories that occur less than 10 times
mask = countries.isin(country_counts[country_counts < 10].index)

# Print the top 5 rows in the mask series
print(mask.head())

<script.py> output:
    0    False
    1    False
    2    False
    3    False
    4    False
    Name: Country, dtype: bool

创建一个mask筛选出我们不需要的类别
比如datacamp的栗子
可以把创建出来的

# Create a series out of the Country column
countries = so_survey_df['Country']

# Get the counts of each category
country_counts = countries.value_counts()

# Create a mask for only categories that occur less than 10 times
mask = countries.isin(country_counts[country_counts < 10].index)

# Label all other categories as Other
countries[mask] = 'Other'

# Print the updated category counts
print(pd.value_counts(countries))

<script.py> output:
    South Africa    166
    USA             164
    Spain           134
    Sweeden         119
    France          115
    Russia           97
    UK               95
    India            95
    Other            14
    Name: Country, dtype: int64

处理数值特征

Numeric variables
binning Numeric variables二进制数值变量

pd.[‘列名‘]=vaule
创建一个新列
可以筛选一些特征，但是我总是想不到简单的写法，应该还是基础不太牢固
pandas.cut
用来把一组数据分割成离散的区间。比如有一组年龄数据，可以使用pandas.cut将年龄数据分割成不同的年龄段并打上标签cnblog

pandas.cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False, duplicates='raise')

x：被切分的类数组（array-like）数据，必须是1维的（不能用DataFrame）；cnblog
bins：bins是被切割后的区间（或者叫“桶”、“箱”、“面元”），有3中形式：一个int型的标量、标量序列（数组）或者pandas.IntervalIndex 。
- 一个int型的标量
- 当bins为一个int型的标量时，代表将x平分成bins份。x的范围在每侧扩展0.1%，以包括x的最大值和最小值。
- 标量序列:标量序列定义了被分割后每一个bin的区间边缘，此时x没有扩展。
- pandas.IntervalIndex:定义要使用的精确区间。
right：bool型参数，默认为True，表示是否包含区间右部。比如如果bins=[1,2,3]，right=True，则区间为(1,2]，(2,3]；right=False，则区间为(1,2),(2,3)。
labels：给分割后的bins打标签，比如把年龄x分割成年龄段bins后，可以给年龄段打上诸如青年、中年的标签。labels的长度必须和划分后的区间长度相等，比如bins=[1,2,3]，划分后有2个区间(1,2]，(2,3]，则labels的长度必须为2。如果指定labels=False，则返回x中的数据在第几个bin中（从0开始）。
retbins：bool型的参数，表示是否将分割后的bins返回，当bins为一个int型的标量时比较有用，这样可以得到划分后的区间，默认为False。
precision：保留区间小数点的位数，默认为3.
include_lowest：bool型的参数，表示区间的左边是开还是闭的，默认为false，也就是不包含区间左部（闭）。
duplicates：是否允许重复区间。有两种选择：raise：不允许，drop：允许。

这个datacamp的栗子可以划分不同的区间，完事儿给不同的列贴labels，不过要注意labels的取值

# Import numpy
import numpy as np

# Specify the boundaries of the bins
bins = [-np.inf, 10000, 50000, 100000, 150000, np.inf]

# Bin labels
labels = ['Very low', 'Low', 'Medium', 'High', 'Very high']

# Bin the continuous variable ConvertedSalary using these boundaries
so_survey_df['boundary_binned'] = pd.cut(so_survey_df['ConvertedSalary'], 
                                         bins, labels=labels)

# Print the first 5 rows of the boundary_binned column
print(so_survey_df[['boundary_binned', 'ConvertedSalary']].head())


<script.py> output:
      boundary_binned  ConvertedSalary
    0        Very low              0.0
    1          Medium          70841.0
    2        Very low              0.0
    3             Low          21426.0
    4             Low          41671.0

缺失值的处理

我先补充一个小的知识点，就是我突然发现so_survey_df[[‘Gender‘]].info()==so_survey_df.loc[‘Gender‘],[[]]的奇效，哈哈哈哈

info()特征的非缺失值等信息
isna()查看缺失值信息
notnull()查看非缺失值的信息
dropna() 删除缺失值，其中有一个subset argument,可以指定删除某一列的缺失值
filna() 可以使用指定字符串填充缺失值：so_survey_df[‘Gender‘].fillna(‘Not Given‘, inplace=True)
round() 方法返回浮点数x的四舍五入值。
在删除缺失值的时候是不能删除训练集里面的缺失值的

特殊符号的处理

replace()
替换

# Remove the commas in the column
# 这里注意要先变为字符串型
so_survey_df['RawSalary'] = so_survey_df['RawSalary'].str.replace(',', '')
so_survey_df['RawSalary'] = so_survey_df['RawSalary'].str.replace('$','')

pd.to_numeric()
可以直接转化为数值型
numeric_vals = pd.to_numeric(so_survey_df[‘RawSalary‘], errors=‘coerce‘)
astype(‘类型‘)
强制类型转化
一般展示输出结果的时候，都会看到dtype的类型

可以同时替换多个值

# Use method chaining
so_survey_df['RawSalary'] = so_survey_df['RawSalary']                              .str.replace(',', '')                              .str.replace('$', '')                              .str.replace('￡', '')                              .astype('float')
 
# Print the RawSalary column
print(so_survey_df['RawSalary'])

np.clip()

也就是说clip这个函数将将数组中的元素限制在a_min, a_max之间，大于a_max的就使得它等于 a_max，小于a_min,的就使得它等于a_min。
这个也就是学长说的，当遇到情感分为1或者0 的时候，需要替换掉0和1.那就可以把结果限制在(0.0001,0.9999),这样可以防止在计算损失函数logloss的时候inf的出现，也算是一个近似求解，这里我先整理一下

x=np.array([[1,2,3,5,6,7,8,9],[1,2,3,5,6,7,8,9]])
np.clip(x,3,8)

Out[90]:
array([[3, 3, 3, 5, 6, 7, 8, 8],
       [3, 3, 3, 5, 6, 7, 8, 8]])

数据分布

查看数据的分布可以使用可视化进行展示

特征工程

标签：type res country tle update name 创建长度 var

原文地址：https://www.cnblogs.com/gaowenxingxing/p/12461437.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行