标签:type res country tle update name 创建 长度 var
# Create subset of only the numeric columns
so_numeric_df = so_survey_df.select_dtypes(include=['int', 'float'])
# Create a series out of the Country column
countries = so_survey_df['Country']
# Get the counts of each category
country_counts = countries.value_counts()
# Print the count values for each category
print(country_counts)
<script.py> output:
South Africa 166
USA 164
Spain 134
Sweeden 119
France 115
Russia 97
UK 95
India 95
Ukraine 9
Ireland 5
Name: Country, dtype: int64
# Create a series out of the Country column
countries = so_survey_df['Country']
# Get the counts of each category
country_counts = countries.value_counts()
# Create a mask for only categories that occur less than 10 times
mask = countries.isin(country_counts[country_counts < 10].index)
# Print the top 5 rows in the mask series
print(mask.head())
<script.py> output:
0 False
1 False
2 False
3 False
4 False
Name: Country, dtype: bool
# Create a series out of the Country column
countries = so_survey_df['Country']
# Get the counts of each category
country_counts = countries.value_counts()
# Create a mask for only categories that occur less than 10 times
mask = countries.isin(country_counts[country_counts < 10].index)
# Label all other categories as Other
countries[mask] = 'Other'
# Print the updated category counts
print(pd.value_counts(countries))
<script.py> output:
South Africa 166
USA 164
Spain 134
Sweeden 119
France 115
Russia 97
UK 95
India 95
Other 14
Name: Country, dtype: int64
Numeric variables
binning Numeric variables二进制数值变量
pandas.cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False, duplicates='raise')
这个datacamp的栗子可以划分不同的区间,完事儿给不同的列贴labels,不过要注意labels的取值
# Import numpy
import numpy as np
# Specify the boundaries of the bins
bins = [-np.inf, 10000, 50000, 100000, 150000, np.inf]
# Bin labels
labels = ['Very low', 'Low', 'Medium', 'High', 'Very high']
# Bin the continuous variable ConvertedSalary using these boundaries
so_survey_df['boundary_binned'] = pd.cut(so_survey_df['ConvertedSalary'],
bins, labels=labels)
# Print the first 5 rows of the boundary_binned column
print(so_survey_df[['boundary_binned', 'ConvertedSalary']].head())
<script.py> output:
boundary_binned ConvertedSalary
0 Very low 0.0
1 Medium 70841.0
2 Very low 0.0
3 Low 21426.0
4 Low 41671.0
我先补充一个小的知识点,就是我突然发现so_survey_df[[‘Gender‘]].info()==so_survey_df.loc[‘Gender‘],[[]]的奇效,哈哈哈哈
# Remove the commas in the column
# 这里注意要先变为字符串型
so_survey_df['RawSalary'] = so_survey_df['RawSalary'].str.replace(',', '')
so_survey_df['RawSalary'] = so_survey_df['RawSalary'].str.replace('$','')
pd.to_numeric()
可以直接转化为数值型
numeric_vals = pd.to_numeric(so_survey_df[‘RawSalary‘], errors=‘coerce‘)
astype(‘类型‘)
强制类型转化
一般展示输出结果的时候,都会看到dtype的类型
可以同时替换多个值
# Use method chaining
so_survey_df['RawSalary'] = so_survey_df['RawSalary'] .str.replace(',', '') .str.replace('$', '') .str.replace('£', '') .astype('float')
# Print the RawSalary column
print(so_survey_df['RawSalary'])
也就是说clip这个函数将将数组中的元素限制在a_min, a_max之间,大于a_max的就使得它等于 a_max,小于a_min,的就使得它等于a_min。
这个也就是学长说的,当遇到情感分为1或者0 的时候,需要替换掉0和1.那就可以把结果限制在(0.0001,0.9999),这样可以防止在计算损失函数logloss的时候inf的出现,也算是一个近似求解,这里我先整理一下
x=np.array([[1,2,3,5,6,7,8,9],[1,2,3,5,6,7,8,9]])
np.clip(x,3,8)
Out[90]:
array([[3, 3, 3, 5, 6, 7, 8, 8],
[3, 3, 3, 5, 6, 7, 8, 8]])
查看数据的分布可以使用可视化进行展示
标签:type res country tle update name 创建 长度 var
原文地址:https://www.cnblogs.com/gaowenxingxing/p/12461437.html