文章目录
- 0.写在前面
- 1.EDA
-
- 1.1 观察数据
- 1.2 处理缺失值
- 1.3 模型利用挖掘数据隐含信息
- 2.Deep Feature Engineering
- 3.特征筛选 降维(实验记录)
- 4.lightGBM best_parameters
- 5. Internal blend
- 6.最终结果
0.写在前面
Kaggle竞赛——IEEE-CIS Fraud Detection
-
赛题描述: In this competition, you’ll benchmark machine learning models on a challenging large-scale dataset. The data comes from Vesta’s real-world e-commerce transactions and contains a wide range of features from device type to product features. You also have the opportunity to create new features to improve your results. In this competition you are predicting the probability that an online transaction is fraudulent, as denoted by the binary target isFraud. The data is broken into two files identity and transaction, which are joined by TransactionID. Not all transactions have corresponding identity information.
-
LB:利用测试集前20%的数据进行验证auc得分。
-
Private Leaderboard最终得分:使用测试集剩余80%的数据进行验证auc得分。
-
这场比赛可以提交两个结果。
之前参加了Kaggle这次试试几个入门级比赛IEEE和Vesta主办二分类预测比赛,使用Python基于Jupyter Notebook用LightGBM建立预测模型的关键在于数据挖掘和数据处理生成特征的策略选择,需要非常详细的EDA以及FE。 比赛结果是铜牌:373/6381-Top 6% Private Leaderboard:0.928512
本文提出的思路旨在帮助理解和解释主题Python代码不是最好的方法。本文的想法和代码仅供参考。请将思路中涉及的方法和详细步骤移动到参考链接中。代码中的变量命名、注释和测试记录混乱,仅供参考。
1.EDA
请参考以下Kaggle_kernels: Nanashi:Fraud complete EDA_Nanashi
1.1 观察数据
官方数据描述及相关问答:Data Description (Details and Discussion)
-
先来看Transaction表: TransactionDT: 不是真实的时间戳,而是与某一时间以秒为单位的时差。 TransactionAMT: transaction payment amount in USD,小数值得关注。 ProductCD: product code,有W\H\C\S\R五种。不一定是实际商品也有可能指某种服务。 card1 - card6: payment card information, such as card type, card category, issue bank, country, etc. addr1-addr2: 是billing region和billing country dist: distances between (not limited) billing address, mailing address, zip code, IP address, phone area, etc. P_ and (R__) emaildomain: purchaser and recipient email domain,有些交易不需要recipient的,其对应Remaildomain为空 C1-C14: counting, such as how many addresses are found to be associated with the payment card, etc. Plus like device, ipaddr, billingaddr, etc. Also these are for both purchaser and recipient, which doubles the number. D1-D15: timedelta, such as days between previous transaction, etc. M1-M9: match, such as names on card and address, etc.均为01变量 Vxxx: Vesta engineered rich features, including ranking, counting, and other entity relations.不同部位的V特征缺失不同比例,其真正含义和处理方法仍不明确。
-
再来看Identity表: id01 to id11 are for identity, which is collected by Vesta and security partners such as device rating, ip_domain rating, proxy rating, etc. Also it recorded behavioral fingerprint like account login times/failed to login times, how long an account stayed on the page, etc. All of these are not able to elaborate due to security partner T&C. DeviceType、DeviceInfo、id12 - id38是Categorical Features。
在许多EDA相关Kernels我们可以发现数据的一些特征,特别是随着时间的推移,以及训练集和测试集分布的区别。
1.2 处理缺失值
- 缺失比例:请检查缺失值的比例EDA Kernels。
- 利用特征的相关性来判断相关特征来填补缺失值 请参考:Gunes Evitan——IEEE-CIS Fraud Detection Dependency Check
def check_dependency(independent_var, dependent_var): independent_uniques = [] temp_df = pd.concat([train_df[[independent_var, dependent_var]], test_df[[independent_var, dependent_var]]])
for value in temp_df[independent_var].unique():
independent_uniques.append(temp_df[temp_df[independent_var] == value][dependent_var].value_counts().shape[0])
values = pd.Series(data=independent_uniques, index=temp_df[independent_var].unique())
N = len(values)
N_dependent = len(values[values == 1])
N_notdependent = len(values[values > 1])
N_null = len(values[values == 0])
print(f'In {independent_var}, there are {N} unique values')
print(f'{N_dependent}/{N} have one unique {dependent_var} value')
print(f'{N_notdependent}/{N} have more than one unique {dependent_var} values')
print(f'{N_null}/{N} have only missing {dependent_var} values\n')
举个例子:
check_dependency('R_emaildomain', 'C5')
print(train_df['C10'].isnull().sum()/train_df.shape[0])
print(test_df['C10'].isnull().sum()/test_df.shape[0])
print(test_df[~test_df['R_emaildomain'].isnull()]['C5'].value_counts())
In R_emaildomain, there are 61 unique values
60/61 have one unique C5 value
0/61 have more than one unique C5 values
1/61 have only missing C5 values
0.0
5.920768278891869e-06
0.0 135867
Name: C5, dtype: int64
可见 R_emaildomain和C5相关度很高,且C5特征于测试集中有少量缺失,而R_emaildomain不缺失的时候C5缺失,R_emaildomain不缺失时C5均为0,将C5缺失值用0补上便是比较合理的。 按这个思路找到了几组相关度很高的特征,将测试集中的缺失值补上:
#1.1 find dependency and fillna
#'dist1', 'C3',只有test有C3的缺失,且只在dist1不缺失的时候缺失,dist1不缺失的时候C3全都是0
test_df['C3'] = test_df['C3'].fillna(0)
#'R_emaildomain', 'C5',只有test有C5的缺失,基本上都是在R_emaildomain不缺失的时候缺失,R_emaildomain缺失的C5缺失只有3个
test_df['C5'] = test_df['C5'].fillna(0)
#'id_30','C7',只有test有C7的缺失,只在id_30不缺失的时候缺失,id_30不缺失的C7缺失只有3个,其他都是0(Device)
test_df['C7'] = test_df['C7'].fillna(0)
#'id_31','C9',只有test有C9的缺失,只在id_31不缺失的时候缺失,id_31不缺失的C9缺失只有3个,其他都是0(Browser)
test_df['C9'] = test_df['C9'].fillna(0)
- 利用card1对应其余card特征的信息来填补card23456的缺失值
#1. More interaction between card features + fill nans
i_cols = ['TransactionID','card1','card2','card3','card4','card5','card6']
full_df = pd.concat([train_df[i_cols], test_df[i_cols]])
## I've used frequency encoding before so we have ints here
## we will drop very rare cards
full_df['card6'] = np.where(full_df['card6']==30, np.nan, full_df['card6'])
full_df['card6'] = np.where(full_df['card6']==16, np.nan, full_df['card6'])
i_cols = ['card2','card3','card4','card5','card6']
## We will find best match for nan values and fill with it 把23456都补上好多了
for col in i_cols:
temp_df = full_df.groupby(['card1',col])[col].agg(['count']).reset_index()
temp_df = temp_df.sort_values(by=['card1','count'], ascending=False).reset_index(drop=True)
del temp_df['count']
temp_df = temp_df.drop_duplicates(keep='first').reset_index(drop=True)
temp_df.index = temp_df['card1'].values
temp_df = temp_df[col].to_dict()
full_df[col] = np.where(full_df[col].isna(), full_df['card1'].map(temp_df), full_df[col])
i_cols = ['card1','card2','card3','card4','card5','card6']
for col in i_cols:
train_df[col] = full_df[full_df['TransactionID'].isin(train_df['TransactionID'])][col].values
test_df[col] = full_df[full_df['TransactionID'].isin(test_df['TransactionID'])][col].values
1.3 挖掘数据隐含信息以便模型利用
为了保护用户信息官方对特征做了许多处理也隐瞒了特征的真实意义,需要通过对数据细致的观察分析来判断特征的意义及其蕴含的信息,以选择特征处理的合理手段。
- 日期 Kevin——TransactionDT startdate 这样Black Friday和Cyber Monday可以更好重合,这里选取2017-11-30作为起始日期点,加以TransactionDT这个timedelta可以获得交易的日期信息。
- D系列特征 Akasyanama——EDA what’s behind D features? A Humphrey——Understanding the D features (updated) tuttifrutti——Creating features from D columns (guessing userID) 取几个意思明晰的: D1: timedelta (days, rounded down) since first transaction for . D2: this appears to be the same as D1, except D1 = 0 values have been replaced by NaN. D3: timedelta since the previous transaction for . As with D1 and D2, the this feature appears to count different cards separately. D4: timedelta since first transaction for Using the example of a husband and wife each using their own card on a joint credit card account, this feature would not distinguish between which card was used. D5: timedelta since the previous transaction for . D6 and D7: 是D4和D5某种组合变形,丢掉任何一个auc都会下降。 D8: timedelta (float) since some event. D9:是D8的小数部分,也就是the hour of day,由于每个小时对应的fraud_rate,也就是IsFraud的平均值变化相差很小,这个特征无法为模型的预测提供较大的帮助,计划丢掉这个特征。 D10:some kind of timedelta for domestic transactions.
选取处理策略:
- 由于Ds特征具有时间相关性,会随TransactionDT变化,可以考虑取部分D特征(如D1,D4)和TransactionDT,用两者求差得到时间差,从而显示开卡时间、上一笔交易具体时间等因素,单纯利用不加处理的Ds特征只能反映距离某一操作的时间差累积,且会引入时间变化。得到DminusDT类特征后可用于进行用户uid和卡cardid的合成,可以更加清晰地确定用户。关于得到的DminusDT类特征,虽然有可能带来过拟合的风险,但本模型还是选择保留它了。
- Ds特征也可进行以不同时间段内的min_max_scaling处理以及std_score处理,用自定义的value_normalization函数实现。
dt_df[new_col+'_min_max'] = (dt_df[col]-dt_df['temp_min'])/(dt_df['temp_max']-dt_df['temp_min'])
dt_df[new_col+'_std_score'] = (dt_df[col]-dt_df['temp_mean'])/(dt_df['temp_std'])
- C系列特征 分布请参考EDA相关kernels。 前文也提到了,由于Cs特征是对于交易付款人和收款人信息(如账单地址、邮箱地址)个数的统计,部分C与其他特征有较高的关联度,可考虑通过这个思路填充其测试集的缺失值。 训练集和测试集的分布有较大差别,考虑去除离群值改善分布。
- V系列特征 请参考:Laevatein——Interesting finding about the V columns 可以根据Vs特征缺失率将Vs特征分块,各部分内应该是由相同数据生成的。 V1 ~ V11 V12 ~ V34 V35 ~ V52 V53 ~ V74 V75 ~ V94 V95 ~ V137 高相关度 V126-V138 V138 ~ V166 (high null ratio) V167 ~ V216 (high null ratio) V217 ~ V278 (high null ratio, 2 different null ratios) V279 ~ V321 (2 different null ratios) V289-V318 V319-V321高相关度 V322 ~ V339 (high null ratio)
其中numerical类型的Vs特征有:
'V126' 'V127' 'V128' 'V130' 'V131' 'V132' 'V133' 'V134' 'V136' 'V137'
'V143' 'V144' 'V145' 'V150' 'V159' 'V160' 'V164' 'V165' 'V166' 'V202'
'V203' 'V204' 'V205' 'V206' 'V207' 'V208' 'V209' 'V210' 'V211' 'V212'
'V213' 'V214' 'V215' 'V216' 'V263' 'V264' 'V265' 'V266' 'V267' 'V268'
'V270' 'V271' 'V272' 'V273' 'V274' 'V275' 'V276' 'V277' 'V278' 'V306'
'V307' 'V308' 'V309' 'V310' 'V312' 'V313' 'V314' 'V315' 'V316' 'V317'
'V318' 'V320' 'V321' 'V331' 'V332' 'V333' 'V335'
选取处理方式:
- 对numerical的V做scaling和pca
- 对Vs做Group PCA、一些其他处理,但是LB没有提升便放弃了
2.Deep Feature Engineering
初步特征处理思路(LB–>0.9487)请参考: Konstantin Yakovlev——IEEE - Internal Blend David Cairuz——Feature Engineering & LightGBM 后期特征处理思路(LB:0.9487–>0.9526)请参见其他实验记录,以下为最终采用的特征工程代码:
import numpy as np
import pandas as pd
import gc
import os, sys, random, datetime
将数据集缩小,占用更小内存,并得到更高的处理效率,请参考:Konstantin Yakovlev——IEEE Data minification
def seed_everything(seed=0):
random.seed(seed)
os.environ['PYTHONHASHSEED'] = str(seed)
np.random.seed(seed)
## Memory Reducer
# :df pandas dataframe to reduce size # type: pd.DataFrame()
# :verbose # type: bool
def reduce_mem_usage(df, verbose=True):
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
start_mem = df.memory_usage().sum() / 1024**2
for col in df.columns:
col_type = df[col].dtypes
if col_type in numerics:
c_min = df[col].min()
c_max = df[col].max()
if str(col_type)[:3] == 'int':
if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
df[col] = df[col].astype(np.int8)
elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
df[col] = df[col].astype(np.int16)
elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
df[col] = df[col].astype(np.int32)
elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
df[col] = df[col].astype(np.int64)
else:
if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
df[col] = df[col].astype(np.float16)
elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
df[col] = df[col].astype(np.float32)
else:
df[col] = df[col].astype(np.float64)
end_mem = df.memory_usage().sum() / 1024**2
if verbose: print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
return df
载入训练集和测试集,缩小其占用空间。
print('Load Data')
train_df = pd.read_csv('../input/train_transaction.csv')
test_df = pd.read_csv('../input/test_transaction.csv')
test_df['isFraud'] = 0
train_identity = pd.read_csv('../input/train_identity.csv')
test_identity = pd.read_csv('../input/test_identity.csv')
print('Reduce Memory')
train_df = reduce_mem_usage(train_df)
test_df = reduce_mem_usage(test_df)
train_identity = reduce_mem_usage(train_identity)
test_identity = reduce_mem_usage(test_identity)
Load Data
Reduce Memory
Mem. usage decreased to 542.35 Mb (69.4% reduction)
Mem. usage decreased to 473.07 Mb (68.9% reduction)
Mem. usage decreased to 25.86 Mb (42.7% reduction)
Mem. usage decreased to 25.44 Mb (42.7% reduction)
对identity部分数据进行初步处理,主要是将字符串特征,如DeviceInfo、id_30(系统信息)、id_31(浏览器信息),split生成新的特征,用id_33(分辨率)生成设备特征;并将其余类别特征从字符串转为numerical,部分信息bin处理:
def id_split(dataframe): dataframe['device_name'] = dataframe['DeviceInfo'].str.split('/', expand=True)[0] dataframe['device_version'] = dataframe['DeviceInfo'].str.split('/', expand=True)[1] dataframe['OS_id_30'] = dataframe['id_30'].str.split(' ', expand=True)[0] dataframe['version_id_30'] = dataframe['id_30'].str.split(' ', expand=True)[1] dataframe['browser_id_31'] = dataframe['id_31'].str.split(' ', expand=True)[0] dataframe['version_id_31'] = dataframe['id_31'].str.split(' ', expand=True)[1] dataframe['screen_width'] = dataframe['id_33'].str.split('x', expand=True)[0] dataframe['screen_height'] = dataframe['id_33'].str.split('x', expand=True)[1] dataframe['id_12'] = dataframe['id_12'].map({ 'Found':1, 'NotFound':0}) dataframe['id_15'] = dataframe['id_15'].map({ 'New':2, 'Found':1, 'Unknown':0}) dataframe['id_16'] = dataframe['id_16'].map({ 'Found':1, 'NotFound':0}) dataframe['id_23'] = dataframe['id_23'].map({ 'TRANSPARENT':4, 'IP_PROXY':3, 'IP_PROXY:ANONYMOUS':2, 'IP_PROXY:HIDDEN':1}) dataframe['id_27'] = dataframe['id_27'].map({ 'Found':1, 'NotFound':0}) dataframe['id_28'] = dataframe['id_28'].map({ 'New':2, 'Found':1}) dataframe['id_29'] = dataframe['id_29'].map({ 'Found':1, 'NotFound':0}) dataframe['id_35'] = dataframe['id_35'].map({ 'T':1, 'F':0}) dataframe['id_36'] = dataframe['id_36'].map({ 'T':1, 'F':0}) dataframe['id_37'] = dataframe['id_37'].map({ 'T':1, 'F':0}) dataframe['id_38'] = dataframe['id_38'].map({ 'T':1, 'F':0}) dataframe['id_34'] = dataframe 标签:
160v155j安规电容器