Post

[Dacon] 인구 데이터 기반 소득 예측 경진대회


데이콘의 “인구 데이터 기반 소득 예측 경진대회”에 참여하여 작성한 글이며, 간단한 데이터 전처리 및 EDA, LightGBMXGBoost 모델을 활용한 Ensemble 구현 코드를 소개합니다.

코드실행은 Google Colab의 GPU, Standard RAM 환경에서 진행했습니다.

데이콘에서 읽기

     

0. Import Packages

  • 라이브러리 불러오기
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
!pip install -U pandas-profiling
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import sklearn
import pandas_profiling
import seaborn as sns
import random as rn
import os
import scipy.stats as stats

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import VotingClassifier
from sklearn import metrics

import xgboost as xgb
import lightgbm as lgb

from collections import Counter
import warnings

%matplotlib inline
warnings.filterwarnings(action='ignore')

     

  • 주요 라이브러리 버전 확인
1
2
3
4
5
6
print("numpy version: {}". format(np.__version__))
print("pandas version: {}". format(pd.__version__))
print("matplotlib version: {}". format(matplotlib.__version__))
print("scikit-learn version: {}". format(sklearn.__version__))
print("xgboost version: {}". format(xgb.__version__))
print("lightgbm version: {}". format(lgb.__version__))
    numpy version: 1.21.6
    pandas version: 1.3.5
    matplotlib version: 3.2.2
    scikit-learn version: 1.0.2
    xgboost version: 0.90
    lightgbm version: 2.2.3

     

  • 랜덤 시드 고정
1
2
3
4
5
# reproducibility
seed_num = 42
np.random.seed(seed_num)
rn.seed(seed_num)
os.environ['PYTHONHASHSEED']=str(seed_num)

     

1. Load and Check Dataset

Variable Description

ageworkclassfnlwgteducation
나이일 유형CPS(Current Population Survey) 가중치교육수준
education.nummarital.statusoccupation
교육수준 번호결혼 상태직업
relationshipracesexcapital.gaincapital.losshours.per.weeknative.country
가족관계인종성별자본 이익자본 손실주당 근무시간본 국적

     

  • 데이터 불러오기
1
2
3
4
5
6
7
train = pd.read_csv('/content/drive/MyDrive/Forecasting_income/dataset/train.csv')
test = pd.read_csv('/content/drive/MyDrive/Forecasting_income/dataset/test.csv')

train.columns = train.columns.str.replace('.','_')
test.columns = test.columns.str.replace('.','_')

train.head()
       id  age workclass  fnlwgt     education  education_num      marital_status  \
    0   0   32   Private  309513    Assoc-acdm             12  Married-civ-spouse   
    1   1   33   Private  205469  Some-college             10  Married-civ-spouse   
    2   2   46   Private  149949  Some-college             10  Married-civ-spouse   
    3   3   23   Private  193090     Bachelors             13       Never-married   
    4   4   55   Private   60193       HS-grad              9            Divorced   
    
            occupation   relationship   race     sex  capital_gain  capital_loss  \
    0     Craft-repair        Husband  White    Male             0             0   
    1  Exec-managerial        Husband  White    Male             0             0   
    2     Craft-repair        Husband  White    Male             0             0   
    3     Adm-clerical      Own-child  White  Female             0             0   
    4     Adm-clerical  Not-in-family  White  Female             0             0   
    
       hours_per_week native_country  target  
    0              40  United-States       0  
    1              40  United-States       1  
    2              40  United-States       0  
    3              30  United-States       0  
    4              40  United-States       0  

     

  • Pandas Profiling Report 생성하기
1
2
pr=train.profile_report()
pr.to_file('/content/drive/MyDrive/Forecasting_income/pr_report.html')

     

  • Pandas Profiling을 활용하면 아래와 같이 데이터 프레임을 쉽고 효율적으로 탐색할 수 있습니다.

pandas_profiling1 pandas_profiling2

     

Pandas Profiling Report의 Alert 활용하기

Variable Pairs with the High Correlation

  1. relationship - sex
  2. age - marital.status
  3. workclass - occupation
  4. education - education.num
  5. relationship - marital.status
  6. race - native.country
  7. sex - occupation
  8. target - relationship

Data Type

  • Numeric (7) : id, age, fnlwgt, education.num, capital.gain, capital.loss, hours.per.week
  • Categorical (9) : workclass, education, marital.status, occupation, relationship, race, sex, native.country, target

Note

  • workclassoccupation이 같은 비율 (10.5%)의 결측치(Missing Value)를 가집니다.
  • native.country는 583(3.3%)의 결측치(Missing Value)를 가지므로 해당 행(Row)을 삭제해주겠습니다.
  • capital.gaincapital.loss는 높은 왜도(Skewness)를 가집니다. 이상치(Outlier)를 확인하고 필요시 제거하거나 변환 함수를 적용하겠습니다.

     

1
train.info()
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 17480 entries, 0 to 17479
    Data columns (total 16 columns):
     #   Column          Non-Null Count  Dtype 
    ---  ------          --------------  ----- 
     0   id              17480 non-null  int64 
     1   age             17480 non-null  int64 
     2   workclass       15644 non-null  object
     3   fnlwgt          17480 non-null  int64 
     4   education       17480 non-null  object
     5   education_num   17480 non-null  int64 
     6   marital_status  17480 non-null  object
     7   occupation      15637 non-null  object
     8   relationship    17480 non-null  object
     9   race            17480 non-null  object
     10  sex             17480 non-null  object
     11  capital_gain    17480 non-null  int64 
     12  capital_loss    17480 non-null  int64 
     13  hours_per_week  17480 non-null  int64 
     14  native_country  16897 non-null  object
     15  target          17480 non-null  int64 
    dtypes: int64(8), object(8)
    memory usage: 2.1+ MB

     

2. Data Preprocessing

(1) Missing Value

1
train.columns[train.isnull().any()]
    Index(['workclass', 'occupation', 'native_country'], dtype='object')
1
train[train["workclass"].isnull()]
              id  age workclass  fnlwgt     education  education_num  \
    15081  15081   90       NaN   77053       HS-grad              9   
    15082  15082   66       NaN  186061  Some-college             10   
    15084  15084   51       NaN  172175     Doctorate             16   
    15086  15086   61       NaN  135285       HS-grad              9   
    15087  15087   71       NaN  100820       HS-grad              9   
    ...      ...  ...       ...     ...           ...            ...   
    17475  17475   35       NaN  320084     Bachelors             13   
    17476  17476   30       NaN   33811     Bachelors             13   
    17477  17477   71       NaN  287372     Doctorate             16   
    17478  17478   41       NaN  202822       HS-grad              9   
    17479  17479   72       NaN  129912       HS-grad              9   
    
               marital_status occupation   relationship                race  \
    15081             Widowed        NaN  Not-in-family               White   
    15082             Widowed        NaN      Unmarried               Black   
    15084       Never-married        NaN  Not-in-family               White   
    15086  Married-civ-spouse        NaN        Husband               White   
    15087  Married-civ-spouse        NaN        Husband               White   
    ...                   ...        ...            ...                 ...   
    17475  Married-civ-spouse        NaN           Wife               White   
    17476       Never-married        NaN  Not-in-family  Asian-Pac-Islander   
    17477  Married-civ-spouse        NaN        Husband               White   
    17478           Separated        NaN  Not-in-family               Black   
    17479  Married-civ-spouse        NaN        Husband               White   
    
              sex  capital_gain  capital_loss  hours_per_week native_country  \
    15081  Female             0          4356              40  United-States   
    15082  Female             0          4356              40  United-States   
    15084    Male             0          2824              40  United-States   
    15086    Male             0          2603              32  United-States   
    15087    Male             0          2489              15  United-States   
    ...       ...           ...           ...             ...            ...   
    17475  Female             0             0              55  United-States   
    17476  Female             0             0              99  United-States   
    17477    Male             0             0              10  United-States   
    17478  Female             0             0              32  United-States   
    17479    Male             0             0              25  United-States   
    
           target  
    15081       0  
    15082       0  
    15084       1  
    15086       0  
    15087       0  
    ...       ...  
    17475       1  
    17476       0  
    17477       1  
    17478       0  
    17479       0  
    
    [1836 rows x 16 columns]

     

1
train['workclass'].unique()
    array(['Private', 'State-gov', 'Local-gov', 'Self-emp-not-inc',
           'Self-emp-inc', 'Federal-gov', 'Without-pay', nan, 'Never-worked'],
          dtype=object)
  • workclassoccupation 열(Column)에서 결측치가 포함된 행은 삭제합니다.
  • 두 열이 동시에 결측치를 갖는 경우가 대부분이므로, workclass의 결측치만 Never-worked와 같은 이미 존재하는 특성으로 채우는 것은 의미가 없습니다.
  • workclassoccupation에 새로운 feature을 부여하는 방법도 시도하였지만, One-hot Encoding을 했을 때 생기는 테스트 데이터와의 값 차이 때문에 다른 방법을 고려해볼 필요가 있다고 생각합니다 😔

     

1
2
3
4
print(sum(train['workclass'].isna()))
print(sum(train['occupation'].isna()))

fill_na = train['workclass'].isna()
    1836
    1843
1
2
3
4
5
df_train = train.dropna()  

print(sum(df_train['workclass'].isna()))
print(sum(df_train['occupation'].isna()))
print(sum(df_train['native_country'].isna()))
    0
    0
    0

     

1
df_train
              id  age     workclass  fnlwgt     education  education_num  \
    0          0   32       Private  309513    Assoc-acdm             12   
    1          1   33       Private  205469  Some-college             10   
    2          2   46       Private  149949  Some-college             10   
    3          3   23       Private  193090     Bachelors             13   
    4          4   55       Private   60193       HS-grad              9   
    ...      ...  ...           ...     ...           ...            ...   
    15076  15076   35       Private  337286       Masters             14   
    15077  15077   36       Private  182074  Some-college             10   
    15078  15078   50  Self-emp-inc  175070   Prof-school             15   
    15079  15079   39       Private  202937  Some-college             10   
    15080  15080   33       Private   96245    Assoc-acdm             12   
    
               marital_status       occupation   relationship                race  \
    0      Married-civ-spouse     Craft-repair        Husband               White   
    1      Married-civ-spouse  Exec-managerial        Husband               White   
    2      Married-civ-spouse     Craft-repair        Husband               White   
    3           Never-married     Adm-clerical      Own-child               White   
    4                Divorced     Adm-clerical  Not-in-family               White   
    ...                   ...              ...            ...                 ...   
    15076       Never-married  Exec-managerial  Not-in-family  Asian-Pac-Islander   
    15077            Divorced     Adm-clerical  Not-in-family               White   
    15078  Married-civ-spouse   Prof-specialty        Husband               White   
    15079            Divorced     Tech-support  Not-in-family               White   
    15080  Married-civ-spouse   Prof-specialty        Husband               White   
    
              sex  capital_gain  capital_loss  hours_per_week native_country  \
    0        Male             0             0              40  United-States   
    1        Male             0             0              40  United-States   
    2        Male             0             0              40  United-States   
    3      Female             0             0              30  United-States   
    4      Female             0             0              40  United-States   
    ...       ...           ...           ...             ...            ...   
    15076    Male             0             0              40  United-States   
    15077    Male             0             0              45  United-States   
    15078    Male             0             0              45  United-States   
    15079  Female             0             0              40         Poland   
    15080    Male             0             0              50  United-States   
    
           target  
    0           0  
    1           1  
    2           0  
    3           0  
    4           0  
    ...       ...  
    15076       0  
    15077       0  
    15078       1  
    15079       0  
    15080       0  
    
    [15081 rows x 16 columns]

     

(2) Outlier

1
2
3
4
5
6
7
fig, ax = plt.subplots(1, 2, figsize=(12,3))
g = sns.distplot(df_train['capital_gain'], color='b', label='Skewness : {:.2f}'.format(df_train['capital_gain'].skew()), ax=ax[0])
g = g.legend(loc='best')

g = sns.distplot(df_train['capital_loss'], color='b', label='Skewness : {:.2f}'.format(df_train['capital_loss'].skew()), ax=ax[1])
g = g.legend(loc='best')
plt.show()

outlier

     

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
numeric_fts = ['age', 'fnlwgt', 'education_num', 'capital_gain', 'capital_loss', 'hours_per_week']

outlier_ind = []
for i in numeric_fts:
  Q1 = np.percentile(df_train[i],25)
  Q3 = np.percentile(df_train[i],75)
  IQR = Q3-Q1
  outlier_list = df_train[(df_train[i] < Q1 - IQR * 1.5) | (df_train[i] > Q3 + IQR * 1.5)].index
  outlier_ind.extend(outlier_list)

outlier_ind = Counter(outlier_ind)
multi_outliers = list(k for k,j in outlier_ind.items() if j > 2)

# Drop outliers
train_df = df_train.drop(multi_outliers, axis = 0).reset_index(drop = True)
train_df
              id  age     workclass  fnlwgt     education  education_num  \
    0          0   32       Private  309513    Assoc-acdm             12   
    1          1   33       Private  205469  Some-college             10   
    2          2   46       Private  149949  Some-college             10   
    3          3   23       Private  193090     Bachelors             13   
    4          4   55       Private   60193       HS-grad              9   
    ...      ...  ...           ...     ...           ...            ...   
    15043  15076   35       Private  337286       Masters             14   
    15044  15077   36       Private  182074  Some-college             10   
    15045  15078   50  Self-emp-inc  175070   Prof-school             15   
    15046  15079   39       Private  202937  Some-college             10   
    15047  15080   33       Private   96245    Assoc-acdm             12   
    
               marital_status       occupation   relationship                race  \
    0      Married-civ-spouse     Craft-repair        Husband               White   
    1      Married-civ-spouse  Exec-managerial        Husband               White   
    2      Married-civ-spouse     Craft-repair        Husband               White   
    3           Never-married     Adm-clerical      Own-child               White   
    4                Divorced     Adm-clerical  Not-in-family               White   
    ...                   ...              ...            ...                 ...   
    15043       Never-married  Exec-managerial  Not-in-family  Asian-Pac-Islander   
    15044            Divorced     Adm-clerical  Not-in-family               White   
    15045  Married-civ-spouse   Prof-specialty        Husband               White   
    15046            Divorced     Tech-support  Not-in-family               White   
    15047  Married-civ-spouse   Prof-specialty        Husband               White   
    
              sex  capital_gain  capital_loss  hours_per_week native_country  \
    0        Male             0             0              40  United-States   
    1        Male             0             0              40  United-States   
    2        Male             0             0              40  United-States   
    3      Female             0             0              30  United-States   
    4      Female             0             0              40  United-States   
    ...       ...           ...           ...             ...            ...   
    15043    Male             0             0              40  United-States   
    15044    Male             0             0              45  United-States   
    15045    Male             0             0              45  United-States   
    15046  Female             0             0              40         Poland   
    15047    Male             0             0              50  United-States   
    
           target  
    0           0  
    1           1  
    2           0  
    3           0  
    4           0  
    ...       ...  
    15043       0  
    15044       0  
    15045       1  
    15046       0  
    15047       0  
    
    [15048 rows x 16 columns]

     

1
print(train_df['capital_gain'].skew(), train_df['capital_loss'].skew())
    12.004940559585881 4.607122286739042
  • 이상치들을 제거하였음에도 두 변수는 여전히 높은 왜도를 보이고 있어, 로그 변환(Log Transformation)을 진행했습니다.
1
2
3
4
5
6
7
8
# log transformation
train_df['capital_gain'] = train_df['capital_gain'].map(lambda i: np.log(i) if i > 0 else 0)
test['capital_gain'] = test['capital_gain'].map(lambda i: np.log(i) if i > 0 else 0)

train_df['capital_loss'] = train_df['capital_loss'].map(lambda i: np.log(i) if i > 0 else 0)
test['capital_loss'] = test['capital_loss'].map(lambda i: np.log(i) if i > 0 else 0)

print(train_df['capital_gain'].skew(), train_df['capital_loss'].skew())
    3.0945787119106676 4.390015583095806

     

3. Feature Engineering

(1) Correlation

  • 범주형(Categorical) 데이터를 라벨 인코더(Label Encoder)를 통해 수치형(Numerical)으로 변환한 후 상관관계를 확인합니다.
  • Categorical : workclass, education, marital.status, occupation, relationship, race, sex, native.country
1
2
3
4
5
6
7
8
la_train = train_df.copy()

cat_fts = ['workclass', 'education', 'marital_status', 'occupation', 'relationship', 'race', 'sex', 'native_country']
for i in range(len(cat_fts)):
  encoder = LabelEncoder()
  la_train[cat_fts[i]] = encoder.fit_transform(la_train[cat_fts[i]])

la_train.head()
       id  age  workclass  fnlwgt  education  education_num  marital_status  \
    0   0   32          2  309513          7             12               2   
    1   1   33          2  205469         15             10               2   
    2   2   46          2  149949         15             10               2   
    3   3   23          2  193090          9             13               4   
    4   4   55          2   60193         11              9               0   
    
       occupation  relationship  race  sex  capital_gain  capital_loss  \
    0           2             0     4    1           0.0           0.0   
    1           3             0     4    1           0.0           0.0   
    2           2             0     4    1           0.0           0.0   
    3           0             3     4    0           0.0           0.0   
    4           0             1     4    0           0.0           0.0   
    
       hours_per_week  native_country  target  
    0              40              38       0  
    1              40              38       1  
    2              40              38       0  
    3              30              38       0  
    4              40              38       0  

     

  • 앞서 수행한 Pandas Profiling Report의 Alert 섹션을 참고하여 상관계수를 계산했습니다.
  • 유의미한 상관관계를 가지고 있다고 생각되는 변수 Pair는 relationship-sex, occupation-workclass, education-education.num 입니다.
1
2
# Pearson
la_train[['age','marital_status', 'relationship', 'sex', 'occupation', 'workclass']].corr()
                         age  marital_status  relationship       sex  occupation  \
    age             1.000000       -0.271955     -0.240331  0.087515   -0.007994   
    marital_status -0.271955        1.000000      0.180281 -0.124481    0.023856   
    relationship   -0.240331        0.180281      1.000000 -0.590077   -0.052109   
    sex             0.087515       -0.124481     -0.590077  1.000000    0.061443   
    occupation     -0.007994        0.023856     -0.052109  0.061443    1.000000   
    workclass       0.081100       -0.044000     -0.070512  0.078764    0.010194   
    
                    workclass  
    age              0.081100  
    marital_status  -0.044000  
    relationship    -0.070512  
    sex              0.078764  
    occupation       0.010194  
    workclass        1.000000 
1
la_train[['education', 'education_num', 'race', 'native_country']].corr()
                    education  education_num      race  native_country
    education        1.000000       0.348614  0.011236        0.079063
    education_num    0.348614       1.000000  0.034686        0.097485
    race             0.011236       0.034686  1.000000        0.126654
    native_country   0.079063       0.097485  0.126654        1.000000

     

  • 범주형인 변수는 Cramer’s V 공식을 활용하여 상관관계를 확인했습니다.

    \(V = \sqrt{\frac{\chi^2}{N \cdot \min(k-1, r-1)}}\)

    • $N$: 전체 관측값의 합
    • $k$: 행의 개수
    • $r$: 열의 개수
1
2
3
4
5
6
7
stat = stats.chi2_contingency(la_train[['race', 'native_country']].values, correction=False)[0]
obs = np.sum(la_train[['race', 'native_country']].values) 
mini = min(la_train[['race', 'native_country']].values.shape)-1 

# Cramer's V 
V = np.sqrt((stat/(obs*mini)))
print(V)
    0.11306993147326666

     

(2) String to numerical

  • 범주형 데이터를 모델의 Input으로 사용하기 위해서는 수치형 데이터로 변환시킬 필요가 있습니다. 라벨 인코더는 불필요한 상관관계를 보일 가능성이 있기에 원핫 인코더(One-hot Encoder)를 사용했습니다.
  • Categorical : workclass, education, marital.status, occupation, relationship, race, sex, native.country
1
2
train_dataset = train_df.copy()
test_dataset = test.copy()

     

  • get_dummies를 사용하여 원핫 인코딩을 진행했습니다.
1
2
3
4
5
train_dataset = pd.get_dummies(train_dataset)
test_dataset = pd.get_dummies(test_dataset)

print(train_dataset.columns)
print(test_dataset.columns)
    Index(['id', 'age', 'fnlwgt', 'education_num', 'capital_gain', 'capital_loss',
           'hours_per_week', 'target', 'workclass_Federal-gov',
           'workclass_Local-gov',
           ...
           'native_country_Portugal', 'native_country_Puerto-Rico',
           'native_country_Scotland', 'native_country_South',
           'native_country_Taiwan', 'native_country_Thailand',
           'native_country_Trinadad&Tobago', 'native_country_United-States',
           'native_country_Vietnam', 'native_country_Yugoslavia'],
          dtype='object', length=106)
    Index(['id', 'age', 'fnlwgt', 'education_num', 'capital_gain', 'capital_loss',
           'hours_per_week', 'workclass_Federal-gov', 'workclass_Local-gov',
           'workclass_Private',
           ...
           'native_country_Portugal', 'native_country_Puerto-Rico',
           'native_country_Scotland', 'native_country_South',
           'native_country_Taiwan', 'native_country_Thailand',
           'native_country_Trinadad&Tobago', 'native_country_United-States',
           'native_country_Vietnam', 'native_country_Yugoslavia'],
          dtype='object', length=104)

     

  • Train 데이터와 Test 데이터의 열 길이를 맞춰주는 작업을 합니다.
1
2
3
4
5
6
7
8
9
test_col = []
add_test = []

for i in test_dataset.columns:
    test_col.append(i)
for j in train_dataset.columns:
    if j not in test_col:
        add_test.append(j)
add_test.remove('target')
  • Test 데이터의 native.country 열에는 ‘Holand-Netherlands’ 특성이 없는걸까요?
1
print(add_test)
    ['native_country_Holand-Netherlands']

     

1
2
3
4
5
for d in add_test:
    test_dataset[d] = 0

print(train_dataset.columns)
print(test_dataset.columns)
    Index(['id', 'age', 'fnlwgt', 'education_num', 'capital_gain', 'capital_loss',
           'hours_per_week', 'target', 'workclass_Federal-gov',
           'workclass_Local-gov',
           ...
           'native_country_Portugal', 'native_country_Puerto-Rico',
           'native_country_Scotland', 'native_country_South',
           'native_country_Taiwan', 'native_country_Thailand',
           'native_country_Trinadad&Tobago', 'native_country_United-States',
           'native_country_Vietnam', 'native_country_Yugoslavia'],
          dtype='object', length=106)
    Index(['id', 'age', 'fnlwgt', 'education_num', 'capital_gain', 'capital_loss',
           'hours_per_week', 'workclass_Federal-gov', 'workclass_Local-gov',
           'workclass_Private',
           ...
           'native_country_Puerto-Rico', 'native_country_Scotland',
           'native_country_South', 'native_country_Taiwan',
           'native_country_Thailand', 'native_country_Trinadad&Tobago',
           'native_country_United-States', 'native_country_Vietnam',
           'native_country_Yugoslavia', 'native_country_Holand-Netherlands'],
          dtype='object', length=105)
  • Train 데이터의 Target 열을 제외하면, 열 길이가 잘 맞춰진것을 확인할 수 있습니다.

     

4. Modeling

  • 먼저, Train과 Validation 데이터를 train_test_split 함수를 사용하여 만들어줍니다.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
test_size =0.15

train_data, val_data = train_test_split(train_dataset, test_size = test_size, random_state = seed_num)

drop_col = ['target', 'id']

train_x = train_data.drop(drop_col, axis = 1)
train_y = pd.DataFrame(train_data['target'])

val_x = val_data.drop(drop_col, axis = 1)
val_y = pd.DataFrame(val_data['target'])

print(train_x.shape, train_y.shape)
print(val_x.shape, val_y.shape)
    (12790, 104) (12790, 1)
    (2258, 104) (2258, 1)

     

  • LGBM과 XGboost를 Soft Voting하여 간단한 앙상블(Ensemble) 파이프라인을 제작했습니다.
  • Soft Voting은 LGBM, XGboost 모델의 예측 확률을 평균 계산하여 최종 Class를 결정합니다.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
LGBClassifier = lgb.LGBMClassifier(random_state = seed_num)
lgbm = LGBClassifier.fit(train_x.values,
                       train_y.values.ravel(),
                       eval_set = [(train_x.values, train_y), (val_x.values, val_y)], 
                       eval_metric ='auc', early_stopping_rounds = 1000,
                       verbose = True)

XGBClassifier = xgb.XGBClassifier(max_depth = 6, learning_rate = 0.01, n_estimators = 10000, random_state = seed_num)
xgb = XGBClassifier.fit(train_x.values,
                       train_y.values.ravel(),
                       eval_set = [(train_x.values, train_y), (val_x.values, val_y)], 
                       eval_metric = 'auc', early_stopping_rounds = 1000,
                       verbose = True)

voting = VotingClassifier(estimators=[('xgb', xgb),('lgbm', lgbm)], voting='soft')
vot = voting.fit(train_x.values, train_y.values)

     

5. Evaluation & Submission

1
2
3
4
5
6
7
l_val_y_pred = lgbm.predict(val_x.values)
x_val_y_pred = xgb.predict(val_x.values)
v_val_y_pred = vot.predict(val_x.values)

print(metrics.accuracy_score(l_val_y_pred, val_y))
print(metrics.accuracy_score(x_val_y_pred, val_y))
print(metrics.accuracy_score(v_val_y_pred, val_y))
    0.8702391496899912
    0.8680248007085917
    0.8596102745792737

     

1
print(metrics.classification_report(v_val_y_pred, val_y))
                  precision    recall  f1-score   support
    
               0       0.93      0.89      0.91      1800
               1       0.63      0.72      0.68       458
        
        accuracy                           0.86      2258
       macro avg       0.78      0.81      0.79      2258
    weighted avg       0.87      0.86      0.86      2258

     

1
2
3
4
5
6
val_xgb = pd.Series(l_val_y_pred, name="XGB")
val_lgbm = pd.Series(x_val_y_pred, name="LGBM")

ensemble_results = pd.concat([val_xgb,val_lgbm],axis=1)
sns.heatmap(ensemble_results.corr(), annot=True)
plt.show()

corr

     

  • Soft Voting을 진행했음에도 성능이 향상되지 않았습니다.
  • 두 모델의 예측은 높은 상관관계를 가지기 때문에, 앙상블을 해도 성능이 향상되지 않는 것이라 예상해봅니다.
This post is licensed under CC BY 4.0 by the author.