Codelog

[AIB] One Hot Encoding, Feature Selection, Ridge Regression ๋ณธ๋ฌธ

Boot Camp/section2

[AIB] One Hot Encoding, Feature Selection, Ridge Regression

minzeros 2022. 3. 30. 17:10

๐Ÿ’ก One-hot Encoding

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.DataFrame({
	'City' : ['Seoul', 'Seoul', 'Seoul', 'Busan', 'Busan', 'Busan', 'Incheon', 'Incheon', 'Seoul', 'Busan', 'Incheon'],
    'Room' : [3, 4, 3, 2, 3, 3, 3, 3, 3, 3, 2],
    'Price' : [55000, 61000, 44000, 35000, 53000, 45000, 32000, 51000, 50000, 40000, 30000]
})

# ๋ฐ์ดํ„ฐ ํ™•์ธ
df

output :

 

City ์ปฌ๋Ÿผ์— ์žˆ๋Š” ๋ฐ์ดํ„ฐ๋Š” ๋„์‹œ ์ง€์—ญ์„ ๊ตฌ๋ถ„ํ•˜๋Š” ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜ (Categorical variable)์ด๋‹ค.

๋ฒ”์ฃผํ˜• ์ž๋ฃŒ๋Š” ์ˆœ์„œ๊ฐ€ ์—†๋Š” ๋ช…๋ชฉํ˜•(Nominal)๊ณผ ์ˆœ์„œ๊ฐ€ ์žˆ๋Š” ์ˆœ์„œํ˜•(Ordinal)๋กœ ๋‚˜๋‰œ๋‹ค.

City ์ปฌ๋Ÿผ์€ ๋ช…๋ชฉํ˜• ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜๋กœ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

 

City ์ปฌ๋Ÿผ์ฒ˜๋Ÿผ ๋ฌธ์ž์—ด(String) ๋ณ€์ˆ˜๋ฅผ ์ฒ˜๋ฆฌํ•ด์•ผ ํ•  ๋•Œ, ์ธ์ฝ”๋”ฉ ๋ฐฉ์‹์„ ํ™œ์šฉํ•œ๋‹ค.

์›ํ•ซ์ธ์ฝ”๋”ฉ์œผ๋กœ ๋ฒ”์ฃผํ˜• ๋ฐ์ดํ„ฐ๋ฅผ ๋ณ€ํ™˜ํ•˜๋Š” ๋ชจ์‹๋„๋Š” ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

City -> Seoul Busan Incheon
Seoul   1 0 0
Busan   0 1 0
Incheon   0 0 1

 

์›ํ•ซ์ธ์ฝ”๋”ฉ์„ ์ˆ˜ํ–‰ํ•˜๋ฉด ๊ฐ ์นดํ…Œ๊ณ ๋ฆฌ์— ํ•ด๋‹นํ•˜๋Š” ๋ณ€์ˆ˜๋“ค์ด ๋ชจ๋‘ ์ฐจ์›(์ปฌ๋Ÿผ)์— ๋”ํ•ด์ง€๊ฒŒ ๋œ๋‹ค.

๋”ฐ๋ผ์„œ ์นดํ…Œ๊ณ ๋ฆฌ๊ฐ€ ๋„ˆ๋ฌด ๋งŽ์€ ๊ฒฝ์šฐ(high cardinality)์—๋Š” ์‚ฌ์šฉํ•˜๊ธฐ ์ ํ•ฉํ•˜์ง€ ์•Š๋‹ค.

 

 

 

โœจ ์›ํ•ซ์ธ์ฝ”๋”ฉ ์˜ˆ์ œ

# category_encoders ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ์‚ฌ์šฉ
!pip install category_encoders
from category_encoders import OneHotEncoder

features = ['City', 'Room']
target = 'Price'

# ํ›ˆ๋ จ/ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ ๋ถ„๋ฆฌ
X_train = df[features][:8]
y_train = df[target][:8]
X_test = df[features][8:]
y_test = df[target][8:]

# ์›ํ•ซ์ธ์ฝ”๋”ฉ
encoder = OneHotEncoder(use_cat_names=True)
X_train = encoder.fit_transform(X_train)
X_test = encoder.fit_transform(X_test)

 

X_train.head()

output :

X_test

output :

 

 

๐Ÿ’ก ํŠน์„ฑ ์„ ํƒ (Feature Selection)

ํŠน์„ฑ๊ณตํ•™์€ ๊ณผ์ œ์— ์ ํ•ฉํ•œ ํŠน์„ฑ์„ ๋งŒ๋“ค์–ด ๋‚ด๋Š” ๊ณผ์ •์ด๋‹ค.

๊ทธ ์ค‘์—์„œ ํŠน์„ฑ ์„ ํƒ์ด ์กด์žฌํ•˜๋Š”๋ฐ ์ข‹์€ ํŠน์„ฑ์„ ์„ ํƒํ•˜๋Š” ๋ฐฉ๋ฒ•์€

ํŠน์„ฑ๋ผ๋ฆฌ์˜ ์ƒ๊ด€๋„๋Š” ๋‚ฎ์œผ๋ฉด์„œ, ํƒ€๊ฒŸ๊ณผ์˜ ์ƒ๊ด€๋„๊ฐ€ ํฐ ํŠน์„ฑ ์กฐํ•ฉ์„ ์„ ํƒํ•˜๋Š” ๊ฒƒ์ด๋‹ค.

 

 

โœจ ํŠน์„ฑ ์„ ํƒ ์˜ˆ์ œ

ํ‚น์นด์šดํ‹ฐ ์ฃผํƒ ๊ฐ€๊ฒฉ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค.
# ํ‚น์นด์šดํ‹ฐ ์ฃผํƒ ๊ฐ€๊ฒฉ ๋ฐ์ดํ„ฐ
df = pd.read_csv('kc_house_data.csv')

# price ๊ฐ’ ์ƒ์œ„ 5%, ํ•˜์œ„ 5%์ธ ๋ฐ์ดํ„ฐ ์‚ญ์ œ
# np.percentile ์‚ฌ์šฉ
df = df[(df['price'] >= np.percentile(df['price'], 0.05)) &
	(df['price'] <= np.percentile(df['price'], 0.95))]
    
# date ์ปฌ๋Ÿผ์„ datetime64 ํƒ€์ž…์œผ๋กœ ๋ณ€ํ™˜
df['date'] = pd.to_datetime(df['date'])

# 2015-03-01 ๋‚ ์งœ๋ฅผ ๊ธฐ์ค€์œผ๋กœ ํ›ˆ๋ จ/ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ๋กœ ๋ถ„๋ฆฌ
cutoff = pd.to_datetime('2015-03-01')

train = df[df['date'] < cutoff]
test = df[df['date'] >= cutoff]
# ํŠน์„ฑ ์„ ํƒ์„ ํ•˜๊ธฐ ์ „, ์ƒˆ๋กœ์šด ํŠน์„ฑ ์ƒ์„ฑ ๋ฐ ์‚ญ์ œ ์ง„ํ–‰ = "ํŠน์„ฑ ๊ณตํ•™"

def engineer_features(X):
	# pandas.DataFrame.copy()
    X = X.copy()
    
    # ์š•์‹ค ๊ฐœ์ˆ˜๋ฅผ ์ •์ˆ˜ํ˜•์œผ๋กœ ๋ณ€ํ™˜
    X['bathrooms'] = X['bathrooms'].round(0).astype(int)
    
    # ๋ฐฉ ์ˆ˜๋ฅผ ํ•ฉ์ณ์„œ rooms ์ปฌ๋Ÿผ์œผ๋กœ ํ•ฉ์‚ฐ
    X['rooms'] = X['bedrooms'] + X['bathrooms']
    
    # ์‚ฌ์šฉํ•˜์ง€ ์•Š์„ ํŠน์„ฑ ์‚ญ์ œ
    X = X.drop(['id', 'date', 'waterfront'], axis=1)
    
    return X
    
train = engineer_features(train)
test = engineer_features(test)
# ๋ณ€๊ฒฝ๋œ ํ…Œ์ด๋ธ” ํ™•์ธ
train.head()

output :

from math import factorial

# n๊ฐœ์˜ ํŠน์„ฑ ์ค‘์—์„œ k๊ฐœ์˜ ํŠน์„ฑ์„ ๋ฝ‘๋Š” ๊ฒฝ์šฐ์˜ ์ˆ˜
def n_choose_k(n, k):
	return factorial(n) / (factorial(k)*factorial(n-k))
    
n = len(train.columns)
combinations = sum(n_choose_k(n, k) for k in range(1, n+1))

print(combinations)
>>> 524287

k๊ฐœ์˜ ํŠน์„ฑ์„ ์„ ํƒํ•  ์ˆ˜ ์žˆ๋Š” ๊ฒฝ์šฐ์˜ ์ˆ˜๊ฐ€ ๋„ˆ๋ฌด ๋งŽ์„ ๋•Œ,

์ข‹์€ ํŠน์„ฑ๋งŒ ๋ฝ‘์•„์ฃผ๋Š” SelectKBest ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํŠน์„ฑ์„ ์„ ํƒํ•  ์ˆ˜ ์žˆ๋‹ค.

target = 'price'

X_train = train.drop(columns=target)
y_train = train[target]
X_test = test.drop(columns=target)
y_test = test[target]

# target ํŠน์„ฑ์ธ price ์ปฌ๋Ÿผ๊ณผ ๊ฐ€์žฅ ์ƒ๊ด€๋„๊ฐ€ ๋†’์€ feature k๊ฐœ๋ฅผ ์„ ํƒ
from sklearn.feature_selection import f_regression, SelectKBest

# SelectKBest๋ฅผ ์‚ฌ์šฉํ•  ๋•Œ score ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š”๋ฐ
# ํšŒ๊ท€ ๋ฌธ์ œ์—์„œ๋Š” f_reression ์„ ์ž์ฃผ ์‚ฌ์šฉํ•œ๋‹ค.
selector = SelectKBest(score_func=f_regression, k=10)

# ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์— fit_transform
X_train_selected = selector.fit_transform(X_train, y_train)

# ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ์—๋Š” transform
X_test_selected = selector.transform(X_test)
# ์„ ํƒ๋œ ํŠน์„ฑ ํ™•์ธ

all_names = X_train.columns

selected_mask = selector.get_support()

# ์„ ํƒ๋œ ์ปฌ๋Ÿผ๋“ค
selected_names = all_names[selected_mask]

# ์„ ํƒ๋˜์ง€ ์•Š์€ ์ปฌ๋Ÿผ๋“ค
unselected_names = all_names[~selected_mask]

 

์„ ํƒํ•  ํŠน์„ฑ ๊ฐœ์ˆ˜ k๋ฅผ ๊ฒฐ์ •ํ•˜๋Š” ๋ฐฉ๋ฒ•
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, r2_score

training = []
testing = []

# 1 ๋ถ€ํ„ฐ ํŠน์„ฑ ์ˆ˜ ๋งŒํผ ์‚ฌ์šฉํ•œ ๋ชจ๋ธ์„ ๋งŒ๋“ค์–ด์„œ MAE ๊ฐ’์„ ๋น„๊ต ํ•ฉ๋‹ˆ๋‹ค.
for k in range(1, len(X_train.columns)+ 1):
    print(f'{k} features')
    
    selector = SelectKBest(score_func=f_regression, k=k)
    
    X_train_selected = selector.fit_transform(X_train, y_train)
    X_test_selected = selector.transform(X_test)
    
    all_names = X_train.columns
    selected_mask = selector.get_support()
    selected_names = all_names[selected_mask]
    print('Selected names: ', selected_names)

    
    model = LinearRegression()
    model.fit(X_train_selected, y_train)
    y_pred = model.predict(X_train_selected)
    mae = mean_absolute_error(y_train, y_pred)
    training.append(mae)
    
    y_pred = model.predict(X_test_selected)
    mae = mean_absolute_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    testing.append(mae)
    print(f'Test MAE: ${mae:,.0f}')
    print(f'Test R2: {r2} \n')

output :

ห™ห™ห™ห™

ks = range(1, len(X_train.columns)+1)

plt.plot(ks, training, label='Training Score', color='b')
plt.plot(ks, testing, label='Testing Score', color='g')
plt.ylabel("MAE ($)")
plt.xlabel("Number of Features")
plt.title('Validation Curve')
plt.legend()
plt.show()

output :

ํŠน์„ฑ ๊ฐœ์ˆ˜ k ๊ฐ’์ด 5๊ฐœ์—์„œ 6๊ฐœ๋กœ ๋„˜์–ด๊ฐˆ ๋•Œ ๊ทธ๋ž˜ํ”„ ์ƒ์—์„œ ๊ฐ€์žฅ ํฌ๊ฒŒ MAE๊ฐ€ ๊ฐ์†Œํ–ˆ๋‹ค.

๋ชจ๋“  ํŠน์„ฑ์˜ ์‚ฌ์šฉํ•˜๋ฉด ๋‹น์—ฐํžˆ MAE ๊ฐ’์€ ์ตœ์†Œ๊ฐ€ ๋˜์ง€๋งŒ ์ฐจ์›์ด ๋„ˆ๋ฌด ์ปค์ ธ ๋ชจ๋ธ์— ์˜ํ–ฅ์„ ๋ฏธ์นœ๋‹ค.

๋”ฐ๋ผ์„œ k๊ฐ’์€ 6๊ฐœ์ผ ๋•Œ, ๊ฐ€์žฅ ํšจ๊ณผ์ ์ด๋ผ๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

 

 

 

๐Ÿ’ก Ridge Regression

Ridge ํšŒ๊ท€๋Š” ๊ธฐ์กด์˜ ๋‹ค์ค‘ํšŒ๊ท€์„ ์„ ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์— ๋œ ์ ํ•ฉ๋˜๋„๋ก ๋งŒ๋“œ๋Š” ๊ฒƒ์ด๋‹ค.

Ridge ํšŒ๊ท€๋ชจ๋ธ์˜ ๋น„์šฉํ•จ์ˆ˜์‹์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

n : ์ƒ˜ํ”Œ์ˆ˜,   p : ํŠน์„ฑ์ˆ˜,   λ(๋žŒ๋‹ค) : ํŠœ๋‹ ํŒŒ๋ผ๋ฏธํ„ฐ,   β : ํšŒ๊ท€๊ณ„์ˆ˜

 

๋ฆฟ์ง€ ํšŒ๊ท€๋ชจ๋ธ์˜ ๋น„์šฉํ•จ์ˆ˜๋ฅผ ๋ณด๋ฉด ๊ธฐ์กด ๋น„์šฉํ•จ์ˆ˜์ธ SSE์— ํšŒ๊ท€๊ณ„์ˆ˜์ œ๊ณฑํ•ฉ ํŒŒํŠธ๊ฐ€ ์ถ”๊ฐ€๋˜์—ˆ๋‹ค.

ํšŒ๊ท€๊ณ„์ˆ˜์ œ๊ณฑํ•ฉ์—์„œ ๋žŒ๋‹ค๋Š” ํŠœ๋‹ ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ, ๋žŒ๋‹ค ๊ฐ’์ด ์ปค์งˆ์ˆ˜๋ก ํšŒ๊ท€๊ณ„์ˆ˜๋“ค์„ 0์œผ๋กœ ์ˆ˜๋ ด์‹œํ‚จ๋‹ค.

ํšŒ๊ท€๊ณ„์ˆ˜๊ฐ’์„ 0์œผ๋กœ ์ˆ˜๋ ด์‹œํ‚ด์œผ๋กœ์จ ๋œ ์ค‘์š”ํ•œ ํŠน์„ฑ์˜ ๊ฐฏ์ˆ˜๋ฅผ ์ค„์ด๋Š” ํšจ๊ณผ๋ฅผ ๋‚ธ๋‹ค. ์ฆ‰ ๊ณผ์ ํ•ฉ์„ ์ค„์ธ๋‹ค.

๋ฐ˜๋Œ€๋กœ ๋žŒ๋‹ค๊ฐ€ 0์— ๊ฐ€๊นŒ์›Œ์ง€๋ฉด ๋ฆฟ์ง€ ํšŒ๊ท€๋Š” ๋‹ค์ค‘ํšŒ๊ท€๋ชจ๋ธ ๋ฌธ์ œ๊ฐ€ ๋œ๋‹ค.

 

Ridge ํšŒ๊ท€๋Š” ๊ณผ์ ํ•ฉ์„ ์ค„์ด๊ธฐ ์œ„ํ•ด์„œ ์‚ฌ์šฉํ•œ๋‹ค.

๊ณผ์ ํ•ฉ์„ ์ค„์ด๋Š” ๊ฐ„๋‹จํ•œ ๋ฐฉ๋ฒ•์€ ๋ชจ๋ธ์˜ ๋ณต์žก๋„๋ฅผ ์ค„์ด๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค.

ํŠน์„ฑ์˜ ๊ฐœ์ˆ˜๋ฅผ ์ค„์ด๊ฑฐ๋‚˜ ๋ชจ๋ธ์„ ๋‹จ์ˆœํ•œ ๋ชจ์–‘์œผ๋กœ ์ ํ•ฉ์‹œํ‚ค๋Š” ๊ฒƒ์ด๋‹ค.

Ridge ํšŒ๊ท€๋Š” ํŽธํ–ฅ(Bias)์„ ์กฐ๊ธˆ ๋”ํ•˜๊ณ , ๋ถ„์‚ฐ(Variance)์„ ์ค„์ด๋Š” ๋ฐฉ์‹์œผ๋กœ ์ •๊ทœํ™”(Regularization)๋ฅผ ์ˆ˜ํ–‰ํ•œ๋‹ค.

์—ฌ๊ธฐ์„œ ๋งํ•˜๋Š” ์ •๊ทœํ™”๋Š” ๋ชจ๋ธ์„ ๋ณ€ํ˜•ํ•˜์—ฌ ๊ณผ์ ํ•ฉ์„ ์™„ํ™”ํ•ด ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ์„ ๋†’์—ฌ์ฃผ๊ธฐ ์œ„ํ•œ ๊ธฐ๋ฒ•์„ ๋งํ•œ๋‹ค.

 

 

โœจ๋‹ค์ค‘ํšŒ๊ท€๋ชจ๋ธ๊ณผ ๋ฆฟ์ง€ํšŒ๊ท€๋ชจ๋ธ ๋น„๊ต

OLS vs Ridge

OLS(Ordinary Least Squares)๋Š” ์ตœ์†Œ์ž์Šน๋ฒ•์œผ๋กœ,

์ž”์ฐจ์ œ๊ณฑํ•ฉ(RSS: Residual Sum of Squares)์„ ์ตœ์†Œํ™”ํ•˜๋Š” ๊ฐ€์ค‘์น˜ ๋ฒกํ„ฐ๋ฅผ ๊ตฌํ•˜๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค.

Anscome's quartet ๋ฐ์ดํ„ฐ ์‚ฌ์šฉ
import seaborn as sns
ans = sns.load_dataset('anscombe').query('dataset=="|||"')
ans.plot.scatter('x', 'y')

 

OLS

%matplotlib inline

ax = ans.plot.scatter('x', 'y')

# OLS
ols = LinearRegression()
ols.fit(ans[['x']], ans['y'])

# ํšŒ๊ท€๊ณ„์ˆ˜์™€ ์ ˆํŽธ(intercept)์„ ํ™•์ธํ•œ๋‹ค
m = ols.coef_[0].round(2)
b = ols.intercept_.round(2)
title = f'Linear Regression \n y = {m}x + b'

# ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ๋กœ ์˜ˆ์ธก
ans['y_pred'] = ols.predict(ans[['x']])

ans.plot('x', 'y_pred', ax=ax, title=title)

output :

 

Ridge Regression

λ ๊ฐ’์„ ์ฆ๊ฐ€์‹œํ‚ค๋ฉฐ ๊ทธ๋ž˜ํ”„๋ฅผ ํ†ตํ•ด ํšŒ๊ท€๊ณ„์ˆ˜์˜ ๋ณ€ํ™”๋ฅผ ํ™•์ธํ•œ๋‹ค.
import matplotlib.pyplot as plt
from sklearn.linear_model import Ridge

def ridge_anscombe(alpha):
	# alpha = lambda
	ans = sns.load_dataset('anscombe').query('dataset=="|||"')
    
    ax = ans.plot.scatter('x', 'y')
    
    ridge = Ridge(alpha=alpha, normalize=True)
    ridge.fit(ans[['x']], ans['y'])
    
    # ํšŒ๊ท€๊ณ„์ˆ˜์™€ ์ ˆํŽธ(intercept)
    m = ridge.coef_[0].round(2)
    b = ridge.intercept_.round(2)
    title = f'Ridge Regression, alpha={alpha} \n y = {m}x + {b}'
    
    # ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ๋กœ ์˜ˆ์ธก
    ans['y_pred'] = ridge.predict(ans[['x']])
    
    ans.plot('x', 'y_pred', ax=ax, title=title)
    plt.show()
    
    
# ์—ฌ๋Ÿฌ alpha ๊ฐ’์œผ๋กœ ๊ทธ๋ž˜ํ”„๋ฅผ ๊ทธ๋ฆฐ๋‹ค
alphas = np.arange(0, 2, 0.4)
for alpha in alphas:
	ridge_anscombe(alpha)

output :

๊ทธ๋ž˜ํ”„๋ฅผ ๋ณด๋ฉด, alpha = 0 ์ธ ๊ฒฝ์šฐ์—๋Š” OLS์™€ ๊ฐ™์€ ๊ทธ๋ž˜ํ”„ ํ˜•ํƒœ๋ฅผ ๋ณด์ด๋ฏ€๋กœ ๊ฐ™์€ ๋ชจ๋ธ์ž„์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๊ณ ,

alpha ๊ฐ’์ด ์ปค์งˆ์ˆ˜๋ก ์ง์„ ์˜ ๊ธฐ์šธ๊ธฐ๊ฐ€ 0์— ๊ฐ€๊นŒ์›Œ์ง€๋ฉด์„œ ํ‰๊ท  ๊ธฐ์ค€๋ชจ๋ธ(baseline)๊ณผ ๋น„์Šทํ•ด์ง€๋Š” ๋ชจ์Šต์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

์ตœ์ ์˜ ํŒจ๋„ํ‹ฐ ๊ฐ’์œผ๋กœ ์˜ˆ์ธก๋œ Ridge ํšŒ๊ท€ ์ง์„ ์€ ์ด์ƒ์น˜ ์˜ํ–ฅ์„ ๋œ ๋ฐ›์„ ์ˆ˜ ์žˆ๋‹ค.

 

์ตœ์  ํŒจ๋„ํ‹ฐ ๊ฐ’์ธ alpha ๊ฐ’์„ ํšจ์œจ์ ์œผ๋กœ ๊ตฌํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š”, ์—ฌ๋Ÿฌ ํŒจ๋„ํ‹ฐ ๊ฐ’์„ ๊ฐ€์ง€๊ณ  ๊ต์ฐจ๊ฒ€์ฆ (cross-validation)์„

์‚ฌ์šฉํ•˜์—ฌ ์ฐพ์•„๋‚˜๊ฐ€์•ผ ํ•œ๋‹ค.

sklearn์˜ RidgeCV ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ๊ต์ฐจ๊ฒ€์ฆ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ž์ฒด์ ์œผ๋กœ ์ง„ํ–‰ํ•ด ์ตœ์ ์˜ ํŒจ๋„ํ‹ฐ ๊ฐ’์„ ์ฐพ๋Š”๋‹ค.

 

 

โœจ RidgeCV๋ฅผ ํ†ตํ•œ ์ตœ์  ํŒจ๋„ํ‹ฐ ๊ฒ€์ฆ

from sklearn.linear_model import RidgeCV

alphas = [0.01, 0.05, 0.1, 0.2, 1.0, 10.0, 100.0]

ridge = RidgeCV(alphas=alphas, normalize=True, cv=3)
ridge.fit(ans[['x']], ans['y'])
print("alpha: ", ridge.alpha_)
print("best score: ", ridge.best_score_)

output :

 

cf.

sklearn.linear_model.RidgeCV

: cv ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ํ†ตํ•ด์„œ K-fold ๊ต์ฐจ๊ฒ€์ฆ์„ ์ง€์ •ํ•  ์ˆ˜ ์žˆ๋‹ค.

 

 

 

reference

https://modern-manual.tistory.com/21

 

Ridge regression(๋Šฅํ˜• ํšŒ๊ท€) ๊ฐ„๋‹จํ•œ ์„ค๋ช…๊ณผ ์žฅ์ 

 ์„ ํ˜• ๋ชจ๋ธ(Linear model)์˜ ์˜ˆ์ธก๋ ฅ(accuracy) ํ˜น์€ ์„ค๋ช…๋ ฅ(interpretability)์„ ๋†’์ด๊ธฐ ์œ„ํ•ด ์—ฌ๋Ÿฌ ์ •๊ทœํ™”(regularization) ๋ฐฉ๋ฒ•๋“ค์„ ์‚ฌ์šฉํ•œ๋‹ค. ๋Œ€ํ‘œ์ ์ธ shrinkage ๋ฐฉ๋ฒ•์—๋Š” ridge regression๊ณผ lasso๊ฐ€ ์žˆ์œผ๋ฉฐ..

modern-manual.tistory.com

https://velog.io/@dlskawns/Linear-Regression-Ridge-Regression-RidgeCV-%EC%A0%95%EB%A6%AC

 

Linear Regression - Ridge Regression, RidgeCV ์ •๋ฆฌ

์˜ค๋Š˜ ๋ฐฐ์šด Ridge Regression ๋ฐ ๋‹คํ•ญํšŒ๊ท€์— ๋Œ€ํ•œ ์ •๋ฆฌ๋ฅผ ํ•ด๋ณธ๋‹ค.Rigdge ํšŒ๊ท€๋Š” ๊ธฐ์กด Linear Rigression์˜ ๊ณผ์ ํ•ฉ์„ ํ•ด๊ฒฐํ•ด์ฃผ๋Š” ๊ต‰์žฅํžˆ ํ˜„์‹ค์ ์ธ ์„ ํ˜•ํšŒ๊ท€๋ฐฉ๋ฒ•์ด๋‹ค. ๋„ˆ๋ฌด๋‚˜ ๋งŽ์€ ํ•™์Šต์œผ๋กœ ์ธํ•ด ๊ณผ์ ํ•ฉ๋˜๋Š”

velog.io

 

'Boot Camp > section2' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

[AIB] Logistic Regression  (0) 2022.03.30
[AIB] Bias, Variance, R-Square, Multiple Regression, Evaluation Metrics  (0) 2022.03.25
[AIB] OLS, MAE, RSS, Simple Regression  (0) 2022.01.07