Boot Camp/section2

[AIB] Bias, Variance, R-Square, Multiple Regression, Evaluation Metrics

minzeros 2022. 3. 25. 15:37

๐Ÿ’ก ํŽธํ–ฅ (Bias)

The inability for a machine learning method (like  linear regression) to capture the true relationship.

(์‹ค์ œ ๋ฐ์ดํ„ฐ๋ฅผ ๋”ฐ๋ผ์žก์ง€ ๋ชปํ•˜๋Š” ์ •๋„)

Because the Straight line can't be curved like 'true' relationship.

 Squiggly line (๊ตฌ๋ถˆ๊ตฌ๋ถˆํ•œ ์„ ) did a great job fitting the training set, but it did a terrible job fitting the testing set.

(= Overfitting, ๊ณผ์ ํ•ฉ)

 

 

๐Ÿ’ก ๋ถ„์‚ฐ (Variance)

The difference in fits between data sets. (๋ฐ์ดํ„ฐ ์„ธํŠธ ๊ฐ„์˜ ์ ํ•ฉ๋„ ์ฐจ์ด)

 

์˜ˆ์ธก๊ฐ’๋“ค๊ณผ ๊ด€์ธก๊ฐ’์ด ๋Œ€์ฒด๋กœ ๋ฉ€๋ฆฌ ๋–จ์–ด์ ธ ์žˆ์œผ๋ฉด ๊ฒฐ๊ณผ์˜ ํŽธํ–ฅ์ด ๋†’๋‹ค๊ณ  ๋งํ•˜๊ณ ,

์˜ˆ์ธก๊ฐ’๋“ค์ด ๋Œ€์ฒด๋กœ ๋ฉ€๋ฆฌ ํฉ์–ด์ ธ ์žˆ์œผ๋ฉด ๋ถ„์‚ฐ์ด ๋†’๋‹ค๊ณ  ๋งํ•œ๋‹ค.

 

cf.

Three commonly used method for finding the sweep spot(์ด์ƒ์ ์ธ ๊ท ํ˜•) between simple and complicated models are Regularization, Boosting and Bagging.

๋‹จ์ˆœํ•œ ๋ชจ๋ธ๊ณผ ๋ณต์žกํ•œ ๋ชจ๋ธ ์‚ฌ์ด์˜ ์ด์ƒ์ ์ธ ๊ท ํ˜•์„ ์ฐพ๋Š” 3๊ฐ€์ง€ ๋ฐฉ๋ฒ• = ์ •๊ทœํ™”, ๋ถ€์ŠคํŒ…, ๋ฐฐ๊น…

 

 

โœจ How to calculate R-Square Using Regression Analysis

(๊ด€์ธก์น˜ - ํ‰๊ท ) distance ์™€ (์˜ˆ์ธก์น˜ - ํ‰๊ท ) distance ๋น„๊ตํ•˜๊ธฐ

R square ๊ฐ’์ด 1์— ๊ฐ€๊นŒ์›Œ์งˆ์ˆ˜๋ก ๊ด€์ธก์น˜์™€ ์˜ˆ์ธก์น˜๊ฐ€ ์œ ์‚ฌํ•˜๋‹ค๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•œ๋‹ค.

 

โœจ Standard Error of the Estimate

y = ์‹ค์ œ y๊ฐ’

^(y) = ์˜ˆ์ธกํ•œ y๊ฐ’

n = ์˜ˆ์ธก๊ฐ’ ๊ฐœ์ˆ˜

 

๐Ÿ’ก ํšŒ๊ท€๋ชจ๋ธ์„ ํ‰๊ฐ€ํ•˜๋Š” ํ‰๊ฐ€์ง€ํ‘œ๋“ค (evaluation metrics)

  • MSE (Mean Squared Error)
  • MAE (Mean Absolute Error)
  • RMSE (Root Mean Squared Error)
  • R-squared 

MSE
MAE
RMSE
R squared

  • SSE (Sum of Squares Error, ๊ด€์ธก์น˜์™€ ์˜ˆ์ธก์น˜ ์ฐจ์ด) 
  • SSR (Sum of Squares due to Regression, ์˜ˆ์ธก์น˜์™€ ํ‰๊ท  ์ฐจ์ด)
  • SST (Sum of Squares Total, ๊ด€์ธก์น˜์™€ ํ‰๊ท  ์ฐจ์ด) = SSE + SSR
# mean_sqaure_error, mean_absolute_error, rmse, r-squared error ํ™•์ธํ•˜๊ธฐ
import pandas as pd
from IPython.display import display
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
rmse = mse ** 0.5
r2 = r2_score(y_test, y_pred)

display(pd.DataFrame([['MSE', mse],['MAE', mae],['RMSE', rmse],['R2', r2]], columns=['Metric', 'Score']))

reference.

https://partrita.github.io/posts/regression-error/

 

ํšŒ๊ท€์˜ ์˜ค๋ฅ˜ ์ง€ํ‘œ ์•Œ์•„๋ณด๊ธฐ

์„ ํ˜• ํšŒ๊ท€์˜ ์˜ค๋ฅ˜ ์ง€ํ‘œ๋ฅผ ์˜ˆ์‹œ๋ฅผ ํ†ตํ•ด ์•Œ์•„๋ด…๋‹ˆ๋‹ค

partrita.github.io

 

 

 

๋ฐ์ดํ„ฐ๋ฅผ ์–ด๋–ป๊ฒŒ ๋‚˜๋ˆ„์–ด์•ผ ํ• ๊นŒ?

๋ฐ์ดํ„ฐ๋ฅผ ๋ฌด์ž‘์œ„๋กœ ์„ ํƒํ•ด ๋‚˜๋ˆ„๋Š” ๋ฐฉ๋ฒ•์ด ์ผ๋ฐ˜์ ์ด์ง€๋งŒ, ์‹œ๊ณ„์—ด ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ง€๊ณ  ๊ณผ๊ฑฐ์—์„œ ๋ฏธ๋ž˜๋ฅผ ์˜ˆ์ธกํ•˜๋ ค๊ณ  ํ•˜๋Š” ๊ฒฝ์šฐ ๋ฌด์ž‘์œ„๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ์„ž์œผ๋ฉด ์ ˆ๋Œ€ ์•ˆ๋œ๋‹ค. ์ด๋•Œ๋Š” ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ๋ณด๋‹ค ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ๊ฐ€ ๋ฏธ๋ž˜์˜ ๊ฒƒ์ด์–ด์•ผ ํ•œ๋‹ค.

# ๋ฐ์ดํ„ฐ๋ฅผ ๋ฌด์ž‘์œ„๋กœ ๋‚˜๋ˆ„๋Š” ๋ฐฉ๋ฒ•
# 1
train = df.sample(frac=0.75, random_state=1)	# frac : ๋ฐ˜ํ™˜ํ•  ํ•ญ๋ชฉ์˜ ๋น„์œจ
test = df.drop(train.index)

# 2
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, train_size=0.75, random_state=1)

 

๐Ÿ’ก ๋‹จ์ˆœ์„ ํ˜•ํšŒ๊ท€๋ชจ๋ธ ์˜ˆ์ œ 

import pandas as pd
df = pd.read_csv('https://ds-lecture-data.s3.ap-northeast-2.amazonaws.com/house-prices/house_prices_train.csv')
# ๊ธฐ์ค€๋ชจ๋ธ ๋งŒ๋“ค๊ธฐ
# ํƒ€๊ฒŸ์ธ SalePrice ์˜ ํ‰๊ท ์„ ๊ธฐ์ค€๋ชจ๋ธ๋กœ ์‚ฌ์šฉ

# label ์ •์˜
target = 'SalePrice'
y_train = train[target]
y_test = test[target]

predict = y_train.mean()
>>> 180327.24200913243
# ๊ธฐ์ค€๋ชจ๋ธ๋กœ ํ›ˆ๋ จ ์—๋Ÿฌ (MAE) ๊ณ„์‚ฐ
from sklearn.metrics import mean_absolute_error
y_pred = [predict] * len(y_train)
mae = mean_absolute_error(y_train, y_pred)
>>> 57775.57
# GrLivArea (์ง€์ƒ์ƒํ™œ๋ฉด์ , sqft)์™€ SalePrice๋ฅผ ์‚ฌ์šฉ, scatterplot์„ ๊ทธ๋ฆฌ๊ณ  OLS ๋ผ์ธ์„ ๊ทธ๋ ค๋ณด์ž

import seaborn as sns
sns.regplot(x=train['GrLivArea'], y=train['SalePrice']).set_title('Housing Prices')

cf.

seaborn.regplot()

: scatter plot๊ณผ line plot์„ ํ•จ๊ป˜ ๋ณผ ์ˆ˜ ์žˆ๋Š” ์‹œ๊ฐํ™” ๋ฐฉ๋ฒ•

 

 

from sklearn.linear_model import LinearRegression

model = LinearRegression()

features = ['GrLivArea']
X_train = train[features]
X_test = test[features]

# ๋ชจ๋ธ fit
model.fit(X_train, y_train)
y_pred = model.predict(X_train)
mae = mean_absolute_error(y_train, y_pred)
print(f'ํ›ˆ๋ จ ์—๋Ÿฌ: {mae:.2f}')
>>> ํ›ˆ๋ จ ์—๋Ÿฌ: 38327.78
# ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ์— ์ ์šฉ
y_pred = model.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
print(f'ํ…Œ์ŠคํŠธ ์—๋Ÿฌ: {mae:.2f}')
>>> ํ…Œ์ŠคํŠธ ์—๋Ÿฌ: 35476.63

 

๐Ÿ’ก ๋‹ค์ค‘์„ ํ˜•ํšŒ๊ท€๋ชจ๋ธ ์˜ˆ์ œ (ํŠน์„ฑ 2๊ฐœ ์ด์ƒ)

features = ['GrLivArea', 'OverallQual']
target = 'SalePrice'

X_train = train[features]
X_test = test[features]

# ๋ชจ๋ธ fit
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
mae = mean_absolute_error(y_train, y_pred)
print(f'ํ›ˆ๋ จ ์—๋Ÿฌ: {mae:.2f}')
>>> ํ›ˆ๋ จ ์—๋Ÿฌ: 29129.58
# ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ์— ์ ์šฉ
y_pred = model.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
print(f'ํ…Œ์ŠคํŠธ ์—๋Ÿฌ: {mae:.2f}')
>>> ํ…Œ์ŠคํŠธ ์—๋Ÿฌ: 27598.31

 

๊ณผ์ ํ•ฉ(Overfitting)๊ณผ ๊ณผ์†Œ์ ํ•ฉ(Underfitting)

์ผ๋ฐ˜ํ™” (Generalization)

ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ์—์„œ ๋งŒ๋“ค์–ด๋‚ด๋Š” ์˜ค์ฐจ๋ฅผ ์ผ๋ฐ˜ํ™” ์˜ค์ฐจ๋ผ๊ณ  ๋ถ€๋ฅธ๋‹ค.

ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์—์„œ์™€ ๊ฐ™์ด ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ์—์„œ๋„ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋‚ด๋Š” ๋ชจ๋ธ์€ ์ผ๋ฐ˜ํ™”๊ฐ€ ์ž˜ ๋œ ๋ชจ๋ธ์ด๋ผ๊ณ  ๋ถ€๋ฅธ๋‹ค.

๋ชจ๋ธ์ด ๋„ˆ๋ฌด ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์— ๊ณผํ•˜๊ฒŒ ํ•™์Šต(๊ณผ์ ํ•ฉ)์„ ํ•˜์ง€ ์•Š๋„๋ก ํ•˜๋Š” ๋งŽ์€ ์ผ๋ฐ˜ํ™” ๋ฐฉ๋ฒ•๋“ค์ด ์žˆ๋‹ค.

 

๊ณผ์ ํ•ฉ (Overfitting)

๊ณผ์ ํ•ฉ์€ ๋ชจ๋ธ์ด ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์—๋งŒ ํŠน์ˆ˜ํ•œ ์„ฑ์งˆ์„ ๊ณผํ•˜๊ฒŒ ํ•™์Šตํ•ด ์ผ๋ฐ˜ํ™”๋ฅผ ๋ชปํ•ด ๊ฒฐ๊ตญ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ์—์„œ ์˜ค์ฐจ๊ฐ€ ์ปค์ง€๋Š” ํ˜„์ƒ์„ ๋งํ•œ๋‹ค. 

 

๊ณผ์†Œ์ ํ•ฉ (Underfitting)

๊ณผ์†Œ์ ํ•ฉ์€ ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์— ๊ณผ์ ํ•ฉ๋„ ๋ชปํ•˜๊ณ  ์ผ๋ฐ˜ํ™” ์„ฑ์งˆ๋„ ํ•™์Šตํ•˜์ง€ ๋ชปํ•ด, ํ›ˆ๋ จ/ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ ๋ชจ๋‘์—์„œ ์˜ค์ฐจ๊ฐ€ ํฌ๊ฒŒ ๋‚˜์˜ค๋Š” ๊ฒฝ์šฐ๋ฅผ ๋งํ•œ๋‹ค.

 

 

๋ถ„์‚ฐ/ํŽธํ–ฅ ํŠธ๋ ˆ์ด๋“œ์˜คํ”„

๊ณผ์ ํ•ฉ, ๊ณผ์†Œ์ ํ•ฉ์€ ์˜ค์ฐจ์˜ ํŽธํ–ฅ(Bias)๊ณผ ๋ถ„์‚ฐ(Variance) ๊ฐœ๋…๊ณผ ๊ด€๊ณ„๊ฐ€ ์žˆ๋‹ค.

๋ถ„์‚ฐ์ด ๋†’์€ ๊ฒฝ์šฐ๋Š”, ๋ชจ๋ธ์ด ํ•™์Šต ๋ฐ์ดํ„ฐ์˜ ๋…ธ์ด์ฆˆ์— ๋ฏผ๊ฐํ•˜๊ฒŒ ์ ํ•ฉํ•˜์—ฌ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ์—์„œ ์ผ๋ฐ˜ํ™”๋ฅผ ์ž˜ ๋ชปํ•˜๋Š” ๊ฒฝ์šฐ, ์ฆ‰ ๊ณผ์ ํ•ฉ ์ƒํƒœ์ด๋‹ค.

ํŽธํ–ฅ์ด ๋†’์€ ๊ฒฝ์šฐ๋Š”, ๋ชจ๋ธ์ด ํ•™์Šต ๋ฐ์ดํ„ฐ์—์„œ ํŠน์„ฑ๊ณผ ํƒ€์ผ“ ๋ณ€์ˆ˜์˜ ๊ด€๊ณ„๋ฅผ ์ž˜ ํŒŒ์•…ํ•˜์ง€ ๋ชปํ•ด ๊ณผ์†Œ์ ํ•ฉ ์ƒํƒœ์ด๋‹ค.