Codelog

[AIB] OLS, MAE, RSS, Simple Regression ๋ณธ๋ฌธ

Boot Camp/section2

[AIB] OLS, MAE, RSS, Simple Regression

minzeros 2022. 1. 7. 18:46

๐Ÿ’ก ์ง€๋„ํ•™์Šต (Supervised Learning) : Comparing Classification & Regression

Purpose

  • Supervised Classification → Is this A or B?
  • Regression → How much / How many?

output type

  • Supervised Classification → discrete output (class or label ํ˜•์‹)
  • Regression → continuous output (number ํ˜•์‹)

what are you trying to find

  • Supervised Classification → decision boundary
  • Regression → best fit line

evalution

  • Supervised Classification → accuracy
  • Regression → sum of squared error or r squared

 

๐Ÿ’ก ๊ธฐ์ค€๋ชจ๋ธ (Baseline Model)

์˜ˆ์ธก ๋ชจ๋ธ์„ ๊ตฌ์ฒด์ ์œผ๋กœ ๋งŒ๋“ค๊ธฐ ์ „์— ๊ฐ€์žฅ ๊ฐ„๋‹จํ•˜๋ฉด์„œ ์ง๊ด€์ ์ธ, ์ตœ์†Œํ•œ์˜ ์„ฑ๋Šฅ์„ ๋‚˜ํƒ€๋‚ด๋Š” ๋ชจ๋ธ์„ ๊ธฐ์ค€๋ชจ๋ธ ์ด๋ผ ํ•œ๋‹ค.

๊ธฐ์ค€๋ชจ๋ธ์„ ๋ฌธ์ œ๋ณ„๋กœ ๋‹ค๋ฅด๊ฒŒ ์„ค์ •ํ•œ๋‹ค.

  • ๋ถ„๋ฅ˜๋ฌธ์ œ : ํƒ€๊ฒŸ์˜ ์ตœ๋นˆ ํด๋ž˜์Šค
  • ํšŒ๊ท€๋ฌธ์ œ : ํƒ€๊ฒŸ์˜ ํ‰๊ท ๊ฐ’
  • ์‹œ๊ณ„์—ดํšŒ๊ท€๋ฌธ์ œ : ์ด์ „ ํƒ€์ž„์Šคํƒฌํ”„์˜ ๊ฐ’

 

๐Ÿ’ก ์ตœ์†Œ์ œ๊ณฑ๋ฒ•, ์ตœ์†Œ์ž์Šน๋ฒ• (Least Square Method, Ordinary Least Squares)

์–ด๋–ค ๊ณ„์˜ ํ•ด๋ฐฉ์ •์‹์„ ๊ทผ์‚ฌ์ ์œผ๋กœ ๊ตฌํ•˜๋Š” ๋ฐฉ๋ฒ•์œผ๋กœ, ๊ทผ์‚ฌ์ ์œผ๋กœ ๊ตฌํ•˜๋ ค๋Š” ํ•ด์™€ ์‹ค์ œ ํ•ด์˜ ์˜ค์ฐจ์˜ ์ œ๊ณฑ์˜ ํ•ฉ์ด ์ตœ์†Œ๊ฐ€  ๋˜๋Š” ํ•ด๋ฅผ ๊ตฌํ•˜๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค.

 

(x,y) ๋ฐ์ดํ„ฐ ์ง‘ํ•ฉ์œผ๋กœ ์„ ํ˜•ํšŒ๊ท€๋ชจ๋ธ(Linear Regression) ๊ณ„์‚ฐํ•˜๋Š” ๋ฐฉ๋ฒ•

 

 

๐Ÿ’ก ํ‰๊ท ์ ˆ๋Œ€์˜ค์ฐจ (mean absolute error, MAE)

์—๋Ÿฌ๊ฐ’์— ์ ˆ๋Œ€๊ฐ’์„ ์ทจํ•œ ํ›„ ํ‰๊ท ์„ ๋‚ธ ๊ฐ’

mae๋Š” ๋‹ค๋ฅธ ์˜ค์ฐจ๊ณ„์‚ฐ๋ฒ•๊ณผ๋Š” ๋‹ค๋ฅด๊ฒŒ ์—๋Ÿฌ๊ฐ’๊ณผ ์‹ค์ œ ํƒ€๊ฒŸ ๋ฐ์ดํ„ฐ์˜ ๊ฐ’์˜ ๋‹จ์œ„๊ฐ€ ๊ฐ™์•„์„œ ์‰ฝ๊ฒŒ ๋น„๊ตํ•  ์ˆ˜ ์žˆ๋‹ค.

# df : ์ฃผํƒ ํŒ๋งค ๊ฐ€๊ฒฉ์„ ์˜ˆ์ธกํ•˜๋Š”๋ฐ ์‚ฌ์šฉ๋˜๋Š” ๋ฐ์ดํ„ฐ

# ํšŒ๊ท€๋ฌธ์ œ์˜ ๊ธฐ์ค€๋ชจ๋ธ
predict = df['SalePrice'].mean()

# ์—๋Ÿฌ๊ฐ’ ๊ณ„์‚ฐ
errors = df['SalePrice'] - predict

# mae ๊ณ„์‚ฐ
mae = errors.abs().mean()

 

 

โœจ ํšŒ๊ท€๋ชจ๋ธ์˜ ์˜ˆ์ธก๋ชจ๋ธ ํ™œ์šฉ

ํšŒ๊ท€๋ฌธ์ œ ๋ฐ์ดํ„ฐ์˜ scatterplot์— ๊ฐ€์žฅ ์ž˜ ๋งž๋Š” ์ง์„ (best fit)์„ ๊ทธ๋ ค์ฃผ๋ฉด ๊ทธ๊ฒƒ์ด ํšŒ๊ท€์˜ˆ์ธก๋ชจ๋ธ์ด ๋œ๋‹ค.

ํšŒ๊ท€๋ถ„์„์—์„œ ์ค‘์š”ํ•œ ๊ฐœ๋…์€ ์˜ˆ์ธก๊ฐ’๊ณผ ์ž”์ฐจ(residual)์ด๋‹ค. ์˜ˆ์ธก๊ฐ’์€ ๋งŒ๋“ค์–ด์ง„ ๋ชจ๋ธ์ด ์ถ”์ •ํ•˜๋Š” ๊ฐ’์ด๊ณ ,

์ž”์ฐจ๋Š” ์˜ˆ์ธก๊ฐ’๊ณผ ๊ด€์ธก๊ฐ’์˜ ์ฐจ์ด์ด๋‹ค. (์˜ค์ฐจ๋Š” ๋ชจ์ง‘๋‹จ์—์„œ์˜ ์˜ˆ์ธก๊ฐ’๊ณผ ๊ด€์ธก๊ฐ’์˜ ์ฐจ์ด๋ฅผ ๋งํ•จ.)

 

ํšŒ๊ท€์ง์„ ์€ ์ž”์ฐจ ์ œ๊ณฑ๋“ค์˜ ํ•ฉ์ธ RSS(Residual Sum of Squares)๋ฅผ ์ตœ์†Œํ™”ํ•˜๋Š” ์ง์„ ์ด๋‹ค.

RSS๋Š” SSE(Sum of Square Error)๋ผ๊ณ ๋„ ๋งํ•˜๋ฉฐ, ์ด ๊ฐ’์ด ํšŒ๊ท€๋ชจ๋ธ์˜ ๋น„์šฉํ•จ์ˆ˜(Cost function)๊ฐ€ ๋œ๋‹ค.

๋จธ์‹ ๋Ÿฌ๋‹์—์„œ๋Š” ์ด๋ ‡๊ฒŒ ๋น„์šฉํ•จ์ˆ˜๋ฅผ ์ตœ์†Œํ™”ํ•˜๋Š” ๋ชจ๋ธ์„ ์ฐพ๋Š” ๊ณผ์ •์„ 'ํ•™์Šต'์ด๋ผ๊ณ  ํ•œ๋‹ค.

์—ฌ๊ธฐ์„œ ๊ณ„์ˆ˜ α์™€ β๋Š” RSS๋ฅผ ์ตœ์†Œํ™”ํ•˜๋Š” ๊ฐ’์œผ๋กœ ๋ชจ๋ธ ํ•™์Šต์„ ํ†ตํ•˜์—ฌ ์–ป์–ด์ง€๋Š” ๊ฐ’์ด๋‹ค.

์ฆ‰ RSS๋ฅผ ์ตœ์†Œํ™” ์‹œํ‚ฌ ์ˆ˜ ์žˆ๋Š” α์™€ β ๊ฐ’์„ ์ฐพ์•„์•ผ ํ•œ๋‹ค.

์ด๋ ‡๊ฒŒ ์ž”์ฐจ์ œ๊ณฑํ•ฉ(RSS)์„ ์ตœ์†Œํ™”ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ตœ์†Œ์ œ๊ณฑํšŒ๊ท€ ํ˜น์€ Ordinary Least Squares(OLS) ๋ผ๊ณ  ๋ถ€๋ฅธ๋‹ค.

์œ„์˜ ๋…ธํŠธํ•„๊ธฐ ๋‚ด์šฉ

์„ ํ˜•ํšŒ๊ท€๋Š” ์ฃผ์–ด์ ธ ์žˆ์ง€ ์•Š์€ ๊ฐ’์˜ ํ•จ์ˆ˜๊ฐ’(๊ฒฐ๊ณผ๊ฐ’)์„ ๋‚ด์‚ฝ(๋ณด๊ฐ„, interpolate)ํ•˜์—ฌ ์˜ˆ์ธกํ•˜๋Š”๋ฐ ๋„์›€์„ ์ค€๋‹ค.

๋”ฐ๋ผ์„œ ๋ฐ์ดํ„ฐ์˜ ์ค‘๊ฐ„์ค‘๊ฐ„ ๋น„์–ด์„œ ์กด์žฌํ•˜์ง€ ์•Š๋Š” ๊ฐ’์— ๋Œ€ํ•ด์„œ๋„ ์„ ํ˜•ํšŒ๊ท€๋ชจ๋ธ์„ ์‚ฌ์šฉํ•ด ์–ด๋ฆผ์žก์•„ ์˜ˆ์ธกํ•ด ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

๋˜ํ•œ ์„ ํ˜•ํšŒ๊ท€๋ชจ๋ธ์€ ๊ธฐ์กด ๋ฐ์ดํ„ฐ์˜ ๋ฒ”์œ„๋ฅผ ๋„˜์–ด์„œ๋Š” ๊ฐ’์„ ์˜ˆ์ธกํ•˜๊ธฐ ์œ„ํ•œ ์™ธ์‚ฝ(extrapolate)๋„ ์ œ๊ณตํ•ด์ค€๋‹ค.

 

 

๐Ÿ’ก Simple Linear Regression (๋‹จ์ˆœ ์„ ํ˜• ํšŒ๊ท€) ์˜ˆ์ œ

# ์ฃผํƒ์˜ ๊ฐ€๊ฒฉ์„ ์˜ˆ์ธกํ•˜๋Š” ๋ชจ๋ธ ์ƒ์„ฑ (์ฃผํƒ ์ •๋ณด ๋ฐ์ดํ„ฐ์…‹ = df)

from sklearn.linear_model import LinearRegression

model = LinearRegression()

feature = ['GrLivArea']
target = ['SalePrice']
X_train = df[feature]
y_train = df[target]

# ๋ชจ๋ธ ํ•™์Šต
model.fit(X_train, y_train)

# ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ ์ƒ˜ํ”Œ์„ ์„ ํƒํ•ด ํ•™์Šตํ•œ ๋ชจ๋ธ์„ ํ†ตํ•ด ์˜ˆ์ธก
X_test = [[4000]]
y_pred = model.predict(X_test)

print(f'{X_test[0][0]} sqft GrLivArea๋ฅผ ๊ฐ€์ง€๋Š” ์ฃผํƒ์˜ ์˜ˆ์ƒ ๊ฐ€๊ฒฉ์€ ${int(y_pred)} ์ž…๋‹ˆ๋‹ค.')

Output : 4000 sqft GrLivArea๋ฅผ ๊ฐ€์ง€๋Š” ์ฃผํƒ์˜ ์˜ˆ์ƒ ๊ฐ€๊ฒฉ์€ $447090 ์ž…๋‹ˆ๋‹ค.

 

# ์ „์ฒด ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ๋ฅผ ํ•™์Šต์‹œํ‚จ ๋ชจ๋ธ์„ ํ†ตํ•ด ์˜ˆ์ธก (์ฃผํƒ ์ •๋ณด ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ์…‹ = df_t)
X_test = [[x] for x in df_t['GrLivArea']
y_pred = model.predict(X_test)

 

์„ ํ˜•ํšŒ๊ท€๋ชจ๋ธ์˜ ๊ณ„์ˆ˜(Coefficients)

Coefficient (ํšŒ๊ท€๊ณ„์ˆ˜) : ํšŒ๊ท€์ง์„ ์˜ ๊ธฐ์šธ๊ธฐ (α)

Intercept (์ ˆํŽธ) : ํšŒ๊ท€์ง์„ ์˜ y ์ ˆํŽธ (β)

 

sklearn ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์˜ LinearRegression ๊ฐ์ฒด์˜ coef_, intercept_ ์†์„ฑ์„ ํ†ตํ•ด์„œ ๊ฐ’์„ ๊ตฌํ•  ์ˆ˜ ์žˆ๋‹ค.

model.coef_	# array([[107.13035897]])
model.intercept_	# array([18569.02585649])

 

cf.

LinearRegression ๋ชจ๋“ˆ → ์ตœ์†Œ์ž์Šน๋ฒ• (OLS) ์‚ฌ์šฉ

SGD Regressor ๋ชจ๋“ˆ → ๊ฒฝ์‚ฌํ•˜๊ฐ•๋ฒ• ์‚ฌ์šฉ

'Boot Camp > section2' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

[AIB] Logistic Regression  (0) 2022.03.30
[AIB] One Hot Encoding, Feature Selection, Ridge Regression  (0) 2022.03.30
[AIB] Bias, Variance, R-Square, Multiple Regression, Evaluation Metrics  (0) 2022.03.25