Codelog

[AIB] Logistic Regression ๋ณธ๋ฌธ

Boot Camp/section2

[AIB] Logistic Regression

minzeros 2022. 3. 30. 18:34

๐Ÿ’ก Classification ๋ฌธ์ œ

ํšŒ๊ท€ ๋ฌธ์ œ์—์„œ๋Š” ๋ณดํ†ต ํƒ€๊ฒŸ ๋ณ€์ˆ˜์˜ ํ‰๊ท ๊ฐ’์„ ๊ธฐ์ค€๋ชจ๋ธ๋กœ ์‚ฌ์šฉํ•œ๋‹ค.

๋ถ„๋ฅ˜ ๋ฌธ์ œ์—์„œ๋Š” ๋ณดํ†ต ํƒ€๊ฒŸ ๋ณ€์ˆ˜์—์„œ ๊ฐ€์žฅ ๋นˆ๋ฒˆํ•˜๊ฒŒ ๋‚˜ํƒ€๋‚˜๋Š” ๋ฒ”์ฃผ๋ฅผ ๊ธฐ์ค€๋ชจ๋ธ๋กœ ์„ค์ •ํ•œ๋‹ค.

์‹œ๊ณ„์—ด ๋ฐ์ดํ„ฐ๋Š” ๋ณดํ†ต ์–ด๋–ค ์‹œ์ ์„ ๊ธฐ์ค€์œผ๋กœ ์ด์ „ ์‹œ๊ฐ„์˜ ๋ฐ์ดํ„ฐ๊ฐ€ ๊ธฐ์ค€๋ชจ๋ธ์ด ๋œ๋‹ค.

 

๋ถ„๋ฅ˜ ๋ฌธ์ œ์—๋Š” ํƒ€๊ฒŸ ๋ณ€์ˆ˜๊ฐ€ ํŽธ์ค‘๋œ ๋ฒ”์ฃผ ๋น„์œจ์„ ๊ฐ–๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์•„์„œ ํ•ญ์ƒ ๋จผ์ € ํƒ€๊ฒŸ ๋ฒ”์ฃผ์˜ ๋น„์œจ์„ ํ™•์ธํ•ด์•ผ ํ•œ๋‹ค.

๋˜ํ•œ ๋ถ„๋ฅ˜์—์„œ๋Š” ํšŒ๊ท€์™€ ๋‹ค๋ฅธ ํ‰๊ฐ€์ง€ํ‘œ๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค.

์ •ํ™•๋„(Accuracy)๋Š” ๋ถ„๋ฅ˜ ๋ฌธ์ œ์—์„œ ์‚ฌ์šฉํ•˜๋Š” ํ‰๊ฐ€์ง€ํ‘œ์ด๋‹ค.

 

๐Ÿ’ก Logistic Regression

๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€๋Š” ํŠน์„ฑ๋ณ€์ˆ˜๋ฅผ ๋กœ์ง€์Šคํ‹ฑ ํ•จ์ˆ˜ ํ˜•ํƒœ๋กœ ํ‘œํ˜„ํ•œ๋‹ค.

๋”ฐ๋ผ์„œ ๊ฒฐ๊ณผ์ ์œผ๋กœ ๊ด€์ธก์น˜๊ฐ€ ํŠน์ • ํด๋ž˜์Šค์— ์†ํ•  ํ™•๋ฅ ๊ฐ’์œผ๋กœ ๊ณ„์‚ฐ๋œ๋‹ค.

๋ถ„๋ฅ˜ ๋ฌธ์ œ์—์„œ๋Š” ํ™•๋ฅ ๊ฐ’์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ถ„๋ฅ˜๋ฅผ ํ•˜๋Š”๋ฐ, ํ™•๋ฅ ๊ฐ’์ด ์ •ํ•ด์ง„ ๊ธฐ์ค€๊ฐ’๋ณด๋‹ค ํฌ๋ฉด 1 ์•„๋‹ˆ๋ฉด 0์ด๋ผ๊ณ  ์˜ˆ์ธกํ•œ๋‹ค.

 

Logit Transformation

๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€์˜ ๊ณ„์ˆ˜๋Š” ๋น„์„ ํ˜• ํ•จ์ˆ˜ ๋‚ด์— ์žˆ์–ด ์ง๊ด€์ ์œผ๋กœ ํ•ด์„ํ•˜๊ธฐ๊ฐ€ ์–ด๋ ต๋‹ค.

ํ•˜์ง€๋งŒ ์˜ค์ฆˆ(Odds)๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ์„ ํ˜•๊ฒฐํ•ฉ ํ˜•ํƒœ๋กœ ๋ณ€ํ™˜์ด ๊ฐ€๋Šฅํ•˜์—ฌ ์‰ฝ๊ฒŒ ํ•ด์„์ด ๊ฐ€๋Šฅํ•ด์ง„๋‹ค.

์˜ค์ฆˆ๋Š” ์‹คํŒจํ™•๋ฅ ์— ๋Œ€ํ•œ ์„ฑ๊ณตํ™•๋ฅ ์˜ ๋น„์ธ๋ฐ ์˜ค์ฆˆ๊ฐ€ 4 ์ด๋ฉด ์„ฑ๊ณตํ™•๋ฅ ์ด ์‹คํŒจํ™•๋ฅ ์˜ 4๋ฐฐ๋ผ๋Š” ๋œป์ด๋‹ค.

์ด๋•Œ, ์•„๋ž˜ ์‹์ฒ˜๋Ÿผ ์˜ค์ฆˆ์— ๋กœ๊ทธ๋ฅผ ์ทจํ•ด ๋ณ€ํ™˜ํ•˜๋Š” ๊ฒƒ์„ ๋กœ์ง“๋ณ€ํ™˜(Logit Transformation) ์ด๋ผ๊ณ  ํ•œ๋‹ค.

๋กœ์ง“๋ณ€ํ™˜์„ ํ†ตํ•ด ๋น„์„ ํ˜•ํ˜•ํƒœ์ธ ๋กœ์ง€์Šคํ‹ฑ ํ•จ์ˆ˜๋ฅผ ์„ ํ˜•ํ˜•ํƒœ๋กœ ๋งŒ๋“ค์–ด ํšŒ๊ท€๊ณ„์ˆ˜์˜ ์˜๋ฏธ๋ฅผ ํ•ด์„ํ•˜๊ธฐ ์‰ฝ๊ฒŒ ํ•œ๋‹ค.

ํŠน์ • ํŠน์„ฑ(feature)์˜ ์ฆ๊ฐ€์— ๋”ฐ๋ผ ๋กœ์ง“(ln(odds))์ด ์–ผ๋งˆ๋‚˜ ์ฆ๊ฐ€ ํ˜น์€ ๊ฐ์†Œํ–ˆ๋‹ค๊ณ  ํ•ด์„ํ•  ์ˆ˜ ์žˆ๋‹ค.

 

โœจ Logistic Regression ์˜ˆ์ œ

Kaggle์˜ Titanic: Machine Learning from Disaster ๋ฐ์ดํ„ฐ ์‚ฌ์šฉ
import pandas as pd
train = pd.read_csv('titanic_train.csv')
test = pd.read_csv('titanic_test.csv')
# ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ๋ฅผ ๋‹ค์‹œ ํ›ˆ๋ จ/๊ฒ€์ฆ ์„ธํŠธ๋กœ ๋ถ„๋ฆฌ

from sklearn.model_selection import train_test_split
train, val = train_test_split(train, random_state=2)

 

๋ถ„๋ฅ˜ ๋ฌธ์ œ์˜ ๊ธฐ์ค€๋ชจ๋ธ (major class)๋กœ ์˜ˆ์ธก ์ˆ˜ํ–‰
# ํƒ€๊ฒŸ ์„ค์ •
target = 'Survived'
y_train = train[target]

# mode() : Return the highest frequency value in a Series
major = y_train.mode()[0]

# ํƒ€๊ฒŸ ์ƒ˜ํ”Œ ์ˆ˜ ๋งŒํผ 0์ด ๋‹ด๊ธด ๋ฆฌ์ŠคํŠธ๋ฅผ ๋งŒ๋“ค์–ด ๊ธฐ์ค€๋ชจ๋ธ๋กœ ์˜ˆ์ธก
y_pred = [major] * len(y_train)
from sklearn.metrics import accuracy_score

# ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์—์„œ์˜ ๊ธฐ์ค€๋ชจ๋ธ ์ •ํ™•๋„
print("training accuracy: ", accuracy_score(y_train, y_pred))
>>> training accuracy:  0.625748502994012
# ๊ฒ€์ฆ ๋ฐ์ดํ„ฐ์—์„œ์˜ ๊ธฐ์ค€๋ชจ๋ธ ์ •ํ™•๋„
y_val = val[target]
y_pred = [major] * len(y_val)
print("validation accuracy: ", accuracy_score(y_val, y_pred))
>>> validation accuracy:  0.5874439461883408

 

์„ ํ˜•ํšŒ๊ท€๋ชจ๋ธ๋กœ ํ•™์Šตํ•œ ๊ฒฐ๊ณผ

 

from sklearn.linear_model import LinearRegression

linear_model = LinearRegression()

# ์ˆซ์žํ˜• ํŠน์„ฑ๋งŒ ์‚ฌ์šฉ
features = ['Pclass', 'Age', 'Fare']
X_train = train[features]
X_val = val[features]

# Age, Cabin์˜ ๊ฒฐ์ธก์น˜๋ฅผ ํ‰๊ท  ๊ฐ’์œผ๋กœ ๋Œ€์ฒดํ•˜๊ธฐ ์œ„ํ•ด์„œ SimpleImputer ๋ชจ๋ธ ์‚ฌ์šฉ
from sklearn.impute import SimpleImputer

# default, imputing 'mean' value
imputer = SimpleImputer() 
X_train_imputed = imputer.fit_transform(X_train)
X_val_imputed = imputer.transform(X_val)

# ํ•™์Šต
linear_model.fit(X_train_imputed, y_train)

# ์˜ˆ์ธก
pred = linear_model.predict(X_val_imputed)
# ํšŒ๊ท€๊ณ„์ˆ˜ ํ™•์ธ
pd.Series(linear_model.coef_, features)

output :

  • Pclass ๊ฐ’์ด ๋†’์„์ˆ˜๋ก(2,3๋“ฑ์„) ์ƒ์กด์œจ์ด ๋–จ์–ด์ง
  • Age ๊ฐ’์ด ๋งŽ์„์ˆ˜๋ก ์ƒ์กด์œจ์ด ๋–จ์–ด์ง
  • Fare ๊ฐ’์ด ๋†’์„์ˆ˜๋ก ์ˆ˜์น˜๊ฐ€ ์ž‘์ง€๋งŒ ์ƒ์กด์œจ์ด ์˜ฌ๋ผ๊ฐ

 

๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€๋ชจ๋ธ๋กœ ํ•™์Šตํ•œ ๊ฒฐ๊ณผ
from sklearn.linear_model import LogisticRegression

logistic = LogisticRegression()

# ํ•™์Šต
logistic.fit(X_train_imputed, y_train)

# ์˜ˆ์ธก
pred = logistic.predict(X_val_imputed)

print('๊ฒ€์ฆ์„ธํŠธ ์ •ํ™•๋„', logistic.score(X_val_imputed, y_val))
>>> ๊ฒ€์ฆ์„ธํŠธ ์ •ํ™•๋„ 0.7130044843049327

print(features)
print(logistic.coef_)

output :

 

ํƒ€์ดํƒ€๋‹‰ ๋ฐ์ดํ„ฐ์˜ ๋ชจ๋“  ํŠน์„ฑ์„ ์‚ฌ์šฉํ•œ ๋ชจ๋ธ ํ•™์Šต ๊ฒฐ๊ณผ
['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']
  • ์นดํ…Œ๊ณ ๋ฆฌ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ๋ฅผ ์œ„ํ•œ OneHotEncoder
  • ๊ฒฐ์ธก์น˜ ์ฒ˜๋ฆฌ๋ฅผ ์œ„ํ•œ SimpleImputer
  • ํŠน์„ฑ๋“ค์˜ ์ฒ™๋„๋ฅผ ๋งž์ถ”๊ธฐ๋ฅผ ์œ„ํ•ด ํ‘œ์ค€์ •๊ทœ๋ถ„ํฌ๋กœ ํ‘œ์ค€ํ™”ํ•˜๋Š” StandardScaler
from category_encoders import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']
target = 'Survived'

X_train = train[features]
y_train = train[target]

X_val = val[features]
y_val = val[target]
# ๋ฒ”์ฃผํ˜• ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ
encoder = OneHotEncoder(use_cat_names=True)
X_train_encoded = encoder.fit_transform(X_train)
X_val_encoded = encoder.transform(X_val)

# ๊ฒฐ์ธก์น˜ ์ฒ˜๋ฆฌ
imputer = SimpleImputer(strategy='mean')
X_train_imputed = imputer.fit_transform(X_train_encoded)
X_val_imputed = imputer.transform(X_val_encoded)

# ํ‘œ์ค€ํ™”
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_imputed)
X_val_scaled = scaler.transform(X_val_imputed)
model = LogisticRegression(random_state=1)

# ํ•™์Šต
model.fit(X_train_scaled, y_train)

# ์˜ˆ์ธก
y_pred = model.predict(X_val_scaled)

accuracy_score(y_val, y_pred)
>>> 0.7892376681614349
coefficients = pd.Series(model.coef_[0], X_train_encoded.columns)
coefficients

output :

  • ์ขŒ์„ ๋“ฑ๊ธ‰์ด ๋†’์„์ˆ˜๋ก, ๋‚˜์ด๊ฐ€ ์–ด๋ฆด์ˆ˜๋ก, ๋‚จ์„ฑ๋ณด๋‹ค๋Š” ์—ฌ์„ฑ์˜ ์ƒ์กด์œจ์ด ๋” ๋†’๋‹ค.
coefficients.sort_values().plot.barh()

output :

 

'Boot Camp > section2' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

[AIB] One Hot Encoding, Feature Selection, Ridge Regression  (0) 2022.03.30
[AIB] Bias, Variance, R-Square, Multiple Regression, Evaluation Metrics  (0) 2022.03.25
[AIB] OLS, MAE, RSS, Simple Regression  (0) 2022.01.07