덜 정확한 분류모델 여러 개를 모아서 더 정확한 분류모델을 만들 수 있을까?
랜덤 포레스트 : 의사결정나무의 앙상블(Ensemble)
다양성(Diversity) - 배깅(Bagging)
임의성(Random) - Random Subspace
1. 탐색적 데이터 분석
1-1. 빈도분석
DF.species.value_counts()
virginica 50
setosa 50
versicolor 50
Name: species, dtype: int64
1-2. 분포 시각화
import matplotlib.pyplot as plt
import seaborn as sns
sns.pairplot(hue = 'species', data = DF)
plt.show()
2. Data Preprocessing
2-1. Data Set
X = DF[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']]
y = DF['species']
2-2. Train & Test Split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size = 0.3,
random_state = 2045)
print('Train Data : ', X_train.shape, y_train.shape)
print('Test Data : ', X_test.shape, y_test.shape)
Train Data : (105, 4) (105,)
Test Data : (45, 4) (45,)
3. Modeling
3-1. Train_Data로 모델 생성
from sklearn.ensemble import RandomForestClassifier
Model_rf = RandomForestClassifier(n_estimators = 10,
max_features = 2,
random_state = 2045,
n_jobs = -1)
Model_rf.fit(X_train, y_train)
3-2. Test_Data에 Model 적용
y_hat = Model_rf.predict(X_test)
3-3. Model Evaluate
from sklearn.metrics import confusion_matrix, accuracy_score
print(confusion_matrix(y_test, y_hat))
print(accuracy_score(y_test, y_hat))
0.9555555555555556
3-4. Feature Importance
Model_rf.feature_importances_
array([0.1571031 , 0.03897972, 0.45102744, 0.35288974])
plt.figure(figsize = (9, 6))
sns.barplot(Model_rf.feature_importances_,
['sepal_length', 'sepal_width', 'petal_length', 'petal_width'])
plt.show()
4. Hyperparameter Tuning
4-1. RandomForestClassifier 객체 생성
from sklearn.ensemble import RandomForestClassifier
Model_rf = RandomForestClassifier()
4-2.GridSearchCV Hyperparameters 설정
params = {'n_estimators':[100, 300, 500, 700],
'max_features':[1, 2, 3, 4],
'max_depth':[1, 2, 3, 4, 5],
'random_state':[2045]}
4-3. GridSearchCV 객체 생성
from sklearn.model_selection import GridSearchCV, KFold
grid_cv = GridSearchCV(Model_rf,
param_grid = params,
scoring = 'accuracy',
cv = KFold(n_splits = 5,
random_state = 2045),
refit = True,
n_jobs = -1)
4-4. GridSearchCV 수행
from datetime import datetime
start_time = datetime.now()
grid_cv.fit(X_train, y_train)
end_time = datetime.now()
print('Elapsed Time : ', end_time - start_time)
4-5. 최적 Hyperparameter 확인
grid_cv.best_score_
0.9523809523809523
grid_cv.best_params_
{'max_depth': 3, 'max_features': 1, 'n_estimators': 100, 'random_state': 2045}
4-6. 최적 모델 생성 및 평가
Model_CV = grid_cv.best_estimator_
y_hat = Model_CV.predict(X_test)
from sklearn.metrics import confusion_matrix, accuracy_score
print(confusion_matrix(y_test, y_hat))
print(accuracy_score(y_test, y_hat))
0.9555555555555556
K-평균 군집(K-means Clustering) 2 (0) | 2022.06.09 |
---|---|
K-평균 군집(K-means Clustering) 1 (1) | 2022.06.09 |
의사결정 나무(Decision Tree) (0) | 2022.06.08 |
로지스틱 회귀(Logistic Regression) 3 (0) | 2022.06.07 |
로지스틱 회귀(Logistic Regression) 2 (0) | 2022.06.07 |