[인공지능] train_test_split 하는 이유는 무엇일까?

728x90

위 그림처럼 강아지 집을 만들기 위해 강아지 몸 사이즈에 딱 맞게 집을 만들어주면 , 다른 강아지한테도 과연 딱 맞을까?

당연히 안맞을 것이다. 우리는 이걸 '과적합(Overfitting)'이 되었다고 한다. 우리가 모델을 학습시키고 좋은 결과만 나오기를 원한다. 그렇다면 어떻게 해야 좋은 결과(예측)가 나올까?

과적합을 피하고, 좋은 결과를 내기 위해서 Cross validation(교차 검증)을 시행한다. 이 과정에서 우리가 알고 있는 train_test_split을 하여 학습 데이터와 평가 데이터를 나누고 학습과 평가를 한다.

1. Cross validation 3분할

1. 60% train data로 모델 학습(Learn)시킨다

2. 20% Validation data로 모델(or 하이퍼파라미터)을 최적화/선택(tune)시킨다

3. 20% test data로 모델을 평가(Test only, no more tune)한다

Validation & Test 차이
- Validation : 여러 후보 모델 중 가장 좋은 결과를 내는 모델을 선택하는 과정
- Test : 선택한 모델의 실제 정확도를 평가하는 것
(여러 모델 or 하이퍼 파라미터 중 선택을 해야하는 경우가 아니라면, Validation과 Test를 나누지 않고 진행하기도 한다)

# 학습을 위한 training / test dataset 나누기
from sklearn.model_selection import train_test_split

# train / test 데이터 나누기
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.2, random_state = 0xC0FFEE)

# train / train_value 데이터 나누기
x_train, x_val, y_train, y_val = train_test_split(x_train, y_train, test_size=0.25, random_state = 0xC0FFEE)

# 6 : 2 : 2 = train : validation : test
print(x_train.shape, x_val.shape, x_test.shape, y_train.shape, y_val.shape, y_test.shape)

1
2
3
4
5
6
7
8
9
10
11
12

# 학습을 위한 training / test dataset 나누기
from sklearn.model_selection import train_test_split
 
# train / test 데이터 나누기
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.2, random_state = 0)
 
# train / train_value 데이터 나누기
x_train, x_val, y_train, y_val = train_test_split(x_train, y_train, test_size=0.25, random_state = 0)
 
# 6 : 2 : 2 = train : validation : test
print(x_train.shape, x_val.shape, x_test.shape, y_train.shape, y_val.shape, y_test.shape)
 
Colored by Color Scripter

cs

2. Cross validation 2분할

# 학습을 위한 training / test dataset 나누기
from sklearn.model_selection import train_test_split

# train / test 데이터 나누기
x_train, x_val, y_train, y_val = train_test_split(x,y, test_size=0.2, random_state = 0)

# 8 : 2 = train : validation
print(x_train.shape, x_val.shape, y_train.shape, y_val.shape)

분할하는 비율은 6:2:2, 8:2 가 절대적인 것은 아니다. 다만 많은 사람들이 이와 같이 분할하며, 7:3 도 많이 사용한다

그 외 활용되는 방법들

- K-Fold cross validation : 후보 모델 간 비교 및 선택을 위한 알고리즘

- Cost function에 Regularization term 추가 ( L1 or L2, weight up = cost up )

- Drop-out & Batch Normalization (NN) 등

- Traing data를 많이 확보하거나 모델의 Feature를 줄이는 것도 좋은 방법!

728x90

'인공지능' 카테고리의 다른 글

[인공지능] 딥러닝 강의, 구현, 체험 등 사이트 모음집 (0)	2022.07.01
[인공지능] 딥러닝 Overfitting(오버피팅) 피하는 방법! (0)	2022.06.27
[인공지능] Detectron2 - Github (0)	2022.06.26

1. Cross validation 3분할

2. Cross validation 2분할

그 외 활용되는 방법들

'인공지능' 카테고리의 다른 글

티스토리툴바