데이터 분석 - 시계열 데이터

2^7 2025. 2. 13. 21:30

시계열 데이터란

시간의 흐름에 따라 수집된 데이터로, 보통 날짜나 시간 정보를 포함하고 있는 데이터. 시계열 데이터를 분석할 때 추세(Trend), 계절성(Seasonality), 주기성(Cyclic Patterns), 불규칙성(Noise) 등의 요소를 고려하여 데이터를 분석함

1. 시계열 데이터의 주요 개념

추세(Trend): 데이터가 장기적으로 증가하거나 감소하는 경향
계절성(Seasonality): 일정한 주기로 반복되는 패턴 (예: 월별 기온 변화)
주기성(Cyclic Patterns): 일정한 주기가 있지만 반드시 고정된 기간은 아님 (예: 경기 순환)
불규칙성(Noise): 예측할 수 없는 변동 요소

2. 시계열 데이터 전처리 방법

2-1. 날짜 데이터 변환 (pd.to_datetime(), 인덱스 설정)

날짜를 올바르게 처리해야 시계열 분석이 가능하며, 시간 순서대로 정렬하여 패턴을 정확히 파악할 수 있음.
pd.to_datetime()을 사용하여 문자열 형태의 날짜를 날짜 타입으로 변환

import pandas as pd

# 예제 데이터 생성
data = {'date': ['2023-01-01', '2023-01-02', '2023-01-03'], 'value': [10, 20, 30]}
df = pd.DataFrame(data)

# 문자열을 날짜 형식으로 변환
df['date'] = pd.to_datetime(df['date'])

# 날짜를 인덱스로 설정
df.set_index('date', inplace=True)

print(df.head())

parse_dates=['date'] 옵션을 사용하면 CSV 파일을 불러올 때 바로 날짜 형식으로 변환할 수 있음

df = pd.read_csv('data.csv', parse_dates=['date'])

2-2. 결측치 처리 (fillna(), interpolate())

누락된 데이터가 있으면 분석 및 예측 모델의 신뢰도가 낮아지므로 적절히 보완해야 함.

✅ 결측치 확인

print(df.isnull().sum())  # 컬럼별 결측치 개수 확인

✅ 결측치 채우기

# 이전 값으로 채우기 (Forward Fill)
df.fillna(method='ffill', inplace=True)

#이후 값으로 채우기 (Backward Fill)
df.fillna(method='bfill', inplace=True)

#보간법(Interpolation) 사용
df.interpolate(method='linear', inplace=True)

2-3. 이상치 제거 (IQR, Z-score 활용)

극단적인 값이 존재하면 평균 및 패턴 분석이 왜곡될 수 있어 정확한 분석을 위해 필수적임.

✅ IQR (Interquartile Range) 방법

Q1 = df['value'].quantile(0.25)
Q3 = df['value'].quantile(0.75)
IQR = Q3 - Q1

# 이상치 경계 설정
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# 이상치 제거
df = df[(df['value'] >= lower_bound) & (df['value'] <= upper_bound)]

✅ Z-Score 방법

from scipy import stats

df['z_score'] = stats.zscore(df['value'])
df = df[df['z_score'].abs() < 3]  # Z-score가 ±3 이상이면 이상치로 간주
df.drop(columns=['z_score'], inplace=True)

2-4. 시간 간격 조정 (resample() 활용)

데이터 간격이 일정해야 비교 및 패턴 분석이 가능하며, 분석 목적에 맞게 데이터를 적절한 시간 단위로 변환해야 함.

✅ 일별 데이터로 변환

df_resampled = df.resample('D').mean()  # 일별 평균값

✅ 다른 간격으로 변환하는 예제

df_hourly = df.resample('H').sum()  # 시간 단위로 변환
df_weekly = df.resample('W').mean()  # 주 단위 평균
df_monthly = df.resample('M').sum()  # 월 단위 합계

2-5. 데이터 스케일링 (MinMaxScaler, StandardScaler)

값의 크기 차이가 크면 모델 학습이 어려워질 수 있으므로 정규화를 통해 학습 속도와 정확도를 향상시킴.

✅ Min-Max Scaling (0~1 범위로 조정)

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
df['scaled_value'] = scaler.fit_transform(df[['value']])

✅ Standard Scaling (평균 0, 표준편차 1로 변환)

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df['scaled_value'] = scaler.fit_transform(df[['value']])

2-6. 특성 생성 (연도, 월, 요일 등의 변수 추가)

날짜 정보에서 의미 있는 특성을 추출하면 계절성, 주기성 등을 분석하는 데 유용함.

✅ 날짜 관련 특성 생성

df['year'] = df.index.year
df['month'] = df.index.month
df['day'] = df.index.day
df['weekday'] = df.index.weekday  # 0=월요일, 6=일요일
df['hour'] = df.index.hour  # 시간 단위 (시계열이 시간 단위일 경우)

2-7. 데이터 분할 (시간순으로 훈련/테스트 데이터 나누기)

시계열 데이터는 시간의 흐름에 따라 변화하므로, 순서를 유지한 채 훈련 및 테스트 데이터로 나눠야 실제 예측 성능을 평가할 수 있음.

train_size = int(len(df) * 0.8)  # 80%는 훈련 데이터, 20%는 테스트 데이터

train, test = df[:train_size], df[train_size:]
print("Train set size:", len(train))
print("Test set size:", len(test))

3. 시계열 데이터 시각화 방법

3-1. 기본 라인 그래프 (Line Plot)

시계열 데이터는 시간에 따른 변화를 주로 보기 위해 분석함으로 주로 라인 그래프를 많이 활용함.

import matplotlib.pyplot as plt

plt.figure(figsize=(10,5))
plt.plot(df.index, df['value'], marker='o', linestyle='-', color='b', label='Value')
plt.xlabel('Date')
plt.ylabel('Value')
plt.title('Time Series Line Plot')
plt.legend()
plt.grid()
plt.show()

3-2. 이동 평균 그래프 (Moving Average)

이동 평균을 사용하면 데이터의 노이즈를 줄이고 추세를 더 명확하게 볼 수 있음

df['rolling_mean'] = df['value'].rolling(window=7).mean()

plt.figure(figsize=(10,5))
plt.plot(df.index, df['value'], alpha=0.5, label='Original Data')
plt.plot(df.index, df['rolling_mean'], color='red', label='7-Day Moving Average')
plt.xlabel('Date')
plt.ylabel('Value')
plt.title('Time Series with Moving Average')
plt.legend()
plt.grid()
plt.show()

3.3 계절성 분석 (Seasonal Decomposition)

반복적인 패턴(계절성)을 가지는 경우 추세(Trend), 계절성(Seasonality), 잔차(Residual)로 나누어 분석

from statsmodels.tsa.seasonal import seasonal_decompose

result = seasonal_decompose(df['value'], model='additive', period=7)

plt.figure(figsize=(10,8))
plt.subplot(411)
plt.plot(df['value'], label='Original')
plt.legend()
plt.subplot(412)
plt.plot(result.trend, label='Trend', color='red')
plt.legend()
plt.subplot(413)
plt.plot(result.seasonal, label='Seasonality', color='green')
plt.legend()
plt.subplot(414)
plt.plot(result.resid, label='Residual', color='gray')
plt.legend()
plt.tight_layout()
plt.show()

3.4 시계열 히트맵 (Heatmap)

데이터 변동성을 한눈에 보기위해 사용

import seaborn as sns

df['year'] = df.index.year
df['month'] = df.index.month

pivot_table = df.pivot_table(values='value', index='month', columns='year', aggfunc='mean')

plt.figure(figsize=(10, 6))
sns.heatmap(pivot_table, cmap='coolwarm', annot=True, fmt=".1f")
plt.title("Time Series Heatmap")
plt.show()

728x90

저작자표시 비영리 변경금지 (새창열림)