Pytorch의 Buffer를 사용해야 하는 이유 via. Attention

Notice

Recent Posts

Recent Comments

Link

« 2026/03 »
일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

Tags more

Archives

Today

Total

관리 메뉴

Attention, Please!!!

Pytorch의 Buffer를 사용해야 하는 이유 via. Attention 본문

LLM

Pytorch의 Buffer를 사용해야 하는 이유 via. Attention

G3LU 2025. 6. 23. 19:02

대형 언어 모델(LLM)을 처음부터 구축하거나 혹은 복잡한 딥러닝 모델을 다룰 때, 필연적으로 GPU를 활용하는 경우가 대부분이다. 이에 따라 Pytorch에서는 .to(device) 라는 간편한 메서드를 통해 모델의 파라미터를 원하는 장치(CPU 혹은 GPU)로 손쉽게 옮길 수 있는 기능을 제공한다. 하지만 모델을 GPU로 옮겼다고 생각했는데, 막상 실행 했을 때 " Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!" 와 같은 RuntimeError를 마주하게 된다. 이러한 문제는 모델의 파라미터(nn.Parameter)외에, 학습되지는 않지만 모델의 동작에 필수적인 텐서가 GPU로 이동하지 않기 때문에 발생한다. 이에 대한 문제점을 해결하기 buffer라는 유용한 기능이 있다. 따라서 본 게시물에서는 attention mechanism을 기반으로 Pytorch의 Buffer가 무엇인지, 그리고 일반 텐서와 비교했을 때 어떤 장점이 있는지 알아보고자 한다.

Buffer를 사용하지 않았을 때

먼저 Buffer를 사용하지 않고, Causal Self-Attention을 구현한 경우 어떠한 문제가 발생하는지에 대해 알아보고자 하며, 아래 코드는 어텐션 계산 시 현재 토큰이 미래의 토큰 정보를 참고하지 못하도록 막아주는 마스크가 포함되어 있다.

import torch
import torch.nn as nn

class CausalAttentionWithoutBuffers(nn.Module):

    def __init__(self, d_in, d_out, context_length,
                 dropout, qkv_bias=False):
        super().__init__()
        self.d_out = d_out
        self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_key   = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.dropout = nn.Dropout(dropout)
        self.mask = torch.triu(torch.ones(context_length, context_length), diagonal=1)

    def forward(self, x):
        b, num_tokens, d_in = x.shape
        keys = self.W_key(x)
        queries = self.W_query(x)
        values = self.W_value(x)

        attn_scores = queries @ keys.transpose(1, 2)
        attn_scores.masked_fill_(
            self.mask.bool()[:num_tokens, :num_tokens], -torch.inf)
        attn_weights = torch.softmax(
            attn_scores / keys.shape[-1]**0.5, dim=-1
        )
        attn_weights = self.dropout(attn_weights)

        context_vec = attn_weights @ values
        return context_vec

예시를 위해 간단한 데이터를 아래와 같이 생성한다.

torch.manual_seed(123)

inputs = torch.tensor(
  [[0.43, 0.15, 0.89], # Your     (x^1)
   [0.55, 0.87, 0.66], # journey  (x^2)
   [0.57, 0.85, 0.64], # starts   (x^3)
   [0.22, 0.58, 0.33], # with     (x^4)
   [0.77, 0.25, 0.10], # one      (x^5)
   [0.05, 0.80, 0.55]] # step     (x^6)
)

batch = torch.stack((inputs, inputs), dim=0)
context_length = batch.shape[1]
d_in = inputs.shape[1]
d_out = 2

ca_without_buffer = CausalAttentionWithoutBuffers(d_in, d_out, context_length, 0.0)

with torch.no_grad():
    context_vecs = ca_without_buffer(batch)

print(context_vecs)

이후에, batch 및 mask를 CUDA로 넘겨주고 실행해보면, 다음과 같은 RuntimeError를 마주하게 된다.

#CUDA로 넘겨주기 
batch = batch.to("cuda")
ca_without_buffer.to("cuda");

#코드 실행하기 
with torch.no_grad():
    context_vecs = ca_without_buffer(batch)

print(context_vecs)

RuntimeError: expected self and mask to be on the same device, but got mask on cpu and self on cuda:0

오류 메시지는 attn_scores 텐서는 GPU에, mask 텐서는 CPU에 있어 연산이 불가능하다고 알려준다. 직접 확인해 보면, W_query 같은 nn.Linear 계층의 가중치(nn.Parameter)는 .to(device) 호출 시 PyTorch에 의해 자동으로 추적되어 GPU로 잘 이동했지만, self.mask는 여전히 CPU에 남아있는 것을 볼 수 있다.

그 원인은 self.mask가 __init__ 메서드에서 생성된 단순한 PyTorch 텐서 속성이기 때문이다. PyTorch 모듈은 nn.Parameter로 등록된 텐서만 자신의 '파라미터'로 인식하고 관리하지만, 일반 텐서 속성은 추적 대상에 포함하지 않으므로 GPU로 이동시키지 않는다. 그렇다면 이러한 문제를 해결하기 위해 self.mask를 nn.Parameter로 감싸면 되지 않을까? 라는 생각을 할 수 있게 된다. 일반적으로 nn.Parameter로 텐서를 감싸는 것은, "이 텐서는 학습을 통해 업데이트되어야 할 모델의 가중치"라고 명시적으로 선언하는 행위이다. 단순하게 self.mask의 목적은 미래 토큰을 보지 못하게 막는다 라는 명확하고 고정된 규칙에 따라 생성된 값이며, 이는 학습 과정에서 절대로 변해서는 안된다. 이러한 문제점을 해결할 수 있는 것이 바로 Pytorch의 Buffer라는 개념이다.

Buffer를 사용했을 때

앞서 발생한 RuntimeError을 해결하기 위해 register_buffer를 사용하는 것에 대해 다루고자 한다. 코드는 기존과 거의 동일하지만, __init__ 메서드 내 self.mask를 정의하는 부분을 보면 self.register_buffer()로 감싸져 있는 것을 확인해 볼 수 있다. self.register_buffer()는 두 개의 인자를 받는다. 첫 번째는 버퍼의 이름('mask')이며, 두 번째는 저장할 텐서이다.

import torch
import torch.nn as nn

class CausalAttentionWithBuffer(nn.Module):

    def __init__(self, d_in, d_out, context_length,
                 dropout, qkv_bias=False):
        super().__init__()
        self.d_out = d_out
        self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_key   = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.dropout = nn.Dropout(dropout)
        # Old:
        # self.mask = torch.triu(torch.ones(context_length, context_length), diagonal=1)

        # New:
        self.register_buffer("mask", torch.triu(torch.ones(context_length, context_length), diagonal=1))

    def forward(self, x):
        b, num_tokens, d_in = x.shape
        keys = self.W_key(x)
        queries = self.W_query(x)
        values = self.W_value(x)

        attn_scores = queries @ keys.transpose(1, 2)
        attn_scores.masked_fill_(
            self.mask.bool()[:num_tokens, :num_tokens], -torch.inf)
        attn_weights = torch.softmax(
            attn_scores / keys.shape[-1]**0.5, dim=-1
        )
        attn_weights = self.dropout(attn_weights)

        context_vec = attn_weights @ values
        return context_vec

이제 확인을 해보면, 아래와 같이 CUDA에 잘 위치한 것을 확인해 볼 수 있다.

ca_with_buffer = CausalAttentionWithBuffer(d_in, d_out, context_length, 0.0)
ca_with_buffer.to("cuda")

print("W_query.device:", ca_with_buffer.W_query.weight.device)
print("mask.device:", ca_with_buffer.mask.device)

W_query.device: cuda:0
mask.device: cuda:0

이를 이전과 같이 실행해보면, 잘 동작하는 것을 볼 수 있다.

with torch.no_grad():
    context_vecs = ca_with_buffer(batch)

print(context_vecs)

--------- RESULT ---------
tensor([[[0.4772, 0.1063],
         [0.5891, 0.3257],
         [0.6202, 0.3860],
         [0.5478, 0.3589],
         [0.5321, 0.3428],
         [0.5077, 0.3493]],

        [[0.4772, 0.1063],
         [0.5891, 0.3257],
         [0.6202, 0.3860],
         [0.5478, 0.3589],
         [0.5321, 0.3428],
         [0.5077, 0.3493]]], device='cuda:0')

Buffer의 또 다른 장점: state_dict

PyTorch의 state_dict는 모델의 학습 가능한 파라미터(가중치, 편향)와 상태 유지를 위한 버퍼(buffer)를 파이썬 딕셔너리(dictionary) 형태로 담고 있는 객체로, 모델의 훈련된 상태를 저장하고 불러오기 위한 핵심적인 방법이다. 이에 다음과 같은 예시를 통해서 state_dict가 무엇인지 혹은 buffer의 역할이 왜 중요한지 살펴보고자 한다.

버퍼를 사용하지 않은 경우,

ca_without_buffer.state_dict()

----------- RESULT -----------
OrderedDict([('W_query.weight',
              tensor([[-0.2354,  0.0191, -0.2867],
                      [ 0.2177, -0.4919,  0.4232]], device='cuda:0')),
             ('W_key.weight',
              tensor([[-0.4196, -0.4590, -0.3648],
                      [ 0.2615, -0.2133,  0.2161]], device='cuda:0')),
             ('W_value.weight',
              tensor([[-0.4900, -0.3503, -0.2120],
                      [-0.1135, -0.4404,  0.3780]], device='cuda:0'))])

mask 텐서는 단순한 클래스 속성이었기 때문에 state_dict()를 호출해도 학습 대상인 가중치(W_query 등)만 포함되고 mask는 누락되어 있는 것을 확인해 볼 수 있다. 따라서 mask 값을 1에서 2로 바꾼 뒤 모델을 저장하고 새로 불러와도, 저장된 state_dict에는 mask 정보가 없었기 때문에 새로 생성된 모델은 원래의 초기화된 mask 값을 가질 수밖에 없다.

버퍼를 사용한 경우,

ca_with_buffer.state_dict()

----------- RESULT -----------
OrderedDict([('mask',
              tensor([[0., 1., 1., 1., 1., 1.],
                      [0., 0., 1., 1., 1., 1.],
                      [0., 0., 0., 1., 1., 1.],
                      [0., 0., 0., 0., 1., 1.],
                      [0., 0., 0., 0., 0., 1.],
                      [0., 0., 0., 0., 0., 0.]], device='cuda:0')),
             ('W_query.weight',
              tensor([[-0.1362,  0.1853,  0.4083],
                      [ 0.1076,  0.1579,  0.5573]], device='cuda:0')),
             ('W_key.weight',
              tensor([[-0.2604,  0.1829, -0.2569],
                      [ 0.4126,  0.4611, -0.5323]], device='cuda:0')),
             ('W_value.weight',
              tensor([[ 0.4929,  0.2757,  0.2516],
                      [ 0.2377,  0.4800, -0.0762]], device='cuda:0'))])

register_buffer()를 사용하게 된다면, state_dict()이 가중치 뿐만 아니라 mask 텐서 까지 포함한 것을 확인해 볼 수 있다. 그 결과, mask의 값이 변경되더라도 그 상태가 state_dict에 그대로 기록되어, 모델을 새로 불러왔을 때 변경된 mask 값까지 완벽하게 복원할 수 있다.

'LLM' 카테고리의 다른 글

vLLM이 도대체 뭘까? (via. PagedAttention) (2)	2025.08.03
Padding-Free 및 Packing: 빠르고 효율적으로 LLM 파인튜닝 하기 (5)	2025.06.26
"Attention Is All You Need" 의 대항마 : Multi-Head Latent Attention (0)	2025.05.31
거대 언어 모델 : BF16, FP16, FP32에 따른 추론 성능 알아보기 (2)	2025.05.03
과연, Perplexity를 기반으로 LLM을 평가하는 것이 합리적일까? (0)	2025.04.09

'LLM' Related Articles

Attention, Please!!!

Pytorch의 Buffer를 사용해야 하는 이유 via. Attention 본문

Pytorch의 Buffer를 사용해야 하는 이유 via. Attention

Buffer를 사용하지 않았을 때

Buffer를 사용했을 때

Buffer의 또 다른 장점: state_dict

'LLM' 카테고리의 다른 글

티스토리툴바