[실험] Predibase의 LoRA Land 사이트를 참고하여 LoRA의 기본 개념과 활용 방법 이해하기

안정민 2024. 7. 4. 09:15

(1) LoRA 적용을 위한 구체적인 절차와 파라미터 설정

[1] LoRA 설정

task_type: 모델이 수행할 태스크의 종류 (예: TaskType.CAUSAL_LM).
inference_mode: 추론 모드 설정 (학습 시 False).
r: 저랭크 행렬의 크기 (예: 8).
lora_alpha: LoRA 적용 시 기존 가중치 행렬에 곱해지는 스케일 팩터 (예: 32).
lora_dropout: 과적합 방지를 위한 드롭아웃 비율 (예: 0.1)

from peft import LoraConfig
config = LoraConfig()
config

from peft import LoraConfig, TaskType
peft_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,  # 모델이 수행할 태스크의 종류. 인과적 언어 모델링(Causal Language Modeling)
    inference_mode=False,  # 추론 모드 : False는 학습
    r=8,  # LoRA의 저랭크 행렬의 크기
    lora_alpha=32,  # LoRA 알파 값 : LoRA 적용 시 기존 가중치 행렬에 곱해지는 스케일 팩터입니다.
    lora_dropout=0.1,  # LoRA 적용 후 드롭아웃 : 과적합을 방지
)
peft_config

위와 같은 hyper-parameter 설정 후 이를 모델에 적용한다

from transformers import AutoModelForCausalLM  
from peft import get_peft_model  # PEFT 라이브러리에서 모델에 LoRA 설정을 적용하기 위한 함수
# 사전 훈련된 모델 "EleutherAI/polyglot-ko-1.3b" 
model = AutoModelForCausalLM.from_pretrained("EleutherAI/polyglot-ko-1.3b")
# LoRA 설정(peft_config)을 사용하여 모델
model = get_peft_model(model, peft_config)
# LoRA 적용 후 조정된 파라미터 출력
model.print_trainable_parameters()

>> trainable params: 1,572,864 || all params: 1,333,383,168 || trainable%: 0.11796039111242178

전체 파라미터는 1.3b인데 학습 가능한 파라미터 1.5MB로 조정된 것이 보임, 이는 이제 학습할 대상이 13억에서 1.5백만으로 줄어든 것이라고 판단할 수 있게 된다. 즉, 이를 통해 볼 수 있듯이 LoRA는 학습에 있어 LoRA Layer 를 더하는 작업으로 파라미터를 효과적으로 줄여 효율적인 머신 러닝이 가능하도록 한다.

이러한 파라미터 조정 과정을 거친 후 다음과 같이 학습 처리하면 LoRA가 장착된 upgraded 모델을 활용할 수 있게 된다.

from transformers import DataCollatorForLanguageModeling
from transformers import Trainer, TrainingArguments
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
training_args = TrainingArguments(
    output_dir="./polyglot_model_peft",
    overwrite_output_dir=True,
    num_train_epochs=1,
    per_device_train_batch_size=2,
    save_steps=10_000,
    save_total_limit=2,
)
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=tokenized_datasets,
)
trainer.train()

참고 ) https://m.blog.naver.com/se2n/223375998670

Polyglot ko 1.3b Q/A 모델 - LLMs 모델 경량화 PEFT LoRA (Low Rank Adaptation)와 양자화(quantization) 적용 방법

안녕하세요 풀스택 이라떼입니다. 지난 게시물에 이어서 진행됩니다. 지난 게시물에서는 H100 80GB GP...

blog.naver.com

[2] 구체적인 파라미터 설정

temperature: 텍스트 생성의 다양성을 조절하는 값. 높을수록 창의성이 높아지고 근거 없는 발화의 가능성도 증가.
top_p: 사용될 단어의 확률 분포를 제한하는 값. 값이 낮을수록 예상치 못한 답변이 나올 확률이 높아짐.
top_k: 생성할 단어 후보의 개수를 제한하는 값. 높을수록 다양한 답변을 생성할 수 있다.
repetition_penalty: 반복을 방지하기 위한 값. 높을수록 같은 말을 반복하지 않음.

[3] LoRA Land

모델을 다운로드 받아 로컬 환경에서 사용자가 보유하고 있는 모델에 LoRA를 장착하여 활용할 수도 있지만, 간단히 다음 사이트에서 데모가 가능하도록 지원하고 있다.

https://predibase.com/lora-land

LoRA Land: Fine-Tuned Open-Source LLMs

Fine-tuned LLMs that outperform GPT-4, served on a single GPU

predibase.com

(2) LoRA 적용 유무에 따른 모델의 응답 변화 실험

LoRA Land의 데모를 활용하여 Base model response와 LoRA finetuned model response를 비교하는 실험을 진행하고자 함
여러 카테고리와 이에 따른 adapter을 조정하여 response의 차이를 비교하고자 한다.
Used Base Model : Mistral 7B Instruct

[Case 1]

Category : Sentiment Detection
Adapter : Sentiment Detection (SST2)
Adapter info

Feature	Description
정의 및 System Prompt 설정	문장이 주어진다면 문장 안에 내재된 감정을 분류, negative 한 감정일 경우 0의 레이블을 반환, positive 한 감정일 경우 1의 레이블을 반환
Adapter Accuracy	0.95 (Mistral 7B: 0, GPT-4: 0.95)
Used Dataset	https://huggingface.co/datasets/nyu-mll/glue/viewer/sst2
Prompt Template	Given the following sentence: "{sentence}" Respond with 0 if the sentiment of the sentence is negative and 1 if the sentiment of the sentence is positive.
Huggingface	https://huggingface.co/predibase/glue_sst2

Experiment 1

{sentence} = “this illuminating documentary transcends our preconceived vision of the holy land and its inhabitants , revealing the human complexities beneath . “

FINE-TUNED MODEL RESPONSE (Sentiment Detection (SST2))	BASE MODEL RESPONSE (Mistral 7B Instruct)
1	1

Experiment 2

{sentence} = “So we beat on, boats against the current, borne back ceaselessly into the past.”

해당 문장은 위대한 개츠비의 마지막 구절로, 해당 문장에 대한 해석이 굉장히 다양하게 존재한다.

즉, 하나의 직관적인 의미를 담고 있지 않고 독자의 해석에 달려있는 문장의 감정을 어떻게 분류하는 지에 대해 실험하고자 한다.

FINE-TUNED MODEL RESPONSE (Sentiment Detection (SST2))	BASE MODEL RESPONSE (Mistral 7B Instruct)
0	0 (negative sentiment)

그리고 아래는 해당 문장에 대한 GPT-4o 모델의 답변이다.

Query

Answer

tell me if this sentence conveys a positive emotion or a negative one.

The last sentence of The Great Gatsby — "So we beat on, boats against the current, borne back ceaselessly into the past." — expresses a complex blend of emotions, both positive and negative.
Positive aspects:

Resilience and perseverance: The imagery of "beating on" against the current suggests a sense of determination and persistence despite challenges.

Negative aspects:

Struggle and futility: The phrase "against the current" and being "borne back ceaselessly into the past" conveys a sense of inevitability and the struggle against forces that are difficult, if not impossible, to overcome.

Overall, the sentence encapsulates a poignant reflection on the human condition, balancing the hopeful persistence against the often overwhelming pull of the past and life's challenges. This duality gives it a richly ambivalent emotional tone.

Case 1 실험에 대한 결론 : LoRA의 부착 여부와 관련 없이 기존 LLM Base Model과의 차이는 출력 형식을 제외하고는 크게 존재하지 않았다. 그러나 두 번째 실험과 같이 정답이 정해지지 못 하는, 맥락에 의존적인 문장에 대해서는 출력 되는 답변이 정답이라고 할 수 없어 객관적인 평가를 할 수 없었다. 같은 선 상에서 더 결론을 확장시키자면, 문장이 직관적이지 않고 중의적인, 즉 복잡성이 높은 문장에 대해서는 정답을 보장할 수 없을 뿐만 아니라, 개인의 직관적 판단에 의하여 모델의 오류 가능성이 높다고 판단될 가능성이 높음을 예측해 볼 수 있었다.

[Case 2]

Category : Topic Identification
Adapter : News Topic Identification (dbpedia)
Adapter info

Feature	Description
정의 및 System Prompt 설정	하나의 기사가 입력으로 주어지면 기사의 카테고리를 반환 classes: 0 - Company, 1 - Educational Institution, 2 - Artist, 3 - Athlete, 4 - Office Holder, 5 - Mean Of Transportation, 6 - Building, 7 - Natural Place, 8 - Village, 9 - Animal, 10 - Plant, 11 - Album, 12 - Film, 13 - Written Work
Adapter Accuracy	0.99 (Mistral 7B: 0, GPT-4: 0.04)
Used Dataset	https://huggingface.co/datasets/fancyzhx/dbpedia_14
Prompt Template	You are given the title and the body of an article below. Please determine the type of the article. Your possible options are: ["Company", "EducationalInstitution", "Artist", "Athlete", "OfficeHolder", "MeanOfTransportation", "Building", "NaturalPlace", "Village", "Animal", "Plant", "Album", "Film", "WrittenWork"] ### Title: {title} ### Body: {content} ### Article Type:
Huggingface	https://huggingface.co/predibase/dbpedia

Experiment

{title} = “Kosmos Energy“

{content} = “ Kosmos Energy is an American international oil company founded and based in Dallas Texas. It also maintains offices in the Bermudas Morocco Suriname and Ghana. It has been involved in the discovery and development of the Jubilee oil field off the coast of Ghana. Kosmos was a focus of the BBC documentary Storyville: Power Money Greed and Oil.“

FINE-TUNED MODEL RESPONSE (News Topic Identification (dbpedia))	BASE MODEL RESPONSE (Mistral 7B Instruct)
11	The article is about the debut album of a Canadian rock band, Secret and Whisper.

따로 Label 처리가 되지 않은 Base Model의 경우 필요한 정보 만을 출력하는 것이 아니라 부가적인 설명과 함께 하나의 문장의 형태로 출력값을 만들어내는 것을 확인할 수 있는 반면, 파인 튜닝이 된 LoRA 부착 모델의 경우 단 하나의 정수값을 출력값으로 만들어낸다.

Case 2 실험에 대한 결론 : 우리는 LLM을 챗봇의 용도로만 활용하지 않는다. 언어 모델을 활용하여 자동화를 할 수 있는 부분을 자동화하는 목적으로도 활용이 되고 있다. 예를 들면 여러 언어 데이터에 대하여 모델 학습 데이터 구축 시 레이블링을 하는 것도 LLM을 활용하여 자동화 해 효율적인 작업을 진행할 수 있다. 그러한 경우 위의 case를 예로 들자면 총 13개의 레이블 값을 학습 데이터로 모델을 재학습 시켜야 하는데, 레이블이 많아지면 많아질 수록 학습에 필요한 데이터 수는 늘어나게 되고, 높은 정확도 또한 보장할 수 없게 된다. 따라서 부가적인 정보를 가감하여 오직 정수 값을 output으로 가지는 LoRA Layer을 부착하며 이러한 문제를 간단하게 해결할 수 있는 획기적인 방법이라고 결론을 내릴 수 있었다.

[Case 3]

Category : STEM (Science, Technology, Engineering, Mathematics)
Adapter : Code Generation (magicoder)
Adapter info

Feature	Description
정의 및 System Prompt 설정	다양한 코딩 문제를 사용자가 요청한 프로그래밍 언어로 코딩하여 코드를 반환
Adapter Accuracy	0.11 (Mistral 7B: 0.01, GPT-4: 0.82)
Used Dataset	https://huggingface.co/datasets/ise-uiuc/Magicoder-OSS-Instruct-75K
Prompt Template	Below is a programming problem, paired with a language in which the solution should be written. Write a solution in the provided that appropriately solves the programming problem. ### Problem: {problem} ### Language: {lang} ### Solution:
Huggingface	https://huggingface.co/predibase/magicoder

특이점 : 다른 모델보다 정확도가 낮음

Experiment 1

{problem} =

def strlen(string: str) -> int:
    """ Return length of given string
    >>> strlen('')
    0
    >>> strlen('abc')
    3
    """

{lang} = “python“

FINE-TUNED MODEL RESPONSE (Code Generation (magicoder))	BASE MODEL RESPONSE (Mistral 7B Instruct)
def strlen(string: str) -> int: return len(string)	Here's a solution in Python: def strlen(string: str) -> int: return len(string) This function takes a string as input and returns its length using the built-in len() function.

Experiment 2 (언어 변경 Python → Java)

{problem} =

class Solution {
    public double findMedianSortedArrays(int[] nums1, int[] nums2) {
        // Given two sorted arrays nums1 and nums2 of size m and n respectively,
        // return the median of the two sorted arrays.
        // The overall run time complexity should be O(log (m+n)).
        // Constraints:
        // nums1.length == m
        // nums2.length == n
        // 0 <= m <= 1000
        // 0 <= n <= 1000
        // 1 <= m + n <= 2000
        // -10^6 <= nums1[i], nums2[i] <= 10^6
        // Below is the example:
        // Input: nums1 = [1,3], nums2 = [2]
        // Output: 2.00000
        // Explanation: merged array = [1,2,3] and median is 2.
    }
}

{lang} = “Java“

FINE-TUNED MODEL RESPONSE
(Code Generation (magicoder))

BASE MODEL RESPONSE
(Mistral 7B Instruct)

public class Solution {
   public double findMedianSortedArrays(int[] nums1, int[] nums2) {
       int m = nums1.length;
       int n = nums2.length;
       int[] merged = new int[m + n];
       int i = 0, j = 0, k = 0;
       while (i < m && j < n) {
           if (nums1[i] < nums2[j]) {
               merged[k++] = nums1[i++];
           } else {
               merged[k++] = nums2[j++];
           }
       }
       while (i < m) {
           merged[k++] = nums1[i++];
       }
       while (j < n) {
           merged[k++] = nums2[j++];
       }
       if ((m + n) % 2 == 0) {
           return (merged[m + n / 2 - 1] + merged[m + n / 2]) / 2.0;
       } else {
           return merged[m + n / 2];
       }
   }
}

Here's a Java solution to the problem:

public class Solution {
    public double findMedianSortedArrays(int[] nums1, int[] nums2) {
        int m = nums1.length;
        int n = nums2.length;
        int total = m + n;
        int mid = total / 2;
        int left = 0;
        int right = total - 1;
        while (left <= right) {
            int mid = (left + right) / 2;
            if (total % 2 == 1) {
                if (nums1[mid] <= nums2[mid]) {
                    return (double) nums1[mid];
                } else {
                    return (double) nums2[mid];
                }
            } else {
                if (nums1[mid] <= nums2[mid - 1]) {
                    return (double) (nums1[mid] + nums2[mid - 1]) / 2;
                } else {

출력 토큰 제한으로 베이스 모델의 경우 중간에 입력이 잘리는 것을 확인할 수 있음

파이썬의 경우 에러 없는 코드를 잘 생성해내는 것에 반하여 생성 코드 언어를 Java로 변경하였을 때 모든 테스트 케이스를 통과하지 못 하는 코드를 배출하는 것을 확인할 수 있었음

확인 결과 Dataset 내부의 데이터 불균형을 발견해낼 수 있었고, 이를 통하여 불균형한 데이터 셋으로 학습한 LoRA 또한 불균형한 데이터 셋으로 학습한 모델과 동일하게 낮은 정확도를 보여줄 수 있다는 결론에 도달하였다.

이러한 데이터셋의 불균형을 통하여, 위에서 제기하였던 다른 모델보다 정확도가 낮다는 특이점에 대한 의문이 해소되는 계기가 되었으며, 위 LoRA 자체를 오버샘플링 등을 통한 균형있는 데이터셋으로 재학습시킨다면 어떠한 Base Model에 장착이 되어서도 언어에 구애받지 않고 좋은 결과를 낼 수 있을 것이라고 예상된다.

나아가 LoRA를 장착한 이후의 성능에 대해서 결론을 내릴 수 있는 부분은, 앞의 실험에서 다룬 모델들과 같이 부가적인 정보를 모두 가감하고 필요한 정보만을 도출해내도록 LoRA만을 부착하여 추가적인 학습 없이 도출해낼 수 있다는 부분이 인상적이었고 효율적이라는 결론을 내릴 수 있었다.

나아가 두 번째 Experiment 에서도 볼 수 있듯이 Base Model에서 출력 토큰의 제한이 있다면 (Open Web UI의 Max Tokens (num_predict) 파라미터가 적은 경우와 같음) 코드만 간결하게 출력하지 않고서는 출력 코드가 끊기게 되어, 해당 모델의 출력값을 자동화하여 어떠한 프로그램에 연동한 경우 출력 오류 퍼센테이지가 상승하는 부작용이 생길 수 있다는 것을 예상할 수 있었다. 이와 같은 측면에서도 별도의 학습 과정 없이 원하는 출력만을 낼 수 있는 LoRA 기술이 효율적이라고 느끼게 되었다.

저작자표시 (새창열림)