llama.cpp로 AI 에이전트 구축

저자의 이미지

LLAMA.CPP는 Ollama, Local Chatbots 및 기타 Op-Device LLM 솔루션을 포함한 많은 인기있는 로컬 AI 도구를 제공하는 독창적 인 고성능 프레임 워크입니다. llama.cpp와 직접 작업하면 오버 헤드를 최소화하고 세분화 된 제어를 얻고 특정 하드웨어의 성능을 최적화하여 로컬 AI 에이전트 및 응용 프로그램을보다 빠르고 구성 가능하게 만들 수 있습니다.

이 튜토리얼에서는 LLM (Lange Language Models)을 효율적으로 실행하기위한 강력한 C/C ++ 라이브러리 인 Llama.cpp를 사용하여 AI 응용 프로그램을 구축하는 것을 안내합니다. LLAMA.CPP 서버 설정, Langchain과 통합 및 웹 검색 및 Python Relb과 같은 도구를 사용할 수있는 React Agent를 구축 할 수 있습니다.

1. llama.cpp 서버 설정

이 섹션에서는 llama.cpp의 설치 및 그 종속성을 다루고 CUDA 지원을 위해 구성하고 필요한 바이너리 구축 및 서버 실행을 다룹니다.

메모: CUDA 툴킷이 사전 구성된 Linux 운영 체제에서 실행되는 NVIDIA RTX 4090 그래픽 카드를 사용하고 있습니다. 유사한 로컬 하드웨어에 액세스 할 수없는 경우 Vast.ai에서 GPU 인스턴스를 저렴한 가격으로 임대 할 수 있습니다.

Vast.ai의 스크린 샷 | 콘솔

시스템의 패키지 목록을 업데이트하고 Build-Estial, CMake, Curl 및 Git과 같은 필수 도구를 설치하십시오. PCIUTILS는 하드웨어 정보에 포함되며 LLAMA.CPP에는 포옹에서 모델을 다운로드하려면 libcurl4-OpensSL-DEV가 필요합니다.

apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev git -y

GitHub에서 공식 LLAMA.CPP 저장소를 복제하고 CMAKE를 사용하여 빌드를 구성하십시오.

# Clone llama.cpp repository
git clone 

# Configure build with CUDA support
cmake llama.cpp -B llama.cpp/build \
    -DBUILD_SHARED_LIBS=OFF \
    -DGGML_CUDA=ON \
    -DLLAMA_CURL=ON

llama.cpp 및 서버를 포함한 모든 도구를 컴파일하십시오. 편의를 위해 Llama.cpp/ Build/ Bin/ Directory의 모든 편집 된 바이너리를 Main LLAMA.CPP/ 디렉토리에 복사하십시오.

# Build all necessary binaries including server
cmake --build llama.cpp/build --config Release -j --clean-first

# Copy all binaries to main directory
cp llama.cpp/build/bin/* llama.cpp/

Unsloth/Gemma-3-4B-It-Gguf 모델로 llama.cpp 서버를 시작하십시오.

./llama.cpp/llama-server \
    -hf unsloth/gemma-3-4b-it-GGUF:Q4_K_XL \
    --host 0.0.0.0 \
    --port 8000 \
    --n-gpu-layers 999 \
    --ctx-size 8192 \
    --threads $(nproc) \
    --temp 0.6 \
    --cache-type-k q4_0 \
    --jinja

CURL을 사용하여 게시물 요청을 보내서 서버가 올바르게 실행 중인지 테스트 할 수 있습니다.

(main) [email protected]:/workspace$ curl -X POST  \
    -H "Content-Type: application/json" \
    -d '{
        "messages": [
            {"role": "user", "content": "Hello! How are you today?"}
        ],
        "max_tokens": 150,
        "temperature": 0.7
    }'

산출:

{"choices":[{"finish_reason":"length","index":0,"message":{"role":"assistant","content":"\nOkay, user greeted me with a simple "Hello! How are you today?" \n\nHmm, this seems like a casual opening. The user might be testing the waters to see if I respond naturally, or maybe they genuinely want to know how an AI assistant conceptualizes \"being\" but in a friendly way. \n\nI notice they used an exclamation mark, which feels warm and possibly playful. Maybe they're in a good mood or just trying to make conversation feel less robotic. \n\nSince I don't have emotions, I should clarify that gently but still keep it warm. The response should acknowledge their greeting while explaining my nature as an AI. \n\nI wonder if they're asking because they're curious about AI consciousness, or just being polite"}}],"created":1749319250,"model":"gpt-3.5-turbo","system_fingerprint":"b5605-5787b5da","object":"chat.completion","usage":{"completion_tokens":150,"prompt_tokens":9,"total_tokens":159},"id":"chatcmpl-jNfif9mcYydO2c6nK0BYkrtpNXSnseV1","timings":{"prompt_n":9,"prompt_ms":65.502,"prompt_per_token_ms":7.278,"prompt_per_second":137.40038472107722,"predicted_n":150,"predicted_ms":1207.908,"predicted_per_token_ms":8.052719999999999,"predicted_per_second":124.1816429728092}}

2. Langgraph 및 llama.cpp로 AI 에이전트 구축

이제 langgraph 및 langchain을 사용하여 llama.cpp 서버와 상호 작용하고 멀티 도구 AI 에이전트를 구축합시다.

검색 기능을 위해 Tavily API 키를 설정하십시오.
Langchain이 LOCAL LLAMA.CPP 서버 (OpenAI API를 에뮬레이션)와 함께 작동하려면 OpenAI_API_Key를 로컬 또는 비어 있지 않은 문자열로 설정할 수 있습니다. Base_URL은 로컬로 요청을 직접 요청할 수 있습니다.

export TAVILY_API_KEY="your_api_key_here"
export OPENAI_API_KEY=local

필요한 Python 라이브러리를 설치하십시오 : 에이전트 생성을위한 Langgraph, Tavily 검색 도구 용 Tavily-Python 및 LLM 상호 작용 및 도구를위한 다양한 Langchain 패키지.

%%capture
!pip install -U \
    langgraph tavily-python langchain langchain-community langchain-experimental langchain-openai

Langchain에서 Chatopenai를 구성하여 LLAMA.CPP 서버와 통신하십시오.

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    model="unsloth/gemma-3-4b-it-GGUF:Q4_K_XL",   
    temperature=0.6,
    base_url="         
)

에이전트가 사용할 수있는 도구를 설정하십시오.
- tavilysearchresults : 에이전트가 웹을 검색 할 수 있습니다.
- PythonReplTool : 코드를 실행하기 위해 에이전트에 Python read-eval-print 루프를 제공합니다.

from langchain_community.tools import TavilySearchResults
from langchain_experimental.tools.python.tool import PythonREPLTool

search_tool = TavilySearchResults(max_results=5, include_answer=True)
code_tool   = PythonREPLTool()

tools = [search_tool, code_tool]

langgraph의 사전 빌드 된 create_react_agent 함수를 사용하여 LLM과 정의 된 도구를 사용하여 추론하고 (반응 프레임 워크) 할 수있는 에이전트를 만듭니다.

from langgraph.prebuilt import create_react_agent

agent = create_react_agent(
    model=llm,
    tools=tools,
)

3. 예제 쿼리로 AI 에이전트를 테스트하십시오

이제 AI 에이전트를 테스트하고 에이전트가 사용하는 도구를 표시합니다.

이 헬퍼 함수는 대화 기록에서 에이전트가 사용하는 도구의 이름을 추출합니다. 이는 에이전트의 의사 결정 과정을 이해하는 데 유용합니다.

def extract_tool_names(conversation: dict) -> list[str]:
    tool_names = set()
    for msg in conversation.get('messages', []):
        calls = []
        if hasattr(msg, 'tool_calls'):
            calls = msg.tool_calls or []
        elif isinstance(msg, dict):
            calls = msg.get('tool_calls') or []
            if not calls and isinstance(msg.get('additional_kwargs'), dict):
                calls = msg['additional_kwargs'].get('tool_calls', [])
        else:
            ak = getattr(msg, 'additional_kwargs', None)
            if isinstance(ak, dict):
                calls = ak.get('tool_calls', [])
        for call in calls:
            if isinstance(call, dict):
                if 'name' in call:
                    tool_names.add(call['name'])
                elif 'function' in call and isinstance(call['function'], dict):
                    fn = call['function']
                    if 'name' in fn:
                        tool_names.add(fn['name'])
    return sorted(tool_names)

주어진 질문으로 에이전트를 실행하는 함수를 정의하고 사용 된 도구와 최종 답변을 인쇄하십시오.

def run_agent(question: str):
    result = agent.invoke({"messages": [{"role": "user", "content": question}]})
    raw_answer = result["messages"][-1].content
    tools_used = extract_tool_names(result)
    return tools_used, raw_answer

에이전트에게 상위 5 위 브레이킹 뉴스 기사를 요청합시다. tavily_search_results_json 도구를 사용해야합니다.

tools, answer = run_agent("What are the top 5 breaking news stories?")
print("Tools used ➡️", tools)
print(answer)

산출:

Tools used ➡️ ['tavily_search_results_json']
Here are the top 5 breaking news stories based on the provided sources:

1.  **Gaza Humanitarian Crisis:** Ongoing conflict and challenges in Gaza, including the Eid al-Adha holiday, and the retrieval of a Thai hostage's body.
2.  **Russian Drone Attacks on Kharkiv:** Russia continues to target Ukrainian cities with drone and missile strikes.
3.  **Wagner Group Departure from Mali:** The Wagner Group is leaving Mali after heavy losses, but Russia's Africa Corps remains.
4.  **Trump-Musk Feud:** A dispute between former President Trump and Elon Musk could have implications for Tesla stock and the U.S. space program.
5.  **Education Department Staffing Cuts:** The Biden administration is seeking Supreme Court intervention to block planned staffing cuts at the Education Department.

에이전트에게 Fibonacci 시리즈의 Python 코드를 작성하고 실행하도록 요청합시다. Python_Repl 도구를 사용해야합니다.

tools, answer = run_agent(
    "Write a code for the Fibonacci series and execute it using Python REPL."
)
print("Tools used ➡️", tools)
print(answer)

산출:

Tools used ➡️ ['Python_REPL']
The Fibonacci series up to 10 terms is [0, 1, 1, 2, 3, 5, 8, 13, 21, 34].

최종 생각

이 가이드에서는 소규모 양자화 된 LLM을 사용했는데, 이는 때때로 도구를 선택할 때 때때로 정확성으로 어려움을 겪고 있습니다. 당신의 목표가 생산 준비 AI 에이전트를 구축하는 것이라면 LLAMA.CPP로 최신 풀 사이즈 모델을 실행하는 것이 좋습니다. 더 크고 더 최근의 모델은 일반적으로 더 나은 결과와보다 안정적인 출력을 제공합니다.

llama.cpp를 설정하는 것은 Ollama와 같은 사용자 친화적 인 도구에 비해 더 어려울 수 있습니다. 그러나 특정 하드웨어를 위해 LLAMA.CPP를 디버그, 최적화 및 맞춤형으로 기꺼이 투자하려는 경우 성능 향상과 유연성은 그만한 가치가 있습니다.

LLAMA.CPP의 가장 큰 장점 중 하나는 효율성입니다. 시작하기 위해 고급 하드웨어가 필요하지 않습니다. 전용 GPU가없는 일반 CPU 및 노트북에서 잘 작동하므로 거의 모든 사람이 로컬 AI에 액세스 할 수 있습니다. 더 많은 전력이 필요한 경우 클라우드 제공 업체에서 항상 저렴한 GPU 인스턴스를 빌릴 수 있습니다.

Abid Ali Awan (@1abidaliawan)은 기계 학습 모델 구축을 좋아하는 공인 데이터 과학자입니다. 현재 그는 컨텐츠 제작 및 기계 학습 및 데이터 과학 기술에 대한 기술 블로그 작성에 중점을두고 있습니다. Abid는 기술 관리 석사 학위와 통신 공학 학사 학위를 취득했습니다. 그의 비전은 정신 질환으로 어려움을 겪고있는 학생들을위한 그래프 신경망을 사용하여 AI 제품을 구축하는 것입니다.

출처 참조