채팅 & 생성#

Xinference에서 LLM과 채팅하는 방법을 알아봅니다.

Introducción#

chat 또는 generate 기능을 갖춘 모델은 일반적으로 대규모 언어 모델(LLM) 또는 텍스트 생성 모델이라고 불립니다. 이러한 모델은 수신된 입력에 따라 텍스트 출력 방식으로 응답하도록 설계되었으며, 이 입력은 일반적으로 “프롬프트”라고 합니다. 일반적으로 특정 지침을 제공하거나 구체적인 예시를 제시하여 이러한 모델이 작업을 수행하도록 유도할 수 있습니다.

generate 기능을 갖춘 모델은 일반적으로 사전 학습된 대규모 언어 모델입니다. 반면에 chat 기능이 탑재된 모델은 대화 시나리오에 최적화되도록 미세 조정 및 정렬된 LLM(Language Model)입니다. 대부분의 경우, “chat”으로 끝나는 모델(예: llama-2-chat, qwen-chat 등)은 chat 기능을 가지고 있습니다.

Chat API와 Generate API는 LLMs와 상호작용하는 두 가지 서로 다른 방법을 제공합니다:

Chat API (OpenAI의 `Chat Completion API <https://platform.openai.com/docs/api-reference/chat/create>`__와 유사)는 다중 턴 대화를 수행할 수 있습니다.
Generate API는 OpenAI의 Completions API 와 유사하게 텍스트 프롬프트를 기반으로 텍스트를 생성할 수 있게 해줍니다.

Model Capability	API endpoint	OpenAI 호환 엔드포인트
chat	Chat API	/v1/chat/completions
generate	Generate API	/v1/completions

지원되는 모델 목록#

Xinference에 내장된 LLM 모델의 기능을 모두 확인할 수 있습니다.

채팅 모델#

Chat API#

cURL, OpenAI Client 또는 Xinference의 Python 클라이언트를 사용하여 Chat API를 테스트해 보세요:

curl -X 'POST' \
  'http://<XINFERENCE_HOST>:<XINFERENCE_PORT>/v1/chat/completions' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "<MODEL_UID>",
    "messages": [
        {
            "role": "system",
            "content": "You are a helpful assistant."
        },
        {
            "role": "user",
            "content": "What is the largest animal?"
        }
    ],
    "max_tokens": 512,
    "temperature": 0.7
  }'

import openai

client = openai.Client(
    api_key="cannot be empty",
    base_url="http://<XINFERENCE_HOST>:<XINFERENCE_PORT>/v1"
)
client.chat.completions.create(
    model="<MODEL_UID>",
    messages=[
        {
            "content": "What is the largest animal?",
            "role": "user",
        }
    ],
    max_tokens=512,
    temperature=0.7
)

from xinference.client import RESTfulClient

client = RESTfulClient("http://<XINFERENCE_HOST>:<XINFERENCE_PORT>")
model = client.get_model("<MODEL_UID>")
messages = [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is the largest animal?"}]
model.chat(
    messages,
    generate_config={
      "max_tokens": 512,
      "temperature": 0.7
    }
)

{
  "id": "chatcmpl-8d76b65a-bad0-42ef-912d-4a0533d90d61",
  "model": "<MODEL_UID>",
  "object": "chat.completion",
  "created": 1688919187,
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The largest animal that has been scientifically measured is the blue whale, which has a maximum length of around 23 meters (75 feet) for adult animals and can weigh up to 150,000 pounds (68,000 kg). However, it is important to note that this is just an estimate and that the largest animal known to science may be larger still. Some scientists believe that the largest animals may not have a clear \"size\" in the same way that humans do, as their size can vary depending on the environment and the stage of their life."
      },
      "finish_reason": "None"
    }
  ],
  "usage": {
    "prompt_tokens": -1,
    "completion_tokens": -1,
    "total_tokens": -1
  }
}

튜토리얼 노트북에서 더 많은 Chat API 예제를 찾을 수 있습니다.

Gradio Chat

Xinference의 Chat API와 Python 클라이언트를 사용하는 방법을 배우는 예시입니다.

https://github.com/xorbitsai/inference/blob/main/examples/gradio_chatinterface.py

Mixed Thinking Model#

일부 대형 언어 모델은 ``혼합형``으로 표시되어 있으며, 사고 모드 실행 여부를 선택할 수 있습니다.

Added in version v1.17.0: 요청 수준의 enable_thinking 스위치는 v1.17.0에서 지원됩니다.

Xinference는 요청 수준의 enable_thinking 스위치를 제공하며, 이 스위치는 다양한 모델 템플릿에 적용됩니다(예: Qwen은 ``enable_thinking``을 사용하고, 일부 DeepSeek 템플릿은 ``thinking``을 사용합니다).

사용 예시:

curl -X 'POST' \
  'http://<XINFERENCE_HOST>:<XINFERENCE_PORT>/v1/chat/completions' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "<MODEL_UID>",
    "messages": [
        {"role": "user", "content": "What is the largest animal?"}
    ],
    "enable_thinking": false
  }'

import openai

client = openai.Client(
    api_key="cannot be empty",
    base_url="http://<XINFERENCE_HOST>:<XINFERENCE_PORT>/v1"
)
client.chat.completions.create(
    model="<MODEL_UID>",
    messages=[
        {"role": "user", "content": "What is the largest animal?"}
    ],
    extra_body={"enable_thinking": False}
)

from xinference.client import RESTfulClient

client = RESTfulClient("http://<XINFERENCE_HOST>:<XINFERENCE_PORT>")
model = client.get_model("<MODEL_UID>")
model.chat(
    [{"role": "user", "content": "What is the largest animal?"}],
    enable_thinking=False,
)

model.chat(
    [{"role": "user", "content": "What is the largest animal?"}],
    generate_config={"chat_template_kwargs": {"enable_thinking": False}},
)

생성 모델#

Generate API#

Generate API는 OpenAI의 `Completions API <https://platform.openai.com/docs/api-reference/completions/create>`__를 복제했습니다.

Generate API와 Chat API의 주요 차이점은 입력 형식에 있습니다. Chat API는 메시지 목록을 입력으로 받고, Generate API는 `prompt`라는 자유 텍스트 문자열을 입력으로 받습니다.

curl -X 'POST' \
  'http://<XINFERENCE_HOST>:<XINFERENCE_PORT>/v1/completions' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "<MODEL_UID>",
    "prompt": "What is the largest animal?",
    "max_tokens": 512,
    "temperature": 0.7
  }'

import openai

client = openai.Client(api_key="cannot be empty", base_url="http://<XINFERENCE_HOST>:<XINFERENCE_PORT>/v1")
client.chat.completions.create(
    model=("<MODEL_UID>",
    messages=[
        {"role": "user", "content": "What is the largest animal?"}
    ],
    max_tokens=512,
    temperature=0.7
)

from xinference.client import RESTfulClient

client = RESTfulClient("http://<XINFERENCE_HOST>:<XINFERENCE_PORT>")
model = client.get_model("<MODEL_UID>")
print(model.generate(
    prompt="What is the largest animal?",
    generate_config={
      "max_tokens": 512,
      "temperature": 0.7
    }
))

{
  "id": "cmpl-8d76b65a-bad0-42ef-912d-4a0533d90d61",
  "model": "<MODEL_UID>",
  "object": "text_completion",
  "created": 1688919187,
  "choices": [
    {
      "index": 0,
      "text": "The largest animal that has been scientifically measured is the blue whale, which has a maximum length of around 23 meters (75 feet) for adult animals and can weigh up to 150,000 pounds (68,000 kg). However, it is important to note that this is just an estimate and that the largest animal known to science may be larger still. Some scientists believe that the largest animals may not have a clear \"size\" in the same way that humans do, as their size can vary depending on the environment and the stage of their life.",
      "finish_reason": "None"
    }
  ],
  "usage": {
    "prompt_tokens": -1,
    "completion_tokens": -1,
    "total_tokens": -1
  }
}

FAQ#

Xinference의 LLM은 LangChain 또는 LlamaIndex와의 통합 방법을 제공하나요?#

네, 각각의 공식 Xinference 문서에서 관련 부분을 참고할 수 있습니다. 링크는 다음과 같습니다: