사용자 정의 모델#

Xinference는 사용자 정의 모델을 통합, 관리 및 적용할 수 있는 유연하고 포괄적인 방법을 제공합니다.

사용자 등록 없이도 사용자 정의 모델을 직접 시작할 수 있습니다.#

v0.14.0 버전부터, 등록하려는 모델의 패밀리가 Xinference에 내장 지원되는 모델인 경우, launch 인터페이스의 model_path 매개변수를 통해 직접 시작할 수 있어 등록 절차의 번거로움을 피할 수 있습니다. 이제 이 방식을 적극 권장합니다.

예를 들어:

xinference launch --model-path <model_file_path> --model-engine <engine> -n qwen1.5-chat

curl -X 'POST' \
  'http://127.0.0.1:9997/v1/models' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "model_engine": "<engine>",
  "model_name": "qwen1.5-chat",
  "model_path": "<model_file_path>"
}'

from xinference.client import RESTfulClient
client = RESTfulClient("http://127.0.0.1:9997")
model_uid = client.launch_model(
  model_engine="<inference_engine>",
  model_name="qwen1.5-chat",
  model_path="<model_file_path>"
)
print('Model uid: ' + model_uid)

위의 예시는 제가 이미 qwen1.5-chat 모델 파일을 가지고 있을 때, 이를 어떻게 직접 launch하는지 보여줍니다.

분산 시나리오의 경우, 모델 파일을 특정 워커에 배치한 후 launch 인터페이스의 worker_ip 및 model_path 매개변수를 통해 직접 launch 효과를 달성할 수 있습니다.

참고

CLI(명령줄 인터페이스) 사용 시, ``–model-path``(세미콜론으로 구분된 대소문자 혼합 형식)를 우선 사용하십시오. ``–model_path``는 이전 버전 규격과 호환되지만, 사용을 권장하지 않습니다.

사용자 정의 모델 정의#

Web UI: 자동으로 대규모 언어 모델 설정을 파싱합니다.#

Added in version v2.0.0.

Web UI를 통해 사용자 정의 LLM을 등록할 때, Xinference는 모델 구성을 자동으로 분석하여 핵심 필드를 미리 채워줍니다.

당신은 오직 다음만 제공하면 됩니다:

모델 경로/모델 ID (모델이 위치한 곳, 로컬 경로 또는 센터 ID)
모델 패밀리

파싱 후, 사용자 인터페이스는 다음 필드를 자동으로 채울 수 있습니다:

컨텍스트 길이
model language
모델 능력
모델 사양

사용자 정의 모델을 저장하기 전에 이러한 필드를 확인하고 편집할 수 있습니다.

다음 템플릿을 기반으로 사용자 정의 모델을 정의하세요:

{
    "version": 2,
    "context_length": 32768,
    "model_name": "custom-qwen-2.5",
    "model_lang": [
        "en",
        "zh"
    ],
    "model_ability": [
        "generate"
    ],
    "model_description": "This is a custom model description.",
    "model_family": "my-custom-qwen-2.5",
    "model_specs": [
        {
            "model_format": "pytorch",
            "model_size_in_billions": "0_5",
            "quantization": "none",
            "model_id": null,
            "model_hub": "huggingface",
            "model_uri": "file:///path/to/models--Qwen--Qwen2.5-0.5B",
            "model_revision": null,
            "activated_size_in_billions": null
        }
    ],
    "chat_template": null,
    "stop_token_ids": null,
    "stop": null,
    "reasoning_start_tag": null,
    "reasoning_end_tag": null,
    "cache_config": null,
    "virtualenv": {
        "packages": [],
        "inherit_pip_config": true,
        "index_url": null,
        "extra_index_url": null,
        "find_links": null,
        "trusted_host": null,
        "no_build_isolation": null
    },
    "is_builtin": false
}

{
   "version": 2,
   "model_name": "my-bge-large-zh-v1.5",
   "dimensions": 1024,
   "max_tokens": 512,
   "language": [
       "zh"
   ],
   "model_specs": [
      {
          "model_format": "pytorch",
          "model_hub": "huggingface",
          "model_id": null,
          "model_uri": "file:///path/to/my-bge-large-zh-v1.5",
          "model_revision": null,
          "quantization": "none"
      }
   ],
   "cache_config": null,
   "virtualenv": {
      "packages": [],
      "inherit_pip_config": true,
      "index_url": null,
      "extra_index_url": null,
      "find_links": null,
      "trusted_host": null,
      "no_build_isolation": null
   },
   "is_builtin": false
}

{
  "version": 2,
  "model_name": "my-bge-reranker-base",
  "model_specs": [
      {
          "model_format": "pytorch",
          "model_hub": "huggingface",
          "model_id": null,
          "model_revision": null,
          "model_uri": "file:///path/to/my-bge-reranker-base",
          "quantization": "none"
      }
  ],
  "language": [
      "en",
      "zh"
  ],
  "type": "unknown",
  "max_tokens": 512,
  "virtualenv": {
      "packages": [],
      "inherit_pip_config": true,
      "index_url": null,
      "extra_index_url": null,
      "find_links": null,
      "trusted_host": null,
      "no_build_isolation": null
  },
  "is_builtin": false
}

{
  "model_name": "my-qwen-image",
  "model_id": null,
  "model_revision": null,
  "model_hub": "huggingface",
  "cache_config": null,
  "version": 2,
  "model_family": "stable_diffusion",
  "model_ability": null,
  "controlnet": [],
  "default_model_config": {},
  "default_generate_config": {},
  "gguf_model_id": null,
  "gguf_quantizations": null,
  "gguf_model_file_name_template": null,
  "lightning_model_id": null,
  "lightning_versions": null,
  "lightning_model_file_name_template": null,
  "virtualenv": {
      "packages": [],
      "inherit_pip_config": true,
      "index_url": null,
      "extra_index_url": null,
      "find_links": null,
      "trusted_host": null,
      "no_build_isolation": null
  },
  "model_uri": "file:///path/to/my-qwen-image",
  "is_builtin": false
}

{
  "model_name": "my-ChatTTS",
  "model_id": null,
  "model_revision": null,
  "model_hub": "huggingface",
  "cache_config": null,
  "version": 2,
  "model_family": "ChatTTS",
  "multilingual": false,
  "language": null,
  "model_ability": [
      "text2audio"
  ],
  "default_model_config": null,
  "default_transcription_config": null,
  "engine": null,
  "virtualenv": {
      "packages": [],
      "inherit_pip_config": true,
      "index_url": null,
      "extra_index_url": null,
      "find_links": null,
      "trusted_host": null,
      "no_build_isolation": null
  },
  "model_uri": "file:///path/to/my-ChatTTS",
  "is_builtin": false
}

{
  "model_name": "my-flexible-model",
  "model_id": null,
  "model_revision": null,
  "model_hub": "huggingface",
  "cache_config": null,
  "version": 2,
  "model_description": "This is a model description.",
  "model_uri": "file:///path/to/my-flexible-model",
  "launcher": "xinference.model.flexible.launchers.transformers",
  "launcher_args": "{}",
  "virtualenv": {
      "packages": [],
      "inherit_pip_config": true,
      "index_url": null,
      "extra_index_url": null,
      "find_links": null,
      "trusted_host": null,
      "no_build_isolation": null
  },
  "is_builtin": false
}

model_name: 모델 이름. 이름은 알파벳 또는 숫자로 시작해야 하며, 오직 알파벳, 숫자, 밑줄 또는 대시만 포함할 수 있습니다.
context_length: 선택적 정수로, 모델이 지원하는 최대 컨텍스트 길이로 입력 및 출력 길이를 포함합니다. 정의되지 않은 경우 기본값은 2048개의 토큰(약 1,500단어)입니다.
dimensions: 임베딩 모델 출력의 벡터 크기를 정의하는 정수.
max_tokens: 정수로, 임베딩 모델이 단일 요청에서 처리할 수 있는 최대 입력 토큰 수를 정의합니다.
model_lang: 문자열 리스트로, 모델이 지원하는 언어를 나타냅니다. 예: [‘en’]은 해당 모델이 영어를 지원함을 의미합니다.
model_ability: 모델의 능력을 정의하는 문자열 목록입니다. ‘embed’, ‘generate’, ‘chat’과 같은 옵션을 포함할 수 있습니다. 예시는 모델이 ‘generate’ 능력을 가지고 있음을 나타냅니다.
model_family: 등록할 모델 패밀리를 나타내는 필수 문자열입니다. 이 매개변수 이름은 내장 모델 이름과 충돌해서는 안 됩니다.
model_specs: 모델 사양을 정의하는 객체 배열입니다. 이러한 사양에는 다음이 포함됩니다:
- model_format: 모델 형식을 정의하는 문자열로, ‘pytorch’ 또는 ‘ggufv2’일 수 있습니다.
model_size_in_billions: 모델의 파라미터 수를 10억 단위로 정의하는 정수입니다.
quantizations: 모델의 양자화 방식을 정의하는 문자열 목록입니다. PyTorch 모델의 경우 “4-bit”, “8-bit” 또는 “none”일 수 있습니다. ggufv2 모델의 경우 양자화 방식은 model_file_name_template 의 값과 일치해야 합니다. 일부 엔진은 fp4 / fp8 / bnb 형식도 지원합니다(백엔드 지원 세부 사항은 설치 참조).
- model_id: 모델 ID를 나타내는 문자열로, 해당 모델에 해당하는 HuggingFace 저장소 ID일 수 있습니다. model_uri 필드가 누락된 경우, Xinference는 이 ID가 가리키는 HuggingFace 저장소에서 모델을 다운로드하려고 시도합니다.
- model_hub: 모델을 다운로드할 위치를 나타내는 선택적 문자열입니다. 예를 들어 HuggingFace 또는 modelscope가 있습니다.
- model_uri: 모델 파일의 위치를 나타내는 문자열입니다. 예를 들어 로컬 디렉토리: “file:///path/to/llama-2-7b”. model_format이 ggufv2인 경우, 이 필드는 구체적인 모델 파일 경여야 합니다. 반면 model_format이 pytorch인 경우, 이 필드는 모든 모델 파일을 포함하는 디렉토리여야 합니다.
- model_revision: 모델 파일의 특정 버전이나 저장소에서 사용할 커밋 해시를 나타내는 문자열입니다.
chat_template: model_ability``에 ``chat``이 포함된 경우, 적절한 전체 프롬프트를 생성하기 위해 이 옵션을 반드시 구성해야 합니다. 이는 Jinja 템플릿 문자열입니다. 일반적으로 모델 디렉토리의 ``tokenizer_config.json 파일에서 찾을 수 있습니다.
stop_token_ids：model_ability``에 ``chat``이 포함되어 있다면, 대화의 중지를 적절히 제어하기 위해 이 옵션을 구성하는 것을 권장합니다. 이는 정수 리스트이며, 모델 디렉토리의 ``generation_config.json 및 tokenizer_config.json 파일에서 해당 값을 추출할 수 있습니다.
stop：model_ability``에 ``chat``이 포함되어 있다면, 대화의 중단을 적절히 제어하기 위해 이 옵션을 설정하는 것이 좋습니다. 이는 문자열을 포함하는 리스트이며, 모델 디렉터리의 ``tokenizer_config.json 파일에서 토큰 값에 해당하는 문자열을 찾을 수 있습니다.
reasoning_start_tag: 대규모 언어 모델이 출력에서 사고 연쇄 또는 추론 과정의 시작 지점을 명확히 표시하도록 지시하는 특별한 토큰 또는 프롬프트입니다.
reasoning_end_tag: 대규모 언어 모델의 출력에서 사고 사슬이나 추론 과정의 종료 지점을 명확히 지시하는 특수 토큰 또는 프롬프트입니다.
cache_config: 시스템이 임시 데이터(캐시)를 저장하고 관리하기 위한 매개변수를 나타내는 문자열입니다.
virtualenv: A settings object for model dependency isolation. Please refer to this document for details.

사용자 정의 모델 등록#

코드 방식으로 사용자 정의 모델을 등록하는 방법

import json
from xinference.client import Client

with open('model.json') as fd:
    model = fd.read()

# replace with real xinference endpoint
endpoint = 'http://localhost:9997'
client = Client(endpoint)
client.register_model(model_type="<model_type>", model=model, persist=False)

명령줄 방식으로

xinference register --model-type <model_type> --file model.json --persist

다음 부분의 <model_type>``을 ``LLM, embedding 또는 ``rerank``로 바꾸십시오.

내장 모델 및 사용자 정의 모델 나열#

코드를 사용하여 내장 모델과 사용자 정의 모델을 나열하세요.

registrations = client.list_model_registrations(model_type="<model_type>")

명령줄 방식으로

xinference registrations --model-type <model_type>

사용자 정의 모델 시작#

코드로 사용자 정의 모델을 실행합니다.

uid = client.launch_model(model_name='custom-llama-2', model_format='pytorch')

명령줄 방식으로

xinference launch --model-name custom-llama-2 --model-format pytorch

사용자 정의 모델 사용#

코드로 모델을 호출하는 방식

model = client.get_model(model_uid=uid)
model.generate('What is the largest animal in the world?')

결과는:

{
   "id":"cmpl-a4a9d9fc-7703-4a44-82af-fce9e3c0e52a",
   "object":"text_completion",
   "created":1692024624,
   "model":"43e1f69a-3ab0-11ee-8f69-fa163e74fa2d",
   "choices":[
      {
         "text":"\nWhat does an octopus look like?\nHow many human hours has an octopus been watching you for?",
         "index":0,
         "logprobs":"None",
         "finish_reason":"stop"
      }
   ],
   "usage":{
      "prompt_tokens":10,
      "completion_tokens":23,
      "total_tokens":33
   }
}

또는 명령줄 방식으로 실제 모델 UID로 ``${UID}``를 대체하세요:

xinference generate --model-uid ${UID}

사용자 정의 모델 등록 해제#

코드로 사용자 정의 모델을 등록 해제합니다.

model = client.unregister_model(model_type="<model_type>", model_name='custom-llama-2')

명령줄 방식으로

xinference unregister --model-type <model_type> --model-name custom-llama-2