사용#

로컬에서 Xinference 실행#

Let’s use a classic large language model qwen2.5-instruct as an example to demonstrate how to run a large model locally with Xinference.

이 빠른 시작 이후에는 분산 클러스터 환경에서 Xinference를 배포하는 방법을 계속 학습할 수 있습니다.

로컬 서비스 시작#

먼저, 문서 의 지침에 따라 로컬에 Xinference가 설치되어 있는지 확인하세요. 다음 명령어를 사용하여 로컬 Xinference 서비스를 시작합니다:

xinference-local --host 0.0.0.0 --port 9997

INFO     Xinference supervisor 0.0.0.0:64570 started
INFO     Xinference worker 0.0.0.0:64570 started
INFO     Starting Xinference at endpoint: http://0.0.0.0:9997
INFO     Uvicorn running on http://0.0.0.0:9997 (Press CTRL+C to quit)

참고

기본적으로 Xinference는 ``<HOME>/.xinference``를 메인 디렉토리로 사용하여 로그 파일 및 모델 파일과 같은 필수 정보를 저장합니다. 여기서 ``<HOME>``은 현재 사용자의 홈 디렉토리입니다.

You can modify the home directory by configuring the environment variable XINFERENCE_HOME, for example:

XINFERENCE_HOME=/tmp/xinference xinference-local --host 0.0.0.0 --port 9997

축하합니다! 로컬에서 Xinference 서비스를 실행했습니다. Xinference 서비스가 실행되면 웹 페이지, cURL 명령, 명령줄 또는 Xinference의 Python SDK를 포함한 다양한 방법으로 사용할 수 있습니다.

UI는 http://127.0.0.1:9997/ui 에 접속하여 사용할 수 있으며, API 문서는 http://127.0.0.1:9997/docs 에서 확인할 수 있습니다.

다음 명령어를 통해 설치한 후, Xinference 명령줄 도구 또는 Python 코드를 이용하여 사용할 수 있습니다:

pip install xinference

명령줄 도구는 ``xinference``입니다. 다음 명령을 통해 사용 가능한 명령을 확인할 수 있습니다:

xinference --help

Usage: xinference [OPTIONS] COMMAND [ARGS]...

Options:
  -v, --version       Show the version and exit.
  --log-level TEXT
  -H, --host TEXT
  -p, --port INTEGER
  --help              Show this message and exit.

Commands:
  cached
  cal-model-mem
  chat
  engine
  generate
  launch
  list
  login
  register
  registrations
  remove-cache
  stop-cluster
  terminate
  unregister
  vllm-models

Xinference의 Python SDK만 설치해야 하는 경우, 다음 명령어를 사용하여 최소한의 의존성만 설치할 수 있습니다. 버전은 반드시 Xinference 서비스 버전과 일치해야 합니다.

pip install xinference-client==${SERVER_VERSION}

모델의 추론 엔진#

v0.11.0 버전부터 LLM 모델을 로드하기 전에 구체적인 추론 엔진을 지정해야 합니다. 현재 Xinference는 다음 추론 엔진을 지원합니다:

vllm
sglang
llama.cpp
transformers
MLX

이러한 추론 엔진에 대한 자세한 내용은 여기 를 참조하십시오.

주의, LLM 모델을 로드할 때 실행 가능한 엔진은 model_format 및 quantization 매개변수와 밀접한 관련이 있습니다.

Xinference는 xinference engine 명령을 제공하여 관련 매개변수 조합을 조회할 수 있도록 도와줍니다.

예를 들어:

I want to query the parameter combinations related to the qwen-chat model to determine how it can run on various inference engines.

xinference engine -e <xinference_endpoint> --model-name qwen-chat

I want to run qwen-chat on the VLLM inference engine, but I don’t know what other parameters meet this requirement.

xinference engine -e <xinference_endpoint> --model-name qwen-chat --model-engine vllm

I want to load the qwen-chat model in GGUF format, I need to know the rest of the parameter combinations.

xinference engine -e <xinference_endpoint> --model-name qwen-chat -f ggufv2

요약하자면, 이전 버전과 달리 LLM 모델을 로드할 때 model_engine 매개변수를 추가로 전달해야 합니다. xinference engine 명령어를 통해 실행하려는 추론 엔진과 다른 매개변수 조합의 관계를 확인할 수 있습니다.

참고

다음은 어떤 엔진을 언제 사용해야 하는지에 대한 몇 가지 제안입니다:

Linux
- 사용 가능한 경우, vLLM 또는 **SGLang**을 우선적으로 사용하세요. 더 나은 성능을 제공하기 때문입니다.
- 리소스가 제한된 경우, 더 많은 양자화 옵션을 제공하는 **llama.cpp**를 고려해볼 수 있습니다.
- 다른 사용 고려사항으로 **Transformers**를 사용하면 거의 모든 모델을 지원합니다.
Windows
- WSL 사용을 권장하며, 이 경우 Linux와 동일한 옵션을 선택합니다.
- 다른 경우에는 **llama.cpp**를 권장하며, 지원되지 않는 모델의 경우 **Transformers**를 사용하세요.
Mac
- 모델이 지원하는 경우, MLX 엔진 사용을 권장합니다. 이는 최상의 성능을 제공합니다.
- 다른 경우에는 **llama.cpp**를 사용하는 것을 추천하며, 지원되지 않는 모델의 경우 **Transformers**를 선택하여 사용하세요.

qwen2.5-instruct 실행#

내장된 qwen2.5-instruct 모델을 실행해 보겠습니다. 모델을 실행해야 할 때, 처음 실행 시 HuggingFace에서 모델 파라미터를 다운로드해야 하며, 일반적으로 모델 크기에 따라 10분에서 30분 정도 소요됩니다. 다운로드가 완료되면 Xinference 로컬에 캐시가 저장되므로, 이후 동일한 모델을 다시 실행할 때는 새로 다운로드할 필요가 없습니다.

참고

Xinference는 다른 모델 호스팅 플랫폼에서 모델을 다운로드할 수도 있습니다. Xinference를 실행할 때 환경 변수를 지정하여 수행할 수 있습니다. 예를 들어, ModelScope에서 모델을 다운로드하려면 다음 명령을 사용할 수 있습니다.

XINFERENCE_MODEL_SRC=modelscope xinference-local --host 0.0.0.0 --port 9997

--model-uid 혹은 -u 매개변수를 사용하여 모델의 UID를 지정할 수 있으며, 지정하지 않으면 Xinference가 무작위로 ID를 생성합니다. 기본 ID는 모델 이름과 동일하게 유지됩니다.

xinference launch --model-engine <inference_engine> -n qwen2.5-instruct -s 0_5 -f pytorch

curl -X 'POST' \
  'http://127.0.0.1:9997/v1/models' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "model_engine": "<inference_engine>",
  "model_name": "qwen2.5-instruct",
  "model_format": "pytorch",
  "size_in_billions": "0_5"
}'

from xinference.client import RESTfulClient
client = RESTfulClient("http://127.0.0.1:9997")
model_uid = client.launch_model(
  model_engine="<inference_engine>",
  model_name="qwen2.5-instruct",
  model_format="pytorch",
  size_in_billions="0_5"
)
print('Model uid: ' + model_uid)

Model uid: qwen2.5-instruct

참고

일부 추론 엔진(예: vllm)의 경우 사용자가 모델을 실행할 때 엔진 관련 매개변수를 지정해야 합니다. 이 경우 명령줄에서 직접 해당 매개변수 이름과 값을 지정하면 됩니다. 예를 들어:

xinference launch --model-engine vllm -n qwen2.5-instruct -s 0_5 -f pytorch --gpu_memory_utilization 0.9

모델을 실행할 때 `gpu_memory_utilization=0.9`가 vllm 백엔드로 전달됩니다.

참고

모델 로딩에 대한 더 많은 팁은 :ref:`launch`를 참조하세요.

이 단계까지 오셨다면, 축하합니다! Xinference를 통해 ``qwen2.5-instruct``를 성공적으로 실행하셨습니다. 이 모델이 실행 중이면 명령줄, cURL 또는 Python 코드를 통해 상호작용할 수 있습니다:

curl -X 'POST' \
  'http://127.0.0.1:9997/v1/chat/completions' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "qwen2.5-instruct",
    "messages": [
        {
            "role": "system",
            "content": "You are a helpful assistant."
        },
        {
            "role": "user",
            "content": "What is the largest animal?"
        }
    ]
  }'

from xinference.client import RESTfulClient
client = RESTfulClient("http://127.0.0.1:9997")
model = client.get_model("qwen2.5-instruct")
model.chat(
    messages=[
        {"role": "user", "content": "Who won the world series in 2020?"}
    ]
)

{
  "id": "chatcmpl-8d76b65a-bad0-42ef-912d-4a0533d90d61",
  "model": "qwen2.5-instruct",
  "object": "chat.completion",
  "created": 1688919187,
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The largest animal that has been scientifically measured is the blue whale, which has a maximum length of around 23 meters (75 feet) for adult animals and can weigh up to 150,000 pounds (68,000 kg). However, it is important to note that this is just an estimate and that the largest animal known to science may be larger still. Some scientists believe that the largest animals may not have a clear \"size\" in the same way that humans do, as their size can vary depending on the environment and the stage of their life."
      },
      "finish_reason": "None"
    }
  ],
  "usage": {
    "prompt_tokens": -1,
    "completion_tokens": -1,
    "total_tokens": -1
  }
}

Xinference는 OpenAI 호환 API를 제공하므로 Xinference에서 실행되는 모델을 OpenAI의 로컬 대체품으로 사용할 수 있습니다. 예를 들어:

from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:9997/v1", api_key="not used actually")

response = client.chat.completions.create(
    model="qwen2.5-instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the largest animal?"}
    ]
)
print(response)

지원되는 OpenAI API 목록은 다음과 같습니다:

대화 생성: https://platform.openai.com/docs/api-reference/chat
https://platform.openai.com/docs/api-reference/completions
벡터 생성: https://platform.openai.com/docs/api-reference/embeddings

Xinference는 기본 URL ``http://127.0.0.1:9997/anthropic``을 통해 Anthropic API를 호출할 수도 있으며, Claude Code와 같은 환경에서 Xinference를 사용할 수 있습니다. 자세한 내용은 :ref:`anthropic client <anthropic_client>`를 참조하세요.

모델 관리#

시작 모델 외에도, Xinference는 모델의 전체 라이프사이클을 관리할 수 있는 기능을 제공합니다. 마찬가지로 명령줄, cURL 및 Python 코드를 사용하여 관리할 수 있습니다:

다음은 Xinference가 지원하는 모든 지정 유형의 모델입니다:

xinference registrations -t LLM

curl http://127.0.0.1:9997/v1/model_registrations/LLM

from xinference.client import RESTfulClient
client = RESTfulClient("http://127.0.0.1:9997")
print(client.list_model_registrations(model_type='LLM'))

다음 명령어는 실행 중인 모든 모델을 나열합니다:

xinference list

curl http://127.0.0.1:9997/v1/models

from xinference.client import RESTfulClient
client = RESTfulClient("http://127.0.0.1:9997")
print(client.list_models())

실행 중인 모델이 더 이상 필요하지 않을 경우, 다음 방법을 통해 중지하고 리소스를 해제할 수 있습니다:

xinference terminate --model-uid "qwen2.5-instruct"

curl -X DELETE http://127.0.0.1:9997/v1/models/qwen2.5-instruct

from xinference.client import RESTfulClient
client = RESTfulClient("http://127.0.0.1:9997")
client.terminate_model(model_uid="qwen2.5-instruct")

클러스터에 Xinference 배포#

클러스터 환경에서 Xinference를 배포하려면 한 대의 머신에서 supervisor 노드를 시작하고, 현재 또는 다른 노드에서 worker 노드를 시작해야 합니다.

먼저, :ref:`문서 <installation>`에 따라 모든 서버에 Xinference가 설치되어 있는지 확인합니다. 그 다음 단계를 따릅니다:

Supervisor 시작#

서버에서 다음 명령을 실행하여 Supervisor 노드를 시작합니다:

xinference-supervisor -H "${supervisor_host}"

현재 노드의 IP로 ``${supervisor_host}``를 대체합니다.

http://${supervisor_host}:9997/ui 에서 웹 UI에 접속할 수 있으며, http://${supervisor_host}:9997/docs 에서 API 문서에 접속할 수 있습니다.

Worker 시작#

Xinference worker를 시작해야 하는 머신에서 다음 명령을 실행하세요:

xinference-worker -e "http://${supervisor_host}:9997" -H "${worker_host}"

참고

주의해야 할 점은 반드시 현재 Worker 노드의 IP를 사용하여 ``${worker_host}``를 대체해야 한다는 것입니다.

참고

주의할 점은, 명령줄을 통해 클러스터와 상호작용해야 할 경우 -e 또는 --endpoint 매개변수를 사용하여 supervisor의 주소를 지정해야 한다는 것입니다. 예를 들면:

xinference launch -n qwen2.5-instruct -s 0_5 -f pytorch -e "http://${supervisor_host}:9997"

Docker를 사용하여 Xinference 배포#

다음 명령어로 컨테이너에서 Xinference를 실행하세요:

NVIDIA 그래픽 카드가 장착된 머신에서 실행#

For CUDA 12.4:

docker run -e XINFERENCE_MODEL_SRC=modelscope -p 9998:9997 --gpus all xprobe/xinference:<your_version> xinference-local -H 0.0.0.0 --log-level debug

CUDA 12.8

Added in version v1.8.1: CUDA 12.8 버전은 실험적이며, 개선을 위한 피드백을 환영합니다.

버전 v1.16.0에서 변경: CUDA 12.8 버전은 v1.16.0에서 제거되었습니다.

docker run -e XINFERENCE_MODEL_SRC=modelscope -p 9998:9997 --gpus all xprobe/xinference:<your_version>-cu128 xinference-local -H 0.0.0.0 --log-level debug

CUDA 12.9의 경우:

Added in version v1.16.0: Xinference v2.0.0이 출시된 후, CUDA 12.9가 기본 버전이 됩니다.

docker run -e XINFERENCE_MODEL_SRC=modelscope -p 9998:9997 --gpus all xprobe/xinference:<your_version>-cu129 xinference-local -H 0.0.0.0 --log-level debug

CPU만 있는 머신에서 실행#

docker run -e XINFERENCE_MODEL_SRC=modelscope -p 9998:9997 xprobe/xinference:<your_version>-cpu xinference-local -H 0.0.0.0 --log-level debug

``<your_version>``을 Xinference의 버전으로 교체하세요. 예를 들어 ``v0.10.3``이며, 최신 버전에는 ``latest``를 사용할 수 있습니다.

더 많은 Docker 사용 방법은 :ref:`Docker 이미지 사용 <using_docker_image>`를 참조하십시오.

더#

축하합니다! Xinference 사용법을 기본적으로 익히셨습니다! 도구를 더 잘 활용할 수 있도록, 아래에 다른 문서와 가이드 리소스를 제공합니다: