Catalog

Overview

The Deepseek-R1 1.5B Llama.cpp on NVIDIA Jetson™ delivers a plug-and-play AI runtime for NVIDIA Jetson™ devices, featuring the DeepSeek R1 1.5B model served locally using llama-cpp-python (LlamaCPP python binding).

This container is optimized for offline, edge AI applications and includes:

  • On-device LLM inference using DeepSeek R1 1.5B via llama-cpp-python—no internet needed after setup
  • Support for GGUF-quantized models (e.g., Q4_0, Q6_K) for optimal performance on resource-constrained Jetson devices
  • FastAPI middleware for serving REST endpoints and building modular AI workflows
  • Streaming chat UI via OpenWebUI
  • OpenAI-compatible API endpoints for seamless integration
  • Customizable model parameters via API payload
  • Simplified integration using LlamaCpp Python binding for developers building AI pipelines in Python

Container Demo


Use Cases

  • Private LLM Inference on Local Devices: Run large language models locally with no internet requirement—ideal for privacy-critical environments
  • Lightweight Backend for LLM APIs: Use LlamaCpp to expose models via its local API for fast integration with tools like LangChain, FastAPI, or custom UIs.
  • Document-Based Q&A Systems: Combine LlamaCpp with a vector database to create offline RAG (Retrieval-Augmented Generation) systems for querying internal documents or manuals.
  • Multilingual Assistants: Deploy multilingual chatbots using local models that can translate, summarize, or interact in different languages without depending on cloud services.
  • LLM Evaluation and Benchmarking Easily swap and test different quantized models (e.g., Mistral, LLaMA, DeepSeek) to compare performance, output quality, and memory usage across devices.
  • Custom Offline Agents: Use LlamaCpp as the reasoning core of intelligent agents that interact with other local tools (e.g., databases, APIs, sensors)—especially powerful when paired with LangChain
  • Edge AI for Industrial Use: Deploy LlamaCpp on Edge to enable intelligent interfaces, command parsing, or decision-support tools at the edge.

Key Features

  • LlamaCPP Engine: High-performance C++ backend optimized for fast, quantized large language model (LLM) inference on edge devices. Supports GGUF models and utilizes CPU/GPU acceleration
  • Python Bindings: Integrated via llama-cpp-python, a lightweight Python wrapper that provides seamless access to LlamaCPP’s capabilities through Python and REST APIs—ideal for building custom applications, pipelines, or microservices
  • Quantized Model Support: Compatible with GGUF quantized models (e.g., Q4_0, Q5_K_M, Q6_K), enabling efficient inference with reduced memory and compute footprint on Jetson-class hardware
  • Complete AI Framework Stack: PyTorch, TensorFlow, ONNX Runtime, and TensorRT™
  • Industrial Vision Support: Accelerated OpenCV and GStreamer pipelines
  • Edge AI Capabilities: Support for computer vision, LLMs, and time-series analysis
  • Performance Optimized: Tuned specifically for NVIDIA® Jetson Orin™ NX 8GB

Host Device Prerequisites

Item Specification
Compatible Hardware Advantech devices accelerated by NVIDIA Jetson™—refer to Compatible hardware
NVIDIA Jetson™ Version 5.x
Host OS Ubuntu 20.04
Required Software Packages Refer to Below
Software Installation NVIDIA Jetson™ Software Package Installation

Container Environment Overview

Software Components on Container Image

Component Version Description
CUDA® 11.4.315 GPU computing platform
cuDNN 8.6.0 Deep Neural Network library
TensorRT™ 8.5.2.2 Inference optimizer and runtime
PyTorch 2.0.0+nv23.02 Deep learning framework
TensorFlow 2.12.0 Machine learning framework
ONNX Runtime 1.16.3 Cross-platform inference engine
OpenCV 4.5.0 Computer vision library with CUDA®
GStreamer 1.16.2 Multimedia framework
FastAPI 0.115.12 API service exposing LangChain interface
OpenWebUI 0.6.5 Web interface for chat interactions
LlamaCpp 0.2.0 LLM inference engine
LlamaCpp-Python 0.3.9 Python wrapper for LlamaCPP

Quick Start Guide

For container quick start, including the docker-compose file and more, please refer to README.


Supported AI Capabilities

Language Models Recommendation

Model Family Parameters Quantization Size Performance
DeepSeek R1 1.5 B Q4_K_M 1.1 GB ~15-17 tokens/sec
DeepSeek R1 7 B Q4_K_M 4.7 GB ~5-7 tokens/sec
DeepSeek Coder 1.3 B Q4_0 776 MB ~20-25 tokens/sec
Llama 3.2 1 B Q8_0 1.3 GB ~17-20 tokens/sec
Llama 3.2 Instruct 1 B Q4_0 ~0.8 GB ~17-20 tokens/sec
Llama 3.2 3 B Q4_K_M 2 GB ~10-12 tokens/sec
Llama 2 7 B Q4_0 3.8 GB ~5-7 tokens/sec
Tinyllama 1.1 B Q4_0 637 MB ~22-27 tokens/sec
Qwen 2.5 0.5 B Q4_K_M 398 MB ~25-30 tokens/sec
Qwen 2.5 1.5 B Q4_K_M 986 MB ~15-17 tokens/sec
Qwen 2.5 Coder 0.5 B Q8_0 531 MB ~25-30 tokens/sec
Qwen 2.5 Coder 1.5 B Q4_K_M 986 MB ~15-17 tokens/sec
Qwen 0.5 B Q4_0 395 MB ~25-30 tokens/sec
Qwen 1.8 B Q4_0 1.1 GB ~15-20 tokens/sec
Gemma 2 2 B Q4_0 1.6 GB ~10-12 tokens/sec
Mistral 7 B Q4_0 4.1 GB ~5-7 tokens/sec

Best Practices and Recommendations

  • Ensure models are fully loaded into GPU memory for best results.
  • Use quantized GGUF models for the best performance & accuracy.
  • Batch inference for better throughput
  • Use stream processing for continuous data
  • Enable Jetson™ Clocks for better inference speed
  • Increase swap size if models loaded are large
  • Use lesser context & batch size to avoid high memory utilization
  • Set max-tokens in API payloads to avoid unnecessarily long response generations, which may affect memory utilization.
  • It is recommended to use models with parameters <2B and Q4 quantization.

Hardware Acceleration Support

Accelerator Support Level Compatible Libraries Notes
CUDA® Full PyTorch, TensorFlow, OpenCV, ONNX Runtime Primary acceleration method
TensorRT™ Full ONNX, TensorFlow, PyTorch (via export) Recommended for inference optimization
cuDNN Full PyTorch, TensorFlow Accelerates deep learning primitives
NVDEC Full GStreamer, FFmpeg Hardware video decoding
NVENC Full GStreamer, FFmpeg Hardware video encoding
DLA Partial TensorRT™ Requires specific model optimization

Copyright © Advantech Corporation. All rights reserved.