Overview
The Deepseek-R1 1.5B Llama.cpp on NVIDIA Jetson™ delivers a plug-and-play AI runtime for NVIDIA Jetson™ devices, featuring the DeepSeek R1 1.5B model served locally using llama-cpp-python (LlamaCPP python binding).
This container is optimized for offline, edge AI applications and includes:
- On-device LLM inference using DeepSeek R1 1.5B via llama-cpp-python—no internet needed after setup
- Support for GGUF-quantized models (e.g., Q4_0, Q6_K) for optimal performance on resource-constrained Jetson devices
- FastAPI middleware for serving REST endpoints and building modular AI workflows
- Streaming chat UI via OpenWebUI
- OpenAI-compatible API endpoints for seamless integration
- Customizable model parameters via API payload
- Simplified integration using LlamaCpp Python binding for developers building AI pipelines in Python
Container Demo

Use Cases
- Private LLM Inference on Local Devices: Run large language models locally with no internet requirement—ideal for privacy-critical environments
- Lightweight Backend for LLM APIs: Use LlamaCpp to expose models via its local API for fast integration with tools like LangChain, FastAPI, or custom UIs.
- Document-Based Q&A Systems: Combine LlamaCpp with a vector database to create offline RAG (Retrieval-Augmented Generation) systems for querying internal documents or manuals.
- Multilingual Assistants: Deploy multilingual chatbots using local models that can translate, summarize, or interact in different languages without depending on cloud services.
- LLM Evaluation and Benchmarking Easily swap and test different quantized models (e.g., Mistral, LLaMA, DeepSeek) to compare performance, output quality, and memory usage across devices.
- Custom Offline Agents: Use LlamaCpp as the reasoning core of intelligent agents that interact with other local tools (e.g., databases, APIs, sensors)—especially powerful when paired with LangChain
- Edge AI for Industrial Use: Deploy LlamaCpp on Edge to enable intelligent interfaces, command parsing, or decision-support tools at the edge.
Key Features
- LlamaCPP Engine: High-performance C++ backend optimized for fast, quantized large language model (LLM) inference on edge devices. Supports GGUF models and utilizes CPU/GPU acceleration
- Python Bindings: Integrated via llama-cpp-python, a lightweight Python wrapper that provides seamless access to LlamaCPP’s capabilities through Python and REST APIs—ideal for building custom applications, pipelines, or microservices
- Quantized Model Support: Compatible with GGUF quantized models (e.g., Q4_0, Q5_K_M, Q6_K), enabling efficient inference with reduced memory and compute footprint on Jetson-class hardware
- Complete AI Framework Stack: PyTorch, TensorFlow, ONNX Runtime, and TensorRT™
- Industrial Vision Support: Accelerated OpenCV and GStreamer pipelines
- Edge AI Capabilities: Support for computer vision, LLMs, and time-series analysis
- Performance Optimized: Tuned specifically for NVIDIA® Jetson Orin™ NX 8GB
Host Device Prerequisites
| Item | Specification |
|---|---|
| Compatible Hardware | Advantech devices accelerated by NVIDIA Jetson™—refer to Compatible hardware |
| NVIDIA Jetson™ Version | 5.x |
| Host OS | Ubuntu 20.04 |
| Required Software Packages | Refer to Below |
| Software Installation | NVIDIA Jetson™ Software Package Installation |
Container Environment Overview
Software Components on Container Image
| Component | Version | Description |
|---|---|---|
| CUDA® | 11.4.315 | GPU computing platform |
| cuDNN | 8.6.0 | Deep Neural Network library |
| TensorRT™ | 8.5.2.2 | Inference optimizer and runtime |
| PyTorch | 2.0.0+nv23.02 | Deep learning framework |
| TensorFlow | 2.12.0 | Machine learning framework |
| ONNX Runtime | 1.16.3 | Cross-platform inference engine |
| OpenCV | 4.5.0 | Computer vision library with CUDA® |
| GStreamer | 1.16.2 | Multimedia framework |
| FastAPI | 0.115.12 | API service exposing LangChain interface |
| OpenWebUI | 0.6.5 | Web interface for chat interactions |
| LlamaCpp | 0.2.0 | LLM inference engine |
| LlamaCpp-Python | 0.3.9 | Python wrapper for LlamaCPP |
Quick Start Guide
For container quick start, including the docker-compose file and more, please refer to README.
Supported AI Capabilities
Language Models Recommendation
| Model Family | Parameters | Quantization | Size | Performance |
|---|---|---|---|---|
| DeepSeek R1 | 1.5 B | Q4_K_M | 1.1 GB | ~15-17 tokens/sec |
| DeepSeek R1 | 7 B | Q4_K_M | 4.7 GB | ~5-7 tokens/sec |
| DeepSeek Coder | 1.3 B | Q4_0 | 776 MB | ~20-25 tokens/sec |
| Llama 3.2 | 1 B | Q8_0 | 1.3 GB | ~17-20 tokens/sec |
| Llama 3.2 Instruct | 1 B | Q4_0 | ~0.8 GB | ~17-20 tokens/sec |
| Llama 3.2 | 3 B | Q4_K_M | 2 GB | ~10-12 tokens/sec |
| Llama 2 | 7 B | Q4_0 | 3.8 GB | ~5-7 tokens/sec |
| Tinyllama | 1.1 B | Q4_0 | 637 MB | ~22-27 tokens/sec |
| Qwen 2.5 | 0.5 B | Q4_K_M | 398 MB | ~25-30 tokens/sec |
| Qwen 2.5 | 1.5 B | Q4_K_M | 986 MB | ~15-17 tokens/sec |
| Qwen 2.5 Coder | 0.5 B | Q8_0 | 531 MB | ~25-30 tokens/sec |
| Qwen 2.5 Coder | 1.5 B | Q4_K_M | 986 MB | ~15-17 tokens/sec |
| Qwen | 0.5 B | Q4_0 | 395 MB | ~25-30 tokens/sec |
| Qwen | 1.8 B | Q4_0 | 1.1 GB | ~15-20 tokens/sec |
| Gemma 2 | 2 B | Q4_0 | 1.6 GB | ~10-12 tokens/sec |
| Mistral | 7 B | Q4_0 | 4.1 GB | ~5-7 tokens/sec |
Best Practices and Recommendations
- Ensure models are fully loaded into GPU memory for best results.
- Use quantized GGUF models for the best performance & accuracy.
- Batch inference for better throughput
- Use stream processing for continuous data
- Enable Jetson™ Clocks for better inference speed
- Increase swap size if models loaded are large
- Use lesser context & batch size to avoid high memory utilization
- Set max-tokens in API payloads to avoid unnecessarily long response generations, which may affect memory utilization.
- It is recommended to use models with parameters <2B and Q4 quantization.
Hardware Acceleration Support
| Accelerator | Support Level | Compatible Libraries | Notes |
|---|---|---|---|
| CUDA® | Full | PyTorch, TensorFlow, OpenCV, ONNX Runtime | Primary acceleration method |
| TensorRT™ | Full | ONNX, TensorFlow, PyTorch (via export) | Recommended for inference optimization |
| cuDNN | Full | PyTorch, TensorFlow | Accelerates deep learning primitives |
| NVDEC | Full | GStreamer, FFmpeg | Hardware video decoding |
| NVENC | Full | GStreamer, FFmpeg | Hardware video encoding |
| DLA | Partial | TensorRT™ | Requires specific model optimization |
Copyright © Advantech Corporation. All rights reserved.