Overview

The Deepseek-R1 1.5B Llama.cpp on NVIDIA Jetson™ delivers a plug-and-play AI runtime for NVIDIA Jetson™ devices, featuring the DeepSeek R1 1.5B model served locally using llama-cpp-python (LlamaCPP python binding).

This container is optimized for offline, edge AI applications and includes:

On-device LLM inference using DeepSeek R1 1.5B via llama-cpp-python—no internet needed after setup
Support for GGUF-quantized models (e.g., Q4_0, Q6_K) for optimal performance on resource-constrained Jetson devices
FastAPI middleware for serving REST endpoints and building modular AI workflows
Streaming chat UI via OpenWebUI
OpenAI-compatible API endpoints for seamless integration
Customizable model parameters via API payload
Simplified integration using LlamaCpp Python binding for developers building AI pipelines in Python

Container Demo

Use Cases

Private LLM Inference on Local Devices: Run large language models locally with no internet requirement—ideal for privacy-critical environments
Lightweight Backend for LLM APIs: Use LlamaCpp to expose models via its local API for fast integration with tools like LangChain, FastAPI, or custom UIs.
Document-Based Q&A Systems: Combine LlamaCpp with a vector database to create offline RAG (Retrieval-Augmented Generation) systems for querying internal documents or manuals.
Multilingual Assistants: Deploy multilingual chatbots using local models that can translate, summarize, or interact in different languages without depending on cloud services.
LLM Evaluation and Benchmarking Easily swap and test different quantized models (e.g., Mistral, LLaMA, DeepSeek) to compare performance, output quality, and memory usage across devices.
Custom Offline Agents: Use LlamaCpp as the reasoning core of intelligent agents that interact with other local tools (e.g., databases, APIs, sensors)—especially powerful when paired with LangChain
Edge AI for Industrial Use: Deploy LlamaCpp on Edge to enable intelligent interfaces, command parsing, or decision-support tools at the edge.

Key Features

LlamaCPP Engine: High-performance C++ backend optimized for fast, quantized large language model (LLM) inference on edge devices. Supports GGUF models and utilizes CPU/GPU acceleration
Python Bindings: Integrated via llama-cpp-python, a lightweight Python wrapper that provides seamless access to LlamaCPP’s capabilities through Python and REST APIs—ideal for building custom applications, pipelines, or microservices
Quantized Model Support: Compatible with GGUF quantized models (e.g., Q4_0, Q5_K_M, Q6_K), enabling efficient inference with reduced memory and compute footprint on Jetson-class hardware
Complete AI Framework Stack: PyTorch, TensorFlow, ONNX Runtime, and TensorRT™
Industrial Vision Support: Accelerated OpenCV and GStreamer pipelines
Edge AI Capabilities: Support for computer vision, LLMs, and time-series analysis
Performance Optimized: Tuned specifically for NVIDIA® Jetson Orin™ NX 8GB

Host Device Prerequisites

Item	Specification
Compatible Hardware	Advantech devices accelerated by NVIDIA Jetson™—refer to Compatible hardware
NVIDIA Jetson™ Version	5.x
Host OS	Ubuntu 20.04
Required Software Packages	Refer to Below
Software Installation	NVIDIA Jetson™ Software Package Installation

Container Environment Overview

Software Components on Container Image

Component	Version	Description
CUDA®	11.4.315	GPU computing platform
cuDNN	8.6.0	Deep Neural Network library
TensorRT™	8.5.2.2	Inference optimizer and runtime
PyTorch	2.0.0+nv23.02	Deep learning framework
TensorFlow	2.12.0	Machine learning framework
ONNX Runtime	1.16.3	Cross-platform inference engine
OpenCV	4.5.0	Computer vision library with CUDA®
GStreamer	1.16.2	Multimedia framework
FastAPI	0.115.12	API service exposing LangChain interface
OpenWebUI	0.6.5	Web interface for chat interactions
LlamaCpp	0.2.0	LLM inference engine
LlamaCpp-Python	0.3.9	Python wrapper for LlamaCPP

Quick Start Guide

For container quick start, including the docker-compose file and more, please refer to README.

Supported AI Capabilities

Language Models Recommendation

Model Family	Parameters	Quantization	Size	Performance
DeepSeek R1	1.5 B	Q4_K_M	1.1 GB	~15-17 tokens/sec
DeepSeek R1	7 B	Q4_K_M	4.7 GB	~5-7 tokens/sec
DeepSeek Coder	1.3 B	Q4_0	776 MB	~20-25 tokens/sec
Llama 3.2	1 B	Q8_0	1.3 GB	~17-20 tokens/sec
Llama 3.2 Instruct	1 B	Q4_0	~0.8 GB	~17-20 tokens/sec
Llama 3.2	3 B	Q4_K_M	2 GB	~10-12 tokens/sec
Llama 2	7 B	Q4_0	3.8 GB	~5-7 tokens/sec
Tinyllama	1.1 B	Q4_0	637 MB	~22-27 tokens/sec
Qwen 2.5	0.5 B	Q4_K_M	398 MB	~25-30 tokens/sec
Qwen 2.5	1.5 B	Q4_K_M	986 MB	~15-17 tokens/sec
Qwen 2.5 Coder	0.5 B	Q8_0	531 MB	~25-30 tokens/sec
Qwen 2.5 Coder	1.5 B	Q4_K_M	986 MB	~15-17 tokens/sec
Qwen	0.5 B	Q4_0	395 MB	~25-30 tokens/sec
Qwen	1.8 B	Q4_0	1.1 GB	~15-20 tokens/sec
Gemma 2	2 B	Q4_0	1.6 GB	~10-12 tokens/sec
Mistral	7 B	Q4_0	4.1 GB	~5-7 tokens/sec

Best Practices and Recommendations

Ensure models are fully loaded into GPU memory for best results.
Use quantized GGUF models for the best performance & accuracy.
Batch inference for better throughput
Use stream processing for continuous data
Enable Jetson™ Clocks for better inference speed
Increase swap size if models loaded are large
Use lesser context & batch size to avoid high memory utilization
Set max-tokens in API payloads to avoid unnecessarily long response generations, which may affect memory utilization.
It is recommended to use models with parameters <2B and Q4 quantization.

Hardware Acceleration Support

Accelerator	Support Level	Compatible Libraries	Notes
CUDA®	Full	PyTorch, TensorFlow, OpenCV, ONNX Runtime	Primary acceleration method
TensorRT™	Full	ONNX, TensorFlow, PyTorch (via export)	Recommended for inference optimization
cuDNN	Full	PyTorch, TensorFlow	Accelerates deep learning primitives
NVDEC	Full	GStreamer, FFmpeg	Hardware video decoding
NVENC	Full	GStreamer, FFmpeg	Hardware video encoding
DLA	Partial	TensorRT™	Requires specific model optimization