| # CLIP Inference with AMD Ryzen AI |
|
|
| This repository contains a Python script for running CLIP (Contrastive Language-Image Pre-training) model inference on the CIFAR-100 dataset using AMD Ryzen AI NPU or CPU. This version is for RAI 1.5. This script demonstrates zero-shot image classification capabilities of the CLIP model. It runs on both NPU and CPU. |
|
|
| ### Installation instructions |
|
|
| The user must have the RAI 1.5 environment set up. Please follow the [Ryzen AI Installation Guide](https://ryzenai.docs.amd.com/en/latest/inst.html) to prepare your environment. |
|
|
| 1. Activate your conda environment: |
| ```bash |
| conda activate ryzen-ai-1.5.0 |
| ``` |
|
|
| 2. Unzip both of the cache directories. There is one for vision and one for text. Make sure that the directories are in the same location as the inference script. |
|
|
| 3. Install the required Python packages: |
| ```bash |
| pip install -r requirements.txt |
| ``` |
| ### Required Files |
|
|
| Ensure the following files are present in the same directory as `clip_inference.py`: |
|
|
| #### ONNX Model Files |
| - `clip_text_model.onnx` - ONNX text encoder model |
| - `clip_vision_model.onnx` - ONNX vision encoder model |
|
|
| #### Configuration Files (for NPU execution) |
| - `vitisai_config.json` - VitisAI configuration |
|
|
| #### Model Cache Directories |
| - `clip_text_model_cache/` - Cached text model artifacts |
| - `clip_vision_model_cache/` - Cached vision model artifacts |
|
|
| ### Cache Directory Structure |
|
|
| The cache directories contain pre-compiled model artifacts and optimization files for improved performance. |
|
|
| They eliminate the need for model compilation, which may be timely. |
|
|
| CLIP uses two models, and has two cache files provided as zip files. |
|
|
| #### Cache Directory Descriptions |
|
|
| - **Root Level Files**: Contain compilation metadata, graph analysis, and performance summaries |
| - **`cache/`**: Hash-based cache storage for model artifacts |
| - **`vaiml_par_0/`**: Contains compiled model artifacts, MLIR representations, and native libraries |
| - **`vaiml_partition_fe.flexml/`**: Contains optimized ONNX models and visualization files |
|
|
| **Note**: These cache directories are automatically generated during the first NPU compilation and significantly reduce subsequent startup times. |
|
|
| ## Usage |
|
|
| ### Command Line Interface |
|
|
| ```bash |
| python clip_inference.py [-h] (--npu | --cpu) [--num_images NUM_IMAGES] |
| ``` |
|
|
| ### Arguments |
|
|
| **Required (mutually exclusive):** |
| - `--cpu`: Run inference on CPU using CPUExecutionProvider |
| - `--npu`: Run inference on NPU using VitisAIExecutionProvider |
|
|
| **Optional:** |
| - `--num_images`: Number of images to process from CIFAR-100 test set (default: 50, max: 10,000) |
|
|
| ### Examples |
|
|
| 1. **CPU inference with default settings (50 images):** |
| ```bash |
| python clip_inference.py --cpu |
| ``` |
|
|
| 2. **NPU inference with 100 images:** |
| ```bash |
| python clip_inference.py --npu --num_images 100 |
| ``` |
|
|
| 3. **NPU inference on complete test dataset:** |
| ```bash |
| python clip_inference.py --npu --num_images 10000 |
| ``` |
|
|
| ## How It Works |
|
|
| ### Model Architecture |
| - **Text Encoder**: Processes text descriptions ("a photo of a {class_name}") |
| - **Vision Encoder**: Processes CIFAR-100 images (32x32 RGB) |
| - **Classification**: Computes similarity between image and text embeddings |
| |
| ### Inference Pipeline |
| 1. **Text Processing**: Pre-compute text features for all 100 CIFAR-100 class labels |
| 2. **Image Processing**: Process each image through the vision encoder |
| 3. **Classification**: Compute cosine similarity between image and text features |
| 4. **Prediction**: Select the class with highest similarity score |
| |
| ### Performance Optimization |
| - **NPU Acceleration**: Leverages AMD Ryzen AI NPU for faster inference |
| - **Caching**: Uses pre-compiled model caches for reduced startup time |
| |
| ## Output Metrics |
| |
| The script reports the following performance metrics: |
| |
| - **Text Latency**: Average time per text inference (ms) |
| - **Text Throughput**: Text inferences per second (inf/s) |
| - **Vision Latency**: Average time per image inference (ms) |
| - **Vision Throughput**: Image inferences per second (inf/s) |
| - **Classification Accuracy**: Percentage of correctly classified images |
| |
| ### Example Output |
| |
| **NPU Execution (50 images):** |
| ``` |
| Compilation Done |
| Session on NPU |
| |
| Processing images... |
| Image inference: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 50/50 [00:03<00:00, 13.45it/s] |
| |
| Results: |
| Text latency: 26.65 ms |
| Text throughput: 37.52 inf/s |
| Vision latency: 73.46 ms |
| Vision throughput: 13.61 inf/s |
| Classification accuracy: 77.55% |
| ``` |
| |
| |