If you're serving LLMs on NVIDIA GPUs in production and need serious performance, this is the tool to reach for. It delivers 100x speedups over PyTorch on models like Llama 3 through aggressive optimizations like in-flight batching, FP8 quantization, and multi-GPU scaling. The tradeoff is complexity: you're dealing with model compilation, CUDA dependencies, and TensorRT's learning curve. But if you're running A100s or H100s and need to squeeze out 24,000 tokens per second with low latency, the performance gains justify the setup cost. For simpler deployments or non-NVIDIA hardware, vLLM or llama.cpp are easier choices.
npx skills add https://github.com/orchestra-research/ai-research-skills --skill tensorrt-llm