Milestone Systems has introduced a new Vision Language Model (VLM) specifically designed to understand traffic scenarios. The solution is based on Nvidia Cosmos Reason and addresses key challenges in modern video security: enormous amounts of data, manual evaluation and time-consuming analysis processes.
The new VLM forms the technological basis for two new products: Video Summarisation for XProtect Video Management Software and VLM as a Service (VLMaaS) for third-party providers and developers.
Video Summarisation for XProtect: Quick insights instead of hours of viewing
Modern video systems capture huge volumes of data every day. Manually reviewing video material ties up resources and delays decisions. With Video Summarisation for XProtect, Milestone uses generative AI to automatically summarise video content, identify relevant events and automate reports.
Users can search visual data in the form of structured summaries, enabling them to gain actionable insights more quickly. The tool can be installed directly in XProtect Smart Client in just a few minutes and is available to download free of charge. Users are only charged for their use of VLM per query – a low-threshold entry point into AI-supported video analysis.
VLM as a Service: Production-ready video AI via API
With Hafnia VLM as a Service, Milestone is opening up its AI technology to developers, system integrators and technology partners. An API gives them access to a production-ready vision language model without having to build their own AI infrastructures or train models at great expense.
VLMaaS makes it possible to quickly add generative video intelligence to existing applications, regardless of the previous level of analysis. This significantly accelerates the development of new solutions: according to Milestone, the effort required is reduced by up to 70 times compared to fine-tuning your own VLM.
Key features include:
- High-precision, traffic-optimised vision language model based on Nvidia Cosmos Reason
- Prompt-based control for traffic-related analysis tasks
- API-first approach with easy integration via HTTPS
- Fine-tuned models for US and EU markets, with other regions to follow
- Flexible use as a standalone solution or integrated into the Milestone portfolio
- 100% responsibly sourced training data with traceable data origin, compliant with GDPR and EU AI Act
The pricing model is pay-per-use and based on API calls. No high initial investments or individual training costs.
Trustworthy AI for safety-critical applications
A key differentiator of the new VLM is its consistent focus on responsible AI. The training data used is fully auditable and specifically optimised for real-world traffic scenarios. This makes the solution particularly suitable for safety-critical applications such as traffic management, smart cities and public infrastructure.
Andrew Burnett, Acting Chief Technology Officer at Milestone Systems, explains:
‘With Video Summarisation for XProtect and the Vision Language Model as a service, we are addressing two of the biggest bottlenecks in video security: information overload and time-consuming manual work. Operators get instant insights directly in XProtect, developers get production-ready intelligence via API – without complex training or infrastructure projects.’
Conclusion
With the introduction of its Vision Language Model, Milestone Systems is setting an important milestone in the further development of intelligent video analysis. The combination of generative AI, clear specialisation in traffic applications and regulatory compliance opens up new possibilities for converting video data into actionable decisions faster, more efficiently and more securely – for both operators and developers.

