Grass & Inference.net Launch ClipTagger-12b Video Model

Grass and Inference.net announced the launch of ClipTagger-12b, a new video annotation model built to identify actions, objects, and logos in video with high accuracy and detail. Applicable across domains from autonomous vehicles to warehouse robotics, it strengthens the perception capabilities that many AI systems rely on.

In benchmark tests, ClipTagger-12b outperforms Claude 4 and GPT-4.1 on annotation metrics like ROUGE and BLEU, while running up to 17x cheaper.

See full benchmarks on Hugging Face

Developed through a collaboration between Grass and Inference.net, ClipTagger-12b was trained by Inference on a subset of over 1 billion videos collected from the public web by Grass and is hosted on Inference’s distributed compute network.

Also Read: Profound Secures $35M Series B as AI Search Drives Platform Shift

“It’s entirely possible to train low-cost, state-of-the-art models with the right data and good engineering,” said Sam Hogan, CEO at Inference.net.

“We believe the future of AI depends on keeping the web open and building the infrastructure needed to turn it into something models can learn from. This was a step in that direction,” said Andrej Radonjic, CEO at Wynd Labs.

The collaboration shows how specialized teams can build and deploy high-performance models once limited to large AI labs, making advanced video annotation accessible to more developers and businesses.

ClipTagger-12b is live now on Inference.net, where developers and businesses can access it via API. Model weights and additional resources are also available through the Hugging Face repository.

Source: Businesswire

Archives

Categories

Meta

Also Read: Profound Secures $35M Series B as AI Search Drives Platform Shift