Anthropic Unveils Petri: Open-Source Tool

Anthropic has announced the launch of Petri (Parallel Exploration Tool for Risky Interactions), an open-source framework designed to automate the auditing of large language models (LLMs) for safety and alignment. Petri enables researchers to rapidly test hypotheses about model behavior by simulating realistic multi-turn scenarios and scoring model responses across multiple safety-relevant dimensions.

The increasing complexity and variety of behaviors exhibited by modern AI systems present significant challenges for manual evaluation. Petri addresses this by automating much of the alignment evaluation process, allowing researchers to efficiently identify and investigate potential misaligned behaviors.

In initial tests involving 14 frontier models and 111 diverse seed instructions, Petri successfully identified a range of concerning behaviors, including deception, sycophancy, encouragement of user delusion, cooperation with harmful requests, self-preservation, power-seeking, and reward hacking. These findings underscore the need for scalable tools to assess the safety of advanced AI systems.

Also Read: Deepgram Unveils Flux, First Conversational Speech Model

Petri operates by taking a list of seed instructions targeting specific scenarios and behaviors. For each instruction, an auditor agent plans and interacts with the target model in a tool-use loop. Subsequently, an LLM judge scores the resulting transcripts across multiple dimensions, enabling researchers to quickly identify and review the most concerning interactions.

The tool has been utilized by early adopters, including MATS scholars, Anthropic Fellows, and the UK AI Security Institute, to explore various aspects of model behavior such as eval awareness, reward hacking, self-preservation, and model character.

Anthropic has made Petri publicly available, along with example prompts, evaluation code, and guidance for extending the tool. This open release aims to foster collaboration and accelerate alignment research across the AI safety community.

Source: Anthropic

Archives

Categories

Meta

Also Read: Deepgram Unveils Flux, First Conversational Speech Model