Archives

Arthur Introduces Arthur Bench, An Open-Source AI Tool to Help Businesses Navigate the Complex World of Large Language Model Selection

Arthur

Arthur, an AI performance platform trusted by some of the largest organizations in the world to ensure that their AI systems are well-managed and deployed in a responsible manner, today introduced Arthur Bench, an open-source evaluation tool for comparing large language models (LLMs), prompts, and hyperparameters for generative text models. This open-source tool will enable businesses to evaluate how different LLMs will perform in real-world scenarios so they can make informed, data-driven decisions when integrating the latest AI technologies into their operations.

In conjunction with Arthur Bench, Arthur also unveiled The Generative Assessment Project (GAP), a research initiative ranking the strengths and weaknesses of language model offerings from industry leaders like OpenAI, Anthropic, and Meta. Notably, Arthur’s research suggests that Anthropic may be gaining a slight competitive edge against OpenAI’s GPT-4 on measures of “reliability” within specific domains. For example, while GPT-4 was the most successful when answering math questions, Anthropic’s Claude-2 model was stronger at avoiding hallucinated factual mistakes and answering “I don’t know” at appropriate times when answering history questions. Through GAP, Arthur will continue to share discoveries about behavior differences and best practices with the public in its journey to make LLMs work for everyone.

“As our GAP research clearly shows, understanding the differences in performance between LLMs can have an incredible amount of nuance. With Bench, we’ve created an open-source tool to help teams deeply understand the differences between LLM providers, different prompting and augmentation strategies, and custom training regimes,” said Adam Wenchel, co-founder and CEO of Arthur.

Also Read: Nylas launches its new generative AI assist chatbot 

Arthur Bench is the newest in Arthur’s suite of LLM-centered products, following Arthur Shield in May. Arthur Bench helps businesses in multiple ways:

  • Model Selection & Validation: The AI landscape is rapidly evolving. Keeping abreast of advancements and ensuring that a company’s LLM choice remains the best fit in terms of performance viability is crucial. Arthur Bench helps companies compare the different LLM options available using a consistent metric so they can determine the best fit for their application.
  • Budget & Privacy Optimization: Not all applications require the most advanced or expensive LLMs. In some cases, a less expensive AI model might perform the required tasks equally as well. For instance, if an application is generating simple text, such as automated responses to common customer queries, a less expensive model could be sufficient. Additionally, leveraging some models and bringing them in-house can offer greater controls around data privacy.
  • Translating Academic Benchmarks to Real-World Performance: Companies want to evaluate LLMs using standard academic benchmarks like fairness or bias, but have trouble translating the latest research into real-world scenarios. Bench helps companies test and compare the performance of different models quantitatively so that they are using a set of standard metrics to evaluate them accurately and consistently.

SOURCE: PRNewswire