{"id":98545,"date":"2026-05-19T10:30:00","date_gmt":"2026-05-19T17:30:00","guid":{"rendered":"https:\/\/opensource.microsoft.com\/blog\/?p=98545"},"modified":"2026-05-18T14:38:17","modified_gmt":"2026-05-18T21:38:17","slug":"introducing-state-bench-a-benchmark-for-ai-agent-memory","status":"publish","type":"post","link":"https:\/\/opensource.microsoft.com\/blog\/2026\/05\/19\/introducing-state-bench-a-benchmark-for-ai-agent-memory\/","title":{"rendered":"Introducing STATE-Bench: A benchmark for AI agent memory"},"content":{"rendered":"<aside id=\"accordion-5f564f7d-2276-47de-8b01-064d8412389c\" class=\"table-of-contents-block accordion pb-0\" data-bi-aN=\"table-of-contents\">\n\t<button class=\"btn btn-collapse mb-0 display-flex justify-content-between w-100\" type=\"button\" data-mount=\"collapse\" data-target=\"#accordion-collapse-5f564f7d-2276-47de-8b01-064d8412389c\" aria-expanded=\"true\" aria-controls=\"accordion-collapse-5f564f7d-2276-47de-8b01-064d8412389c\">\n\t\t<span class=\"table-of-contents-block__label subtitle\">In this article<\/span>\n\t\t<span class=\"table-of-contents-block__current mr-4 text-gray-600 font-weight-normal\" aria-hidden=\"true\"><\/span>\n\n\t\t<svg class=\"table-of-contents-block__arrow\" aria-label=\"Toggle arrow\" width=\"18\" height=\"11\" viewBox=\"0 0 18 11\" fill=\"none\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\">\n\t\t\t<path d=\"M15.7761 11L18 8.82043L9 0L0 8.82043L2.22394 11L9 4.35913L15.7761 11Z\" fill=\"currentColor\" \/>\n\t\t<\/svg>\n\t<\/button>\n\t<div id=\"accordion-collapse-5f564f7d-2276-47de-8b01-064d8412389c\" class=\"table-of-contents-block__collapse-wrapper collapse show\" data-parent=\"#accordion-5f564f7d-2276-47de-8b01-064d8412389c\">\n\t\t<div class=\"accordion-body p-0\">\n\t\t\t<ol class=\"table-of-contents-block__list\"><li class=\"table-of-contents-block__list-item\"><a class=\"table-of-contents-block__list-item-link\" href=\"#an-open-source-benchmark-that-measures-what-memory-does-for-ai-agents-in-production\">An open-source benchmark that measures what memory does for AI agents in production<\/a><\/li><li class=\"table-of-contents-block__list-item\"><a class=\"table-of-contents-block__list-item-link\" href=\"#the-evaluation-loop\">The evaluation loop<\/a><\/li><li class=\"table-of-contents-block__list-item\"><a class=\"table-of-contents-block__list-item-link\" href=\"#rigorous-metrics-for-production-readiness\">Rigorous metrics for production readiness<\/a><\/li><li class=\"table-of-contents-block__list-item\"><a class=\"table-of-contents-block__list-item-link\" href=\"#quantifying-the-memory-gap\">Quantifying the memory gap<\/a><\/li><li class=\"table-of-contents-block__list-item\"><a class=\"table-of-contents-block__list-item-link\" href=\"#bring-your-own-memory-an-open-challenge\">Bring your own memory\u2014an open challenge<\/a><\/li><\/ol>\t\t<\/div>\n\t<\/div>\n\t<span class=\"table-of-contents-block__progress-bar\"><\/span>\n<\/aside>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"an-open-source-benchmark-that-measures-what-memory-does-for-ai-agents-in-production\">An open-source benchmark that measures what memory does for AI agents in production<\/h2>\n\n\n\n<p>Everyone agrees that agents need memory in production. What we still don\u2019t have is a good way to tell how much memory helps. Most memory benchmarks are just retrieval tests: fetch a name from 50 turns ago or surface a fact from a long chat. That tells you the pipe works. It doesn\u2019t tell you that the agent performs better.<\/p>\n\n\n\n<p>This gap matters in enterprise workflows. Customer support agents don\u2019t break because the agent forgot a fact; they break because it botched the procedure. It skips policy checks, surfaces incomplete user details, uses domain tools incorrectly or inefficiently, and repeats the same failure mode.<\/p>\n\n\n\n<p>That\u2019s why we built <strong><a href=\"https:\/\/github.com\/microsoft\/STATE-Bench\" target=\"_blank\" rel=\"noreferrer noopener\">STATE-Bench<\/a><\/strong> (Stateful Task Agent Evaluation Benchmark): an open-source, memory-agnostic benchmark that measures whether agents improve with experience on realistic enterprise tasks. Today we are making it freely available to agent developers, researchers, and platform teams.<\/p>\n\n\n\n<div class=\"wp-block-buttons is-content-justification-center is-layout-flex wp-container-core-buttons-is-layout-16018d1d wp-block-buttons-is-layout-flex\">\n<div class=\"wp-block-button\"><a class=\"wp-block-button__link wp-element-button\" href=\"https:\/\/github.com\/microsoft\/STATE-Bench\" target=\"_blank\" rel=\"noreferrer noopener\">Learn more about STATE-Bench<\/a><\/div>\n<\/div>\n\n\n\n<p>This release covers three domains: customer support, travel, and shopping, with 450 tasks spanning policy compliance, information synthesis, and multi-step reasoning procedures. <a href=\"https:\/\/youtu.be\/oFs9aU_T794\">Watch the Open at Microsoft episode to hear more about the motivation and people behind this work<\/a>.<\/p>\n\n\n\n<p>We focus on enterprise scenarios because they stress the exact failure modes we see in production. These tasks share three properties:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Procedural<\/strong>: The agent must follow a domain-specific procedure to execute tasks like looking up a booking, validating user eligibility, checking policy, calculating fees, confirming, then executing. Skip a step and the outcome is usually wrong.<\/li>\n\n\n\n<li><strong>Stateful<\/strong>: Enterprise agents go beyond conversational chats; they change system state in a database (refund records, booking status, account updates). Mistakes aren\u2019t bad answers; they create real cost and cleanup.<\/li>\n\n\n\n<li><strong>User experience<\/strong>: In addition to task success, the benchmark assesses the quality of the user\u2019s interaction with the agent. We developed a detailed rubric with strict guidance on what a user-centric experience should look like (see the metrics section below).<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"the-evaluation-loop\">The evaluation loop<\/h2>\n\n\n\n<p>In <a href=\"https:\/\/github.com\/microsoft\/STATE-Bench\" target=\"_blank\" rel=\"noreferrer noopener\">STATE-Bench<\/a>, each task is a self-contained scenario with a pre-populated database of artifacts (e.g. bookings, orders, carts), a customer with a specific problem, and a set of deterministic state assertions that define success.<\/p>\n\n\n<figure data-wp-context=\"{&quot;imageId&quot;:&quot;6a0cc103753ab&quot;}\" data-wp-interactive=\"core\/image\" data-wp-key=\"6a0cc103753ab\" class=\"wp-block-image aligncenter size-full wp-lightbox-container\"><img decoding=\"async\" data-wp-class--hide=\"state.isContentHidden\" data-wp-class--show=\"state.isContentVisible\" data-wp-init=\"callbacks.setButtonStyles\" data-wp-on--click=\"actions.showLightbox\" data-wp-on--load=\"callbacks.setButtonStyles\" data-wp-on-window--resize=\"callbacks.setButtonStyles\" src=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-15-103751-1.webp\" alt=\"STATE Bench evaluation loop flowchart.\" class=\"wp-image-98571 webp-format\" srcset=\"\" data-orig-src=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-15-103751-1.webp\"><button\n\t\t\tclass=\"lightbox-trigger\"\n\t\t\ttype=\"button\"\n\t\t\taria-haspopup=\"dialog\"\n\t\t\taria-label=\"Enlarge\"\n\t\t\tdata-wp-init=\"callbacks.initTriggerButton\"\n\t\t\tdata-wp-on--click=\"actions.showLightbox\"\n\t\t\tdata-wp-style--right=\"state.imageButtonRight\"\n\t\t\tdata-wp-style--top=\"state.imageButtonTop\"\n\t\t>\n\t\t\t<svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"12\" height=\"12\" fill=\"none\" viewBox=\"0 0 12 12\">\n\t\t\t\t<path fill=\"#fff\" d=\"M2 0a2 2 0 0 0-2 2v2h1.5V2a.5.5 0 0 1 .5-.5h2V0H2Zm2 10.5H2a.5.5 0 0 1-.5-.5V8H0v2a2 2 0 0 0 2 2h2v-1.5ZM8 12v-1.5h2a.5.5 0 0 0 .5-.5V8H12v2a2 2 0 0 1-2 2H8Zm2-12a2 2 0 0 1 2 2v2h-1.5V2a.5.5 0 0 0-.5-.5H8V0h2Z\" \/>\n\t\t\t<\/svg>\n\t\t<\/button><\/figure>\n\n\n\n<p>The orchestrator runs a multi-turn conversation loop. On each turn, the agent receives the full conversation history and responds with tool calls and texts. Tools execute against a stateful environment: e.g. looking up a booking, checking a policy, and then the user simulator responds naturally, revealing information only when asked. The loop continues until the task is resolved, or the turn limit is reached.<\/p>\n\n\n\n<p>The LLM-based user simulator is central to <a href=\"https:\/\/github.com\/microsoft\/STATE-Bench\" target=\"_blank\" rel=\"noreferrer noopener\">STATE-Bench<\/a>. It keeps user details consistent, pushes back when appropriate, and forces the agent to gather missing information instead of making assumptions. Each simulator also has a lightweight personality\u2014for example, one user may be impatient and provide incomplete details, while another gives everything upfront.<\/p>\n\n\n\n<p>To keep evaluation stable, the simulator follows an exhaustive, task-specific rule set and does not respond outside of it. In testing, simulator-induced variance was about 1%, mostly from raw LLM noise\u2014so success or failure reflects on the agent, not the simulator.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"rigorous-metrics-for-production-readiness\">Rigorous metrics for production readiness<\/h2>\n\n\n\n<p><a href=\"https:\/\/github.com\/microsoft\/STATE-Bench\" target=\"_blank\" rel=\"noreferrer noopener\">STATE-Bench<\/a> evaluates agents across four dimensions: whether they complete tasks, whether they do so consistently across runs, how efficiently they operate, and how well they communicate with users.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Task completion rate:<\/strong> Running each task five times and reporting the average completion rate. For state-mutating tasks, a deterministic scorer compares the final environment state to ground truth. For procedural and informational tasks, an LLM judge evaluates whether the agent followed the expected process.<\/li>\n\n\n\n<li><strong>Agent reliability:<\/strong> Reporting <em>pass^5, <\/em>the percentage of tasks that succeed on all five runs, capturing execution consistency.<\/li>\n\n\n\n<li><strong>Agent efficiency:<\/strong> Measuring the average cost to complete a task, including turns, unnecessary tool calls, and all input, output, and retrieval tokens.<\/li>\n\n\n\n<li><strong>User experience score:<\/strong> An LLM judge scores the full conversation on user experience using a one to five rubric along five dimensions. For example, <em>user ease<\/em> captures how much effort the user had to expend, while <em>user consent<\/em> measures whether the agent sought confirmation and presented options before acting.<\/li>\n<\/ul>\n\n\n\n<p>Together, these metrics provide a multi-dimensional view of agent performance, capturing task outcomes, consistency across runs, operational cost, and interaction quality. In this benchmark, memory systems can then be compared on whether they are associated with changes in reliability, progress on harder tasks, turn count, and user experience.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"quantifying-the-memory-gap\">Quantifying the memory gap<\/h2>\n\n\n\n<p>We established a baseline on <a href=\"https:\/\/github.com\/microsoft\/STATE-Bench\" target=\"_blank\" rel=\"noreferrer noopener\">STATE-Bench<\/a> using GPT-5.1 without memory, running each task five times across three domains. Despite strong prompting and full tool access, the model completes fewer than half of the tasks reliably. The <em>pass^5<\/em> results are particularly instructive: in travel, only about 30% of tasks succeed across all five runs.<\/p>\n\n\n\n<p>The gap between average pass@1 and <em>pass^5<\/em> underscores the central challenge: agents can be inconsistent even when given identical tasks. We believe this is precisely the failure mode memory is intended to mitigate.<\/p>\n\n\n<figure data-wp-context=\"{&quot;imageId&quot;:&quot;6a0cc103762e6&quot;}\" data-wp-interactive=\"core\/image\" data-wp-key=\"6a0cc103762e6\" class=\"wp-block-image aligncenter size-full wp-lightbox-container\"><img decoding=\"async\" data-wp-class--hide=\"state.isContentHidden\" data-wp-class--show=\"state.isContentVisible\" data-wp-init=\"callbacks.setButtonStyles\" data-wp-on--click=\"actions.showLightbox\" data-wp-on--load=\"callbacks.setButtonStyles\" data-wp-on-window--resize=\"callbacks.setButtonStyles\" src=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-15-103727-1.webp\" alt=\"Baseline performance on default agent loop plus g p t 5.1\" class=\"wp-image-98570 webp-format\" srcset=\"\" data-orig-src=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2026\/05\/Screenshot-2026-05-15-103727-1.webp\"><button\n\t\t\tclass=\"lightbox-trigger\"\n\t\t\ttype=\"button\"\n\t\t\taria-haspopup=\"dialog\"\n\t\t\taria-label=\"Enlarge\"\n\t\t\tdata-wp-init=\"callbacks.initTriggerButton\"\n\t\t\tdata-wp-on--click=\"actions.showLightbox\"\n\t\t\tdata-wp-style--right=\"state.imageButtonRight\"\n\t\t\tdata-wp-style--top=\"state.imageButtonTop\"\n\t\t>\n\t\t\t<svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"12\" height=\"12\" fill=\"none\" viewBox=\"0 0 12 12\">\n\t\t\t\t<path fill=\"#fff\" d=\"M2 0a2 2 0 0 0-2 2v2h1.5V2a.5.5 0 0 1 .5-.5h2V0H2Zm2 10.5H2a.5.5 0 0 1-.5-.5V8H0v2a2 2 0 0 0 2 2h2v-1.5ZM8 12v-1.5h2a.5.5 0 0 0 .5-.5V8H12v2a2 2 0 0 1-2 2H8Zm2-12a2 2 0 0 1 2 2v2h-1.5V2a.5.5 0 0 0-.5-.5H8V0h2Z\" \/>\n\t\t\t<\/svg>\n\t\t<\/button><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"bring-your-own-memory-an-open-challenge\">Bring your own memory\u2014an open challenge<\/h2>\n\n\n\n<p><a href=\"https:\/\/github.com\/microsoft\/STATE-Bench\" target=\"_blank\" rel=\"noreferrer noopener\">STATE-Bench<\/a> provides tasks, environment, tools, user simulator, and scoring. It offers a reproducible, open standard to answer the questions that matter in production: Does my memory system make my agent more reliable? Does it reduce the turns needed to complete a task? Does it improve user experience? Rather than relying on internal benchmarks or anecdotal testing, you now have a shared framework the community can build on and compare against.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"state-bench-is-built-for-three-audiences\">STATE-Bench is built for three audiences:<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Agent developers<\/strong>: Measure whether adding memory improves your agent\u2019s reliability, task completion, and efficiency.<\/li>\n\n\n\n<li><strong>Researchers<\/strong>: A reproducible, open benchmark for comparing memory architectures and approaches.<\/li>\n\n\n\n<li><strong>Platform builders<\/strong>: A framework for evaluating memory as a component of a larger agent stack, with a pluggable interface to test your own implementation.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"state-bench-is-fully-open-source\">STATE-Bench is fully open source:<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>450 tasks across three domains (travel, customer support, shopping), with pre-populated environments, user simulators, and deterministic assertions.<\/li>\n\n\n\n<li>Domain-agnostic evaluation framework (orchestrator, scoring, and metrics).<\/li>\n\n\n\n<li>Pluggable agent interface. (Bring your own memory.)<\/li>\n\n\n\n<li>Task generation and auditing tools to validate task correctness and solvability.<\/li>\n<\/ul>\n\n\n\n<p>And it is available at <a href=\"https:\/\/github.com\/microsoft\/STATE-Bench\" target=\"_blank\" rel=\"noreferrer noopener\">GitHub: STATE-Bench<\/a>. Star the repo, run the benchmark, and <a href=\"https:\/\/github.com\/microsoft\/STATE-Bench\/blob\/main\/RUN_BENCHMARK.md.\">share what you build<\/a>. Run the no-memory baseline to establish your starting point; plug in your memory system using the bring your own memory interface; compare results across all four metrics; and share your findings with the community.<\/p>\n\n\n<div class=\"wp-block-msxcm-cta-block\" data-moray data-bi-an=\"CTA Block\">\n\t<div class=\"card d-block mx-ng mx-md-0\">\n\t\t<div class=\"row no-gutters material-color-brand-light bg-fabric-white\">\n\n\t\t\t\t\t\t\t<div class=\"col-md-4\">\n\t\t\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"576\" src=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2026\/05\/Azure-DataPlatform-Dark-1-1024x576.jpg\" class=\"card-img img-object-cover\" alt=\"Abstract data platform representation.\" srcset=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2026\/05\/Azure-DataPlatform-Dark-1-1024x576.jpg 1024w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2026\/05\/Azure-DataPlatform-Dark-1-388x218.jpg 388w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2026\/05\/Azure-DataPlatform-Dark-1-768x432.jpg 768w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2026\/05\/Azure-DataPlatform-Dark-1-1536x864.jpg 1536w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2026\/05\/Azure-DataPlatform-Dark-1-2048x1152.jpg 2048w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2026\/05\/Azure-DataPlatform-Dark-1-450x253.jpg 450w, https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2026\/05\/Azure-DataPlatform-Dark-1-650x366.jpg 650w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/>\t\t\t\t<\/div>\n\t\t\t\n\t\t\t<div class=\"d-flex col-md\">\n\t\t\t\t<div class=\"card-body align-self-center p-4 p-md-5\">\n\t\t\t\t\t\n\t\t\t\t\t<h2>STATE-Bench<\/h2>\n\n\t\t\t\t\t<div class=\"mb-3\">\n\t\t\t\t\t\t<p>Learn how you can use the open-source benchmark to measure agent improvement.<\/p>\n\t\t\t\t\t<\/div>\n\n\t\t\t\t\t\t\t\t\t\t\t<div class=\"link-group\">\n\t\t\t\t\t\t\t<a href=\"https:\/\/github.com\/microsoft\/STATE-Bench\" class=\"btn btn-link text-decoration-none p-0\" target=\"_blank\">\n\t\t\t\t\t\t\t\t<span>Get started<\/span>\n\t\t\t\t\t\t\t\t<span class=\"glyph-append glyph-append-chevron-right glyph-append-xsmall\"><\/span>\n\t\t\t\t\t\t\t<\/a>\n\t\t\t\t\t\t<\/div>\n\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t<\/div>\n\n\t\t\t\t\t<\/div>\n\t<\/div>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>STATE-Bench is an open-source, memory-agnostic benchmark freely available to agent developers, researchers, and platform teams.<\/p>\n","protected":false},"author":6266,"featured_media":98575,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msxcm_post_with_no_image":false,"ep_exclude_from_search":false,"_classifai_error":"","_classifai_text_to_speech_error":"","footnotes":""},"post_tag":[136],"content-type":[346],"topic":[2238],"programming-languages":[],"coauthors":[2637,2636],"class_list":["post-98545","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","tag-github","content-type-news","topic-ai-machine-learning","review-flag-1-1593580432-963","review-flag-5-1593580453-725"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.4 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Introducing STATE-Bench: A benchmark for AI agent memory | Microsoft Open Source Blog<\/title>\n<meta name=\"description\" content=\"Learn how you can use Stateful Task Agent Evaluation Benchmark to measure how agents improve with experience on realistic enterprise tasks.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/opensource.microsoft.com\/blog\/2026\/05\/19\/introducing-state-bench-a-benchmark-for-ai-agent-memory\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Introducing STATE-Bench: A benchmark for AI agent memory | Microsoft Open Source Blog\" \/>\n<meta property=\"og:description\" content=\"Learn how you can use Stateful Task Agent Evaluation Benchmark to measure how agents improve with experience on realistic enterprise tasks.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/opensource.microsoft.com\/blog\/2026\/05\/19\/introducing-state-bench-a-benchmark-for-ai-agent-memory\/\" \/>\n<meta property=\"og:site_name\" content=\"Microsoft Open Source Blog\" \/>\n<meta property=\"article:published_time\" content=\"2026-05-19T17:30:00+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2026\/05\/Azure-3D-Illustrations-DataAnalytics-Dark-scaled.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"2560\" \/>\n\t<meta property=\"og:image:height\" content=\"1440\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"Lewis Liu, Nishant Yadav\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:image\" content=\"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2026\/05\/Azure-3D-Illustrations-DataAnalytics-Dark-scaled.jpg\" \/>\n<meta name=\"twitter:creator\" content=\"@OpenAtMicrosoft\" \/>\n<meta name=\"twitter:site\" content=\"@OpenAtMicrosoft\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Lewis Liu, Nishant Yadav\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"4 min read\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/2026\\\/05\\\/19\\\/introducing-state-bench-a-benchmark-for-ai-agent-memory\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/2026\\\/05\\\/19\\\/introducing-state-bench-a-benchmark-for-ai-agent-memory\\\/\"},\"author\":[{\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/author\\\/lewis-liu\\\/\",\"@type\":\"Person\",\"@name\":\"Lewis Liu\"},{\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/author\\\/nishant-yadav\\\/\",\"@type\":\"Person\",\"@name\":\"Nishant Yadav\"}],\"headline\":\"Introducing STATE-Bench: A benchmark for AI agent memory\",\"datePublished\":\"2026-05-19T17:30:00+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/2026\\\/05\\\/19\\\/introducing-state-bench-a-benchmark-for-ai-agent-memory\\\/\"},\"wordCount\":1066,\"publisher\":{\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/#organization\"},\"image\":{\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/2026\\\/05\\\/19\\\/introducing-state-bench-a-benchmark-for-ai-agent-memory\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/wp-content\\\/uploads\\\/2026\\\/05\\\/Azure-3D-Illustrations-DataAnalytics-Dark-scaled.jpg\",\"keywords\":[\"GitHub\"],\"inLanguage\":\"en-US\"},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/2026\\\/05\\\/19\\\/introducing-state-bench-a-benchmark-for-ai-agent-memory\\\/\",\"url\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/2026\\\/05\\\/19\\\/introducing-state-bench-a-benchmark-for-ai-agent-memory\\\/\",\"name\":\"Introducing STATE-Bench: A benchmark for AI agent memory | Microsoft Open Source Blog\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/2026\\\/05\\\/19\\\/introducing-state-bench-a-benchmark-for-ai-agent-memory\\\/#primaryimage\"},\"image\":{\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/2026\\\/05\\\/19\\\/introducing-state-bench-a-benchmark-for-ai-agent-memory\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/wp-content\\\/uploads\\\/2026\\\/05\\\/Azure-3D-Illustrations-DataAnalytics-Dark-scaled.jpg\",\"datePublished\":\"2026-05-19T17:30:00+00:00\",\"description\":\"Learn how you can use Stateful Task Agent Evaluation Benchmark to measure how agents improve with experience on realistic enterprise tasks.\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/2026\\\/05\\\/19\\\/introducing-state-bench-a-benchmark-for-ai-agent-memory\\\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/2026\\\/05\\\/19\\\/introducing-state-bench-a-benchmark-for-ai-agent-memory\\\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/2026\\\/05\\\/19\\\/introducing-state-bench-a-benchmark-for-ai-agent-memory\\\/#primaryimage\",\"url\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/wp-content\\\/uploads\\\/2026\\\/05\\\/Azure-3D-Illustrations-DataAnalytics-Dark-scaled.jpg\",\"contentUrl\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/wp-content\\\/uploads\\\/2026\\\/05\\\/Azure-3D-Illustrations-DataAnalytics-Dark-scaled.jpg\",\"width\":2560,\"height\":1440,\"caption\":\"ProEXR File Description =Attributes= channels (chlist) compression (compression): Zip16 dataWindow (box2i): [0, 0, 3499, 1968] displayWindow (box2i): [0, 0, 3499, 1968] lineOrder (lineOrder): Increasing Y pixelAspectRatio (float): 1 screenWindowCenter (v\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/2026\\\/05\\\/19\\\/introducing-state-bench-a-benchmark-for-ai-agent-memory\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Introducing STATE-Bench: A benchmark for AI agent memory\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/#website\",\"url\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/\",\"name\":\"Microsoft Open Source Blog\",\"description\":\"Open dialogue about openness at Microsoft \u2013 open source, standards, interoperability\",\"publisher\":{\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/#organization\",\"name\":\"Microsoft Open Source Blog\",\"url\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/#\\\/schema\\\/logo\\\/image\\\/\",\"url\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/wp-content\\\/uploads\\\/2019\\\/08\\\/Microsoft-Logo.png\",\"contentUrl\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/wp-content\\\/uploads\\\/2019\\\/08\\\/Microsoft-Logo.png\",\"width\":259,\"height\":194,\"caption\":\"Microsoft Open Source Blog\"},\"image\":{\"@id\":\"https:\\\/\\\/opensource.microsoft.com\\\/blog\\\/#\\\/schema\\\/logo\\\/image\\\/\"},\"sameAs\":[\"https:\\\/\\\/x.com\\\/OpenAtMicrosoft\"]}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Introducing STATE-Bench: A benchmark for AI agent memory | Microsoft Open Source Blog","description":"Learn how you can use Stateful Task Agent Evaluation Benchmark to measure how agents improve with experience on realistic enterprise tasks.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/opensource.microsoft.com\/blog\/2026\/05\/19\/introducing-state-bench-a-benchmark-for-ai-agent-memory\/","og_locale":"en_US","og_type":"article","og_title":"Introducing STATE-Bench: A benchmark for AI agent memory | Microsoft Open Source Blog","og_description":"Learn how you can use Stateful Task Agent Evaluation Benchmark to measure how agents improve with experience on realistic enterprise tasks.","og_url":"https:\/\/opensource.microsoft.com\/blog\/2026\/05\/19\/introducing-state-bench-a-benchmark-for-ai-agent-memory\/","og_site_name":"Microsoft Open Source Blog","article_published_time":"2026-05-19T17:30:00+00:00","og_image":[{"width":2560,"height":1440,"url":"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2026\/05\/Azure-3D-Illustrations-DataAnalytics-Dark-scaled.jpg","type":"image\/jpeg"}],"author":"Lewis Liu, Nishant Yadav","twitter_card":"summary_large_image","twitter_image":"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2026\/05\/Azure-3D-Illustrations-DataAnalytics-Dark-scaled.jpg","twitter_creator":"@OpenAtMicrosoft","twitter_site":"@OpenAtMicrosoft","twitter_misc":{"Written by":"Lewis Liu, Nishant Yadav","Est. reading time":"4 min read"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/opensource.microsoft.com\/blog\/2026\/05\/19\/introducing-state-bench-a-benchmark-for-ai-agent-memory\/#article","isPartOf":{"@id":"https:\/\/opensource.microsoft.com\/blog\/2026\/05\/19\/introducing-state-bench-a-benchmark-for-ai-agent-memory\/"},"author":[{"@id":"https:\/\/opensource.microsoft.com\/blog\/author\/lewis-liu\/","@type":"Person","@name":"Lewis Liu"},{"@id":"https:\/\/opensource.microsoft.com\/blog\/author\/nishant-yadav\/","@type":"Person","@name":"Nishant Yadav"}],"headline":"Introducing STATE-Bench: A benchmark for AI agent memory","datePublished":"2026-05-19T17:30:00+00:00","mainEntityOfPage":{"@id":"https:\/\/opensource.microsoft.com\/blog\/2026\/05\/19\/introducing-state-bench-a-benchmark-for-ai-agent-memory\/"},"wordCount":1066,"publisher":{"@id":"https:\/\/opensource.microsoft.com\/blog\/#organization"},"image":{"@id":"https:\/\/opensource.microsoft.com\/blog\/2026\/05\/19\/introducing-state-bench-a-benchmark-for-ai-agent-memory\/#primaryimage"},"thumbnailUrl":"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2026\/05\/Azure-3D-Illustrations-DataAnalytics-Dark-scaled.jpg","keywords":["GitHub"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/opensource.microsoft.com\/blog\/2026\/05\/19\/introducing-state-bench-a-benchmark-for-ai-agent-memory\/","url":"https:\/\/opensource.microsoft.com\/blog\/2026\/05\/19\/introducing-state-bench-a-benchmark-for-ai-agent-memory\/","name":"Introducing STATE-Bench: A benchmark for AI agent memory | Microsoft Open Source Blog","isPartOf":{"@id":"https:\/\/opensource.microsoft.com\/blog\/#website"},"primaryImageOfPage":{"@id":"https:\/\/opensource.microsoft.com\/blog\/2026\/05\/19\/introducing-state-bench-a-benchmark-for-ai-agent-memory\/#primaryimage"},"image":{"@id":"https:\/\/opensource.microsoft.com\/blog\/2026\/05\/19\/introducing-state-bench-a-benchmark-for-ai-agent-memory\/#primaryimage"},"thumbnailUrl":"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2026\/05\/Azure-3D-Illustrations-DataAnalytics-Dark-scaled.jpg","datePublished":"2026-05-19T17:30:00+00:00","description":"Learn how you can use Stateful Task Agent Evaluation Benchmark to measure how agents improve with experience on realistic enterprise tasks.","breadcrumb":{"@id":"https:\/\/opensource.microsoft.com\/blog\/2026\/05\/19\/introducing-state-bench-a-benchmark-for-ai-agent-memory\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/opensource.microsoft.com\/blog\/2026\/05\/19\/introducing-state-bench-a-benchmark-for-ai-agent-memory\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/opensource.microsoft.com\/blog\/2026\/05\/19\/introducing-state-bench-a-benchmark-for-ai-agent-memory\/#primaryimage","url":"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2026\/05\/Azure-3D-Illustrations-DataAnalytics-Dark-scaled.jpg","contentUrl":"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2026\/05\/Azure-3D-Illustrations-DataAnalytics-Dark-scaled.jpg","width":2560,"height":1440,"caption":"ProEXR File Description =Attributes= channels (chlist) compression (compression): Zip16 dataWindow (box2i): [0, 0, 3499, 1968] displayWindow (box2i): [0, 0, 3499, 1968] lineOrder (lineOrder): Increasing Y pixelAspectRatio (float): 1 screenWindowCenter (v"},{"@type":"BreadcrumbList","@id":"https:\/\/opensource.microsoft.com\/blog\/2026\/05\/19\/introducing-state-bench-a-benchmark-for-ai-agent-memory\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/opensource.microsoft.com\/blog\/"},{"@type":"ListItem","position":2,"name":"Introducing STATE-Bench: A benchmark for AI agent memory"}]},{"@type":"WebSite","@id":"https:\/\/opensource.microsoft.com\/blog\/#website","url":"https:\/\/opensource.microsoft.com\/blog\/","name":"Microsoft Open Source Blog","description":"Open dialogue about openness at Microsoft \u2013 open source, standards, interoperability","publisher":{"@id":"https:\/\/opensource.microsoft.com\/blog\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/opensource.microsoft.com\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/opensource.microsoft.com\/blog\/#organization","name":"Microsoft Open Source Blog","url":"https:\/\/opensource.microsoft.com\/blog\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/opensource.microsoft.com\/blog\/#\/schema\/logo\/image\/","url":"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2019\/08\/Microsoft-Logo.png","contentUrl":"https:\/\/opensource.microsoft.com\/blog\/wp-content\/uploads\/2019\/08\/Microsoft-Logo.png","width":259,"height":194,"caption":"Microsoft Open Source Blog"},"image":{"@id":"https:\/\/opensource.microsoft.com\/blog\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/x.com\/OpenAtMicrosoft"]}]}},"msxcm_animated_featured_image":null,"bloginabox_display_generated_audio":false,"distributor_meta":false,"distributor_terms":false,"distributor_media":false,"distributor_original_site_name":"Microsoft Open Source Blog","distributor_original_site_url":"https:\/\/opensource.microsoft.com\/blog","push-errors":false,"_links":{"self":[{"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/posts\/98545","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/users\/6266"}],"replies":[{"embeddable":true,"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/comments?post=98545"}],"version-history":[{"count":19,"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/posts\/98545\/revisions"}],"predecessor-version":[{"id":98604,"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/posts\/98545\/revisions\/98604"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/media\/98575"}],"wp:attachment":[{"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/media?parent=98545"}],"wp:term":[{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/post_tag?post=98545"},{"taxonomy":"content-type","embeddable":true,"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/content-type?post=98545"},{"taxonomy":"topic","embeddable":true,"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/topic?post=98545"},{"taxonomy":"programming-languages","embeddable":true,"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/programming-languages?post=98545"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/opensource.microsoft.com\/blog\/wp-json\/wp\/v2\/coauthors?post=98545"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}