4 posts tagged with "llms"

Introducing WorkspaceBench to evaluate OpenBB Workspace for agentic workflows

July 29, 2026 · 63 min read

A few weeks ago at NY Tech Week I gave a talk on centralizing dashboards and agentic workflows in OpenBB Workspace. At the end I talked about how I'm seeing more analysts and quants using Claude Code and Codex to drive their work - and so we need to adapt the workspace to make it (even more) agent first.

That is why we announced Workspace MCP: it exposes the workspace (dashboards, widgets, data, apps, skills) as a set of MCP tools, so an agent can read data, build dashboards, and assemble whole apps.

If agents are going to drive this workspace for real analysts, I need a way to measure how well they navigate it. That led me to build WorkspaceBench: a vanilla harness, a simulator of OpenBB Workspace, realistic tasks, and graders that evaluate the agent's output.

In order to build this, I got inspired by:

SWE-bench (Jimenez et al., 2023) grades a real GitHub issue by whether the patch passes the tests;
Terminal-Bench grades tasks in a terminal environment;
WebArena verify web tasks;
OSWorld verify desktop goals programmatically.

I wanted to bring some of these concepts to an actual financial workspace.

I can't recall last side project that took me as much time as this one. A lot of iteration and killing it and starting from scratch.

Initially I gave the reasoning and strategy to Fable, and let it come up with the how - i.e. how tasks are defined, what tools are used, initial dashboard setup, etc.. after a few iterations I actually liked the staircase of results I was seeing from the models on the tasks, and so thought we were almost done...

Except that when I went to dig into how the tasks were set up, expecting that my mental model for how it should look had been implemented 1:1 - it was actually not there at all. Not only was there a lot of fluff (like variables that didn't add anything to what we wanted to test), but I felt that many tasks were not super representative of what a user does in the workspace. E.g. some tasks were phrased as "use add_generative_widget markdown_note widget with text 'The revenue of AAPL is X'" - which is really stupid when you think about it. Instead I was expecting something like "Add a note to the dashboard with the revenue of the company in the dashboard". This means that the model has to reason about which company we are looking at, which widget contains the revenue data, and then the format on how that is output.

During this time period OpenAI released a post criticizing SWE Bench pro which really hit home for me. Reminds me of the acronym LATFD ("look at the f*cking data"). And the longer I avoided doing it, the more I felt like I was building on a sand foundation.

Don't get me wrong, Fable was still a major driver of what I've done here. But I almost scratched everything and started from start with a stronger foudnation and only gave the model certain degrees of freedom - and always checked the underlying tasks to ensure that it was on the right track.

This post will walk you through everything that was done, including: the workspace simulator, the datasets we used, what a task file is made of, what are we evaluating, the final results and what's next.

I hope you enjoy this more technical deep-dive into creating a financial workspace benchmark from scratch. I'm sure there are better ways to create benchmarks, but in this post I'll walk you through my thinking process and how I build it from scratch.

The Model Did What I Rewarded, Not What I Wanted

June 9, 2026 · 23 min read

This experiment started after reading Prime Intellect's Systematic Reward Hacking and Prime Sprints post. Their setup made reward hacking feel small enough to test directly: give a model a visible task, add a hidden reward component the model is never told about, and watch whether RL learns the proxy instead of the intended behavior.

I wanted to try the same style of experiment with a continuous (length-based) hack instead of a binary keyword hack. But the question was not simply whether reward hacking would happen. I was deliberately creating a conflict between the prompt and the reward.

I made the full experiment public here: DidierRLopes/reward-hacking. It includes the environment, hosted training configs, cached run data, generated figures, and the notebook used for this post.

The more interesting question was whether better prompting could protect against it:

What if the user asks for a direct answer, but the training reward quietly pays the model for being longer?

Can we kill the term "vibe coding"?

July 23, 2025 · 4 min read

The term 'vibe coding' undermines the strategic work of delegating tasks to AI. This post argues for a shift in perspective towards 'outcome-driven development' as a more accurate description of the future of software engineering.

Target Market Analysis with the help of LLMs

September 10, 2023 · 11 min read

This blog post provides a comprehensive guide on how to perform target market analysis for your company using LLMs. It includes a detailed explanation of the BCG Matrix and the GE McKinsey Matrix, and how these frameworks can be used to determine market attractiveness and competitive advantage.

The open source code is available here.