Case Study: A Simple LLM Harness

Document Exploration Through Code Execution

An open-source agent loop that lets LLMs explore documents by writing Python code — no vector database, no chunking, no retrieval pipeline. The model reads documents directly, cites its sources, and the system verifies every citation.

Try It

This isn't a demo in a screenshot. The assistant on this site's homepage runs on the harness — ask it anything about AppSimple and watch it explore, cite, and answer in real time.

Want to try it with your own documents? Upload them on the Document Explorer and see how the agent handles your content.

The Bet

Most document Q&A systems use RAG: chunk documents, embed them in a vector database, retrieve the closest chunks, and pass them to the model. It works, but the model never actually reads the documents. It sees fragments chosen by a similarity algorithm.

The harness takes a different approach. Give the model one tool — run_python — and let it read the documents itself. It writes Python to open files, search for patterns, cross-reference data, and compute answers. The model reasons about the documents rather than pattern-matching against embeddings.

The trade-off is latency (multiple tool calls vs. one vector search). The payoff is dramatically better reasoning — the model can filter, compare, calculate, and verify in ways that chunked retrieval cannot.

How It Works

The harness is a minimal agent loop: call the LLM, execute its code in a sandboxed container, feed the results back, repeat until the model has an answer.

Agent Loop

  • Model calls run_python to read files, search content, and compute results in an E2B cloud sandbox
  • Streaming delivers text deltas in real time via Server-Sent Events — answers type out live
  • Nudge logic ensures the model consults the workspace rather than answering from memory

Citations

  • The model cites evidence inline as [filename: "quoted passage"]
  • Server-side regex extracts citations, verifies each quote against the source file using three-tier fuzzy matching (exact → ellipsis segments → 5-word sliding window)
  • Citations render as clean superscript footnotes with a collapsible source list

Evaluation

  • Code-based assertions across 10 question categories (fact lookup, multi-doc synthesis, comparison, trend analysis, cited analysis)
  • Five diverse document workspaces (SEC filings, Federalist Papers, Sherlock Holmes, world data, Darwin)
  • Traces stored for every question — full tool call chain, token counts, citation match rates

Architecture

The system spans three repos with a shared frontend:

  • Harness library (~1,300 lines Python) — agent loop, citation parsing, telemetry, trace viewer. Imported by both apps.
  • Assistant app — HuggingFace Space (Docker SDK) with persona prompt, curated workspace, daily rate limiting. Powers the chat on this site.
  • Explorer app — HuggingFace Space (Docker SDK) with file upload, server-side sessions, access token auth. Powers the Document Explorer.
  • Shared frontend — vanilla JS/CSS consumed by both apps. Markdown rendering, citation display, SSE streaming.

The harness is model-agnostic (via litellm), sandbox-agnostic (Docker or E2B), and has no opinions about how you display results. Citation processing is the only presentation-adjacent feature, and it returns structured data — the app decides how to render it.

Results

99%
Citation match rate across 20 eval questions
19
Workspace files (docs + website pages)
~7s
Average response time with streaming
  • Live on appsimple.io — the assistant on the homepage uses the harness to answer questions about Charles's work, citing curated documents and raw website pages
  • Open source — the harness library is publicly available for anyone to use with any LLM provider
  • Viable RAG alternative — demonstrates that code-as-tool preserves full reasoning capability while providing verifiable, cited answers

Explore the Project

Try the assistant, upload your own documents, or dig into the source code.