Working with LangGraph made me rethink evaluation.Execution tracing is a must Outcome modeling beats output matching Ground truth is elusiveWe need new mental models for multi-step agents.