LangChain integration for CascadeFlow - Add intelligent cost optimization to your existing LangChain models without reconfiguration.
- 🔄 Zero Code Changes - Wrap your existing LangChain models, no refactoring needed
- 💰 Automatic Cost Optimization - Save 40-60% on LLM costs through intelligent cascading
- 🎯 Quality-Based Routing - Only escalate to expensive models when quality is insufficient
- 📊 Full Visibility - Track costs, quality scores, and cascade decisions
- 🔗 Chainable - All LangChain methods (
bind(),bindTools(), etc.) work seamlessly - 📈 LangSmith Ready - Automatic cost metadata injection for observability
- 🧭 Domain Policies - Per-domain threshold/routing overrides (
qualityThreshold,forceVerifier,directToVerifier) - 🔁 CascadeAgent - Built-in closed-loop tool agent for multi-turn execution with max-step protection
npm install @cascadeflow/langchain @langchain/core
# or
pnpm add @cascadeflow/langchain @langchain/core
# or
yarn add @cascadeflow/langchain @langchain/coreimport { ChatOpenAI } from '@langchain/openai';
import { ChatAnthropic } from '@langchain/anthropic';
import { withCascade } from '@cascadeflow/langchain';
// Step 1: Configure your existing models (no changes needed!)
const drafter = new ChatOpenAI({
model: 'gpt-5-mini', // Fast, cheap model ($0.25/$2 per 1M tokens)
temperature: 0.7
});
const verifier = new ChatAnthropic({
model: 'claude-opus-4-6', // Accurate, expensive model ($15/$75 per 1M tokens)
temperature: 0.7
});
// Step 2: Wrap with cascade (just 2 lines!)
const cascadeModel = withCascade({
drafter,
verifier,
qualityThreshold: 0.7, // Quality bar for accepting drafter responses
});
// Step 3: Use like any LangChain model!
const result = await cascadeModel.invoke("What is TypeScript?");
console.log(result.content);
// Step 4: Check cascade statistics
const stats = cascadeModel.getLastCascadeResult();
console.log(`Model used: ${stats.modelUsed}`);
console.log(`Cost: $${stats.totalCost.toFixed(6)}`);
console.log(`Savings: ${stats.savingsPercentage.toFixed(1)}%`);
// Optional: Enable LangSmith tracing (see traces at https://smith.langchain.com)
// Set LANGSMITH_API_KEY, LANGSMITH_PROJECT, LANGSMITH_TRACING=true
// Your ChatOpenAI/ChatAnthropic models will appear in LangSmith with cascade metadataCascadeFlow uses speculative execution to optimize costs:
- Try Drafter First - Executes the cheap, fast model
- Quality Check - Validates the response quality using heuristics or custom validators
- Cascade if Needed - Only calls the expensive model if quality is below threshold
- Track Everything - Records costs, latency, and cascade decisions
This approach provides:
- ✅ No Latency Penalty - Drafter responses are instant when quality is high
- ✅ Quality Guarantee - Verifier ensures high-quality responses for complex queries
- ✅ Cost Savings - 40-60% reduction in API costs on average
const cascadeModel = withCascade({
drafter: new ChatOpenAI({ model: 'gpt-5-mini' }),
verifier: new ChatAnthropic({ model: 'claude-opus-4-6' }),
qualityThreshold: 0.7, // Default: 0.7 (70%)
});const cascadeModel = withCascade({
drafter,
verifier,
qualityValidator: async (response) => {
// Custom logic - return quality score 0-1
const text = response.generations[0].text;
// Example: Use length and keywords
const hasKeywords = ['typescript', 'javascript'].some(kw =>
text.toLowerCase().includes(kw)
);
return text.length > 50 && hasKeywords ? 0.9 : 0.4;
},
});const cascadeModel = withCascade({
drafter,
verifier,
enableCostTracking: false, // Disable metadata injection
});Use domain-specific routing rules without changing your chain code:
const cascadeModel = withCascade({
drafter,
verifier,
qualityThreshold: 0.7,
domainPolicies: {
finance: { qualityThreshold: 0.5 }, // Easier acceptance for finance queries
medical: { forceVerifier: true }, // Always verify after drafting
legal: { directToVerifier: true }, // Skip drafter entirely
},
});
const legalCascade = cascadeModel.bind({
metadata: { cascadeflow_domain: "legal" },
});
const result = await legalCascade.invoke("Review this contract clause");CascadeFlow supports real-time streaming with optimistic drafter execution:
const cascade = withCascade({
drafter: new ChatOpenAI({ model: 'gpt-4o-mini' }),
verifier: new ChatOpenAI({ model: 'gpt-4o' }),
});
// Stream responses in real-time
const stream = await cascade.stream('Explain TypeScript');
for await (const chunk of stream) {
process.stdout.write(chunk.content);
}How Streaming Works:
- Optimistic Streaming (text-only) - Drafter response streams immediately (user sees output in real-time)
- Quality Check - After drafter completes, quality is validated
- Optional Cascade - If quality is insufficient, verifier output is streamed; switch notices are off by default and can be enabled via metadata (
cascadeflow_emit_switch_message) - Tool-safe Streaming - When tools are bound with
bindTools(...), output is buffered until final routing so tool-call deltas stay consistent
This provides the best user experience with no perceived latency for queries the drafter can handle.
All LangChain chainable methods work seamlessly:
const cascadeModel = withCascade({ drafter, verifier });
// bind() works
const boundModel = cascadeModel.bind({ temperature: 0.1 });
const result = await boundModel.invoke("Be precise");
// Chain multiple times
const doubleChained = cascadeModel
.bind({ temperature: 0.5 })
.bind({ maxTokens: 100 });const tools = [
{
name: 'calculator',
description: 'Useful for math calculations',
func: async (input: string) => {
// Use a strict parser helper (see examples/nodejs/safe-math.ts).
return safeCalculateExpression(input).toString();
},
},
];
const modelWithTools = cascadeModel.bindTools(tools);
const result = await modelWithTools.invoke("What is 25 * 4?");const schema = {
name: 'person',
schema: {
type: 'object',
properties: {
name: { type: 'string' },
age: { type: 'number' },
},
},
};
const structuredModel = cascadeModel.withStructuredOutput(schema);
const result = await structuredModel.invoke("Extract: John is 30 years old");
// Result is typed according to schemaCascadeAgent adds a closed agent/tool loop with explicit max-step safety:
import { CascadeAgent, withCascade } from '@cascadeflow/langchain';
const cascadeModel = withCascade({
drafter,
verifier,
domainPolicies: {
legal: { directToVerifier: true },
medical: { forceVerifier: true },
},
});
const agent = new CascadeAgent({
model: cascadeModel.bindTools(tools),
maxSteps: 6,
toolHandlers: {
// Use the same strict parser helper (see examples/nodejs/safe-math.ts).
calculator: async ({ expression }) => safeCalculateExpression(expression).toString(),
},
});
const run = await agent.run(
[{ role: 'user', content: 'What is (25 * 4) + 10?' }],
{ systemPrompt: 'You are a precise calculator assistant.' }
);
console.log(run.status, run.steps, run.message.content);Input can be a string, LangChain BaseMessage[], or role/content message list for multi-turn conversations.
CascadeAgent also supports multi-tool calls in a single step and keeps system prompts at the front of looped executions.
const result = await cascadeModel.invoke("Complex question");
const stats = cascadeModel.getLastCascadeResult();
console.log({
content: stats.content,
modelUsed: stats.modelUsed, // 'drafter' or 'verifier'
accepted: stats.accepted, // Was drafter response accepted?
drafterQuality: stats.drafterQuality, // 0-1 quality score
drafterCost: stats.drafterCost, // $ spent on drafter
verifierCost: stats.verifierCost, // $ spent on verifier
totalCost: stats.totalCost, // Total $ spent
savingsPercentage: stats.savingsPercentage, // % saved vs verifier-only
latencyMs: stats.latencyMs, // Total latency in ms
});CascadeFlow works seamlessly with LangSmith for observability and cost tracking.
When you enable LangSmith tracing, you'll see:
- Your Actual Chat Models - ChatOpenAI, ChatAnthropic, etc. appear as separate traces
- Cascade Metadata - Decision info attached to each response
- Token Usage & Costs - Server-side calculation by LangSmith
- Nested Traces - Parent CascadeFlow trace with child model traces
// Set environment variables
process.env.LANGSMITH_API_KEY = 'lsv2_pt_...';
process.env.LANGSMITH_PROJECT = 'your-project';
process.env.LANGSMITH_TRACING = 'true';
// Use CascadeFlow normally - tracing happens automatically
const cascade = withCascade({
drafter: new ChatOpenAI({ model: 'gpt-5-mini' }),
verifier: new ChatAnthropic({ model: 'claude-opus-4-6' }),
costTrackingProvider: 'cascadeflow', // Default (local pricing)
});
const result = await cascade.invoke("Your query");In your LangSmith dashboard (https://smith.langchain.com):
- For cascaded queries - You'll see only the drafter model trace (e.g., ChatOpenAI with gpt-5-mini)
- For escalated queries - You'll see BOTH drafter AND verifier traces (e.g., ChatOpenAI gpt-5-mini + ChatAnthropic claude-opus-4-6)
- Metadata location - Click any trace → Outputs → response_metadata → cascade
{
"cascade": {
"cascade_decision": "cascaded",
"model_used": "drafter",
"drafter_quality": 0.85,
"savings_percentage": 66.7,
"drafter_cost": 0, // Calculated by LangSmith
"verifier_cost": 0, // Calculated by LangSmith
"total_cost": 0 // Calculated by LangSmith
}
}Note: costTrackingProvider: 'cascadeflow' (default) computes costs locally using CascadeFlow's pricebook. If you use costTrackingProvider: 'langsmith', costs are calculated server-side and shown in the LangSmith UI (local cost values will be $0).
See docs/COST_TRACKING.md for more details on cost tracking options.
Works with any LangChain-compatible chat model:
import { ChatOpenAI } from '@langchain/openai';
const drafter = new ChatOpenAI({ model: 'gpt-5-mini' });
const verifier = new ChatOpenAI({ model: 'gpt-5' });import { ChatAnthropic } from '@langchain/anthropic';
const drafter = new ChatAnthropic({ model: 'claude-haiku-4-5-20251001' });
const verifier = new ChatAnthropic({ model: 'claude-opus-4-6' });// Use different providers for optimal cost/quality balance!
const drafter = new ChatOpenAI({ model: 'gpt-5-mini' });
const verifier = new ChatAnthropic({ model: 'claude-opus-4-6' });-
Choose Your Drafter Wisely - Use the cheapest model that can handle most queries
- GPT-5-mini: $0.25/$2.00 per 1M tokens (input/output)
- GPT-4o-mini: $0.15/$0.60 per 1M tokens (input/output)
- Claude Haiku 4.5: $0.80/$4.00 per 1M tokens
-
Tune Quality Threshold - Higher threshold = more cascades = higher cost but better quality
0.6- Aggressive cost savings, may sacrifice some quality0.7- Balanced (recommended default)0.8- Conservative, ensures high quality
-
Use Custom Validators - Domain-specific validation can improve accuracy
qualityValidator: (response) => { const text = response.generations[0].text; // Check for specific requirements return hasRelevantKeywords(text) && meetsLengthRequirement(text) ? 0.9 : 0.5; }
Typical cascade behavior:
| Query Type | Drafter Hit Rate | Avg Latency | Cost Savings |
|---|---|---|---|
| Simple Q&A | 85% | 500ms | 55-65% |
| Complex reasoning | 40% | 1200ms | 20-30% |
| Code generation | 60% | 800ms | 35-45% |
| Overall | 70% | 700ms | 40-60% |
Full TypeScript support with type inference:
import type { CascadeConfig, CascadeResult } from '@cascadeflow/langchain';
const config: CascadeConfig = {
drafter,
verifier,
qualityThreshold: 0.7,
};
const stats: CascadeResult | undefined = cascadeModel.getLastCascadeResult();See the examples directory for complete working examples:
- basic-usage.ts - Getting started guide
- streaming-cascade.ts - Real-time streaming with optimistic drafter execution
- lcel-pipeline.ts - LCEL runnable composition with CascadeFlow
- tool-risk-gating.ts - Tool-call acceptance + high-risk verifier gating
- langgraph-multi-agent.ts - Optional LangGraph multi-agent pattern
Creates a cascade-wrapped LangChain model.
Parameters:
config.drafter- The cheap, fast modelconfig.verifier- The accurate, expensive modelconfig.qualityThreshold?- Minimum quality to accept drafter (default: 0.7)config.qualityValidator?- Custom function to calculate qualityconfig.enableCostTracking?- Enable LangSmith metadata injection (default: true)config.costTrackingProvider?-'cascadeflow'(default, local pricing) or'langsmith'(server-side)config.domainPolicies?- Per-domain overrides:qualityThreshold,forceVerifier,directToVerifier
Returns: CascadeFlow - A LangChain-compatible model with cascade logic
Creates a closed-loop agent around a LangChain model (or directly from cascade config).
Parameters:
config.model?- Any LangChain chat model (oftenwithCascade(...).bindTools(...))config.cascade?- OptionalCascadeConfigused to create an internalCascadeFlowconfig.maxSteps?- Loop safety cap (default:8)config.toolHandlers?- Tool name to handler map
Runs model/tool/model loops until completion or maxSteps is reached.
Returns: CascadeAgentRunResult with:
message- FinalAIMessagemessages- Full message history (including tool messages)steps- Executed model turnsstatus-'completed' | 'max_steps_reached'toolCalls- Collected tool calls across steps
Returns statistics from the last cascade execution.
Returns: CascadeResult with:
content- The final response textmodelUsed- Which model provided the response ('drafter' | 'verifier')accepted- Whether drafter response was accepteddrafterQuality- Quality score of drafter response (0-1)drafterCost- Cost of drafter callverifierCost- Cost of verifier call (0 if not used)totalCost- Total costsavingsPercentage- Percentage saved vs verifier-onlylatencyMs- Total latency in milliseconds
Contributions welcome! Please see CONTRIBUTING.md for guidelines.
MIT © Lemony Inc.
- @cascadeflow/core - Core CascadeFlow Python library
- LangChain - Framework for LLM applications
- LangSmith - LLM observability platform