Name	Name	Last commit message	Last commit date
parent directory ..
docs	docs
examples	examples
src	src
.eslintrc.js	.eslintrc.js
README.md	README.md
package.json	package.json
tsconfig.json	tsconfig.json
vitest.config.ts	vitest.config.ts

@cascadeflow/langchain

LangChain integration for CascadeFlow - Add intelligent cost optimization to your existing LangChain models without reconfiguration.

Features

🔄 Zero Code Changes - Wrap your existing LangChain models, no refactoring needed
💰 Automatic Cost Optimization - Save 40-60% on LLM costs through intelligent cascading
🎯 Quality-Based Routing - Only escalate to expensive models when quality is insufficient
📊 Full Visibility - Track costs, quality scores, and cascade decisions
🔗 Chainable - All LangChain methods (bind(), bindTools(), etc.) work seamlessly
📈 LangSmith Ready - Automatic cost metadata injection for observability
🧭 Domain Policies - Per-domain threshold/routing overrides (qualityThreshold, forceVerifier, directToVerifier)
🔁 CascadeAgent - Built-in closed-loop tool agent for multi-turn execution with max-step protection

Installation

npm install @cascadeflow/langchain @langchain/core
# or
pnpm add @cascadeflow/langchain @langchain/core
# or
yarn add @cascadeflow/langchain @langchain/core

Quick Start

import { ChatOpenAI } from '@langchain/openai';
import { ChatAnthropic } from '@langchain/anthropic';
import { withCascade } from '@cascadeflow/langchain';

// Step 1: Configure your existing models (no changes needed!)
const drafter = new ChatOpenAI({
  model: 'gpt-5-mini',  // Fast, cheap model ($0.25/$2 per 1M tokens)
  temperature: 0.7
});

const verifier = new ChatAnthropic({
  model: 'claude-opus-4-6',  // Accurate, expensive model ($15/$75 per 1M tokens)
  temperature: 0.7
});

// Step 2: Wrap with cascade (just 2 lines!)
const cascadeModel = withCascade({
  drafter,
  verifier,
  qualityThreshold: 0.7,  // Quality bar for accepting drafter responses
});

// Step 3: Use like any LangChain model!
const result = await cascadeModel.invoke("What is TypeScript?");
console.log(result.content);

// Step 4: Check cascade statistics
const stats = cascadeModel.getLastCascadeResult();
console.log(`Model used: ${stats.modelUsed}`);
console.log(`Cost: $${stats.totalCost.toFixed(6)}`);
console.log(`Savings: ${stats.savingsPercentage.toFixed(1)}%`);

// Optional: Enable LangSmith tracing (see traces at https://smith.langchain.com)
// Set LANGSMITH_API_KEY, LANGSMITH_PROJECT, LANGSMITH_TRACING=true
// Your ChatOpenAI/ChatAnthropic models will appear in LangSmith with cascade metadata

How It Works

CascadeFlow uses speculative execution to optimize costs:

Try Drafter First - Executes the cheap, fast model
Quality Check - Validates the response quality using heuristics or custom validators
Cascade if Needed - Only calls the expensive model if quality is below threshold
Track Everything - Records costs, latency, and cascade decisions

This approach provides:

✅ No Latency Penalty - Drafter responses are instant when quality is high
✅ Quality Guarantee - Verifier ensures high-quality responses for complex queries
✅ Cost Savings - 40-60% reduction in API costs on average

Configuration

Basic Configuration

const cascadeModel = withCascade({
  drafter: new ChatOpenAI({ model: 'gpt-5-mini' }),
  verifier: new ChatAnthropic({ model: 'claude-opus-4-6' }),
  qualityThreshold: 0.7,  // Default: 0.7 (70%)
});

Custom Quality Validator

const cascadeModel = withCascade({
  drafter,
  verifier,
  qualityValidator: async (response) => {
    // Custom logic - return quality score 0-1
    const text = response.generations[0].text;

    // Example: Use length and keywords
    const hasKeywords = ['typescript', 'javascript'].some(kw =>
      text.toLowerCase().includes(kw)
    );

    return text.length > 50 && hasKeywords ? 0.9 : 0.4;
  },
});

Disable Cost Tracking

const cascadeModel = withCascade({
  drafter,
  verifier,
  enableCostTracking: false,  // Disable metadata injection
});

Domain Policies

Use domain-specific routing rules without changing your chain code:

const cascadeModel = withCascade({
  drafter,
  verifier,
  qualityThreshold: 0.7,
  domainPolicies: {
    finance: { qualityThreshold: 0.5 }, // Easier acceptance for finance queries
    medical: { forceVerifier: true }, // Always verify after drafting
    legal: { directToVerifier: true }, // Skip drafter entirely
  },
});

const legalCascade = cascadeModel.bind({
  metadata: { cascadeflow_domain: "legal" },
});

const result = await legalCascade.invoke("Review this contract clause");

Advanced Usage

Streaming Responses

CascadeFlow supports real-time streaming with optimistic drafter execution:

const cascade = withCascade({
  drafter: new ChatOpenAI({ model: 'gpt-4o-mini' }),
  verifier: new ChatOpenAI({ model: 'gpt-4o' }),
});

// Stream responses in real-time
const stream = await cascade.stream('Explain TypeScript');

for await (const chunk of stream) {
  process.stdout.write(chunk.content);
}

How Streaming Works:

Optimistic Streaming (text-only) - Drafter response streams immediately (user sees output in real-time)
Quality Check - After drafter completes, quality is validated
Optional Cascade - If quality is insufficient, verifier output is streamed; switch notices are off by default and can be enabled via metadata (cascadeflow_emit_switch_message)
Tool-safe Streaming - When tools are bound with bindTools(...), output is buffered until final routing so tool-call deltas stay consistent

This provides the best user experience with no perceived latency for queries the drafter can handle.

Chaining with bind()

All LangChain chainable methods work seamlessly:

const cascadeModel = withCascade({ drafter, verifier });

// bind() works
const boundModel = cascadeModel.bind({ temperature: 0.1 });
const result = await boundModel.invoke("Be precise");

// Chain multiple times
const doubleChained = cascadeModel
  .bind({ temperature: 0.5 })
  .bind({ maxTokens: 100 });

Tool Calling

const tools = [
  {
    name: 'calculator',
    description: 'Useful for math calculations',
    func: async (input: string) => {
      // Use a strict parser helper (see examples/nodejs/safe-math.ts).
      return safeCalculateExpression(input).toString();
    },
  },
];

const modelWithTools = cascadeModel.bindTools(tools);
const result = await modelWithTools.invoke("What is 25 * 4?");

Structured Output

const schema = {
  name: 'person',
  schema: {
    type: 'object',
    properties: {
      name: { type: 'string' },
      age: { type: 'number' },
    },
  },
};

const structuredModel = cascadeModel.withStructuredOutput(schema);
const result = await structuredModel.invoke("Extract: John is 30 years old");
// Result is typed according to schema

Agentic Tool Loops (`CascadeAgent`)

CascadeAgent adds a closed agent/tool loop with explicit max-step safety:

import { CascadeAgent, withCascade } from '@cascadeflow/langchain';

const cascadeModel = withCascade({
  drafter,
  verifier,
  domainPolicies: {
    legal: { directToVerifier: true },
    medical: { forceVerifier: true },
  },
});

const agent = new CascadeAgent({
  model: cascadeModel.bindTools(tools),
  maxSteps: 6,
  toolHandlers: {
    // Use the same strict parser helper (see examples/nodejs/safe-math.ts).
    calculator: async ({ expression }) => safeCalculateExpression(expression).toString(),
  },
});

const run = await agent.run(
  [{ role: 'user', content: 'What is (25 * 4) + 10?' }],
  { systemPrompt: 'You are a precise calculator assistant.' }
);

console.log(run.status, run.steps, run.message.content);

Input can be a string, LangChain BaseMessage[], or role/content message list for multi-turn conversations. CascadeAgent also supports multi-tool calls in a single step and keeps system prompts at the front of looped executions.

Accessing Cascade Statistics

const result = await cascadeModel.invoke("Complex question");

const stats = cascadeModel.getLastCascadeResult();
console.log({
  content: stats.content,
  modelUsed: stats.modelUsed,  // 'drafter' or 'verifier'
  accepted: stats.accepted,  // Was drafter response accepted?
  drafterQuality: stats.drafterQuality,  // 0-1 quality score
  drafterCost: stats.drafterCost,  // $ spent on drafter
  verifierCost: stats.verifierCost,  // $ spent on verifier
  totalCost: stats.totalCost,  // Total $ spent
  savingsPercentage: stats.savingsPercentage,  // % saved vs verifier-only
  latencyMs: stats.latencyMs,  // Total latency in ms
});

LangSmith Integration

CascadeFlow works seamlessly with LangSmith for observability and cost tracking.

What You'll See in LangSmith

When you enable LangSmith tracing, you'll see:

Your Actual Chat Models - ChatOpenAI, ChatAnthropic, etc. appear as separate traces
Cascade Metadata - Decision info attached to each response
Token Usage & Costs - Server-side calculation by LangSmith
Nested Traces - Parent CascadeFlow trace with child model traces

Enabling LangSmith

// Set environment variables
process.env.LANGSMITH_API_KEY = 'lsv2_pt_...';
process.env.LANGSMITH_PROJECT = 'your-project';
process.env.LANGSMITH_TRACING = 'true';

// Use CascadeFlow normally - tracing happens automatically
const cascade = withCascade({
  drafter: new ChatOpenAI({ model: 'gpt-5-mini' }),
  verifier: new ChatAnthropic({ model: 'claude-opus-4-6' }),
  costTrackingProvider: 'cascadeflow', // Default (local pricing)
});

const result = await cascade.invoke("Your query");

Viewing Traces

In your LangSmith dashboard (https://smith.langchain.com):

For cascaded queries - You'll see only the drafter model trace (e.g., ChatOpenAI with gpt-5-mini)
For escalated queries - You'll see BOTH drafter AND verifier traces (e.g., ChatOpenAI gpt-5-mini + ChatAnthropic claude-opus-4-6)
Metadata location - Click any trace → Outputs → response_metadata → cascade

Example Metadata

{
  "cascade": {
    "cascade_decision": "cascaded",
    "model_used": "drafter",
    "drafter_quality": 0.85,
    "savings_percentage": 66.7,
    "drafter_cost": 0,      // Calculated by LangSmith
    "verifier_cost": 0,     // Calculated by LangSmith
    "total_cost": 0         // Calculated by LangSmith
  }
}

Note: costTrackingProvider: 'cascadeflow' (default) computes costs locally using CascadeFlow's pricebook. If you use costTrackingProvider: 'langsmith', costs are calculated server-side and shown in the LangSmith UI (local cost values will be $0).

See docs/COST_TRACKING.md for more details on cost tracking options.

Supported Models

Works with any LangChain-compatible chat model:

OpenAI

import { ChatOpenAI } from '@langchain/openai';

const drafter = new ChatOpenAI({ model: 'gpt-5-mini' });
const verifier = new ChatOpenAI({ model: 'gpt-5' });

Anthropic

import { ChatAnthropic } from '@langchain/anthropic';

const drafter = new ChatAnthropic({ model: 'claude-haiku-4-5-20251001' });
const verifier = new ChatAnthropic({ model: 'claude-opus-4-6' });

Mix and Match (Recommended)

// Use different providers for optimal cost/quality balance!
const drafter = new ChatOpenAI({ model: 'gpt-5-mini' });
const verifier = new ChatAnthropic({ model: 'claude-opus-4-6' });

Cost Optimization Tips

Choose Your Drafter Wisely - Use the cheapest model that can handle most queries
- GPT-5-mini: $0.25/$2.00 per 1M tokens (input/output)
- GPT-4o-mini: $0.15/$0.60 per 1M tokens (input/output)
- Claude Haiku 4.5: $0.80/$4.00 per 1M tokens
Tune Quality Threshold - Higher threshold = more cascades = higher cost but better quality
- 0.6 - Aggressive cost savings, may sacrifice some quality
- 0.7 - Balanced (recommended default)
- 0.8 - Conservative, ensures high quality

Use Custom Validators - Domain-specific validation can improve accuracy

qualityValidator: (response) => {
  const text = response.generations[0].text;
  // Check for specific requirements
  return hasRelevantKeywords(text) && meetsLengthRequirement(text) ? 0.9 : 0.5;
}

Performance

Typical cascade behavior:

Query Type	Drafter Hit Rate	Avg Latency	Cost Savings
Simple Q&A	85%	500ms	55-65%
Complex reasoning	40%	1200ms	20-30%
Code generation	60%	800ms	35-45%
Overall	70%	700ms	40-60%

TypeScript Support

Full TypeScript support with type inference:

import type { CascadeConfig, CascadeResult } from '@cascadeflow/langchain';

const config: CascadeConfig = {
  drafter,
  verifier,
  qualityThreshold: 0.7,
};

const stats: CascadeResult | undefined = cascadeModel.getLastCascadeResult();

Examples

See the examples directory for complete working examples:

basic-usage.ts - Getting started guide
streaming-cascade.ts - Real-time streaming with optimistic drafter execution
lcel-pipeline.ts - LCEL runnable composition with CascadeFlow
tool-risk-gating.ts - Tool-call acceptance + high-risk verifier gating
langgraph-multi-agent.ts - Optional LangGraph multi-agent pattern

API Reference

`withCascade(config: CascadeConfig): CascadeFlow`

Creates a cascade-wrapped LangChain model.

Parameters:

config.drafter - The cheap, fast model
config.verifier - The accurate, expensive model
config.qualityThreshold? - Minimum quality to accept drafter (default: 0.7)
config.qualityValidator? - Custom function to calculate quality
config.enableCostTracking? - Enable LangSmith metadata injection (default: true)
config.costTrackingProvider? - 'cascadeflow' (default, local pricing) or 'langsmith' (server-side)
config.domainPolicies? - Per-domain overrides: qualityThreshold, forceVerifier, directToVerifier

Returns: CascadeFlow - A LangChain-compatible model with cascade logic

`new CascadeAgent(config: CascadeAgentConfig)`

Creates a closed-loop agent around a LangChain model (or directly from cascade config).

Parameters:

config.model? - Any LangChain chat model (often withCascade(...).bindTools(...))
config.cascade? - Optional CascadeConfig used to create an internal CascadeFlow
config.maxSteps? - Loop safety cap (default: 8)
config.toolHandlers? - Tool name to handler map

`CascadeAgent.run(input, options?): Promise<CascadeAgentRunResult>`

Runs model/tool/model loops until completion or maxSteps is reached.

Returns: CascadeAgentRunResult with:

message - Final AIMessage
messages - Full message history (including tool messages)
steps - Executed model turns
status - 'completed' | 'max_steps_reached'
toolCalls - Collected tool calls across steps

`CascadeFlow.getLastCascadeResult(): CascadeResult | undefined`

Returns statistics from the last cascade execution.

Returns: CascadeResult with:

content - The final response text
modelUsed - Which model provided the response ('drafter' | 'verifier')
accepted - Whether drafter response was accepted
drafterQuality - Quality score of drafter response (0-1)
drafterCost - Cost of drafter call
verifierCost - Cost of verifier call (0 if not used)
totalCost - Total cost
savingsPercentage - Percentage saved vs verifier-only
latencyMs - Total latency in milliseconds

Contributing

Contributions welcome! Please see CONTRIBUTING.md for guidelines.

License

@cascadeflow/core - Core CascadeFlow Python library
LangChain - Framework for LLM applications
LangSmith - LLM observability platform

FilesExpand file tree

langchain-cascadeflow

Directory actions

More options