LangChain integration for cascadeflow - Add intelligent cost optimization to your LangChain models
npm install @cascadeflow/langchainLangChain integration for CascadeFlow - Add intelligent cost optimization to your existing LangChain models without reconfiguration.
- ๐ Zero Code Changes - Wrap your existing LangChain models, no refactoring needed
- ๐ฐ Automatic Cost Optimization - Save 40-60% on LLM costs through intelligent cascading
- ๐ฏ Quality-Based Routing - Only escalate to expensive models when quality is insufficient
- ๐ Full Visibility - Track costs, quality scores, and cascade decisions
- ๐ Chainable - All LangChain methods (bind(), bindTools(), etc.) work seamlessly
- ๐ LangSmith Ready - Automatic cost metadata injection for observability
``bash`
npm install @cascadeflow/langchain @langchain/coreor
pnpm add @cascadeflow/langchain @langchain/coreor
yarn add @cascadeflow/langchain @langchain/core
`typescript
import { ChatOpenAI } from '@langchain/openai';
import { ChatAnthropic } from '@langchain/anthropic';
import { withCascade } from '@cascadeflow/langchain';
// Step 1: Configure your existing models (no changes needed!)
const drafter = new ChatOpenAI({
model: 'gpt-5-mini', // Fast, cheap model ($0.25/$2 per 1M tokens)
temperature: 0.7
});
const verifier = new ChatAnthropic({
model: 'claude-sonnet-4-5', // Accurate, expensive model ($3/$15 per 1M tokens)
temperature: 0.7
});
// Step 2: Wrap with cascade (just 2 lines!)
const cascadeModel = withCascade({
drafter,
verifier,
qualityThreshold: 0.7, // Quality bar for accepting drafter responses
});
// Step 3: Use like any LangChain model!
const result = await cascadeModel.invoke("What is TypeScript?");
console.log(result.content);
// Step 4: Check cascade statistics
const stats = cascadeModel.getLastCascadeResult();
console.log(Model used: ${stats.modelUsed});Cost: $${stats.totalCost.toFixed(6)}
console.log();Savings: ${stats.savingsPercentage.toFixed(1)}%
console.log();
// Optional: Enable LangSmith tracing (see traces at https://smith.langchain.com)
// Set LANGSMITH_API_KEY, LANGSMITH_PROJECT, LANGSMITH_TRACING=true
// Your ChatOpenAI/ChatAnthropic models will appear in LangSmith with cascade metadata
`
CascadeFlow uses speculative execution to optimize costs:
1. Try Drafter First - Executes the cheap, fast model
2. Quality Check - Validates the response quality using heuristics or custom validators
3. Cascade if Needed - Only calls the expensive model if quality is below threshold
4. Track Everything - Records costs, latency, and cascade decisions
This approach provides:
- โ
No Latency Penalty - Drafter responses are instant when quality is high
- โ
Quality Guarantee - Verifier ensures high-quality responses for complex queries
- โ
Cost Savings - 40-60% reduction in API costs on average
`typescript`
const cascadeModel = withCascade({
drafter: new ChatOpenAI({ model: 'gpt-5-mini' }),
verifier: new ChatAnthropic({ model: 'claude-sonnet-4-5' }),
qualityThreshold: 0.7, // Default: 0.7 (70%)
});
`typescript
const cascadeModel = withCascade({
drafter,
verifier,
qualityValidator: async (response) => {
// Custom logic - return quality score 0-1
const text = response.generations[0].text;
// Example: Use length and keywords
const hasKeywords = ['typescript', 'javascript'].some(kw =>
text.toLowerCase().includes(kw)
);
return text.length > 50 && hasKeywords ? 0.9 : 0.4;
},
});
`
`typescript`
const cascadeModel = withCascade({
drafter,
verifier,
enableCostTracking: false, // Disable metadata injection
});
CascadeFlow supports real-time streaming with optimistic drafter execution:
`typescript
const cascade = withCascade({
drafter: new ChatOpenAI({ model: 'gpt-4o-mini' }),
verifier: new ChatOpenAI({ model: 'gpt-4o' }),
});
// Stream responses in real-time
const stream = await cascade.stream('Explain TypeScript');
for await (const chunk of stream) {
process.stdout.write(chunk.content);
}
`
How Streaming Works:
1. Optimistic Streaming - Drafter response streams immediately (user sees output in real-time)
2. Quality Check - After drafter completes, quality is validated
3. Optional Cascade - If quality insufficient, shows "โคด Cascading to [model]" message and streams verifier
This provides the best user experience with no perceived latency for queries the drafter can handle.
All LangChain chainable methods work seamlessly:
`typescript
const cascadeModel = withCascade({ drafter, verifier });
// bind() works
const boundModel = cascadeModel.bind({ temperature: 0.1 });
const result = await boundModel.invoke("Be precise");
// Chain multiple times
const doubleChained = cascadeModel
.bind({ temperature: 0.5 })
.bind({ maxTokens: 100 });
`
`typescript
const tools = [
{
name: 'calculator',
description: 'Useful for math calculations',
func: async (input: string) => {
return eval(input).toString();
},
},
];
const modelWithTools = cascadeModel.bindTools(tools);
const result = await modelWithTools.invoke("What is 25 * 4?");
`
`typescript
const schema = {
name: 'person',
schema: {
type: 'object',
properties: {
name: { type: 'string' },
age: { type: 'number' },
},
},
};
const structuredModel = cascadeModel.withStructuredOutput(schema);
const result = await structuredModel.invoke("Extract: John is 30 years old");
// Result is typed according to schema
`
`typescript
const result = await cascadeModel.invoke("Complex question");
const stats = cascadeModel.getLastCascadeResult();
console.log({
content: stats.content,
modelUsed: stats.modelUsed, // 'drafter' or 'verifier'
accepted: stats.accepted, // Was drafter response accepted?
drafterQuality: stats.drafterQuality, // 0-1 quality score
drafterCost: stats.drafterCost, // $ spent on drafter
verifierCost: stats.verifierCost, // $ spent on verifier
totalCost: stats.totalCost, // Total $ spent
savingsPercentage: stats.savingsPercentage, // % saved vs verifier-only
latencyMs: stats.latencyMs, // Total latency in ms
});
`
CascadeFlow works seamlessly with LangSmith for observability and cost tracking.
When you enable LangSmith tracing, you'll see:
1. Your Actual Chat Models - ChatOpenAI, ChatAnthropic, etc. appear as separate traces
2. Cascade Metadata - Decision info attached to each response
3. Token Usage & Costs - Server-side calculation by LangSmith
4. Nested Traces - Parent CascadeFlow trace with child model traces
`typescript
// Set environment variables
process.env.LANGSMITH_API_KEY = 'lsv2_pt_...';
process.env.LANGSMITH_PROJECT = 'your-project';
process.env.LANGSMITH_TRACING = 'true';
// Use CascadeFlow normally - tracing happens automatically
const cascade = withCascade({
drafter: new ChatOpenAI({ model: 'gpt-5-mini' }),
verifier: new ChatAnthropic({ model: 'claude-sonnet-4-5' }),
costTrackingProvider: 'langsmith', // Default
});
const result = await cascade.invoke("Your query");
`
In your LangSmith dashboard (https://smith.langchain.com):
- For cascaded queries - You'll see only the drafter model trace (e.g., ChatOpenAI with gpt-5-mini)
- For escalated queries - You'll see BOTH drafter AND verifier traces (e.g., ChatOpenAI gpt-5-mini + ChatAnthropic claude-sonnet-4-5)
- Metadata location - Click any trace โ Outputs โ response_metadata โ cascade
`json`
{
"cascade": {
"cascade_decision": "cascaded",
"model_used": "drafter",
"drafter_quality": 0.85,
"savings_percentage": 66.7,
"drafter_cost": 0, // Calculated by LangSmith
"verifier_cost": 0, // Calculated by LangSmith
"total_cost": 0 // Calculated by LangSmith
}
}
Note: When using costTrackingProvider: 'langsmith' (default), costs are calculated server-side and shown in the LangSmith UI. Local cost values are $0.
See docs/COST_TRACKING.md for more details on cost tracking options.
Works with any LangChain-compatible chat model:
typescript
import { ChatOpenAI } from '@langchain/openai';const drafter = new ChatOpenAI({ model: 'gpt-5-mini' });
const verifier = new ChatOpenAI({ model: 'gpt-5' });
`$3
`typescript
import { ChatAnthropic } from '@langchain/anthropic';const drafter = new ChatAnthropic({ model: 'claude-3-5-haiku-20241022' });
const verifier = new ChatAnthropic({ model: 'claude-sonnet-4-5' });
`$3
`typescript
// Use different providers for optimal cost/quality balance!
const drafter = new ChatOpenAI({ model: 'gpt-5-mini' });
const verifier = new ChatAnthropic({ model: 'claude-sonnet-4-5' });
`Cost Optimization Tips
1. Choose Your Drafter Wisely - Use the cheapest model that can handle most queries
- GPT-5-mini: $0.25/$2.00 per 1M tokens (input/output)
- GPT-4o-mini: $0.15/$0.60 per 1M tokens (input/output)
- Claude 3.5 Haiku: $0.80/$4.00 per 1M tokens
2. Tune Quality Threshold - Higher threshold = more cascades = higher cost but better quality
-
0.6 - Aggressive cost savings, may sacrifice some quality
- 0.7 - Balanced (recommended default)
- 0.8 - Conservative, ensures high quality3. Use Custom Validators - Domain-specific validation can improve accuracy
`typescript
qualityValidator: (response) => {
const text = response.generations[0].text;
// Check for specific requirements
return hasRelevantKeywords(text) && meetsLengthRequirement(text) ? 0.9 : 0.5;
}
`Performance
Typical cascade behavior:
| Query Type | Drafter Hit Rate | Avg Latency | Cost Savings |
|-----------|------------------|-------------|--------------|
| Simple Q&A | 85% | 500ms | 55-65% |
| Complex reasoning | 40% | 1200ms | 20-30% |
| Code generation | 60% | 800ms | 35-45% |
| Overall | 70% | 700ms | 40-60% |
TypeScript Support
Full TypeScript support with type inference:
`typescript
import type { CascadeConfig, CascadeResult } from '@cascadeflow/langchain';const config: CascadeConfig = {
drafter,
verifier,
qualityThreshold: 0.7,
};
const stats: CascadeResult | undefined = cascadeModel.getLastCascadeResult();
`Examples
See the examples directory for complete working examples:
- basic-usage.ts - Getting started guide
- streaming-cascade.ts - Real-time streaming with optimistic drafter execution
API Reference
$3
Creates a cascade-wrapped LangChain model.
Parameters:
-
config.drafter - The cheap, fast model
- config.verifier - The accurate, expensive model
- config.qualityThreshold? - Minimum quality to accept drafter (default: 0.7)
- config.qualityValidator? - Custom function to calculate quality
- config.enableCostTracking? - Enable LangSmith metadata injection (default: true)Returns:
CascadeFlow - A LangChain-compatible model with cascade logic$3
Returns statistics from the last cascade execution.
Returns:
CascadeResult with:
- content - The final response text
- modelUsed - Which model provided the response ('drafter' | 'verifier')
- accepted - Whether drafter response was accepted
- drafterQuality - Quality score of drafter response (0-1)
- drafterCost - Cost of drafter call
- verifierCost - Cost of verifier call (0 if not used)
- totalCost - Total cost
- savingsPercentage - Percentage saved vs verifier-only
- latencyMs` - Total latency in millisecondsContributions welcome! Please see CONTRIBUTING.md for guidelines.
MIT ยฉ Lemony Inc.
- @cascadeflow/core - Core CascadeFlow Python library
- LangChain - Framework for LLM applications
- LangSmith - LLM observability platform