A high-performance inference engine for AI models
npm install @trymirai/uzu



Node package for uzu, a high-performance inference engine for AI models on Apple Silicon. It allows you to deploy AI directly in your app with zero latency, full data privacy, and no inference costs. You don’t need an ML team or weeks of setup - one developer can handle everything in minutes. Key features:
- Simple, high-level API
- Specialized configurations with significant performance boosts for common use cases like classification and summarization
- Broad model support
Add the uzu dependency to your project's package.json:
``json`
"dependencies": {
"@trymirai/uzu": "0.2.10"
}
Set up your project through Platform and obtain an API_KEY. Then, choose the model you want from the library and run it with the following snippet using the corresponding identifier:
`ts`
const output = await Engine
.create('API_KEY')
.chatModel('Qwen/Qwen3-0.6B')
.reply('Tell me a short, funny story about a robot');
Everything from model downloading to inference configuration is handled automatically. Refer to the documentation for details on how to customize each step of the process.
Place the API_KEY you obtained earlier in the corresponding example file, and then run it using one of the following commands:
`bash`
pnpm run tsn examples/chat.ts
pnpm run tsn examples/chatDynamicContext.ts
pnpm run tsn examples/chatStaticContext.ts
pnpm run tsn examples/summarization.ts
pnpm run tsn examples/classification.ts
pnpm run tsn examples/cloud.ts
pnpm run tsn examples/structutedOutput.ts
In this example, we will download a model and get a reply to a specific list of messages:
`ts
import Engine, { Message } from '@trymirai/uzu';
async function main() {
const output = await Engine.create('API_KEY')
.chatModel('Qwen/Qwen3-0.6B')
.download((update) => {
console.log('Progress:', update.progress);
})
.replyToMessages(
[
Message.system('You are a helpful assistant'),
Message.user('Tell me a short, funny story about a robot')
],
(partialOutput) => {
return true;
},
);
console.log(output.text.original);
}
main().catch((error) => {
console.error(error);
});
`
In this example, we will use the dynamic ContextMode, which automatically maintains a continuous conversation history instead of resetting the context with each new input. Every new message is added to the ongoing chat, allowing the model to remember what has already been said and respond with full context.
`ts
import Engine, { Config, ContextMode, Input, RunConfig } from '@trymirai/uzu';
async function main() {
const engine = await Engine.load('API_KEY');
const model = await engine.chatModel('Qwen/Qwen3-0.6B');
await engine.downloadChatModel(model, (update) => {
console.log('Progress:', update.progress);
});
const config = Config
.default()
.withContextMode(ContextMode.dynamic());
const session = engine.chatSession(model, config);
const requests = [
'Tell about London',
'Compare with New York',
'Compare the population of the two',
];
const runConfig = RunConfig
.default()
.withTokensLimit(1024)
.withEnableThinking(false);
for (const request of requests) {
const output = session.run(Input.text(request), runConfig, (partialOutput) => {
return true;
});
console.log('Request:', request);
console.log('Response:', output.text.original.trim());
console.log('-------------------------');
}
}
main().catch((error) => {
console.error(error);
});
`
In this example, we will use the static ContextMode, which begins with an initial list of messages defining the base context of the conversation, such as predefined instructions. Unlike dynamic mode, this context is fixed and does not evolve with new messages. Each inference request is processed independently, using only the initial context and the latest input, without retaining any previous conversation history.
`ts
import Engine, { Config, ContextMode, Input, Message, RunConfig } from '@trymirai/uzu';
function listToString(list: string[]): string {
return "[" + list.map((item) => "${item}").join(", ") + "]";
}
async function main() {
const engine = await Engine.load('API_KEY');
const model = await engine.chatModel('Qwen/Qwen3-0.6B');
await engine.downloadChatModel(model, (update) => {
console.log('Progress:', update.progress);
});
const instructions =
Your task is to name countries for each city in the given list.
For example for ${listToString(["Helsenki", "Stockholm", "Barcelona"])} the answer should be ${listToString(["Finland", "Sweden", "Spain"])}.;
const config = Config
.default()
.withContextMode(ContextMode.static(Input.messages([Message.system(instructions)])));
const session = engine.chatSession(model, config);
const requests = [
listToString(["New York", "London", "Lisbon", "Paris", "Berlin"]),
listToString(["Bangkok", "Tokyo", "Seoul", "Beijing", "Delhi"]),
];
const runConfig = RunConfig
.default()
.withEnableThinking(false);
for (const request of requests) {
const output = session.run(Input.text(request), runConfig, (partialOutput) => {
return true;
});
console.log('Request:', request);
console.log('Response:', output.text.original.trim());
console.log('-------------------------');
}
}
main().catch((error) => {
console.error(error);
});
`
In this example, we will use the summarization preset to generate a summary of the input text:
`ts
import Engine, { Preset, SamplingMethod } from '@trymirai/uzu';
async function main() {
const textToSummarize =
"A Large Language Model (LLM) is a type of artificial intelligence that processes and generates human-like text. It is trained on vast datasets containing books, articles, and web content, allowing it to understand and predict language patterns. LLMs use deep learning, particularly transformer-based architectures, to analyze text, recognize context, and generate coherent responses. These models have a wide range of applications, including chatbots, content creation, translation, and code generation. One of the key strengths of LLMs is their ability to generate contextually relevant text based on prompts. They utilize self-attention mechanisms to weigh the importance of words within a sentence, improving accuracy and fluency. Examples of popular LLMs include OpenAI's GPT series, Google's BERT, and Meta's LLaMA. As these models grow in size and sophistication, they continue to enhance human-computer interactions, making AI-powered communication more natural and effective.";
const prompt = Text is: "${textToSummarize}". Write only summary itself.;
const output = await Engine.create('API_KEY')
.chatModel('Qwen/Qwen3-0.6B')
.download((update) => {
console.log('Progress:', update.progress);
})
.preset(Preset.summarization())
.session()
.tokensLimit(256)
.enableThinking(false)
.samplingMethod(SamplingMethod.greedy())
.reply(prompt);
console.log('Summary:', output.text.original);
console.log(
'Model runs:',
output.stats.prefillStats.modelRun.count + (output.stats.generateStats?.modelRun.count ?? 0),
);
console.log('Tokens count:', output.stats.totalStats.tokensCountOutput);
}
main().catch((error) => {
console.error(error);
});
`
You will notice that the model’s run count is lower than the actual number of generated tokens due to speculative decoding, which significantly improves generation speed.
In this example, we will use the classification preset to determine the sentiment of the user's input:
`ts
import Engine, { ClassificationFeature, Preset, SamplingMethod } from '@trymirai/uzu';
async function main() {
const feature = new ClassificationFeature('sentiment', [
'Happy',
'Sad',
'Angry',
'Fearful',
'Surprised',
'Disgusted',
]);
const textToDetectFeature =
"Today's been awesome! Everything just feels right, and I can't stop smiling.";
const prompt =
Text is: "${textToDetectFeature}". Choose ${feature.name} from the list: ${feature.values.join(', ')}. +
"Answer with one word. Don't add a dot at the end.";
const output = await Engine.create('API_KEY')
.chatModel('Qwen/Qwen3-0.6B')
.download((update) => {
console.log('Progress:', update.progress);
})
.preset(Preset.classification(feature))
.session()
.tokensLimit(32)
.enableThinking(false)
.samplingMethod(SamplingMethod.greedy())
.reply(prompt);
console.log('Prediction:', output.text.original);
console.log('Stats:', output.stats);
}
main().catch((error) => {
console.error(error);
});
`
You can view the stats to see that the answer will be ready immediately after the prefill step, and actual generation won’t even start due to speculative decoding, which significantly improves generation speed.
Sometimes you want to create a complex pipeline where some requests are processed on-device and the more complex ones are handled in the cloud using a larger model. With uzu, you can do this easily: just choose the cloud model you want to use and perform all requests through the same API:
`ts
import Engine from '@trymirai/uzu';
async function main() {
const output = await Engine
.create('API_KEY')
.chatModel('openai/gpt-oss-120b')
.reply('How LLMs work');
console.log(output.text.original);
}
main().catch((error) => {
console.error(error);
});
`
Sometimes you want the generated output to be valid JSON with predefined fields. You can use GrammarConfig to manually specify a JSON schema, or define the schema using Zod.
`ts
import Engine, { GrammarConfig } from '@trymirai/uzu';
import * as z from "zod";
const CountryType = z.object({
name: z.string(),
capital: z.string(),
});
const CountryListType = z.array(CountryType);
async function main() {
const output = await Engine.create('API_KEY')
.chatModel('Qwen/Qwen3-0.6B')
.download((update) => {
console.log('Progress:', update.progress);
})
.session()
.enableThinking(false)
.grammarConfig(GrammarConfig.fromType(CountryListType))
.reply(
"Give me a JSON object containing a list of 3 countries, where each country has name and capital fields",
(partialOutput) => {
return true;
},
);
const countries = output.text.parsed.structuredResponse(CountryListType);
console.log(countries);
}
main().catch((error) => {
console.error(error);
});
``
If you experience any problems, please contact us via Discord or email.
This project is licensed under the MIT License. See the LICENSE file for details.