Skip to content

Version 1.0.0.#1

Open
hopeseekr wants to merge 1 commit into
trunkfrom
v1.x
Open

Version 1.0.0.#1
hopeseekr wants to merge 1 commit into
trunkfrom
v1.x

Conversation

@hopeseekr

@hopeseekr hopeseekr commented Dec 22, 2025

Copy link
Copy Markdown
Contributor

CodeAnt-AI Description

Add streaming Ollama runner with interruptible callbacks and token/timing metrics

What Changed

  • Introduces an Ollama client that runs models via the local Ollama CLI, streams model stdout to a user-provided callback in real time, and stops when the stream ends.
  • Callers can supply a periodic interrupt check to abort a running generation; callback errors also stop processing and surface as errors.
  • Returns a SpeakResult that contains total tokens (counted with the tokenizer) and elapsed duration, plus a tokens-per-second metric for observable performance.
  • Adds an integration test that fails early if the Ollama CLI or the specified model is not available; adds many tests covering streaming behavior, interrupts, error propagation, special/unicode prompts, and token/timing assertions.

Impact

✅ Streaming real-time model output
✅ Interruptible model runs
✅ Observable token and timing metrics

💡 Usage Guide

Checking Your Pull Request

Every time you make a pull request, our system automatically looks through it. We check for security issues, mistakes in how you're setting up your infrastructure, and common code problems. We do this to make sure your changes are solid and won't cause any trouble later.

Talking to CodeAnt AI

Got a question or need a hand with something in your pull request? You can easily get in touch with CodeAnt AI right here. Just type the following in a comment on your pull request, and replace "Your question here" with whatever you want to ask:

@codeant-ai ask: Your question here

This lets you have a chat with CodeAnt AI about your pull request, making it easier to understand and improve your code.

Example

@codeant-ai ask: Can you suggest a safer alternative to storing this secret?

Preserve Org Learnings with CodeAnt

You can record team preferences so CodeAnt AI applies them in future reviews. Reply directly to the specific CodeAnt AI suggestion (in the same thread) and replace "Your feedback here" with your input:

@codeant-ai: Your feedback here

This helps CodeAnt AI learn and adapt to your team's coding style and standards.

Example

@codeant-ai: Do not flag unused imports.

Retrigger review

Ask CodeAnt AI to review the PR again, by typing:

@codeant-ai: review

Check Your Repository Health

To analyze the health of your code repository, visit our dashboard at https://app.codeant.ai. This tool helps you identify potential issues and areas for improvement in your codebase, ensuring your repository maintains high standards of code health.

@codeant-ai

codeant-ai Bot commented Dec 22, 2025

Copy link
Copy Markdown

CodeAnt AI is reviewing your PR.

@hopeseekr

Copy link
Copy Markdown
Contributor Author

/gemini review

@codeant-ai codeant-ai Bot added the size:XXL This PR changes 1000+ lines, ignoring generated files label Dec 22, 2025
@codeant-ai

codeant-ai Bot commented Dec 22, 2025

Copy link
Copy Markdown

Nitpicks 🔍

🔒 No security issues identified
⚡ Recommended areas for review

  • Orphaned child process on interrupt
    If check_interrupt() returns an error the function will return early but does not ensure the spawned ollama child is killed. That can leave the child running. The child should be explicitly terminated (or waited on) when interrupting.

  • Unbounded memory growth
    The code accumulates the entire child stdout into full_output for token counting. For long-running models or large outputs this can use a large amount of memory and OOM the process. Consider counting tokens incrementally or limiting retained output.

  • UTF-8 parsing fragility
    The final token count uses str::from_utf8(&full_output) and bails on any invalid UTF-8. Streaming outputs or binary data may contain partial/invalid UTF-8 sequences. This will cause the call to fail even though useful text was produced.

  • Incomplete test
    The file ends with a truncated test line (let zero_tokens) and no completion/assertion; this looks unfinished and may hide intended assertions or edge-case checks.

  • Closure mutability mismatch
    Several closures mutate captured state (e.g., incrementing counters) but are bound without mut. Calling an FnMut closure through an immutable binding can lead to borrow/mutability issues or require changes elsewhere. Ensure closures that mutate captured variables are either bound mut or use interior mutability (Cell/RefCell) intentionally.

Comment thread src/lib.rs
Comment on lines +85 to +104
loop {
// Check for external interrupt signal before each read.
check_interrupt()?;

let bytes_read = reader
.read(&mut buffer)
.context("Failed to read from ollama stdout")?;

if bytes_read == 0 {
// End of stream
break;
}

let data_slice = &buffer[..bytes_read];

// Invoke the callback with the new data.
on_data(data_slice)?;

// Keep a copy of the output for token counting at the end.
full_output.extend_from_slice(data_slice);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: When either the interrupt callback or the streaming callback returns an error, the function returns early via ? and never terminates or waits on the spawned ollama child process, leaving it running and potentially creating zombie processes and resource leaks; wrapping the loop in an inner closure and, on error, explicitly killing and waiting on the child before returning fixes this. [resource leak]

Severity Level: Minor ⚠️

Suggested change
loop {
// Check for external interrupt signal before each read.
check_interrupt()?;
let bytes_read = reader
.read(&mut buffer)
.context("Failed to read from ollama stdout")?;
if bytes_read == 0 {
// End of stream
break;
}
let data_slice = &buffer[..bytes_read];
// Invoke the callback with the new data.
on_data(data_slice)?;
// Keep a copy of the output for token counting at the end.
full_output.extend_from_slice(data_slice);
let loop_result: Result<()> = (|| {
loop {
// Check for external interrupt signal before each read.
check_interrupt()?;
let bytes_read = reader
.read(&mut buffer)
.context("Failed to read from ollama stdout")?;
if bytes_read == 0 {
// End of stream
break;
}
let data_slice = &buffer[..bytes_read];
// Invoke the callback with the new data.
on_data(data_slice)?;
// Keep a copy of the output for token counting at the end.
full_output.extend_from_slice(data_slice);
}
Ok(())
})();
if let Err(err) = loop_result {
// Ensure the child process is terminated and reaped on early error.
let _ = command.kill();
let _ = command.wait();
return Err(err);
Why it matters? ⭐

The suggestion correctly identifies a real resource-leak risk: if check_interrupt() or on_data(...) returns an Err the function will return early via ? and the Child (command) will not be waited on or killed. Dropping Child does not guarantee the child process is terminated or reaped, so the proposed fix to capture loop errors, kill the child and wait on it before returning is a valid and pragmatic improvement that prevents zombies and leaked processes.

Prompt for AI Agent 🤖
This is a comment left during a code review.

**Path:** src/lib.rs
**Line:** 85:104
**Comment:**
	*Resource Leak: When either the interrupt callback or the streaming callback returns an error, the function returns early via `?` and never terminates or waits on the spawned `ollama` child process, leaving it running and potentially creating zombie processes and resource leaks; wrapping the loop in an inner closure and, on error, explicitly killing and waiting on the child before returning fixes this.

Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.

let mut output = Vec::new();

let result = ollama.speak_stream(
"devstral:24b",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: This test uses the hardcoded model name devstral:24b while the rest of the suite consistently uses gemma3:270m, so in environments where only gemma3:270m is installed (as implied by other tests) this test will fail at runtime when spawning the ollama process due to a missing model. [logic error]

Severity Level: Minor ⚠️

Suggested change
"devstral:24b",
"gemma3:270m",
Why it matters? ⭐

The project already contains many tests referencing "gemma3:270m" (search shows multiple test files and README mentioning gemma3:270m), while this new hunk uses "devstral:24b". If CI or a developer environment only has gemma3 installed, tests that reference devstral:24b will fail at runtime when the external ollama model is missing. This is a real, environment-dependent logic/runtime failure risk, not merely stylistic. Replacing the literal with the common model string used elsewhere removes that practical failure mode.

Prompt for AI Agent 🤖
This is a comment left during a code review.

**Path:** tests/basic_functionality.rs
**Line:** 23:23
**Comment:**
	*Logic Error: This test uses the hardcoded model name `devstral:24b` while the rest of the suite consistently uses `gemma3:270m`, so in environments where only `gemma3:270m` is installed (as implied by other tests) this test will fail at runtime when spawning the `ollama` process due to a missing model.

Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.

let mut output = Vec::new();

let result = ollama.speak_stream(
"devstral:24b",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: This test also hardcodes the devstral:24b model instead of the gemma3:270m model used everywhere else, which will cause this test to fail at runtime on setups that only have gemma3:270m available. [logic error]

Severity Level: Minor ⚠️

Suggested change
"devstral:24b",
"gemma3:270m",
Why it matters? ⭐

Same as above: the empty-prompt test hardcodes "devstral:24b" while the rest of the suite commonly uses "gemma3:270m". This inconsistency can cause false failures in environments where devstral:24b isn't installed. The suggestion targets a real runtime/configuration mismatch.

Prompt for AI Agent 🤖
This is a comment left during a code review.

**Path:** tests/basic_functionality.rs
**Line:** 48:48
**Comment:**
	*Logic Error: This test also hardcodes the `devstral:24b` model instead of the `gemma3:270m` model used everywhere else, which will cause this test to fail at runtime on setups that only have `gemma3:270m` available.

Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.

let mut output = Vec::new();

let result = ollama.speak_stream(
"devstral:24b",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: This multi-word prompt test again uses the devstral:24b model, diverging from the standardized gemma3:270m used in other tests, which risks failing when the larger/different model is not installed even though the rest of the suite passes. [logic error]

Severity Level: Minor ⚠️

Suggested change
"devstral:24b",
"gemma3:270m",
Why it matters? ⭐

The multi-word prompt test also references "devstral:24b". Given the rest of the repository and tests favor "gemma3:270m", this is a likely source of intermittent CI failures. The fix is a one-line change that addresses a real integration/test-environment issue rather than a cosmetic preference.

Prompt for AI Agent 🤖
This is a comment left during a code review.

**Path:** tests/basic_functionality.rs
**Line:** 70:70
**Comment:**
	*Logic Error: This multi-word prompt test again uses the `devstral:24b` model, diverging from the standardized `gemma3:270m` used in other tests, which risks failing when the larger/different model is not installed even though the rest of the suite passes.

Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.

Comment thread tests/edge_cases.rs
&|| Ok(()), // No-op interrupt check
)?;

println!("Duration: {:?}", result.duration);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: The single-character prompt test never asserts anything about the generated tokens or output, so it will still pass even if the model returns no tokens or an empty response, weakening coverage and potentially letting regressions slip through unnoticed; you should assert on result.total_tokens (and optionally output) as in the other tests. [logic error]

Severity Level: Minor ⚠️

Suggested change
println!("Duration: {:?}", result.duration);
assert!(
result.total_tokens > 0,
"Should generate tokens for a single-character prompt"
);
Why it matters? ⭐

The test currently only prints the result but does not assert any behaviour. That makes it possible for regressions (model returning zero tokens or empty output) to slip by unnoticed. Adding an assertion on result.total_tokens (like other tests in the file) turns this into a meaningful check that the library actually generated tokens for a single-character prompt. This is a real logic/test-coverage improvement, not mere style nitpicking.

Prompt for AI Agent 🤖
This is a comment left during a code review.

**Path:** tests/edge_cases.rs
**Line:** 41:41
**Comment:**
	*Logic Error: The single-character prompt test never asserts anything about the generated tokens or output, so it will still pass even if the model returns no tokens or an empty response, weakening coverage and potentially letting regressions slip through unnoticed; you should assert on `result.total_tokens` (and optionally output) as in the other tests.

Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.

Comment thread tests/unit/metrics.rs
assert_eq!(zero_duration.tokens_per_second(), 0.0);

// Zero tokens
let zero_tokens

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: The test_zero_edge_cases test is incomplete for the "zero tokens" scenario: it declares zero_tokens but never constructs a SpeakResult or asserts anything about its metrics, so the intended behavior for zero tokens is not actually verified; you should create a SpeakResult with zero tokens and a non-zero duration and assert that tokens_per_second() returns 0.0, and close the test function accordingly. [logic error]

Severity Level: Minor ⚠️

Suggested change
let zero_tokens
let zero_tokens = SpeakResult {
duration: Duration::from_secs(1),
total_tokens: 0,
};
assert_eq!(zero_tokens.tokens_per_second(), 0.0);
Why it matters? ⭐

The PR's final file truncates the test at "let zero_tokens" leaving the zero-tokens branch unimplemented — this is a real logic/test bug (the test will not verify the intended behavior and may even fail to compile). The proposed improved code constructs a SpeakResult with total_tokens = 0 and a non-zero duration, then asserts tokens_per_second() == 0.0, which directly fixes the missing assertion and verifies correct behavior for the edge case.

Prompt for AI Agent 🤖
This is a comment left during a code review.

**Path:** tests/unit/metrics.rs
**Line:** 136:136
**Comment:**
	*Logic Error: The `test_zero_edge_cases` test is incomplete for the "zero tokens" scenario: it declares `zero_tokens` but never constructs a `SpeakResult` or asserts anything about its metrics, so the intended behavior for zero tokens is not actually verified; you should create a `SpeakResult` with zero tokens and a non-zero duration and assert that `tokens_per_second()` returns `0.0`, and close the test function accordingly.

Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.

Comment on lines +21 to +25
// Both should have the same size and be usable
assert_eq!(
std::mem::size_of_val(&ollama1),
std::mem::size_of_val(&ollama2)
);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: The test that claims to verify new equals default only compares the sizes of two Ollama values, which are guaranteed to be identical for a single type and thus will always pass even if Ollama::new() and Ollama::default() behave differently, giving a false sense of coverage. [logic error]

Severity Level: Minor ⚠️

Suggested change
// Both should have the same size and be usable
assert_eq!(
std::mem::size_of_val(&ollama1),
std::mem::size_of_val(&ollama2)
);
// Both instances should be constructible and storable together
let ollamas: Vec<Ollama> = vec![ollama1, ollama2];
assert_eq!(ollamas.len(), 2);
Why it matters? ⭐

The current test only compares sizes of two values of the same type — that will always be equal and doesn't validate that new() and default() produce equivalent usable instances. The suggested change (creating a Vec with both instances and asserting length) actually exercises construction and storage, providing a meaningful check rather than a tautology.

Prompt for AI Agent 🤖
This is a comment left during a code review.

**Path:** tests/unit/ollama_instance.rs
**Line:** 21:25
**Comment:**
	*Logic Error: The test that claims to verify `new` equals `default` only compares the sizes of two `Ollama` values, which are guaranteed to be identical for a single type and thus will always pass even if `Ollama::new()` and `Ollama::default()` behave differently, giving a false sense of coverage.

Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.

Comment on lines +85 to +88
let _ollama2 = Ollama::new(); // Create another instance

// Both should be independently usable
assert!(true);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: The test named to validate cloning via new does not assert anything about the created instances and simply asserts true, so it will always pass and fails to detect any regression in creating multiple Ollama instances. [logic error]

Severity Level: Minor ⚠️

Suggested change
let _ollama2 = Ollama::new(); // Create another instance
// Both should be independently usable
assert!(true);
let ollama1 = Ollama::new();
let ollama2 = Ollama::new(); // Create another instance
// Both should be independently usable and storable together
let ollamas: Vec<Ollama> = vec![ollama1, ollama2];
assert_eq!(ollamas.len(), 2);
Why it matters? ⭐

As written the test always passes and doesn't check that two independently created instances are usable together. Replacing the no-op assert!(true) with a tangible assertion (e.g. storing both in a Vec and asserting len==2) makes the test meaningful and will catch regressions in construction/storage semantics.

Prompt for AI Agent 🤖
This is a comment left during a code review.

**Path:** tests/unit/ollama_instance.rs
**Line:** 85:88
**Comment:**
	*Logic Error: The test named to validate cloning via `new` does not assert anything about the created instances and simply asserts `true`, so it will always pass and fails to detect any regression in creating multiple `Ollama` instances.

Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.

fn test_ollama_size() {
let size = std::mem::size_of::<Ollama>();
// Ollama is a zero-sized type
assert_eq!(size, 0);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: The test hard-codes the assumption that Ollama is a zero-sized type, which is not part of the public contract and will cause tests to fail spuriously if Ollama legitimately gains fields in the future, coupling tests to an internal implementation detail. [logic error]

Severity Level: Minor ⚠️

Suggested change
assert_eq!(size, 0);
// Do not rely on a specific size; just ensure the type is Sized.
assert!(size >= 0);
Why it matters? ⭐

Asserting an exact size is coupling tests to an internal implementation detail; if Ollama later gains fields, this test will fail despite correct behavior. Relaxing the assertion to not rely on a specific size (or removing the test entirely) avoids brittle tests. The suggested "ensure Sized" style check is trivial but communicates the intent to avoid locking the API to a ZST assumption.

Prompt for AI Agent 🤖
This is a comment left during a code review.

**Path:** tests/unit/ollama_instance.rs
**Line:** 105:105
**Comment:**
	*Logic Error: The test hard-codes the assumption that `Ollama` is a zero-sized type, which is not part of the public contract and will cause tests to fail spuriously if `Ollama` legitimately gains fields in the future, coupling tests to an internal implementation detail.

Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.

];

for result in &results {
assert_eq!(result.tokens_per_second(), 50.0);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Using exact equality on a floating-point result in this test can cause spurious failures if tokens_per_second ever changes implementation or encounters minor rounding differences, even when the logic is correct; using an approximate comparison like in the other tests makes the test robust. [possible bug]

Severity Level: Critical 🚨

Suggested change
assert_eq!(result.tokens_per_second(), 50.0);
let tps = result.tokens_per_second();
assert!((tps - 50.0).abs() < 0.01);
Why it matters? ⭐

The test uses assert_eq! on a floating-point return which is fragile: tiny implementation or rounding changes can produce spurious failures. The proposed change mirrors other tests in the file that already use an approximate comparison and makes the test robust without changing behavior. This is a real, practical improvement to test reliability rather than a cosmetic nit.

Prompt for AI Agent 🤖
This is a comment left during a code review.

**Path:** tests/unit/speak_result.rs
**Line:** 179:179
**Comment:**
	*Possible Bug: Using exact equality on a floating-point result in this test can cause spurious failures if `tokens_per_second` ever changes implementation or encounters minor rounding differences, even when the logic is correct; using an approximate comparison like in the other tests makes the test robust.

Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.

@codeant-ai

codeant-ai Bot commented Dec 22, 2025

Copy link
Copy Markdown

CodeAnt AI finished reviewing your PR.

@hopeseekr

Copy link
Copy Markdown
Contributor Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new crate ollama-speaker for interacting with the Ollama CLI, complete with streaming support, interrupt handling, and performance metrics. The implementation is straightforward and accompanied by an impressive and comprehensive set of both unit and integration tests, which is excellent for a new library.

My review focuses on a few key areas for improvement:

  • Performance: The tokenizer is re-initialized on every API call, which should be cached for better performance.
  • Library Design: The handling of the subprocess's stderr could be improved to be less intrusive for library consumers, and there's a design trade-off regarding memory usage for token counting that's worth noting.
  • Test Suite: I found a critical issue with an incomplete unit test that prevents compilation, and an inconsistency in the models used for integration testing.

Comment thread tests/unit/metrics.rs
Comment on lines +135 to +136
// Zero tokens
let zero_tokens

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This test case for zero_tokens is incomplete and will cause a compilation error. Please complete the test case to ensure the test suite is valid.

    // Zero tokens
    let zero_tokens = SpeakResult {
        duration: Duration::from_secs(10),
        total_tokens: 0,
    };
    assert_eq!(zero_tokens.tokens_per_second(), 0.0);

Comment thread src/lib.rs
check_interrupt: &dyn Fn() -> Result<()>,
) -> Result<SpeakResult> {
let start_time = Instant::now();
let bpe = p50k_base().context("Failed to load tiktoken p50k_base tokenizer")?;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The p50k_base() tokenizer is loaded and initialized on every call to speak_stream. This is inefficient as it likely involves disk I/O and parsing. For a library function that could be called frequently, this can become a performance bottleneck.

Consider initializing the tokenizer once and storing it within the Ollama struct. This would involve:

  1. Changing Ollama to hold the CoreBPE instance.
  2. Modifying Ollama::new() to be fallible (return Result<Self>) and initialize the tokenizer there.
  3. Updating Ollama::default() to panic! on failure, which is a common pattern for Default implementations.

let mut output = Vec::new();

let result = ollama.speak_stream(
"devstral:24b",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This test uses the devstral:24b model, but the dependency check in 00_dependencies.rs and the test README only mention gemma3:270m. This will cause this test and others in the file (lines 48, 70) to fail for anyone who followed the setup instructions. Please use the gemma3:270m model for consistency.

Suggested change
"devstral:24b",
"gemma3:270m",

let mut output = Vec::new();

let result = ollama.speak_stream(
"devstral:24b",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This test uses the devstral:24b model, which is inconsistent with the documented and checked dependency gemma3:270m. Please use gemma3:270m to ensure test reliability.

Suggested change
"devstral:24b",
"gemma3:270m",

let mut output = Vec::new();

let result = ollama.speak_stream(
"devstral:24b",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This test uses the devstral:24b model, which is inconsistent with the documented and checked dependency gemma3:270m. Please use gemma3:270m to ensure test reliability.

Suggested change
"devstral:24b",
"gemma3:270m",

Comment thread src/lib.rs
Comment on lines +72 to +117
.stderr(Stdio::inherit()) // Show errors from ollama directly
.spawn()
.context(format!("Failed to spawn ollama for model '{}'", model_name))?;

let stdout = command
.stdout
.take()
.context("Child process did not have a stdout handle")?;

let mut reader = std::io::BufReader::new(stdout);
let mut buffer = [0; 1024];
let mut full_output = Vec::new();

loop {
// Check for external interrupt signal before each read.
check_interrupt()?;

let bytes_read = reader
.read(&mut buffer)
.context("Failed to read from ollama stdout")?;

if bytes_read == 0 {
// End of stream
break;
}

let data_slice = &buffer[..bytes_read];

// Invoke the callback with the new data.
on_data(data_slice)?;

// Keep a copy of the output for token counting at the end.
full_output.extend_from_slice(data_slice);
}

let status = command
.wait()
.context(format!("Failed to wait for ollama process for model '{}'", model_name))?;

if !status.success() {
anyhow::bail!(
"Ollama command failed for model '{}'. Exit status: {:?}",
model_name,
status.code()
);
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using .stderr(Stdio::inherit()) prints Ollama's stderr directly to the user's console. For a library, it's better to give the user control over this or to capture it for better error reporting. Inheriting stderr can be noisy and disruptive in applications using this library.

A better approach would be to pipe stderr, and if the command fails, include the stderr content in the returned anyhow::Error. This provides more context for debugging without polluting the user's stderr stream during normal operation. This requires using wait_with_output() instead of wait().

            .stderr(Stdio::piped()) // Show errors from ollama directly
            .spawn()
            .context(format!("Failed to spawn ollama for model '{}'", model_name))?;

        let stdout = command
            .stdout
            .take()
            .context("Child process did not have a stdout handle")?;

        let mut reader = std::io::BufReader::new(stdout);
        let mut buffer = [0; 1024];
        let mut full_output = Vec::new();

        loop {
            // Check for external interrupt signal before each read.
            check_interrupt()?;

            let bytes_read = reader
                .read(&mut buffer)
                .context("Failed to read from ollama stdout")?;

            if bytes_read == 0 {
                // End of stream
                break;
            }

            let data_slice = &buffer[..bytes_read];

            // Invoke the callback with the new data.
            on_data(data_slice)?;

            // Keep a copy of the output for token counting at the end.
            full_output.extend_from_slice(data_slice);
        }

        let output = command
            .wait_with_output()
            .context(format!("Failed to wait for ollama process for model '{}'", model_name))?;

        if !output.status.success() {
            let stderr = String::from_utf8_lossy(&output.stderr);
            anyhow::bail!(
                "Ollama command failed for model '{}'. Exit status: {:?}. Stderr: {}",
                model_name,
                output.status.code(),
                stderr
            );
        }

Comment thread src/lib.rs
Comment on lines +83 to +104
let mut full_output = Vec::new();

loop {
// Check for external interrupt signal before each read.
check_interrupt()?;

let bytes_read = reader
.read(&mut buffer)
.context("Failed to read from ollama stdout")?;

if bytes_read == 0 {
// End of stream
break;
}

let data_slice = &buffer[..bytes_read];

// Invoke the callback with the new data.
on_data(data_slice)?;

// Keep a copy of the output for token counting at the end.
full_output.extend_from_slice(data_slice);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The implementation accumulates the entire model output in full_output to calculate the total token count at the end. While this provides an accurate token count, it defeats one of the main purposes of streaming: processing large outputs with low memory usage. If the model generates a very large response, this could lead to high memory consumption.

Consider one of the following:

  • Make token counting optional, controlled by a parameter. If disabled, full_output would not be needed.
  • Investigate if tiktoken-rs can perform streaming tokenization to avoid buffering the entire output.
  • Document this behavior and its memory implications clearly in the speak_stream function's documentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:XXL This PR changes 1000+ lines, ignoring generated files

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant