Running LLMs Locally: Overconditioning, Right-Sizing, and Multi-Pass Pipelines

January 18, 2025 • 7 min read • LLM Systems, Prompt Engineering, Machine Learning Engineering

Large language models feel deceptively easy to use. A few lines of prompt text, a capable model, and suddenly you have something that looks intelligent. My experience running LLMs locally, outside of managed products like ChatGPT, taught me that this apparent simplicity hides a lot of sharp edges-especially when correctness and fidelity matter more than creativity.

This post walks through how I approached a concrete problem-cleaning and structuring thousands of ASR transcripts-and what I learned about prompt design, model sizing, and pipeline architecture along the way. The intended audience is technical but not expert: engineers who are curious about running LLMs locally and want to reason about them as system components rather than magic boxes.

The Problem: Structuring ASR Transcripts Faithfully

I started with a few thousand long-form transcripts generated by OpenAI’s Whisper ASR model. These were mostly single-speaker recordings with no timestamps and no formatting-just long, flattened blocks of text.

The goal was not summarization, rewriting, or stylistic editing. I wanted to:

  • Preserve all words and meaning
  • Insert paragraph breaks to improve readability
  • Avoid deletions, paraphrasing, or “editorial help”

This distinction matters. Many LLM examples assume that “better” output means more polished prose. In this case, any content loss or summarization was a failure.

To reason about feasibility, I wrote a small token-length estimator using the common 3/4 rule (≈ 1 word = 0.75 tokens). That quickly made two constraints obvious:

Many transcripts were too large for naive single-pass processing

  • Context window management would matter later, even for simple tasks
  • At the time, I didn’t yet appreciate how much prompt intent would matter.

Early Exploration: Models, Sizes, and Execution Realities

Before I could make progress on prompting, I had to understand the tooling landscape:

  • Model sizes and what parameter counts actually imply
  • Quantization, especially Q4 variants and their impact on quality
  • Model types (general-purpose, instruction-tuned, creative, code-oriented)
  • Local execution using Ollama
  • CPU vs GPU execution, including VRAM limits and spillover behavior

My local setup uses an NVIDIA 3070 Ti with 8 GB of VRAM. That constraint alone ruled out many “just use a bigger model” suggestions.

At this stage, I was still thinking in terms of capability: if the model is powerful enough, surely it can follow instructions precisely.

That assumption turned out to be wrong.

Overconditioning: When Prompts Backfire

My first serious prompts were verbose and careful. They looked like something you’d expect an expert editor to follow:

FROM llama3.1:70b

SYSTEM """
You are an expert text editor for raw voice-to-text transcripts.
Correct grammar, spelling, capitalization, and punctuation.
Preserve all information.
Do not summarize.
Do not paraphrase.
Keep the length approximately the same.
...
"""

Despite the explicit constraints, the output frequently:

  • Deleted content
  • Smoothed over repetition
  • Summarized sections implicitly

The more rules I added, the worse the behavior became.

This led me to a concept I now think of as overconditioning.

At first, this was confusing. I had told the model exactly what not to do, yet it kept drifting toward editorial cleanup and summarization. What eventually clicked for me was that these models are not trained to follow tasks in the way humans understand them. They are trained to predict the next token.

From that simple objective, more complex behavior emerges. The model is not choosing to summarize or rewrite, but when a prompt starts to resemble patterns it has seen during training, such as editorial instructions or cleanup guidelines, it follows the most statistically likely continuation. In this case, that continuation often looks like rewriting or summarizing text, even when the rules explicitly forbid it.

In other words, the model is not executing my instructions directly. It is classifying the prompt into a latent category based on its training data, and then continuing along that path. Once the prompt crosses into something that resembles an “editorial” task, the output follows that distribution. This is emergent behavior. It is not programmed explicitly, but it shows up reliably once certain patterns are triggered.

This also explains a confusing discrepancy I saw early on. The same prompt worked reasonably well in ChatGPT, but failed when I ran it locally. ChatGPT is not a single model. It is a layered system that performs intent detection, routing, retries, and output filtering before returning a response. When running a single model locally, none of that scaffolding exists.

Locally, intent must be communicated cleanly, not exhaustively. Adding more rules does not necessarily constrain behavior. In many cases, it does the opposite.

Scaling Up: The 70B Experiment

At one point, I tried running a 70B-parameter model locally. On a handful of files, the results were excellent-nearly perfect structuring with minimal loss. That was encouraging enough that I moved execution to Lambda Cloud to process the dataset at scale.

The infrastructure setup was mostly straightforward. I experimented with instance types and found GH200 instances significantly cheaper than PCIe H100s on paper. Networking was unreliable and file transfers would often time out unless they succeeded immediately,but it was usable.

Performance-wise:

  • The model saturated one CPU core and the GPU
  • VRAM usage stayed under limits
  • Throughput averaged ~37 tokens/second

The math, however, was unforgiving. Processing ~6,300 transcripts would have taken hundreds of hours and a nontrivial amount of money.

More importantly, when I broadened my evaluation, the picture changed. The issue was not that the model’s behavior degraded at scale. It was that my initial testing had been too narrow. I had only validated a small number of files, and those happened to perform exceptionally well, with near-perfect accuracy. Based on that limited sample, I assumed the rest of the dataset would behave similarly.

Once I evaluated a larger and more representative set of transcripts, I saw that the same failure modes were present. Some files still produced excellent results, just as they had locally. Others showed summarization, content deletion, or subtle rewriting. The behavior itself had not changed. My understanding of its consistency had.

At that point, it became clear that model size was not the limiting factor. The larger model was capable of producing correct output, but it was not reliably constrained across the full distribution of inputs. The problem was not scale. It was predictability.

Right-Sizing: Why Smaller Models Worked Better

After more reading and experimentation, I encountered a counterintuitive idea that turned out to be correct:

For narrow, mechanical tasks like structuring text, smaller models often perform better.

I switched to LLaMA 3.1 8B with Q4_K_M quantization and rewrote the prompt to do one thing only:

FROM llama3.1:8b

SYSTEM """
You are a transcript formatter.

Input is a fully flattened transcript (no line breaks).

Task:
- Insert paragraph breaks (blank lines) only.

Hard rules:
- Preserve ALL words exactly. Do NOT delete, add, summarize, paraphrase, reorder, or rewrite.
- Do NOT change capitalization or punctuation.
- Do NOT remove filler words.

Paragraph breaks:
- Insert a blank line only at an obvious transition (new topic/example/question/aside). If unsure, do not insert a break.

Output:
- Return ONLY the transcript text with paragraph breaks.
"""
# More articles to come based on my learnings from these parameters!
PARAMETER num_ctx 8192
PARAMETER temperature 0.0
PARAMETER top_p 1
PARAMETER top_k 0
PARAMETER repeat_penalty 1.02
PARAMETER num_thread 12
PARAMETER num_predict -1

This prompt deliberately avoided (for the most part):

  • Editorial language
  • Cleanup instructions
  • Any notion of “improving” the text

The results were consistently better. Smaller models had less latent pressure to rewrite or summarize and were easier to constrain.

This reinforced an important lesson: capability can be a liability if it activates the wrong behavior.

Multi-Pass Pipelines Beat Perfect Prompts

One reason overconditioning happens is that it’s tempting to ask the model to do everything at once. That’s rarely a good idea.

Instead, I structured the pipeline into discrete stages:

  1. Paragraph structuring
  2. Fixing punctuation
  3. (Future) Terminology correction
  4. (Future) Name correction

Each stage has:

  • A single responsibility
  • A narrowly scoped prompt
  • Independent validation

This design accepts that LLMs are imperfect but makes failures detectable and recoverable. It also makes prompts easier to reason about.

Measuring Fidelity: Making Quality Visible

To avoid subjective judgment, I added a quantitative check using Python’s SequenceMatcher library. For the structuring stage, this produces a “likeness” score that captures how much the output diverges from the input.

This immediately surfaced:

  • Summarization
  • Deletions
  • Over-editing

It also made prompt iteration measurable instead of intuitive. When I compared runs across model sizes and prompt versions, the data was unambiguous: refined prompts on 8B models consistently outperformed 70B models for this task.

Model size was not the constraint. Intent clarity was.

What This Changed for Me

Running LLMs locally forced me to stop thinking of them as assistants and start treating them as system components with failure modes.

The key lessons:

  • Overconditioning is real and subtle
  • Bigger models are not inherently safer
  • Task isolation matters more than clever prompts
  • Multi-pass pipelines outperform “do everything” requests
  • Measurement is essential for trust

This project reshaped how I approach LLM-backed systems. Instead of chasing the most powerful model, I now focus on right-sizing, explicit intent, and validation.

Future work will build on this foundation-chunking strategies, fine-tuning, and additional processing stages-but those only make sense once the basics are sound.

Closing Note

If there’s one takeaway, it’s this: LLMs are most effective when you ask them to do less, not more-and when you design your system to notice when they don’t.

You can follow along with my project on github here: https://github.com/Scott123180/llm-speaker.