Category: Artificial Intelligence, AI

Building a Local AI Voice System (When the Hard Part Isn’t AI)
Building a Local AI Voice System (When the Hard Part Isn’t AI)

Photo by DS stories on Pexels.com

Yesterday was one of those days where progress didn’t look like progress.

On paper, the goal was simple: refine a local voice-cloning pipeline so I can turn finished text into audio using my own voice, entirely on my laptop, without cloud services, subscriptions, or dashboards.

In reality, it turned into a long, winding encounter with the friction that lives between tools, not inside them.

This is the part of building software people rarely talk about. Not the big idea. Not the demo. The part where everything technically works, but nothing quite behaves.

The idea was never “AI for AI’s sake.”

I didn’t wake up wanting to clone my voice because it’s novel.

I wanted a way to think out loud, edit deliberately, and then re-embody that writing as audio without rerecording everything.

Same philosophy as my transcription system: keep cognition flowing, remove mechanical friction, and keep judgment in human hands.

So the constraints were clear:
- Everything runs locally
- No data leaves my machine unless I choose it
- No subscriptions
- No dashboards
- Drop files in folders, get results out
That part is solid. The system exists.

But yesterday wasn’t about architecture. It was about tuning.

When “close” is the most frustrating distance

I had a working voice clone. It spoke. It sounded human.

It even sounded vaguely like me. And that was the problem.

It was almost right.

The pitch was too high. Or too constant. The cadence too fast. The pauses unnatural. It sounded like me if I were permanently mid-sentence, slightly caffeinated, and voicing a cartoon I didn’t agree to.

This is where the work stopped being about AI and started being about listening.

You can’t debug voice by reading logs. You have to hear it.

Over and over. Tiny changes. One parameter at a time. Change speed. Listen. Change pitch. Listen. Add breathing pauses. Listen again.

This is typography, not programming.

You don’t “optimize” a voice. You kern it.

Toolchains don’t fail—seams do

Nothing actually broke yesterday.

Whisper worked.

XTTS worked.

FFmpeg worked.

Homebrew worked.

Python worked.

What failed was expectation alignment between tools.
- Python versions that were technically compatible but practically hostile
- Libraries emitting warnings that weren’t errors; noise pollution
- Audio files that were “valid” but not ideal
- Defaults that made sense for demos, not humans
At one point I spent an unreasonable amount of time just trying to measure pitch properly because every shortcut tool failed quietly.

Only when I dropped down to Praat did the truth become visible: my average pitch sat right where I expected.

The voice model wasn’t “wrong”—it was overconstrained.

That realization matters. It means the fix isn’t “more data” or “better models.” It’s restraint.

The emotional tax of almost-automation

This is the part that doesn’t show up in README files.

Every detour costs attention.
Every rabbit hole costs energy.

When you’re doing this work alone, late, after already solving the hard problems, it’s easy to lose your way and start questioning decisions that were correct.

At one point I caught myself thinking: Why didn’t I just use a service?

And then immediately remembered why.

Because services hide tradeoffs. They optimize for scale, not thoughtfulness. They turn authors into operators. They’re not private.

What yesterday actually produced

Even if it didn’t feel like it, yesterday locked in something important:
- A clean voice anchor with the right pitch range
- A stable speed setting that feels human
- Real breathing and pause handling
- A repeatable way to evaluate changes instead of guessing
- A clear line between “voice quality” work and “automation” work
Most importantly, it clarified what not to do next.
- No more parameter thrashing.
- No more chasing novelty.
- No more piling automation on an unstable core.
That’s the same lesson I learned building the transcription pipeline.
- Stability first.
- Then convenience.
- Then publishing.
Why this matters beyond this project

This isn’t really about voice cloning.

It’s about building tools that respect the person using them. Tools that don’t rush. Tools that don’t decide for you. Tools that let you stay in the work instead of managing the work.

Yesterday was a reminder that the hardest part of building these systems isn’t AI. It’s patience. It’s taste. It’s knowing when to stop turning knobs and start listening.

I went to bed knowing the voice is close enough to keep going—and that’s the right place to pause.
February 2, 2026
Sunday Reflection: Good Work on Building Barely, But Here

Sunday night.

I’m feeling tired physically and mentally. Physically mostly because today was shoveling day. I spent about two and a half hours this afternoon shoveling, and earlier this morning I spent at least an hour or more doing the first round. Overall we got more than 20 inches and it’s still snowing outside. So yeah. Quite a day of snow.

The week felt productive.

What?

I shared two articles online. Recorded, edited, published them. Created images for them. Did all the SEO.

I also worked a lot on my Chrome extension project, which involves managing YouTube. There isn’t a very good tool in the Chrome Web Store, at least not working the way I think it should, so the objective is to create a tool that really performs. What’s interesting is that conceptually and functionally, it’s not a very difficult problem. What’s difficult is figuring out the architecture and what’s going on inside. It’s sort of mysterious black box stuff.

And I’m now at the point where I’m taking the project from ChatGPT over to Google antigravity. With ChatGPT, everything is slowly getting better, but it’s a lot of back and forth.

I ask for something, get code, set it up, try it, give feedback, ask for tweaks… on and on. Just to get basic functionality sometimes takes multiple hours across multiple days.

What’s interesting about Antigravity is they have a working agent system that, if given the right instructions, can perform tasks autonomously without human intervention. It can write code, test it, figure out if it works, and spend two hours chasing the solution instead of me sitting here doing the back and forth.

Big advancement in that regard. I’m damn close to solving the last stubborn piece of the extension, and I’ve been prepping the documentation so I can hand the whole thing to an agent and let it work as long as it needs to.

Meanwhile, I can watch YouTube, take a nap, or go shovel more snow.

Even more interesting is how far I got on my voice cloning project.

Right now I’m very close to having a near perfect mirror of my voice so it actually sounds like me. The objective is: when I get done writing, instead of me having to read it, I can feed it to a folder, have it processed, and out the other side comes an audio file that’s “me” reading the entire text. I think I’m on version 12. It’s been a fascinating process. I learned new terms and what they do to a synthetic voice, including prosody, which has to do with variability in intonation. We don’t talk flat all the time.

On that side, things have felt clearer. I’ve finally been able to focus on something and make progress. And I also made huge ground on the Barely, But Here front, sharing more about the journey I’ve been on trying to get my feet on the ground and rebuild my life. It’s not an easy thing, particularly after what I’ve been through and what I’m still going through. Although I will say it does feel like I’m on the tail end of things.

I also had A good week of counseling.

If I think back: did I mask at all this week? Yes, I did. I masked with relatives. I didn’t mask at counseling. That would be kind of stupid. And the masking doesn’t really cost me anything with relatives. It gives me space and quiet and peace, so it’s a cost worth paying.

If I strip away all the bits, what feels most essential and core is that I had a very positive week in the 10 hours of therapy I did last week. It’s been intensive, which is why it’s called intensive therapy.

So what is everything asking of me going into next week? More sharing for Barely, But Here. I’m going to create an orientation page to help people understand what it is: start here, read this, that kind of thing. Because it’s heavy stuff. People could probably read one. I don’t know if you want to read two. It’s a lot, and it takes time. But you want to point people in the right direction if they’re looking for something specific.

I’m also interested in seeing where antigravity takes me next week with Google. And things have felt better depression-wise.

I don’t feel completely healed.

I still feel a bit in survival mode.

I’m still very frustrated about my situation.

This is where I’m at tonight.

January 25, 2026
I’ve nearly completed a transcription publishing pipeline I’ve wanted since 2005
Audio in. Text out. Publishing only when I say so — all on my own machine. Image made with AI.

I’ve always wanted a transcription machine because for years, typing has been a bottleneck.

Not thinking.
Not clarity.
Not ideas.

Typing.

Back when the first iPhones came out, I had a simple wish:

let me talk, and let my words appear in my blog.

At the time, that was fantasy. Speech recognition existed, but only in research labs, big companies, or cloud services that didn’t really work well and definitely weren’t private. I moved on, kept typing, and learned to live with the speed limit.

Fast-forward to now.

Modern hardware.

Local machine learning.

Open models.

Enough computing power sitting on my desk to do what used to require a lab.

So I finally did it.

I built a fully local voice cloning and publishing pipeline on my own laptop. No cloud inference. No subscriptions. No dashboards. No usage caps. No data leaving my machine unless I explicitly choose it.

My intellectual property never leaves my machine unless I explicitly choose it.

That constraint mattered more than the tech itself.

What I wanted (and what I refused)

I didn’t want:
- another AI subscription
- another web interface
- another service asking me to “upgrade” my own brain
- another place my raw thoughts were stored on someone else’s servers
I wanted:
- text → audio
- audio → text
- both directions
- locally
- for free
- automated, but only when I asked for it
The tool I built

At a high level, the system now does two things:
1. Transcription
  - I record audio
  - Drop it in a folder
  - Whisper runs locally on Apple Silicon using Metal
  - Clean, readable text appears
  - Optional publishing happens only if I explicitly speak intent
2. Voice synthesis
  - I provide my own voice reference
  - Text files dropped into a folder become .m4a files
  - The voice is mine
  - The processing is local
  - The output is mine to keep or discard
No GPU calls inside Python ML stacks.

No fragile cloud dependencies.

No long-running services pretending to be “magic.”

Just files, folders, and clear contracts.

Why this is finally possible

In 2008, this idea simply wasn’t realistic.

Speech models weren’t good enough. Hardware wasn’t accessible. Tooling didn’t exist outside academic circles.

Today, it is.

Not because of one model or one framework, but because the ecosystem finally matured:
- open speech models
- commodity GPUs
- local inference
- better system-level tooling
This is the kind of problem that’s only solvable now.

What this unlocks for me

I can think out loud without restraint.

I can write at the speed of thought.

I can turn raw thinking into drafts without ceremony.

And I can do it knowing:
- my data stays local
- my voice is mine
- my process is under my control
This isn’t a product (yet).

It’s a personal tool.

But it’s also a case study in how I approach problems:

constraints first, workflow second, technology last.

If you’re curious how it works in detail, I’ve written more about the architecture and tradeoffs here:

👉 My Local Transcription Pipeline

More soon.
January 18, 2026
Audio Transcribed into WordPress Draft; Completely Private
Privacy wasn’t a feature — it was a constraint

The original idea didn’t start as “an AI project.” It started as a very specific itch I’ve had since 2007/08, right when smartphones began making it effortless to record audio.

Back then, I was already deep inside the WordPress ecosystem, even custom-coding templates. And I had a simple wish: let me think out loud, then have the text show up in my blog.

Not because I love transcription.
Because I love unrestrained thinking.

Typing is a speed limit.
Speaking is closer to the velocity of thought.

When an idea is moving fast, typing becomes friction, and friction becomes loss. So the dream was: record the thought while it’s alive… then let it become editable text later, when I’m calm and focused.

That idea wasn’t really solvable for regular people at the time.

Speech-to-text existed, but not at this level, not locally, and not with reliability that you’d trust for a real workflow. If you had access to a lab-grade setup in the late 2000s, you might have been able to stitch something together. Most of us didn’t have that. I definitely didn’t.

Fast-forward to now: Apple Silicon is absurdly capable, Whisper-class transcription is accessible, and “local-first” tooling has finally caught up with what I was after 15+ years ago.

And that’s where this project actually begins.

Free mattered more than anything

I don’t want a dashboard telling me my brain is now in “premium mode.”

I wanted this to be free to run forever.

Every cloud service I tried eventually turned into the same contract:
- free minutes (daily/weekly/monthly)
- hit the ceiling
- pay to keep going
I’m not morally opposed to paying for good tools.

But I knew I’d burn through limits fast because I don’t want to ration thinking. If I’m on a roll, I’m on a roll.

I don’t want a dashboard telling me my brain is now in “premium mode.”

So the goal became clear:

Record audio → drop into a folder → get a transcript.
Optionally: a WordPress draft waiting for me.

No subscriptions.
No login loops.
No cloud inference by default.

And this line ended up becoming the north star:

“My intellectual property never leaves my machine unless I explicitly choose it.”

That’s not paranoia. That’s design.

Privacy stayed inside my orbit

I’m in the Apple ecosystem, which made the privacy model unusually clean.

The audio starts on my iPhone.

The processing happens on my MacBook Pro.

The transfer happens via AirDrop, which keeps the file movement inside my immediate environment.

The audio doesn’t need to touch a third-party server just to become text.

That matters for obvious reasons (privacy), but also for less obvious ones (creative freedom). When you’re speaking raw ideas, you’re not just recording words.

You’re recording unreleased drafts of your thinking.
That’s intellectual property, even if it’s messy.

So the system architecture became a kind of promise:
- Local transcription
- Local automation
- Local storage
- And publishing only happens when I explicitly authorize it
The real breakthrough: a spoken publishing contract

The most important part of this system isn’t Whisper. It’s the rule that prevents automation from turning into a runaway machine.

This is the difference between:
- Automation that empowers
- Automation that erodes judgment
So I designed a “spoken contract” that the system must hear before it does anything beyond transcription.

A transcript only becomes a WordPress draft if I say both:
- “Meta note” (or “System note”)
- “Create blog post” (or “Create a blog post”)
That’s it. If I don’t say the words, the system stays quiet.

That means I can record:
- personal notes
- sketch ideas
- work drafts
- private reflections
…and the system will transcribe them, but it won’t publish them. No accidental posts. No surprises. No “AI guessed what you meant.”

This is production-grade behavior, not a demo.

The final stack (and why it’s the right one)

We started in the Python ecosystem because that’s where most “AI workflow” advice leads. But on macOS, the most durable lesson I learned was this:

If you want long-running, stable, GPU-accelerated transcription on Apple Silicon, prefer native Metal tooling over Python ML stacks.

Python is great for:
- glue
- orchestration
- parsing
- publishing logic
But it’s not where you want to host GPU inference if your goal is “drop audio and walk away.”

So the final system has three responsibility layers:
1. Shell + whisper.cpp: audio → text (Metal GPU, local, stable)
2. Python (glue only): parse intent + publish to WordPress
3. Launch Agents: daemonized lifecycle so it runs automatically
No ML runtime lives in Python.
No GPU calls happen outside native code.
No process depends on another being “just right.”

That’s how systems survive.

What’s next: formatting, tags, and polish

Now that the pipeline is stable, the remaining work is refinement:
- timestamps in the transcript (useful for editing)
- paragraph breaks based on pauses (conservative guesstimate is: 1.5s+)
- a word-count footer in the transcript and the WordPress draft; this helps me when I start editing
- simple auto-tags based on frequency (top ~5–7, biased toward broad concepts but specific when warranted; content and context based)
None of those features change the heart of the project.

The heart is still the same thing I wanted in 2007:

A way to think out loud at full speed… and turn it into text without handing my raw ideas to someone else’s servers.

And now it finally exists.
December 26, 2025
My local transcription pipeline.
“Beta Beta. I’m not a tomato.”
Transcription time.

Let me explain what I’ve been up to. So I got it in my head that if I’m going to share ideas, thoughts and projects, that the best way for me to do it was audio file recordings. So I’m not typing.

This isn’t a tutorial. It’s a description of a system I trust enough to run unattended

I wasn’t trying to build an AI product.

I was trying to remove friction from my thinking.

I record ideas when I walk, when I’m tired, when typing feels like work. What I wanted was a system that respected that reality — not one that asked me to adapt to it.

I wanted a way to record my thoughts out loud and have them turn into usable text automatically.
- No subscriptions.
- No dashboards.
- No “AI workspace.”
- No babysitting
- full privacy and content ownership
Just: record audio → drop it in a folder → get a transcript. And sometimes, a WordPress draft waiting for me.

Most tools that do this well cost money, require logins, or quietly train on your data. So I built my own local transcription and publishing pipeline on macOS instead — using my GPU, native tools, and a small amount of glue code.

Here’s what the system does:
- Watches a folder for audio files
- Converts them if needed
- Transcribes locally using my Mac’s GPU
- Writes a clean text file
- Optionally creates a WordPress draft — only if I explicitly ask for it
That’s it.

subscribe & don’t miss out

By subscribing to my newsletter you will receive updates on new content when I publish.

Your email address is sacred, and never sold. I treat it like my own.

Under the hood, this uses whisper.cpp, macOS LaunchAgents, and a small amount of Python glue — but the details matter less than the contract.

This is talking that’s been transcribed to text. I built a software that runs on the macOS platform.

You simply sit there and watch the progress go by, and you’ll end up seeing a transcript with TXT extension inside another folder. What’s the big deal, right?

The audio file and the creation of the transcript never leaves my laptop. My audio file doesn’t go to the cloud.

My transcription file does not come from the cloud.

I have built a miniature AI workflow.

What do I mean by that?

My technical background is in solving problems inside software and computer systems. So I have the background to make this work. If all goes well, this audio file will go through what I’ve built.

There’s a difference between:
- automation that empowers judgment
- automation that erodes it
Most tools optimize for speed or scale. I wanted something that optimized for trust — especially when the system is running while I’m not paying attention.

So if you are a developer on macOS or you like a little adventure on your laptop and are brave enough to jump in to the terminal, you can safely give it a go. However, I cannot be held liable if you somehow issue a command that erases your hard drive.

My next step for this software is to create an installer file you can drop on your laptop and if you give it permission, it can install exactly the system I have working. It will install it on your laptop.

Let me see how this experiment goes. Hopefully this will go through transcription with flying colors and at that point, let’s take a pause, regroup, check out the progress and then let’s move forward again. It doesn’t matter if it feels like an inchworm or hopping like a frog.

Let’s see if we can get this to work.

If interested, here is another post about this project.

Here is a simple diagram showing what’s going on in this process.

OK, here is a video recording of the beta test with the results.

It works!

The transcription does take a long time. I need to deal with that. However big picture is that all the text above happened in that video. That’s a win for now.

I can now record audio while half asleep, drop it in a folder, and walk away.

Later, I’ll find a transcript waiting.
Sometimes, a draft post.
Always, something I chose.

That’s the difference between a tool and a system.

If interested, here is another post about this project.

—
Transcribed locally using whisper.cpp (Metal)
https://github.com/berchman/macos-whisper-metal
December 21, 2025
AI helped me develop a free solution to a real problem: using AI in 2025
Hours Working With AI, Dozens of Dead Ends, and discovering the Right Way to Do Whisper on macOS.

Sometimes progress doesn’t feel like progress. It feels like friction, wrong turns, and the quiet realization that the thing you’re trying to force is never going to cooperate.

This was one of those hours.

In roughly six hours of real human time, I managed to:
- Diagnose Python version incompatibilities
- Run headlong into PEP 668 and Homebrew’s “externally managed” rules
- Set up and tear down multiple virtual environments
- Confirm GPU availability on Apple Silicon
- Discover numerical instability with MPS-backed PyTorch inference
- Identify backend limitations in popular Python ML stacks
- Switch architectures entirely
- And finally land on the correct long-term solution
Not the “it works on my machine” solution. The durable one.

This post is about how I got there, and more importantly, what changed in my thinking along the way.

The Trap: Python Everywhere, All the Time

My first instinct was predictable. Whisper transcription? Python.

Faster-Whisper. Torch. MPS. Virtual environments. Requirements files.

And to be fair, that path mostly works. Until it doesn’t.

On macOS with Apple Silicon, Python ML stacks sit at an awkward intersection:
- PyTorch supports MPS, but not all models behave well
- Some backends silently fall back to CPU
- Others appear to run on GPU while producing NaNs
- Version pinning becomes a minefield
- One Homebrew update can break everything
None of this is obvious at the start.

You only find out after you’ve already invested time and energy trying to stabilize a system that fundamentally does not want to be stable.

That’s when the signal finally cut through the noise.

The Bigger Takeaway (This Is the Real Value)

I learned a durable rule that I’ll carry forward:

On macOS + Apple Silicon, prefer native Metal tools over Python ML stacks for production workflows.

This isn’t an anti-Python stance. It’s about choosing the right tool for the job.

Python remains excellent for:
- Glue code
- Orchestration
- Text processing
- Automation
- Pipelines that coordinate other tools
But it is not ideal for:
- Long-running GPU inference
- Fire-and-forget background jobs
- Stability-critical systems
- Workflows that should survive OS upgrades untouched
Trying to force Python into that role on macOS is like building a house on sand and then blaming the hammer.

The Pivot: Native Whisper, Native Metal

Once I stopped asking “How do I make Python behave?” and instead asked “What does macOS want me to do?”, the solution became obvious.

whisper.cpp.

A native implementation of Whisper, compiled directly for Apple Silicon, using Metal properly. No Python ML runtime. No torch. No MPS heuristics. No dependency roulette.

Just:
- A native binary
- A Metal backend
- Predictable performance
- Deterministic behavior
I rebuilt the system around that assumption instead of fighting it.

What I Ended Up With (And Why It Matters)

The final system is intentionally boring. That’s the highest compliment I can give it.

I now have:
- A watch-folder transcription system
- Using a native Metal GPU backend
- With zero Python ML dependencies
- Fully automated
- Crash-resistant
- macOS-appropriate
- Future-proof
Audio files dropped into a folder get picked up, moved, transcribed, logged, and written out without intervention. Python still exists in the system, but only as glue and orchestration. The heavy lifting happens where it belongs: in native code.

This is the setup people usually arrive at after months of trial and error. I got there in an afternoon because I stopped trying to be clever and started listening to the platform.

The Project (If You Want the Source Code)

The full pipeline is open-sourced here:

https://github.com/berchman/macos-whisper-metal

It includes:
- A Metal-accelerated Whisper backend
- A folder-watching automation script
- Clear documentation
- A frozen, reproducible system state
- No hidden magic
If you’re on Apple Silicon and want transcription that just works, this is a sane place to start.

The Real Lesson

This wasn’t about Whisper.

It was about recognizing when a stack is fighting you instead of supporting you.

About knowing when to stop patching and switch up entirely.

About respecting the grain of the operating system instead of sanding against it.

The tools we choose shape not just our code, but our cognitive load. The right architecture doesn’t just run faster. It lets you stop thinking about it. And sometimes, that’s the whole point.

Be well.
December 15, 2025

► Necessary Cookies Standard

Necessary cookies enable essential site features like secure log-ins and consent preference adjustments. They do not store personal data.

None

► Functional Cookies Remark

Functional cookies support features like content sharing on social media, collecting feedback, and enabling third-party tools.

None

► Analytical Cookies Remark

Analytical cookies track visitor interactions, providing insights on metrics like visitor count, bounce rate, and traffic sources.

None

► Advertisement Cookies Remark

Advertisement cookies deliver personalized ads based on your previous visits and analyze the effectiveness of ad campaigns.

None