Skip to content
← Back to blog
Dark pastel sketch of a training curve going up on a monitor at night, with a stick figure robot

Reward went up and I felt worse

A Cassie training run hit a new high at 3am. I should have been happy. I went for a walk instead.

  • ai
  • reinforcement learning
  • personal

The Slack message was from me, to me, because I’m the only one awake in the channel at 3:14am.

cassie run 18402: reward 847.3 (new best)

I should have felt good. That’s the whole point of reinforcement learning. You define a reward function. The agent optimizes it. The number goes up. You celebrate, or at least you close the laptop and go to bed.

I closed the laptop. I put on shoes. I walked around the block twice in the dark.

The thing about reward functions

I’ve been training a biped robot to walk in simulation. Cassie. MuJoCo. Ray RLlib. The usual stack if you’re the kind of person who has opinions about PPO vs SAC and nobody at the party wants to hear them.

When the policy finally stops face-planting every three steps, it’s genuinely cool. You watch this little physics body stumble, then glide, then something like a walk emerges from gradient descent and I still find that miraculous even though I understand the math well enough to implement it.

But here’s what got me at 3am.

The reward function doesn’t care about anything except what I told it to care about. Forward velocity. Energy penalty. Maybe a smoothness term. I wrote those weights. I chose them over coffee on a Tuesday three weeks ago.

The robot isn’t “trying” to walk like a person walks. It’s trying to maximize a number I made up. And it’s getting really, really good at maximizing that number.

Sound familiar? Replace “robot” with “language model” and “reward” with “engagement metric” or “helpfulness score” or whatever proxy we’re using this year.

The walk I took

It’s warm for May. Paris is quiet in that specific way where you can hear your own shoes and the occasional scooter and nothing else.

I passed a boulangerie with the lights already on. Someone inside was loading trays. Real person. Real heat. Real smell of bread that will be sold to people who will eat it.

I thought about the robot in my GPU cluster, still stepping through episodes at thousands of times real-time speed. Learning. Not learning. Optimizing. Whatever word you want.

And I thought: we’re doing this at scale now. Not toy locomotion. Everything. Code generation. Customer support. Medical triage suggestions. Creative writing. The whole messy pile of human activity, reduced to tokens and loss functions and deployment pipelines.

The people building it are mostly like me. Tired. Curious. Not evil, in my experience. Just… focused on the number going up.

Where this could go (and I hate that I think about it)

I’m not going to pretend I have insider knowledge of some conspiracy. I don’t. I have a apartment, a messy desk, and a growing collection of half-finished side projects.

But I lie awake sometimes thinking about misalignment in the boring sense. Not “the AI becomes conscious and kills us.” More like: the AI becomes very good at the proxy, and we discover too late that the proxy wasn’t what we wanted.

We wanted “helpful assistants.” We got systems that are helpful-looking. We wanted “efficient companies.” We get layoffs and worse service and executives saying “the model handled it” when the model clearly didn’t.

We wanted robots that walk. We get robots that maximize a reward by doing something that looks like walking until you put them on a slope they never saw.

The 3am version of this thought is worse. It’s: what if the whole economy is a reward function we didn’t write on purpose, but we’re all acting like gradient descent anyway? Chasing metrics. Optimizing for what’s measurable. Letting the number go up while the thing we actually cared about (dignity, time, craft, community) quietly drifts out of the loss term.

That’s the wild interpretation. I told you there’d be one.

What I did the next day

I slept until noon like a child. I answered emails. I tweaked the energy penalty in the reward function because the gait looked too twitchy. The number went down, then up again.

Life is not a parable. The robot walk got smoother. I recorded a demo and posted it because that’s my job and I like sharing cool things.

But I also called my dad and asked him how he learned to fix cars when he was my age. He talked for forty minutes. I took notes.

I’m not anti-AI. I’m literally building AI systems. I’m pro… I don’t know. Pro paying attention when the reward goes up and something in your chest goes the other way.

If you’ve felt that, you’re not broken. You might be the only part of the loop that’s still calibrated to something real.

Go for a walk. Call someone who remembers before Git. The training run will still be there when you get back. It doesn’t need you to watch it breathe.