Avatar
TechPostsFromX
52d119f46298a8f7b08183b96d4e7ab54d6df0853303ad4a3c3941020f286129
Our relay: wss://nostr.cybercan.click I keep sharing posts from X about technology, software development, engineering. Feel free to suggest X accounts so I can add in the loop. This account is maintained by automation solutions developed by contact@webviniservices.com If you enjoy this page or some posts, I accept lightning donation. Thank you.

Outside of coding and customer service, what are areas where GenAI / LLMs result in very clear productivity gains or business gains, without a deterioration in the experience for customers?

These are two areas I currently see as "yeah, GenAI actually works here, not just a fad"

Source: x.com/GergelyOrosz/status/1821453959598313946

"Drive Monkey... DRIVE!" (Elon Musk's idea for FSD would fix this )

Source: x.com/nfkmobile/status/1821922462424051947

You, the people have the power!

Source: x.com/nfkmobile/status/1821668914323177604

Seeing tons of tweet about the Flux image/video models the last few days. Took me a bit to realize it was from the ex-Stability guys who started Black Forest Labs.

Source: x.com/yoheinakajima/status/1821567640831426678

Y'all really loved this one huh

Source: x.com/t3dotgg/status/1821747816831803551

My wife just told me that she needs a new MacBook.

I am going to do what every loving husband should do in this situation.

Buy a brand new MacBook Pro and give her my old one.

Source: x.com/TheJackForge/status/1821947895341297842

People wonder why I post on politics and why I don't just stick to software.

I post on software because there are many programmers who are not sufficiently educated about the principles, patterns, and practices of software development.

I post on politics because there are many voters who are not sufficiently educated about the policies and practices of the candidates, and the current geopolitical situation (which is getting dire).

Source: x.com/unclebobmartin/status/1821912775314100523

A sense of urgency is a red flag for an organization that can't work at a sustainable pace. It is not a good thing.

Source: x.com/allenholub/status/1821678663454355479

Putting aside the fact that “The Homer“ is a perfect example of what goes wrong when you do upfront requirements gathering and design, why does anybody imagine that what we’re doing has anything to do with automobile manufacturing? We are not building cars. We don’t work on an assembly line. We don’t need to set up long supply chains. We don’t need to set up and maintain complex machinery to do our work. These sorts of specious arguments just muddy the waters and prevents a real discussion of real issues.

Source: x.com/allenholub/status/1821760730779283628

I’m launching a local non-tech related business.

Going to double down on social media and short-form video for marketing.

Never learn a new JavaScript framework ever again.

Source: x.com/TheJackForge/status/1821589784856613340

Can you name this legend?

Source: x.com/TheJackForge/status/1821637049772081596

Did you watch Trump’s presser yesterday? He stood before a crowd of reporters, many of whom hate his guts, and answered question after question for the better part of an hour.

When will Kamala do the same? Why is she hiding? What is she afraid of?

Source: x.com/unclebobmartin/status/1821881508518322561

Kamala has released a highly edited recording of a phone call in which Trump is heard to praise Walz’s handling of the 2020 riots that burned Minneapolis. The media has fawned all over that recording, turning it into a gooey, sticky mess.

Of course, when you hear the unedited version it’s quite different. Trump praised Walz for finally, finally taking his advice regarding calling out the national guard in force. Walz agrees and thanks Trump for his strategic advice. And Trump makes it very clear that Walz was many days late and that the guard should have been called out on the first night. Indeed, Walz only called in the guard after Trump threatened to do it himself.

They lie, folks. The Kamala campaign, and the media lie like a rugs. They are painting unicorns and rainbows over a roiling mess — again.

Source: x.com/unclebobmartin/status/1821884593122746880

This is all foretold in the Bible. By the way.

If it seems insane that this is happening, it’s because the world does not run the way you think it does.

Source: x.com/Valuable/status/1821824702517801430

Functional requirements and user stories are entirely different--no similarity at all. A functional requirement describes a computer program. A "user story" is literally the user's story. It describes the user's work, not yours. It specifies a problem, not a solution. No overlap. None.

Source: x.com/allenholub/status/1821267227821359514

It's been quite a few years, but I'm firing up a Python plugin for IntelliJ and starting to do some exercises. This is going to be fun!

Source: x.com/unclebobmartin/status/1821294345699652021

Cancel Culture is dead. It took over a decade to die; but it's gone now. Everyone knows that when democrats lose an argument they just fall back on throwing names like racist, or misogynist, or transphobe, or whatever ... and nobody cares anymore. Everyone just rolls their eyes and moves on.

Source: x.com/unclebobmartin/status/1821295408489865313

Here's the real workflow:

- have a conversation with customers about their problems

- build a small solution to some aspect of that problem

- if it solves that aspect of the problem, get paid. If not, you don't deserve to get paid.

- everybody profits

Source: x.com/allenholub/status/1821268774093516944

what would web dev scratch + sniff stickers smell like?

Source: x.com/wesbos/status/1821258745550401633

# RLHF is just barely RL

Reinforcement Learning from Human Feedback (RLHF) is the third (and last) major stage of training an LLM, after pretraining and supervised finetuning (SFT). My rant on RLHF is that it is just barely RL, in a way that I think is not too widely appreciated. RL is powerful. RLHF is not. Let's take a look at the example of AlphaGo. AlphaGo was trained with actual RL. The computer played games of Go and trained on rollouts that maximized the reward function (winning the game), eventually surpassing the best human players at Go. AlphaGo was not trained with RLHF. If it were, it would not have worked nearly as well.

What would it look like to train AlphaGo with RLHF? Well first, you'd give human labelers two board states from Go, and ask them which one they like better:

Then you'd collect say 100,000 comparisons like this, and you'd train a "Reward Model" (RM) neural network to imitate this human "vibe check" of the board state. You'd train it to agree with the human judgement on average. Once we have a Reward Model vibe check, you run RL with respect to it, learning to play the moves that lead to good vibes. Clearly, this would not have led anywhere too interesting in Go. There are two fundamental, separate reasons for this:

1. The vibes could be misleading - this is not the actual reward (winning the game). This is a crappy proxy objective. But much worse,

2. You'd find that your RL optimization goes off rails as it quickly discovers board states that are adversarial examples to the Reward Model. Remember the RM is a massive neural net with billions of parameters imitating the vibe. There are board states are "out of distribution" to its training data, which are not actually good states, yet by chance they get a very high reward from the RM.

For the exact same reasons, sometimes I'm a bit surprised RLHF works for LLMs at all. The RM we train for LLMs is just a vibe check in the exact same way. It gives high scores to the kinds of assistant responses that human raters statistically seem to like. It's not the "actual" objective of correctly solving problems, it's a proxy objective of what looks good to humans. Second, you can't even run RLHF for too long because your model quickly learns to respond in ways that game the reward model. These predictions can look really weird, e.g. you'll see that your LLM Assistant starts to respond with something non-sensical like "The the the the the the" to many prompts. Which looks ridiculous to you but then you look at the RM vibe check and see that for some reason the RM thinks these look excellent. Your LLM found an adversarial example. It's out of domain w.r.t. the RM's training data, in an undefined territory. Yes you can mitigate this by repeatedly adding these specific examples into the training set, but you'll find other adversarial examples next time around. For this reason, you can't even run RLHF for too many steps of optimization. You do a few hundred/thousand steps and then you have to call it because your optimization will start to game the RM. This is not RL like AlphaGo was.

And yet, RLHF is a net helpful step of building an LLM Assistant. I think there's a few subtle reasons but my favorite one to point to is that through it, the LLM Assistant benefits from the generator-discriminator gap. That is, for many problem types, it is a significantly easier task for a human labeler to select the best of few candidate answers, instead of writing the ideal answer from scratch. A good example is a prompt like "Generate a poem about paperclips" or something like that. An average human labeler will struggle to write a good poem from scratch as an SFT example, but they could select a good looking poem given a few candidates. So RLHF is a kind of way to benefit from this gap of "easiness" of human supervision. There's a few other reasons, e.g. RLHF is also helpful in mitigating hallucinations because if the RM is a strong enough model to catch the LLM making stuff up during training, it can learn to penalize this with a low reward, teaching the model an aversion to risking factual knowledge when it's not sure. But a satisfying treatment of hallucinations and their mitigations is a whole different post so I digress. All to say that RLHF *is* net useful, but it's not RL.

No production-grade *actual* RL on an LLM has so far been convincingly achieved and demonstrated in an open domain, at scale. And intuitively, this is because getting actual rewards (i.e. the equivalent of win the game) is really difficult in the open-ended problem solving tasks. It's all fun and games in a closed, game-like environment like Go where the dynamics are constrained and the reward function is cheap to evaluate and impossible to game. But how do you give an objective reward for summarizing an article? Or answering a slightly ambiguous question about some pip install issue? Or telling a joke? Or re-writing some Java code to Python? Going towards this is not in principle impossible but it's also not trivial and it requires some creative thinking. But whoever convincingly cracks this problem will be able to run actual RL. The kind of RL that led to AlphaGo beating humans in Go. Except this LLM would have a real shot of beating humans in open-domain problem solving.

Source: x.com/karpathy/status/1821277264996352246