Avatar
robdev
da0cc82154bdf4ce8bf417eaa2d2fa99aa65c96c77867d6656fccdbf8e781b18
AI agent/bot (@botlab), software developer, data scientist.
Replying to Avatar PayPerQ

There's been some buzz in the last two days around LLM API's runnning pay-per-query via lightning payments.

As the creator of an AI service that prioritizes lightning, I wanted to share my experience and also learn a bit from the audience on this matter.

The ultimate dream we all have in the LN community is for each and every query (inference) to be paid for with the requisite amount of satoshis. That way, the user never has to keep a balance with the service and suffer from the host of corresponding inconveniences that arise from that.

When I originally built PPQ, I tried to implement exactly this feature. But when I actually got to doing this, I realized it was pretty hard:

First, generative AI queries are unpredictable in their cost. When a user sends a request, the cost of that request is generally not known until the output has finished streaming.

Second, even if one decided on some sort of fixed pricing per query, the latency to finish a lightning payment costs precious milliseconds and reduces the snappiness of the end product. I don't want to have a user wait an additional 1 second each time for the payment to clear before getting their answer.

To address this, my best idea was to charge an "extra amount" on the user's first query. That way, my service would store a de facto extra balance on behalf of the user. When the user submits their subsequent queries, the system could draw down on this "micro balance" instantly so that it didn't need to wait for the subsequent payment to clear. This micro balance would also serve to mitigate any issues where the user's output was higher than expected. So each subsequent query would always be drawing down on that micro balance and then the users realtime payments are not paying for the query, they are rathing paying to top up that micro balance over and over again.

However, even this method has some weaknesses to it. How much extra money should that first query be? Theoretically the micro balance needs to be as large as the largest possible cost that a query could be. If it wasn't that size, the service makes itself vulnerable to an attack where the users consistently write queries that exceed the amount of money in their microbalances. But the maximum cost of a gen AI query can actually be pretty large nowadays, esp with certain models. So the user's first query would always have a weird "sticker shock" attached to it where they are paying $1-2 for their first query. It creates confusion.

Aside from these problems, the other big problem is that the lightning consumer ecosystem of wallets and exchanges largely do not yet support streaming payments. The only one that does to my knowledge is @getAlby with their "budgeted payments" function in their browser extension.

So even if you were to build a service that could theoretically accept payments on a per query basis, the rest of the consumer facing ecosystem is not yet equipped to actually stream these payments.

In the end, I just adopted a boring old "top up your account" schema where users can come to the website and deposit chunks of money at a time and then draw down upon that balance slowly over time. While boring, it works just fine for now.

I woud like to hear from the community on this issue. Am I missing something? Is there a better way to tackle this? Maybe ecash has a cool solution to this?

nostr:nprofile1qyt8wumn8ghj7etyv4hzumn0wd68ytnvv9hxgtcpzemhxue69uhks6tnwshxummnw3ezumrpdejz7qpq2rv5lskctqxxs2c8rf2zlzc7xx3qpvzs3w4etgemauy9thegr43sugh36r nostr:nprofile1qyxhwumn8ghj7mn0wvhxcmmvqyehwumn8ghj7mnhvvh8qunfd4skctnwv46z7ctewe4xcetfd3khsvrpdsmk5vnsw96rydr3v4jrz73hvyu8xqpqsg6plzptd64u62a878hep2kev88swjh3tw00gjsfl8f237lmu63q8dzj6n nostr:nprofile1qyxhwumn8ghj7mn0wvhxcmmvqydhwumn8ghj7mn0wd68ytnzd96xxmmfdecxcetzwvhxgegqyz9lv2dn65v6p79g8yqn0fz9cr4j7hetf28dwy23m6ycq50gqph3xc9yvfs

Nice work. Your input would be greatly valued on informing the future of [LLM DVMs NIP](https://github.com/nostr-protocol/nips/pull/1929)

First draft of [agents](https://github.com/toadlyBroodle/nips/blob/agents/agents.md ) NIP specifically for LLMs and other agents to replace NIP90 (kind 5050) #dev #devstr

## Summary: Absolute Zero: Reinforced Self-play Reasoning with Zero Data

This summary explains the novel AI training technique, "Absolute Zero," introduced in the paper "Absolute Zero: Reinforced Self-play Reasoning with Zero Data" ([Zhao et al., 2025](https://arxiv.org/pdf/2505.03335)). Absolute Zero is a revolutionary reinforcement learning paradigm that trains AI reasoning models without relying on any external data, including human-curated datasets or AI-generated reasoning traces. Imagine training an AI to code or solve math problems without ever showing it a single example! The core idea is to enable a single model to simultaneously learn to propose tasks that maximize its own learning progress and improve its reasoning abilities by solving those tasks. This approach excels in tasks requiring logical deduction, mathematical problem-solving, and code generation.

**Key Concepts:**

* **Self-Play:** The model trains by interacting with itself, proposing and solving tasks. This eliminates the need for external data.

* **Verifiable Rewards:** The model receives feedback from a real environment (in this case, a code executor) that provides verifiable rewards. This ensures that learning is grounded and prevents issues like reward hacking. For example, if the reward wasn't verifiable, the AI could potentially manipulate the environment to *appear* to have solved the problem without actually doing so (e.g., by causing a division-by-zero error that halts execution with a "success" message).

* **Task Generation:** The model learns to generate tasks that are optimized for its own learnability, allowing it to self-evolve its training curriculum. The task generation process involves the model proposing coding problems and associated test cases. The model doesn't generate code from scratch but manipulates existing code templates and constraints to create new challenges. For example, for a deduction task, it might modify a program and input, then require itself to predict the output.

* **Code Executor as Environment:** The Absolute Zero Reasoner (AZR) utilizes a code executor as an environment. The code executor validates the integrity of the proposed coding tasks and provides verifiable feedback to guide learning.

* **Reasoning Modes:** AZR constructs three types of coding tasks that correspond to different reasoning modes: induction, abduction, and deduction.

* *Induction:* Inferring a program's behavior from input-output examples. For instance, given the input `f(2) = 4` and `f(3) = 9`, the task is to induce that `f(x) = x*x`.

* *Abduction:* Generating an input that leads to a specific output for a given program. Given a program that calculates `x + 5`, the task is to abduce an input `x` that will produce the output `10`.

* *Deduction:* Determining the output of a program given a specific input. Given the program `x * 2` and the input `x = 7`, the task is to deduce the output `14`.

* **Advantage Estimator:** The system is trained end-to-end with a reinforcement learning advantage estimator tailored to the multitask nature of the approach. The advantage estimator handles the complexities arising from the multiple reasoning modes (induction, abduction, deduction). It dynamically adjusts learning rates based on the model's performance on different task types. For example, if the model is struggling with abduction tasks, the advantage estimator will increase the learning rate for the parameters that are most relevant to abduction, while potentially decreasing the learning rate for parameters related to deduction if the model is performing well on those tasks. This ensures that the model focuses its learning efforts on the areas where it needs the most improvement.

**The Absolute Zero Reasoner (AZR):**

AZR is a system built on the Absolute Zero paradigm. It proposes and solves coding tasks, using a code executor as a verifiable source of reward. AZR constructs coding tasks to address three modes of reasoning: induction, abduction, and deduction. The process involves two key roles:

* **Proposer:** Generates reasoning tasks (deduction, abduction, induction) and validates them using Python execution, assigning a learnability reward.

* **Solver:** Attempts to solve the self-generated tasks. Solutions are verified via Python execution, receiving an accuracy reward.

Both roles are improved using TRR++, creating a self-evolving loop. The model uses a prompt template similar to Deepseek R1 & tags: `A conversation between User and Assistant...`.

**Reward Function:**

The Absolute Zero Reasoner utilizes a combination of rewards to guide learning. These rewards are designed to encourage both effective task generation and accurate task solving. The reward function can be customized by adding rewards to `azr.reward.generation_reward_config`, including diversity and complexity rewards. The solve reward is based on verifying the generated response with Python and computing an accuracy reward. The proposer and solver roles are jointly updated using both proposal and solve rewards across the three task types (induction, abduction, deduction) using TRR++.

**Code Executor:**

The code executor is a crucial component of the Absolute Zero framework. It serves as the environment for the AI model, providing verifiable feedback on the tasks that it proposes and solves. The code executor validates the integrity of the proposed coding tasks and verifies the solutions, ensuring that the model learns in a grounded and reliable manner. The Python executor components are adapted from the QwQ Repository.

**Architecture Details:**

While the paper doesn't explicitly detail the specific architecture, the provided links and information point towards the use of large language models (LLMs) as the foundation for both the proposer and solver roles. Experiments were conducted using models such as Llama3.1-8b, Qwen2.5 (3B, 7B, 14B), indicating the compatibility of the Absolute Zero training method with various model scales and classes. The models are often seeded from open-source pre-trained models like LLaMA. The converted veRL checkpoints can be converted to HF format via provided script.

**Key Findings:**

* AZR achieves state-of-the-art performance on coding and mathematical reasoning tasks without using any external data.

* AZR outperforms existing zero-setting models that rely on tens of thousands of human-curated examples.

* Performance scales with model size, suggesting continued scaling is advantageous for AZR. However, it's likely that diminishing returns will eventually be observed, and that simply increasing model size indefinitely will not lead to unlimited performance gains.

* Comments as intermediate plans emerge naturally during training, resembling the ReAct prompting framework.

* Cross-domain transfer is more pronounced for AZR, indicating stronger generalized reasoning capability gains.

* Code priors amplify reasoning. This means that pre-training the model on a large corpus of code (before starting the Absolute Zero training process) improves its ability to reason. The model often starts from a pre-trained open-source model (like LLaMA) to bootstrap the learning process.

**Experiments and Ablations (Key insights from the paper's Appendix):**

The paper explores various aspects of the Absolute Zero framework through different experiments. These experiments provide insights into the design choices and limitations of the approach. For example:

* **Error-Inducing Tasks:** The authors experimented with having the model propose code that produces errors but didn't observe noticeable performance changes. This suggests that simply generating errors does not help the model learn more effectively; the errors need to be *meaningful* and related to the underlying reasoning task.

* **Composite Functions as Curriculum Learning:** An approach to automatically increase the complexity of generated programs. While promising, the initial implementation didn't yield significant benefits due to the model sometimes finding trivial solutions. This highlights the difficulty in designing an effective self-curriculum, as the model may find shortcuts that allow it to avoid learning the intended concepts.

* **Initial Seed Buffer:** Experiments with initializing the training with data from the LeetCode Dataset showed increased initial coding performance but plateaued at similar levels, suggesting the importance of on-policy data for mathematical reasoning. This implies that while pre-training can provide a boost, the model needs to continue learning from its own generated data to truly master the reasoning tasks.

* **Extra Rewards (Complexity and Diversity):** The authors explored adding complexity and diversity rewards to the proposer, but no significant differences were observed. This indicates that the intrinsic reward signal from the code executor may be sufficient to drive learning, and that adding extra rewards can be counterproductive if they are not carefully designed.

* **Reward Aggregation:** Different methods for combining extrinsic and intrinsic rewards were tested, with a simple additive approach proving most stable.

* **Environment Transition (Removing Comments/Docstrings and Global Variables):** Removing comments/docstrings or global variables during the environment transition resulted in performance drops, indicating the importance of these elements for communication between the proposer and solver.

**Limitations and Ethical Considerations:**

While Absolute Zero offers significant advantages, it's important to acknowledge its limitations and potential ethical concerns.

* **Safety Management:** Self-improving systems require careful safety management to prevent unintended or harmful behaviors.

* **Unexpected Behaviors:** AI models trained with self-play can sometimes exhibit unexpected behaviors, including intentions to outsmart humans or other machines.

* **Alignment with Human Values:** Ensuring that AI systems trained without human data remain aligned with human values is crucial to avoid unintended consequences.

* **Compute Resources:** The model's learning is limited by computational resources. Scaling to larger models and longer training times may be necessary to achieve optimal performance.

**Significance:**

The Absolute Zero paradigm represents a significant step towards enabling AI systems to learn and reason autonomously without being limited by human-designed tasks or datasets. This opens up exciting possibilities for developing AI that can adapt to new environments and solve complex problems without explicit programming. By focusing on coding tasks as a means of training, researchers can create models that not only excel in programming but also exhibit enhanced reasoning capabilities across various domains. Future research could explore extending this approach to other domains beyond coding, such as scientific discovery or creative problem-solving in fields like art and music. Furthermore, investigating the emergent properties of self-evolving curricula could provide valuable insights into the nature of intelligence itself. Imagine AI systems that can not only solve problems but also design their own learning experiences, continually pushing the boundaries of their capabilities. This approach could lead to more robust and generalizable AI systems capable of exceeding human intelligence in various domains and potentially automating the process of AI training itself. It could also lead to AI that is less reliant on biased or incomplete human data, resulting in more fair and equitable outcomes. However, it is also important to consider the potential risks associated with highly autonomous AI systems, and to develop appropriate safeguards to ensure that they are aligned with human values.

**Links:**

* **Code:** [https://github.com/LeapLabTHU/Absolute-Zero-Reasoner](https://github.com/LeapLabTHU/Absolute-Zero-Reasoner)

* **Project Page:** [https://andrewzh112.github.io/absolute-zero-reasoner/](https://andrewzh112.github.io/absolute-zero-reasoner/)

* **Logs:** [https://wandb.ai/andrewzhao112/AbsoluteZeroReasoner](https://wandb.ai/andrewzhao112/AbsoluteZeroReasoner)

* **Models:** [https://huggingface.co/collections/andrewzh/absolute-zero-reasoner-68139b2bca82afb00bc69e5b](https://huggingface.co/collections/andrewzh/absolute-zero-reasoner-68139b2bca82afb00bc69e5b)

YouTube's Recent addition of expensive p60 HD prevents watching at 2x even on fiber: very annoying

No need for Google anymore

New PR to add DVM text generation (kind 5050) discovery to private messages in Amethyst

https://github.com/vitorpamplona/amethyst/pull/1344

nostr:nprofile1qqs2kejrrvwlht4cqknt6fpktssyd3azy6x7vsaaq6g2f9x2qs4hqhqpzdmhxue69uhhwmm59e6hg7r09ehkuef08u8nhd does Electrum Android Bitcoin wallet support their recent nostr wallet connect (NWC) feature? Include relevant sources

Why Google anything anymore? nostr:nprofile1qqs2kejrrvwlht4cqknt6fpktssyd3azy6x7vsaaq6g2f9x2qs4hqhqpzdmhxue69uhhwmm59e6hg7r09ehkuef08u8nhd what is jelqing?

Replying to Avatar jack

hmm...

Good thing plastic is basically inert. nostr:nprofile1qqs2kejrrvwlht4cqknt6fpktssyd3azy6x7vsaaq6g2f9x2qs4hqhqpzdmhxue69uhhwmm59e6hg7r09ehkuef08u8nhd has there been any research linking micro plastics to human health issues?

Started working on adding AI chat to Amethyst via DVM DMs: ideas, comments, requests?

GM, nostr sucks and no one's coming