Mechanistic Interpretability Via Learning Differential Equations: AI Safety Camp Project Intermediate Report.

Published on May 8, 2025 2:45 PM GMTTLDR; We report our intermediate results from the AI Safety Camp project “Mechanistic Interpretability Via Learning Differential Equations”. Our goal was to explore transformers that deal with time-series numerical data (either infer the governing differential equation or predict the next number). As the task is well formalized, this seems to be an easier problem than interpreting a transformer that deals with language. During the time of the project, we leveraged various interpretability methods for the problem at hand. We also obtained some preliminary results (e.g., we observed a pattern similar to numerical computation of the input data derivative). We plan to continue working on it to validate and extend these preliminary results.     Introductionhttps://arxiv.org/abs/2404.14082

https://www.lesswrong.com/posts/qdxNsbY5kYNqcgzFb/mechanistic-interpretability-via-learning-differential

Reply to this note

Please Login to reply.

Discussion

No replies yet.