~www_lesswrong_com | Bookmarks (715)

Dance Weekend Pay II — LessWrong

lesswrong.com

Published on February 28, 2025 3:10 PM GMT The world would be better with a lot...
Published on February 28, 2025 3:10 PM GMT The world would be better with a lot more transparency about pay, but we have a combination of taboos and incentives where it usually stays secret. Several years ago I shared the range of what dance weekends ended up paying me, and it's been long enough to do it again. This is all my dance weekend...
1
Existentialists and Trolleys — LessWrong

lesswrong.com

Published on February 28, 2025 2:01 PM GMTHow might an existentialist approach this notorious thought experiment...
Published on February 28, 2025 2:01 PM GMTHow might an existentialist approach this notorious thought experiment of ethical philosophy?“Not only do we assert that the existentialist doctrine permits the elaboration of an ethics, but it even appears to us as the only philosophy in which an ethics has its place.” ―Simone de Beauvoir, Ethics of Ambiguity“I started to know how it feels when the...
1
On Emergent Misalignment — LessWrong

lesswrong.com

Published on February 28, 2025 1:10 PM GMTOne hell of a paper dropped this week. It...
Published on February 28, 2025 1:10 PM GMTOne hell of a paper dropped this week. It turns out that if you fine-tune models, especially GPT-4o and Qwen2.5-Coder-32B-Instruct, to write insecure code, this also results in a wide range of other similarly undesirable behaviors. They more or less grow a mustache and become their evil twin. More precisely, they become antinormative. They do what seems...
1
Do safety-relevant LLM steering vectors optimized on a single example generalize? — LessWrong

lesswrong.com

Published on February 28, 2025 12:01 PM GMTThis is a linkpost for our recent paper on...
Published on February 28, 2025 12:01 PM GMTThis is a linkpost for our recent paper on one-shot LLM steering vectors. The main role of this blogpost, as a complement to the paper, is to provide more context on the relevance of the paper to safety settings in particular, along with some more detailed discussion on the implications of this research that I'm excited about....
1
Cycles (a short story by Claude 3.7 and me) — LessWrong

lesswrong.com

Published on February 28, 2025 7:04 AM GMTContent warning: this story is AI generated slop.The kitchen...
Published on February 28, 2025 7:04 AM GMTContent warning: this story is AI generated slop.The kitchen hummed with automated precision as breakfast prepared itself. Sarah watched the robotic arms crack eggs into a bowl while the coffee brewed to perfect temperature. Through the window, she could see the agricultural drones tending the family's private farm, harvesting strawberries for the morning meal."Good morning," Michael said,...
1
January-February 2025 Progress in Guaranteed Safe AI — LessWrong

lesswrong.com

Published on February 28, 2025 3:10 AM GMTOk this one got too big, I’m done grouping...
Published on February 28, 2025 3:10 AM GMTOk this one got too big, I’m done grouping two months together after this.BAIF wants to do user interviews to prospect formal verification acceleration projects, reach out if you’re shipping proofs but have pain points!This edition has a lot of my takes, so I should warn you that GSAI is a pretty diverse field and I would...
1
Weirdness Points — LessWrong

lesswrong.com

Published on February 28, 2025 2:23 AM GMTVegans are often disliked. That's what I read online...
Published on February 28, 2025 2:23 AM GMTVegans are often disliked. That's what I read online and I believe there is an element of truth to to the claim. However, I eat a largely[1] vegan diet and I have never received any dislike IRL for my dietary preferences whatsoever. To the contrary, people often happily bend over backwards to accommodate my quirky dietary preferences—even...
1
[New Jersey] HPMOR 10 Year Anniversary Party 🎉 — LessWrong

lesswrong.com

Published on February 27, 2025 10:30 PM GMTIt's been 10 years since the final chapter of...
Published on February 27, 2025 10:30 PM GMTIt's been 10 years since the final chapter of HPMOR and it's time to look back and celebrate the magic. In the spirit of helping me avoid a shlep to NYC or Philadelphia, I invite anyone and everyone to the Princeton HPMOR 10 Year Anniversary Party! The event will be 6PM at the Prince Tea House in Princeton NJ....
1
OpenAI releases GPT-4.5 — LessWrong

lesswrong.com

Published on February 27, 2025 9:40 PM GMTThis is not o3; it is what they'd internally...
Published on February 27, 2025 9:40 PM GMTThis is not o3; it is what they'd internally called Orion, a larger non-reasoning model.They say this is their last fully non-reasoning model, but that research on both types will continue.They say it's currently limited to Pro users, but the model hasn't yet shown up on the chooser (edit: it is available in the app). They say...
1
The non-tribal tribes — LessWrong

lesswrong.com

Published on February 26, 2025 5:22 PM GMTAuthor note: This is basically an Intro to the...
Published on February 26, 2025 5:22 PM GMTAuthor note: This is basically an Intro to the Grey Tribe for normies, and most people here are already very familiar with a lot of the info herein. I wasn't completely sure I should post it here, and I don't expect it to get much traction, but I'll share it in case anyone's curious.IntroductionThis post is about...
1
Fuzzing LLMs sometimes makes them reveal their secrets — LessWrong

lesswrong.com

Published on February 26, 2025 4:48 PM GMTScheming AIs may have secrets that are salient to...
Published on February 26, 2025 4:48 PM GMTScheming AIs may have secrets that are salient to them, such as:What their misaligned goal is;What their takeover plan is and what coordination signals they use to collude with other AIs (if they have one);What good behavior looks like on a task they sandbag.Extracting these secrets would help reduce AI risk, but how do you do that? One...
1
You can just wear a suit — LessWrong

lesswrong.com

Published on February 26, 2025 2:57 PM GMTI like stories where characters wear suits.Since I like...
Published on February 26, 2025 2:57 PM GMTI like stories where characters wear suits.Since I like suits so much, I realized that I should just wear one.The result has been overwhelmingly positive. Everyone loves it: friends, strangers, dance partners, bartenders. It makes them feel like they're in a Kingsmen film. Even teenage delinquents and homeless beggars love it. The only group that gives me...
1
Minor interpretability exploration #1: Grokking of modular addition, subtraction, multiplication, for different activation functions — LessWrong

lesswrong.com

Published on February 26, 2025 11:35 AM GMTEpistemic status: small exploration without previous predictions, results low-stakes...
Published on February 26, 2025 11:35 AM GMTEpistemic status: small exploration without previous predictions, results low-stakes and likely correct.IntroductionAs a personal exercise for building research taste and experience in the domain of AI safety and specifically interpretability, I have done four minor projects, all building upon code previously written. They were done without previously formulated hypotheses or expectations, but merely to check for anything...
1
Optimizing Feedback to Learn Faster — LessWrong

lesswrong.com

Published on February 26, 2025 2:24 PM GMT(This post is to a significant extent just a...
Published on February 26, 2025 2:24 PM GMT(This post is to a significant extent just a rewrite of this excellent comment from niplav. It is one of the highest-leverage insights I know for learning faster.)TheoryTo a large extent we learn by updating on feedback. You might e.g. get positive feedback from having an insight that lets you solve a math problem, which then reinforces...
1
[PAPER] Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations — LessWrong

lesswrong.com

Published on February 26, 2025 12:50 PM GMTWe just published a paper aimed at discovering “computational...
Published on February 26, 2025 12:50 PM GMTWe just published a paper aimed at discovering “computational sparsity”, rather than just sparsity in the representations. In it, we propose a new architecture, Jacobian sparse autoencoders (JSAEs), which induces sparsity in both computations and representations. CLICK HERE TO READ THE FULL PAPER.In this post, I’ll give a brief summary of the paper and some of my thoughts...
1
outlining is a historically recent underutilized gift to family — LessWrong

lesswrong.com

Published on February 26, 2025 1:58 PM GMToutlining is specialized work which reduces a text to...
Published on February 26, 2025 1:58 PM GMToutlining is specialized work which reduces a text to complete summary statements and collapsed detail.an outline containing a work sprint. note the collapsed points in the 'old sprints' which hide all the old sprint detail.outlining is historically recent, since particular digital interfaces (such as Workflowy, Org Mode, Dynalist or Ravel) make it orders of magnitude easier to...
1
Osaka — LessWrong

lesswrong.com

Published on February 26, 2025 1:50 PM GMTThe more I learn about urban planning, the more...
Published on February 26, 2025 1:50 PM GMTThe more I learn about urban planning, the more I realize that the American city I live in is dystopic. I'm referring specifically to urban planning, and I'm not being hyperbolic. Have you ever watched the teen dystopia movie Divergent? The whole city is perfectly walkable (or parkourable, if you're Dauntless). I don't know if it even...
1
Time to Welcome Claude 3.7 — LessWrong

lesswrong.com

Published on February 26, 2025 1:00 PM GMTAnthropic has reemerged from stealth and offers us Claude...
Published on February 26, 2025 1:00 PM GMTAnthropic has reemerged from stealth and offers us Claude 3.7. Given this is named Claude 3.7, an excellent choice, from now on this blog will refer to what they officially call Claude Sonnet 3.5 (new) as Sonnet 3.6. Claude 3.7 is a combination of an upgrade to the underlying Claude model, and the move to a hybrid...
1
Name for Standard AI Caveat? — LessWrong

lesswrong.com

Published on February 26, 2025 7:07 AM GMTI have discussions that ignore the future disruptive effects...
Published on February 26, 2025 7:07 AM GMTI have discussions that ignore the future disruptive effects of AI all the time. The national debt is a real problem. Social security will collapse. The environment is deteriorating. You haven't saved enough for pension. What is my two year old going to do when she is twenty. Could Israel make peace with the Palestinians next generation? And...
1
Superintelligent Agents Pose Catastrophic Risks: Can Scientist AI Offer a Safer Path? — LessWrong

lesswrong.com

Published on February 24, 2025 6:31 PM GMTA new paper by Yoshua Bengio and the Safe...
Published on February 24, 2025 6:31 PM GMTA new paper by Yoshua Bengio and the Safe Artificial Intelligence For Humanity (SAIFH) team argues that the current push towards building generalist AI agents presents catastrophic risks, creating a need for more caution and an alternative approach. We propose such an approach in the form of Scientist AI, a non-agentic AI system that aims to be...
1
Understanding Agent Preferences — LessWrong

lesswrong.com

Published on February 24, 2025 5:46 PM GMTepistemic status: clearing my own confusionI'm going to discuss...
Published on February 24, 2025 5:46 PM GMTepistemic status: clearing my own confusionI'm going to discuss what we mean by preferences of an intelligent agent and try to make things clearer for myself (and hopefully others). I will also argue that the VNM theorem has limited applicability.What are preferences?When reasoning about agent's behavior, preferences are a useful abstraction. Preferences encode epistemic information about an...
1
What We Can Do to Prevent Extinction by AI — LessWrong

lesswrong.com

Published on February 24, 2025 5:15 PM GMTDiscuss
1
Dream, Truth, & Good — LessWrong

lesswrong.com

Published on February 24, 2025 4:59 PM GMTOne way in which I think current AI models...
Published on February 24, 2025 4:59 PM GMTOne way in which I think current AI models are sloppy is that LLMs are trained in a way that messily merges the following "layers":The "dream machine" layer: LLMs are pre-trained on lots of slop from the internet, which creates an excellent "prior". The "truth machine": LLMs are trained to "reduce hallucinations" in a variety of ways,...
1
Forecasting Frontier Language Model Agent Capabilities — LessWrong

lesswrong.com

Published on February 24, 2025 4:51 PM GMTThis work was done as part of the MATS Program...
Published on February 24, 2025 4:51 PM GMTThis work was done as part of the MATS Program - Summer 2024 Cohort.Paper: link Website (with interactive version of Figure 1): linkExecutive summaryFigure 1: Low-Elicitation and High-Elicitation forecasts for LM agent performance on SWE-Bench, Cybench, and RE-Bench. Elicitation level refers to performance improvements from optimizing agent scaffolds, tools, and prompts to achieve better results. Forecasts are generated by predicting...
1

~www_lesswrong_com | Bookmarks (715)

Domains