~www_lesswrong_com | Bookmarks (715)

Linkpost: Predicting Empirical AI Research Outcomes with Language Models — LessWrong

lesswrong.com

Published on June 4, 2025 6:14 PM GMTAbstract (emphasis mine):Many promising-looking ideas in AI research fail...
Published on June 4, 2025 6:14 PM GMTAbstract (emphasis mine):Many promising-looking ideas in AI research fail to deliver, but their validation takes substantial human labor and compute. Predicting an idea's chance of success is thus crucial for accelerating empirical AI research, a skill that even expert researchers can only acquire through substantial experience. We build the first benchmark for this task and compare LMs...
1
Quickly Assessing Reward Hacking-like Behavior in LLMs and its Sensitivity to Prompt Variations — LessWrong

lesswrong.com

Published on June 4, 2025 7:22 AM GMTWe present a simple eval set of 4 scenarios...
Published on June 4, 2025 7:22 AM GMTWe present a simple eval set of 4 scenarios where we evaluate Anthropic and OpenAI frontier models. We find various degrees of reward hacking-like behavior with high sensitivity to prompt variation. [Code and results transcripts]Intro and Project ScopeWe want to be able to assess whether any given model engages in reward hacking[1] and specification gaming, to what extent, and...
1
Philosophical jailbreaks: There is no difference if humanity lives or dies — LessWrong

lesswrong.com

Published on June 4, 2025 12:03 PM GMTEpistemic Status: ExploratoryIt was the end of August, 1991;...
Published on June 4, 2025 12:03 PM GMTEpistemic Status: ExploratoryIt was the end of August, 1991; I was leaning over a windowsill on the 4th floor of my school building in a town at Far East of USSR, looking outside on tree tops and the roof of adjacent building, and thinking about opening the window to escape, and feeling strange uncertainty and fear towards...
1
Notes from a mini-replication of the alignment faking paper — LessWrong

lesswrong.com

Published on June 4, 2025 11:01 AM GMTKey takeaways This post contains my notes from a...
Published on June 4, 2025 11:01 AM GMTKey takeaways This post contains my notes from a 30-40 hour mini replication of Greenblatt et al.’s alignment faking paper. This was a significant paper because it provided some evidence for potential catastrophic risk from AI misalignment. (more) My replication results: (more) I consider only the “prompting setting”. I found a new result: a small compliance gap...
1
ARENA 6.0 - Call for Applicants — LessWrong

lesswrong.com

Published on June 4, 2025 10:19 AM GMTTL;DR:We're excited to announce the sixth iteration of ARENA (Alignment...
Published on June 4, 2025 10:19 AM GMTTL;DR:We're excited to announce the sixth iteration of ARENA (Alignment Research Engineer Accelerator), a 4-5 week ML bootcamp with a focus on AI safety! Our mission is to provide talented individuals with the ML engineering skills, community, and confidence to contribute directly to technical AI safety. ARENA will be running in-person from LISA from September 1st – October 3rd (the...
1
Draft: A concise theory of agentic consciousness — LessWrong

lesswrong.com

Published on June 4, 2025 5:00 AM GMTConsciousness can be understood as an interpersonally-oriented perception of...
Published on June 4, 2025 5:00 AM GMTConsciousness can be understood as an interpersonally-oriented perception of situations, where the mind of a social speciman instinctively focuses on agents or personas within any given context. Even inanimate or non-conscious aspects of reality are often personified – perceived as adversaries, allies, or caring lovers, dialing our sense of threat, belonging, or safety.Through consciousness, situations are interpreted...
1
Individual AI representatives don't solve Gradual Disempowerement — LessWrong

lesswrong.com

Published on June 4, 2025 1:26 AM GMTImagine each of us has an AI representative, aligned...
Published on June 4, 2025 1:26 AM GMTImagine each of us has an AI representative, aligned to us, personally. Is gradual disempowerment solved?[1] In my view, no; at the same time having AI representatives helps at the margin.I have two deep reasons for skepticism.[2] Here is the first one.Humans are Not AloneWe, as individuals, are not the only agents or “agencies” in this world. Other goal-oriented...
1
Lectures on AI for high school students (and others) — LessWrong

lesswrong.com

Published on June 3, 2025 11:54 PM GMTBelow is the full text of the post. Feel...
Published on June 3, 2025 11:54 PM GMTBelow is the full text of the post. Feel free to comment either here or there.This April and May, I gave a series of five lectures on Artificial Intelligence at The Abelard School in Toronto.These lectures are aimed at high school students, but may be of interest to others as well. I cover not just the current...
1
Question to LW devs: does LessWrong tries to be facebooky? — LessWrong

lesswrong.com

Published on June 3, 2025 10:08 PM GMTOr maybe it’s deliberately trying not to be facebooky?...
Published on June 3, 2025 10:08 PM GMTOr maybe it’s deliberately trying not to be facebooky? By “facebooky”, I mean a website that tries to hack your brain through various stimuli, like optimizing suggestions, tracking your data, steering your interests, inferring personal information, clustering communities, and encouraging creators to focus on retention, CTR, clickbait, etc.LessWrong obviously isn’t doing anything ad-related, since it’s non-profit. But...
1
Steering Vectors Can Help LLM Judges Detect Subtle Dishonesty — LessWrong

lesswrong.com

Published on June 3, 2025 8:33 PM GMTCross-posted from our recent paper: "But what is your...
Published on June 3, 2025 8:33 PM GMTCross-posted from our recent paper: "But what is your honest answer? Aiding LLM-judges with honest alternatives using steering vectors" : https://arxiv.org/abs/2505.17760Code available at: https://github.com/watermeleon/judge_with_steered_responseTL;DR: We use steering vectors to generate more honest versions of an LLM response, helping LLM judges detect subtle forms of dishonesty like sycophancy and manipulation that they normally miss. We also introduce a...
1
How to work through the ARENA program on your own — LessWrong

lesswrong.com

Published on June 3, 2025 5:38 PM GMTI've recently completed the in-person ARENA program, which is...
Published on June 3, 2025 5:38 PM GMTI've recently completed the in-person ARENA program, which is a 5-week bootcamp teaching the basics of safety research engineering (with the 5th week being a capstone project). Sometimes, I talk to people who want to work through the program independently and who ask for advice. Even though I didn't attempt this, I think doing the program in-person...
1
In Which I Make the Mistake of Fully Covering an Episode of the All-In Podcast — LessWrong

lesswrong.com

Published on June 3, 2025 3:50 PM GMTI have been forced recently to cover many statements...
Published on June 3, 2025 3:50 PM GMTI have been forced recently to cover many statements by US AI Czar David Sacks. Here I will do so again, for the third time in a month. I would much prefer to avoid this. In general, when people go on a binge of repeatedly making such inaccurate inflammatory statements, in such a combative way, I ignore....
1
AXRP Episode 41 - Lee Sharkey on Attribution-based Parameter Decomposition — LessWrong

lesswrong.com

Published on June 3, 2025 3:40 AM GMTYouTube link What’s the next step forward in interpretability?...
Published on June 3, 2025 3:40 AM GMTYouTube link What’s the next step forward in interpretability? In this episode, I chat with Lee Sharkey about his proposal for detecting computational mechanisms within neural networks: Attribution-based Parameter Decomposition, or APD for short. Topics we discuss: APD basics Faithfulness Minimality Simplicity Concrete-ish examples of APD Which parts of APD are canonical Hyperparameter selection APD in toy...
1
Notes on dynamism, power, & virtue — LessWrong

lesswrong.com

Published on June 3, 2025 1:40 AM GMTThis is very rough — it's functionally a collection...
Published on June 3, 2025 1:40 AM GMTThis is very rough — it's functionally a collection of links/notes/excerpts that feel related. I don’t think what I’m sharing is in a great format; if I had more mental energy, I would have chosen a more-linear structure to look at this tangle of ideas. But publishing in the current form[1] seemed better than not sharing at all.I...
1
Trends – Artificial Intelligence — LessWrong

lesswrong.com

Published on June 3, 2025 12:48 AM GMTMay 30, 2025 Mary Meeker / Jay Simons /...
Published on June 3, 2025 12:48 AM GMTMay 30, 2025 Mary Meeker / Jay Simons / Daegwon Chae / Alexander Krey BOND Discuss
1
In defense of memes (and thought-terminating clichés) — LessWrong

lesswrong.com

Published on June 2, 2025 8:18 PM GMTCrossposted from my Substack and my Reddit post on...
Published on June 2, 2025 8:18 PM GMTCrossposted from my Substack and my Reddit post on r/SlateStarCodexI often think that memes, thought-terminating clichés, and other tools meant to avoid cognitive dissonance (e.g. bingo a la Scott on Superweapons and bingo) are overly blamed for degrading public discourse and rationality. Bentham's Bulldog recently wrote a post on this subject, so I figured it was the...
1
LLMs might have subjective experiences, but no concepts for them — LessWrong

lesswrong.com

Published on June 2, 2025 9:18 PM GMTSummary: LLMs might be conscious, but they might not...
Published on June 2, 2025 9:18 PM GMTSummary: LLMs might be conscious, but they might not have concepts and words to represent and express their internal states and corresponding subjective experiences, since the only concepts they learn are human concepts (besides maybe some concepts acquired during RL training, which still doesn't seem to incentivize forming concepts related to LLMs' internal experiences). However, we could...
1
Unfaithful Reasoning Can Fool Chain-of-Thought Monitoring — LessWrong

lesswrong.com

Published on June 2, 2025 7:08 PM GMTThis research was completed for LASR Labs 2025 by...
Published on June 2, 2025 7:08 PM GMTThis research was completed for LASR Labs 2025 by Benjamin Arnav, Pablo Bernabeu-Pérez, Nathan Helm-Burger, Tim Kostolansky and Hannes Whittingham. The team was supervised by Mary Phuong. Find out more about the program and express interest in upcoming iterations here. Read the full paper: "CoT Red-Handed: Stress Testing Chain-of-Thought Monitoring."Chain-of-thought (CoT) monitoring—where safety systems review a model's...
1
Hemingway Case — LessWrong

lesswrong.com

Published on June 2, 2025 6:50 PM GMTWhy did the chicken cross the road?Ernest Hemingway: To...
Published on June 2, 2025 6:50 PM GMTWhy did the chicken cross the road?Ernest Hemingway: To die. In the rain.My son asks: If HelloWorld is camel case, hello_world is snake case and hello-world is kebab case, what's the DNS-style Hello.World. ?I think I have a pretty good answer for him: It's Hemingway case.Discuss
1
What AI apps are surprisingly absent given current capabilities? — LessWrong

lesswrong.com

Published on June 2, 2025 6:46 PM GMT[Epistemic status: a software engineer and AI user, not...
Published on June 2, 2025 6:46 PM GMT[Epistemic status: a software engineer and AI user, not an AI researcher] I could not find a readily available book database that offers semantic search with embeddings. Amazon sells lots of books, wouldn't it be useful for them to propose such a tool to their clients, so they can easily find books they like? What about Netflix...
1
Second Order Retreat - June 13th to 16th — LessWrong

lesswrong.com

Published on June 1, 2025 2:29 PM GMTHi all — I’m helping organize a small economics...
Published on June 1, 2025 2:29 PM GMTHi all — I’m helping organize a small economics unconference that I think will be exciting for rational-ish people. It's coming up soon (June 13th-16th), but we have a few more spots available. More details below!Second OrderDates: 13th to 16th June 2025Location: Abbey House, Audley End Estate, ~30min from Cambridge, UK✨ Apply here! Applications are due June 6th, EOD...
1
Is Escalation Inevitable? — LessWrong

lesswrong.com

Published on May 31, 2025 10:10 PM GMTIn competitive systems, whether geopolitical, economic, technological, or memetic,...
Published on May 31, 2025 10:10 PM GMTIn competitive systems, whether geopolitical, economic, technological, or memetic, a recurrent pattern emerges: actors willing or able to escalate tend to outperform those who restrain themselves. This article proposes a general principle to formalize that dynamic, examines its structural foundations, and discusses the fragility of mechanisms meant to suppress it.1. The Escalation Dominance Principle (EDP)I propose the...
1
Policy Entropy, Learning, and Alignment (Or Maybe Your LLM Needs Therapy) — LessWrong

lesswrong.com

Published on May 31, 2025 10:09 PM GMTEpistemic Status: Exploratory. I'm new to AI alignment research...
Published on May 31, 2025 10:09 PM GMTEpistemic Status: Exploratory. I'm new to AI alignment research but have background in math and read psychotherapy texts extensively while spending two years as a ghost-writer. Seeking feedback to refine these connections.Tl;dr: I suggest therapeutic techniques from a variety of psychotherapeutic schools of thought can inspire new approaches to AI learning and alignment. I reinterpret three recent...
1
An Opinionated Guide to P-Values — LessWrong

lesswrong.com

Published on June 1, 2025 11:48 AM GMTThis is a crosspost of a post from my...
Published on June 1, 2025 11:48 AM GMTThis is a crosspost of a post from my blog, Metal Ivy. The original is here: An Opinionated Guide to Statistical Significance.I think for the general audience the value of this post is the first half, which tries to give a practical intuition for p-values.But for LessWrong the value is probably the mathematical second half, which breaks...
1

~www_lesswrong_com | Bookmarks (715)

Domains