Alex Mallen

Member of Technical Staff at Redwood Research

AI SafetyAI AlignmentInterpretabilityNatural Language ProcessingReward HackingTime-series Forecasting

About

I'm Alex Mallen, a Member of Technical Staff at Redwood Research. My work is primarily focused on the technical challenges of AI safety, specifically around reward hacking, AI control, and alignment. My journey in this field started during my time at the University of Washington, where I researched factual reliability in language models and time-series forecasting, eventually leading me to roles at EleutherAI and the Allen Institute. I'm passionate about ensuring AI systems remain safe and controllable by moving beyond black-box models to extract robust latent knowledge. I'm always happy to connect with others working on interpretability or alignment to discuss how we can build more reliable and trustworthy machine learning systems.

Networking

What I can offer

›Technical expertise in ML interpretability and alignment
›Deep knowledge of NLP and retrieval-augmented generation
›Experience in probabilistic time-series analysis

Looking for

›expanding my professional network
›exploring mutual opportunities in AI safety and research

Best fit for

AI safety researchersMachine learning engineersAcademic researchers in NLPTechnical alignment organizations

Current Interests

AI ControlReward HackingEliciting Latent Knowledge (ELK)Model ReliabilityComplex Systems

Background

Career

Began as an undergraduate researcher at the University of Washington focusing on time-series and NLP, interned at the Allen Institute, and transitioned into full-time AI safety research at EleutherAI before joining Redwood Research.

Education

Bachelor’s degree, Computer Science from University of Washington (2020 – 2023).

Achievements

›Co-authored 'When Not to Trust Language Models'
›Outperformed 177 global contestants in energy demand forecasting
›Collaborated with NASA Goddard on atmospheric anomaly detection
›Featured guest on the DataSkeptic podcast

Opinions

Current reinforcement learning paradigms require significant safety guardrails regarding reward hacking and AI control.
Models should not be treated as black boxes; we must prioritize extracting latent knowledge and robust representations.