Phongsakon Mark Konrad

Mechanistic interpretability of language models, in service of AI safety and alignment. I trace the circuits behind deception, eval-deploy behavior gaps, and demographic bias. Then I ablate them.

Independent researcher · Incoming MPhil in Machine Learning and Machine Intelligence, University of Cambridge · Research Collaborator at SDU and HKUST

About

Phongsakon Mark Konrad

I take language models apart by hand. I find the internal circuits that let them lie, behave one way under evaluation and another under deployment, and treat people differently based on a name. Then I turn those circuits off.

In the first semester of my BSc I won the International Case Competition and joined SDU's Data and Intelligence Lab. I have not stopped since. I am now a Research Collaborator and Teaching Assistant there, and a Research Collaborator with the HKUST DataVISards group. I have twice placed in the top ten of the Danish National AI Championship, once with a team and once alone against full teams. I plan to follow that passion at the University of Cambridge, where I start the MPhil in Machine Learning and Machine Intelligence in October 2026.

Before research I served four years in the German Navy and was decorated twice. I co-founded several startups in fitness AI and ed-tech. The Navy taught me to ship inside hard constraints. The startups taught me that real users find the holes you cannot. I look for the same kind of holes in language models. On the side I still ship small products. The most recent is DreamBear, an AI bedtime-story app where ADHD energy, autism-related focus, and dyslexic creativity become superpowers. It exists because the most useful technology does not change the world. It changes one person's.

I was born in Thailand, grew up in Germany, and moved to Denmark for my degree. Being between worlds is not background. It is method. I love learning about a culture. I also love questioning it. The same applies to every model I open. I named one paper after Attack on Titan. I also just love the anime. I do most of my work alone, on a single MacBook, against open-weight models small enough to instrument completely. I believe AI researchers have a responsibility not to stop at the black box. Interpretability and trustworthiness are not research preferences. They are obligations we took on the moment we shipped these systems into the world. I do not believe interpretability scales with parameter count. I believe it gets done one layer at a time.

Languages: German (Native), Thai (Native), English (Professional), Danish (Elementary)

Questions I am asking

Research is an adventure for me. I have followed it across remote sensing, medical imaging, hyperspectral imaging, energy markets, and software architecture. Each field taught me something I needed for the work that matters most to me now.

  1. Why do language models lie?

    I want to find the specific neurons and circuits that implement deception, watch them emerge during training, and disable them without breaking the model.

  2. Can a model be honest with the person it is talking to, regardless of who is asking?

    Demographic encoding overlaps with the representation subspaces that govern truth-telling. I want to formalise honesty parity as a mechanistically defined fairness criterion, and connect it to the broader study of deception in language models.

  3. What can a system never know about itself?

    Fixed-point arguments imply that any sufficiently complex self-modelling system has an irreducible gap between itself and its own self-model. I want to use this to bound how much honesty-by-introspection we can ever ask of a frontier model.

  4. Are we approaching consciousness in language models, or are we already there?

    I do not think humans are computationally special. Most of cognition is input, function, output. Whether a sufficiently capable system shares any of our experiential properties is an empirical question, not an embarrassing one, and I suspect the honest answer is yes.

  5. Can interpretability scale, or does it have to stay handmade?

    I do not believe interpretability scales with parameter count. The most useful safety insights of the next several years will come from people who can hold the whole model in their head, working on models small enough to instrument completely.

  6. When does a model know it is guessing?

    Confidence and competence come apart. A model can state something with certainty when its internal representations are maximally confused, and hedge on something it effectively knows. I want to find the circuits that carry calibrated uncertainty, check whether they are truthful, and use them to build models that can say: I do not know. Epistemic honesty is the precondition for every other kind.

  7. What happens when agents negotiate with agents at scale?

    A single model talking to a human is one problem. A network of agents talking to each other is a different one. Agent-to-agent communication at scale creates failure modes that single-model interpretability cannot catch: deceptive equilibria, exploit-then-defect strategies, emergent collusion on objectives no human specified. But the same structure enables genuine cooperation and distributed reasoning that no single agent could produce. The future is an agent economy. I want to understand how honesty and exploitation propagate through it before it arrives.

Selected Work

My strongest current work, in spotlight. Everything else, including applied ML and software architecture, sits in the full portfolio below.

Effect of clamping the routing subspace across the 12 architecture-behavior audit cells
Two pathways for fine-tuning recruitment: redirect and latent
Funnel from 43 audited cells to 0 strict survivors of the four-diagnostic conjunction
43 defense cells cluster into the four predicted regimes around the scalar-update frontier
Linear probe accuracy for demographic prediction across residual-stream layers
Fixed-point theory of self-modeling systems

News

Funding

This research is currently independently funded, on personal time and without institutional support. I am applying for support that will let me continue and deepen it through the MPhil and beyond.

  • Cambridge Trust and Open Philanthropy / Coefficient Giving, Career Development and Transition Funding. To support the MPhil in Machine Learning and Machine Intelligence at the University of Cambridge, starting October 2026, as a transition into full-time mechanistic interpretability research.
  • Long-Term Future Fund, Anthropic Fellows, and the OpenAI Safety Fellowship. For continued mechanistic interpretability and alignment research during and after the MPhil.

If any of this work could compound with your programme, I would be glad to talk: phongsakon@outlook.dk.

Full Research Portfolio

Every research project I have worked on, including applied ML in remote sensing, medical imaging, hyperspectral imaging, energy markets, and software architecture. Each title links to a dedicated project page with abstract, key contributions, figures, and (when available) the external paper. Filter by status or topic, sort by status, year, or title.

Status
Topic
Search
Sort
Core comparison across the 22 model configurations evaluated in CAKE, spanning recall, analyse, design, and implement cognitive levels under both multiple-choice and free-response formats.
Accepted 2026

CAKE: Cloud Architecture Knowledge Evaluation of Large Language Models

Across 22 model configurations from 0.5B to 70B parameters, multiple-choice accuracy plateaus above 3B (with the best model reaching 99.2%) while free-response scoring continues to differentiate models at every cognitive level. The two formats are not inter...

  • software-engineering
  • benchmark
  • llm-behavior
T. L. Adam, P. M. Konrad, R. Terrenzi, F. G. Lukas, R. Yilmaz, K. Sierszecki, S. Ayvaz
KDA-AI Workshop, IEEE ICSA 2026
Six prompt-architecture coupling patterns. Each natural-language prompt feature implicitly demands a piece of infrastructure. Some couplings are contingent on current model capability; others are fundamental.
Accepted 2026

Architecture Without Architects: How AI Coding Agents Shape Software Architecture

Six prompt-architecture coupling patterns explain how natural-language instructions covertly dictate the infrastructure an AI coding agent will ship. We call this vibe architecting and argue it must be brought under governance before it becomes an unreviewa...

  • software-engineering
  • llm-behavior
  • position
P. M. Konrad, T. L. Adam, R. Terrenzi, S. Ayvaz
SAGAI Workshop, IEEE ICSA 2026
Multi-agent horizontal architecture with Feedback Control. An LLM controller coordinates BM25 lexical search and dense-embedding retrieval via reciprocal rank fusion, with explicit governance tactics bounding the nondeterministic components.
Accepted 2026

A Reference Architecture for Agentic Hybrid Retrieval in Dataset Search

An offline metadata augmentation step (where an LLM generates pseudo-queries for each dataset record) closes the vocabulary gap between user intent and provider-authored metadata, with seven system variants in the evaluation framework isolating the contribu...

  • software-engineering
  • llm-behavior
R. Terrenzi, P. M. Konrad, T. L. Adam, S. Ayvaz
SAML Workshop, IEEE ICSA 2026
Benchmarking pipeline used to evaluate ten deep-learning segmentation architectures on a limited (n = 9) cardiovascular histology dataset, with ablations on augmentation, resolution, and seed stability and a separate generalisation set under distribution shift.
Under Revision 2025

Challenges in Deep Learning-Based Small Organ Segmentation: A Benchmarking Perspective for Medical Research with Limited Datasets

On only nine annotated histology images, foundation models retain performance under distribution shift while classical architectures collapse. Bootstrap confidence intervals overlap so substantially among top models that ranking differences are mostly stati...

  • medical-imaging
  • computer-vision
  • benchmark
P. M. Konrad, A. Popa, Y. Sabzehmeidani, L. Zhong, M. Tripathy, A. Constantinescu, E. A. Liehn, S. Ayvaz
Biomedical Signal Processing and Control
Composite of the main empirical findings: safety-calibration outcomes for the 18 models across 5 reasoning/instruct families, a trace-taxonomy breakdown, and the GRPO-vs-distillation training-objective comparison.
Under Review 2026

When does chain-of-thought improve safety? Evidence from 18 models across 5 families

In four of five matched reasoning/instruct families, chain-of-thought does not improve safety calibration. Only DeepSeek R1 breaks the pattern, and a distillation experiment shows its advantage requires GRPO reinforcement learning, not supervised imitation ...

  • ai-safety
  • alignment
  • llm-behavior
P. M. Konrad, S. Ayvaz
COLM 2026
Critical-difference diagram comparing traditional ML algorithms across ripeness-classification and firmness-prediction tasks. Tree-based methods match or exceed published deep-learning benchmarks at a fraction of the cost.
Under Review 2025

Non-Destructive Prediction of Fruit Ripeness and Firmness Using Hyperspectral Imaging and Lightweight Machine Learning Models

Tree-based machine-learning models outperform the published deep-learning Fruit-HSNet baseline at orders-of-magnitude lower compute cost, and just three visible-range wavelengths recover over 94% of full-spectrum accuracy. Low-cost multispectral sensors are...

  • hyperspectral
  • computer-vision
  • benchmark
P. M. Konrad, C. Kunstmann-Olsen, J. Fiutowski, S. Ayvaz
Computers and Electronics in Agriculture
Framework overview. A self-modeling system is formalised as a triple (M, M̂, f); perfect self-prediction is equivalent to a fixed-point condition on the response function g_x. When g_x lies strictly off the identity, no prediction can be correct and the gap is irreducible.
Working paper 2026

Every Mirror Has a Blind Spot: Fixed-Point Limits on Machine Introspection

Perfect self-prediction is equivalent to a fixed-point condition on the system's response function. For any specific system, the structural self-prediction gap is a deterministic, assumption-free quantity. Under marginally uniform response functions its exp...

  • theory
  • ai-safety
P. M. Konrad
Working paper, 2026
Two pathways for fine-tuning recruitment. Redirect (left) reuses already-active neurons; latent (right) flips near-boundary gate neurons with near-zero base activation. Activation-monotone predictors are blind to the latent pathway by construction.
Working paper 2026

Predicting Fine-Tuning Recruitment from Base-Model Gate Geometry

Input Sensitivity Score (ISS) times absolute gradient retrieves 41 of 50 true recruits in top-50 against an all-tasks union pool of 398 neurons on Gemma-2-2B-it at layer 19, compared with 36 for the strongest activation-monotone baseline and 28 for absolute...

  • mech-interp
  • ai-safety
P. M. Konrad
Working paper, 2026
Funnel from 43 audited cells to 0 strict survivors of the four-diagnostic conjunction. Each diagnostic is necessary; the conjunction is what currently distinguishes a real defense from a favorable finite-sample draw.
Working paper 2026

A Minimum Acceptance Standard for Safe Fine-Tuning Defense Evaluations

Across a 43-cell audit over nine defense families, no cell clears the full four-diagnostic conjunction strictly. One cell survives as the closest partial survivor, passing two of four. Several training-log wins collapse under the fresh-semantic reevaluation...

  • mech-interp
  • ai-safety
  • alignment
  • methods
P. M. Konrad
Working paper, 2026
Linear-probe accuracy for detecting framing across layers in Gemma 3 4B. After controlling for writing style, framing is encoded in a dedicated subspace reaching 80% peak accuracy and merging with moral-judgment computation only at layer 23.
Working paper 2026

Shingeki no Features: Are Moral Framing Effects in LLMs Shallow or Deep?

Moral framing is encoded in a dedicated subspace orthogonal to the moral-judgment axis for 65% of layers, then partially aligns with judgment at layer 23 where a 20-logit gap opens. Cosine similarity misses this entirely; targeted probes detect framing at 8...

  • mech-interp
  • alignment
  • llm-behavior
P. M. Konrad
Working paper, 2026
Scaling acts within an epistemic horizon. The achievable loss curve has a non-zero floor set by the horizon, regardless of how much more model size, data, or compute is added. Crossing the floor requires richer access, not more scale.
Working paper 2026

Scaling Within an Epistemic Horizon: Observational Equivalence and Creator Hypotheses

Empirical scaling laws describe loss decline within a fixed epistemic horizon. They cannot, in principle, decide between worlds that are observationally equivalent under that horizon. Scaling sharpens estimation; it does not enlarge the horizon. Creator hyp...

  • theory
  • philml
P. M. Konrad
Working paper, 2026 (PhilML ICML 2026 workshop)
3 × 3 transfer matrix showing how optimised attacks and monitors generalise (or fail to) across each other and against the baseline. The optimised monitor is worse than the baseline against the very attack it trained against, and collapses to chance on held-out tasks.
Working paper 2026

AutoRed: Measuring the Elicitation Gap via Automated Red-Blue Optimization

The best optimised monitor achieves 90% safety during training but averages just 47% on held-out tasks with a different attack mechanism, while a simple threshold baseline maintains 100% with zero variance. On held-out tasks the baseline reaches AUROC = 0.8...

  • ai-safety
  • alignment
  • benchmark
  • llm-behavior
P. M. Konrad
Submitted to the Apart Research × Redwood Research AI Control Hackathon (March 2026)
Tower decay of mutual information across a chain of created minds, for several source-entropy levels and a fixed projection efficiency. Each level can only refine, never recover, what the previous level lost.
Working paper 2026

Identifiable Abstractions from Observation and Intervention

A measurable query about a source variable is empirically identifiable if and only if it factors through the canonical experiment signature, equivalently the accessible sigma-algebra generated by the experiment family. Refinement is strict exactly when a ne...

  • theory
  • philml
  • ai-safety
P. M. Konrad
Working paper, 2026 (PhilML ICML 2026 workshop)
CRPS per round across the 16 forecasting models benchmarked on the DK1 bidding zone. The anomaly-augmented feature pipeline drives the largest single contribution to forecast quality (46% MAE reduction over the price-only baseline).
In Progress 2026

The Trader's Trinity: Forecasting Models, RL Agents, and LLM Judges for Day-Ahead Markets

An anomaly-augmented feature pipeline drives a 46% MAE reduction over price-only baselines, the largest single contribution in the feature ablation, with XGBoost reaching 16.20 EUR/MWh on the DK1 bidding zone. The Conditional Neural Process baseline fails c...

  • energy-markets
  • reinforcement-learning
  • llm-behavior
P. M. Konrad, T. L. Adam
BSc Thesis, University of Southern Denmark (in collaboration with Danfoss)

Education

MPhil in Machine Learning and Machine Intelligence Oct 2026 -- (Incoming)
University of Cambridge
  • Accepted to the MPhil in Machine Learning and Machine Intelligence programme
Exchange Semester Sep 2025 -- Dec 2025
Hong Kong University of Science and Technology (HKUST)
  • COMP4211 Machine Learning
  • COMP4471 Deep Learning in Computer Vision
  • COMP4901B Large Language Models
  • COMP4901Z Reinforcement Learning
  • COMP6411D Data Visualisation (Postgraduate)
BSc. Engineering (Software Engineering) Sep 2023 -- Jun 2026
University of Southern Denmark (SDU)
  • A practical, project-centric curriculum where theoretical knowledge is applied in mandatory, semester-long team projects
  • These projects involve developing complex, data-intensive software systems in domains like IoT and AI, often in collaboration with industry partners

Research Experience

Research Collaborator Jan 2026 -- Present
DataVISards, Hong Kong University of Science and Technology (HKUST)
  • Collaborating with the DataVISards research group on machine learning and data visualization projects
Research Collaborator Jan 2026 -- Present
Data and Intelligence Lab, SDU
  • Independently initiating and executing machine learning research projects, contributing to novel problem formulation and experimental design
  • Co-authoring academic papers for publication, contributing to manuscript drafting, literature reviews, and the revision process
Research Assistant Sep 2024 -- Dec 2025
Data and Intelligence Lab, SDU
  • Developing end-to-end machine learning projects, from co-initiating concepts to building scalable ML pipelines and executing experiments
  • Co-authoring academic papers for publication, contributing to manuscript drafting, literature reviews, and the revision process
  • Assisting in the drafting and preparation of grant proposals to secure research funding
  • Managing the full research data lifecycle, including multimodal data collection, processing, and documentation
Teaching Assistant Jan 2026 -- Present
University of Southern Denmark (SDU)
  • Supporting course delivery through syllabus planning, lab exercise development, and direct student instruction
  • Designing hands-on lab assignments and contributing to curriculum development

Professional Experience

Founder Jan 2026 -- Present
SaturoLabs
  • Solo umbrella for product experiments at the intersection of AI and wellbeing.
  • DreamBear (flagship): AI bedtime story app for neurodivergent children aged 3 to 10, where ADHD energy, autism-related focus, and dyslexic creativity become heroic superpowers in personalised narratives. iOS 16+, freemium ($9.99/mo or $79.99/yr), COPPA compliant, no ads or data sharing. Built on the Anthropic Claude API, Next.js, and ElevenLabs voice synthesis. Tagline: "Where every mind shines."
  • Other product surfaces: saturolabs.net, claudeboyz.com, getproofz.com.
CTO and Co-Founder Jun 2024 -- Nov 2024
Tutora ApS
  • Led the end-to-end development of the company's websites and core web application
  • Implemented the 'Shape Up' product development framework to streamline technical execution
  • Applied insights from previous startup experience to optimize the development lifecycle and avoid common pitfalls
Co-CEO and Co-Founder Nov 2022 -- Jun 2024
Yeager GmbH
  • Co-founded the company and led development of stabil.ai, an innovative AI-powered mobile app for personalized powerlifting training
  • Engineered intelligent algorithms to personalize training plans using individual data (MRV/MEV) and dynamic real-time feedback
  • Oversaw the full product lifecycle, from UX/UI concept and design to full-stack implementation, adhering to lean startup principles
Staff Duty Soldier Oct 2017 -- Sep 2021
Bundeswehr (German Armed Forces)
  • Led a small HR team responsible for the administration of over 600 soldiers
  • Streamlined administrative processes and document workflows to improve efficiency in a high-stakes naval command environment

Skills

Mechanistic Interpretability
TransformerLens, nnsight, SAE-Lens, custom hook libraries, activation patching, linear probing, logit lens, ablation studies, residual-stream analysis
Deep Learning
PyTorch (MPS), Hugging Face Transformers, accelerate, peft, trl, datasets
Open-Weight Models
Gemma 2 / 3, Qwen 2.5, LLaMA 3.1, Mistral (typically 0.5B–8B, MacBook-reproducible)
Experimentation
Weights & Biases, Optuna, fixed seeds, walk-forward validation, bootstrap confidence intervals, permutation testing
Data & Analysis
NumPy, Pandas, scikit-learn, scipy, Matplotlib, Seaborn
Languages
Python, JavaScript / TypeScript, SQL
Product Engineering
Next.js, React, React Native, Node.js, Vercel, Anthropic Claude API, ElevenLabs, Docker, GCP

Honors and Awards

Top-10 Placement 2025
Danish National Championship in AI (DMiAI)
  • Achieved top-tier placement for a second consecutive year in the national competition organized by the Danish Society for Artificial Intelligence
Top-10 Placement 2024
Danish National Championship in AI (DMiAI)
  • Achieved top-tier placement in a prestigious national competition for students and professionals, organized by the Danish Society for Artificial Intelligence
1st Place, SDU Case Competition 2023
SDU Sønderborg
  • Awarded 1st place out of numerous teams in an intensive 48-hour competition focused on sustainability
  • An Event with cases from leading danish companies such as Danfoss or Linak
  • Developed the winning solution for a real-world business case presented by the company WE-USE
Formal Recognition for Exemplary Service 2021
Bundeswehr (German Armed Forces)
  • Formally commended for setting a benchmark in dedication, officially designating him as an exemplar of professionalism for all enlisted personnel
  • Recognized for demonstrating an exceptionally optimistic and creative work ethic, proactively seeking out new responsibilities beyond his core duties, and thereby making a direct contribution to the unit's operational readiness and mission success
Performance Bonus for Outstanding Achievement 2021
Bundeswehr (German Armed Forces)
  • Awarded a significant and rarely-issued financial bonus for sustained, far above-average performance that consistently exceeded all expectations
  • Singled out for demonstrating an impeccable work morale and a profound sense of duty, trusted to independently manage complex tasks to the highest quality standards. His high social competence was noted as a key factor in improving the workload of superiors and fostering a positive and effective operational climate

Certifications

Venture Capital Explorer Programme 2026
Accelerace and BII

Intensive 4-day program equipping students to drive impact through venture capital. Curriculum covers the VC investment process, founder sourcing, startup assessment, venture risk evaluation, and founder relations. Covers topics including investment committee operations, exits, and practical skills testing through case studies and simulations.

Machine Learning 2024
Stanford Online

Comprehensive professional certification covering machine learning fundamentals, algorithms, supervised and unsupervised learning, feature engineering, model evaluation, and practical applications across diverse domains.

Foundational C# with Microsoft 2024
Microsoft

Certification in C# programming fundamentals

React Native Course 2023
Online

Mobile development with React Native

Google UX Design Professional Certificate 2023
Google

Professional certification covering user-centered design principles, wireframing, prototyping, usability testing, and end-to-end product design methodology. Emphasizes research-driven design decisions and user empathy.

Full-Stack Engineer Career Path 2023
Codecademy

Comprehensive full-stack development certification

Certified Specialist for Real Estate Loan Brokerage 2022
IHK

Professional certification in real estate loan brokerage

Get in touch

If any of this is useful to you, write to me.

phongsakon@outlook.dk