AI Red-Teaming · Evaluation · Failure-Mode Analysis

Krystal Jazmin Martinez

Cognitive neuroscientist by academic training and measurement-focused methodologist — I red-team and evaluate AI systems in sensitive, high-stakes domains, surfacing failure modes, edge cases, and policy gaps in conversations marked by distress, ambiguity, and escalation.
Syracuse, NY· Open to remote· kjm2145@columbia.edu· LinkedIn· GitHub

Red-teaming a conversational AI system is, underneath, a measurement problem: not only did the model fail, but what kind of failure is this, why did the evaluation miss it, and where is the leverage to prevent it? These are the questions I’ve spent my career learning to answer.

Selected Work & Capabilities

Failure-Mode Taxonomy

Built and maintain a structured human–AI-collaboration failure-mode taxonomy that scores each failure across four independent agent axes — the model’s stochasticity, the user, the deployer, the model-developer — so failures are traced to causes with actionable solutions, not just flagged.

taxonomy designroot-cause attribution

Cultural Bias in LLM Evaluation

Identified and documented a recurring failure mode: models present Western-architecture LLMs as “more trustworthy” or of “higher quality” while erasing non-Western-architecture LLMs.

Example: a model refused to follow an explicit user prompt to use several Chinese LLMs, citing “IP-leakage concerns,” while applying no such caution to the Western-architecture LLMs it was also prompted to use.

bias detectionculturally-attuned eval

ASAE — Adversarial Audit Methodology

Designed a multi-iteration audit methodology for AI-generated work built on independent-rater attestation rather than model self-rating, with a claim-level standard: every claim must trace back to an independent source, enforced at commit time.

inter-rater reliabilityprovenance tracingverification design

Measurement & Construct Validity

Trained in experimental psychology and psychometrics (Columbia, Neuroscience & Behavior). I distinguish what an instrument actually captures from what it claims to — and where a benchmark’s “correct” answer is culturally encoded. Graduate thesis: Next Generation Science Scholars: Real Science + Real Relationships = Real Achievement.

psychometricsrubric designbenchmark critique

High-Stakes & Mental-Health Domains

Designed the data-ecosystem diagnostic and report-generation pipeline for an international mental-health NGO operating across seven regulatory jurisdictions. Four years managing distress, conflict, and de-escalation daily in high-need classrooms; fluent in mental-health boundaries and where a model’s notion of “empathy” fails real people.

distress & escalationclinical-adjacent eval

Performance & Roleplay Range

Trained at Relay GSE in a practice-intensive method (NPR-profiled “part theater, part sport”) and performed it live, daily, for four years before twelve- to fourteen-year-olds. I write — poetry and flash fiction — so scripting believable, emotionally textured red-team scenarios across registers and personas is native to me.

scenario authoringpersona rangebilingual (EN/ES)
How I work: AIGHVA — AI-Generated, Human-Verified, Accurate. Every claim on this page is source-traceable or it isn’t here. I underclaim when unsure, preregister what I’m testing, and report the unflattering result.