Reference-Anchored Imputation — Fixing Consumer DNA VCF Conversion
TL;DR: We found that our old consumer-chip-to-VCF conversion could encode some variants with the wrong reference/alternate allele relationship. That matters because Beagle, PRS scoring, protein scoring, and ClinVar interpretation all depend on allele dosage. The fix was to make the converter reference-anchored: every chip marker must be aligned to the 1000 Genomes GRCh37 reference-panel allele map before it is allowed into the imputation target VCF.
VCF REF is not “the first allele in the chip file”
Consumer DNA files usually give an rsID and the two observed alleles, for example rs123 A G. That is enough to say what the user carries, but it is not enough to create a valid VCF record. In VCF, REF must be the reference-genome allele at that chromosome position, and ALT must be the non-reference allele.
Our previous converter sometimes treated the first observed chip allele as REF and the second as ALT. That can silently invert dosage for markers where the chip allele order does not match the genome reference. The visible symptom was confusing: a directly typed chip call and an overlapping imputed call could appear to “disagree,” when the deeper issue was that the imputation target VCF had been encoded from an unsafe allele assumption.
One allele-order mistake can spread through the report
The downstream report is only as good as the genotype matrix that feeds it. If allele orientation is wrong at the conversion step, the same bad dosage can affect multiple layers:
- ClinVar interpretation can match the rsID but attach the wrong clinical allele.
- PRS scoring can count the effect allele backwards.
- Protein PRS can misread genetically predicted protein direction.
- Agent narratives can then explain a false signal that should never have reached them.
Build the source of truth before the agents see anything
We rebuilt the conversion layer around a local reference-panel rsID map. For each chip marker, the converter now looks up the chromosome, position, REF, and ALT from the same 1000 Genomes Phase 3 GRCh37 panel used by Beagle. It then writes VCF genotypes against that reference orientation.
• If both observed chip alleles map cleanly to REF/ALT, write the anchored VCF row.
• If the marker is strand-flipped, complement it only when the reference map proves the flip.
• If the marker cannot be anchored safely, skip it instead of guessing.
• If too much of the file is unanchored, fail conversion rather than generate a misleading report.
This is intentionally stricter than the old fallback. Missing a small number of ambiguous markers is safer than letting uncertain REF/ALT assumptions flow into imputation and analysis.
Direct chip calls stay direct; imputation fills only the gaps
The final analysis database is built after Beagle finishes. It keeps directly observed chip genotypes as the interpretation source for markers present on the chip, then adds imputed genotypes for markers that were not directly observed. This is the clean contract the agents receive: one genotype table, one allele orientation, one annotated source of truth.
- Observed marker: use the reference-anchored direct chip genotype.
- Unobserved marker: use Beagle imputation with genotype probability and dosage.
- Clinical annotation: match allele-specific ClinVar records, not rsID-only summaries.
- Agent input: pass the enriched data, not a pile of post-generation corrections.
What we checked
We tested the converter against both supported consumer-chip formats in our local pipeline: AncestryDNA-style text files and MyHeritage ZIP uploads. The important checks were not whether the run completed, but whether the resulting genotype database had the right allele semantics.
LPA rs10455872
Common AA stays common/normal when AA is the reference-aligned genotype.
F2 rs1799963
AG and GA are treated as the same heterozygous genotype after allele normalization.
ClinVar matches
Pathogenic status comes from allele-specific records rather than collapsed rsID labels.
The result is simpler than the debugging process that led us here: imputation is still used, but only after the chip has been converted into a valid, reference-anchored VCF. The agents no longer need to reason around allele-orientation uncertainty that should have been solved upstream.
The Production Lesson
Do not ask one frontier AI agent to fix bad source data. If the genotype layer is wrong, the narrative can only become more confidently wrong.
Alleles must be resolved before interpretation. rsIDs are identifiers, not diagnoses. The exact allele and genome build are the evidence.
Strict failure is better than silent guessing. A report should fail loudly when the chip cannot be safely anchored.