---
vault_clearance: EUCLID
halo:
  classification: INTERNAL
  custodian: The Architect
  created: 2026-04-26
  confidence: HIGH
  front: "Operational map of the cellular encoding architecture"
  updated: 2026-04-26
  wing: UNASSESSED
---

# FORM — 35_Project_TheHats — operational map of the cellular encoding architecture

## Purpose

This file is the **architectural blueprint** of what we're measuring. It lays out the encoding layers, what each layer does, what observable property each lays down, what we measure to read it, and what tools / data we need.

Every "hat" the cell wears corresponds to a layer here. Reading the cell's full source code = profiling all layers simultaneously.

## The architecture (10 measurable layers)

```
┌────────────────────────────────────────────────────────────────────┐
│ LAYER 0: DNA — archival, cell-identity-stable                       │
│   Read by: cell itself                                              │
│   Adversary cannot read: physically inside nucleus                  │
│   Measurable: WGS / WES (we don't measure this — focus is RNA→prot) │
└────────────────────────────────────────────────────────────────────┘
                               │
                               ▼
┌────────────────────────────────────────────────────────────────────┐
│ LAYER 1: Pre-mRNA — full transcript with introns + alt-TSS         │
│   Information: which gene is being transcribed, alt-TSS choice      │
│   Crypto role: Stage 1 routing — what enters the splicing pipeline │
│   Privacy: pre-mRNA is intracellular and transient                 │
│   Adversary access: NONE (transient nuclear)                       │
│   Measurement: nascent-RNA-seq (NET-seq, GRO-seq)                  │
│   Status: NOT MEASURED in cohort                                   │
└────────────────────────────────────────────────────────────────────┘
                               │
                  splicing decisions
                               │
                               ▼
┌────────────────────────────────────────────────────────────────────┐
│ LAYER 2: Mature mRNA — alt splicing, intron retention               │
│   Information: which isoform makes it through                       │
│   Crypto role: protocol-pivot capacity (HALO_ENCODING §7)           │
│   Privacy: short-lived, intracellular, but somewhat readable        │
│   Adversary access: only via single-cell lysis                     │
│   Measurement: scRNA-seq, BAM CIGAR-N junction analysis            │
│   Status: ★ MEASURED VIA atlas_full6.db + STAFF C-kernel pipeline │
│   BT83-BT88 results live here                                      │
└────────────────────────────────────────────────────────────────────┘
                               │
                               ▼
┌────────────────────────────────────────────────────────────────────┐
│ LAYER 3: RNA modifications (m6A, Ψ, m5C)                            │
│   Information: self-marker watermark per transcript                 │
│   Crypto role: authentication watermark (HALO_ENCODING §6)          │
│   Privacy: visible to RIG-I/MDA5; foreign RNA lacks it             │
│   Adversary access: viruses must evolve to add modifications       │
│   Measurement: MeRIP-seq, m6A-MAP, Ψ-seq                           │
│   Status: NOT MEASURED in cohort, public data exists in GEO        │
└────────────────────────────────────────────────────────────────────┘
                               │
                               ▼
┌────────────────────────────────────────────────────────────────────┐
│ LAYER 4: Codon usage / translation kinetics                         │
│   Information: per-tissue private channel via tRNA-pool match       │
│   Crypto role: per-tissue cipher key (HALO_ENCODING §2)             │
│   Privacy: visible only to systems running through the host         │
│            translational machinery                                  │
│   Adversary access: pathogens MUST match host codon bias to spoof  │
│   Measurement: per-gene codon frequency vs. cell-type tRNA pool    │
│                (CAI = Codon Adaptation Index, tAI = tRNA AI)        │
│   Status: NOT MEASURED, COMPUTABLE from existing BAM + reference   │
└────────────────────────────────────────────────────────────────────┘
                               │
                  translation
                               │
                               ▼
┌────────────────────────────────────────────────────────────────────┐
│ LAYER 5: Protein primary sequence — wobble-projected from mRNA     │
│   Information: AA chain                                             │
│   Crypto role: lossy hash output — RNA inventory irrecoverable     │
│   Privacy: same protein from different codons indistinguishable    │
│   Adversary access: visible to ribosome / proteasome / MHC-I        │
│   Measurement: implicit in protein presence; no special tool       │
│   Status: standard                                                 │
└────────────────────────────────────────────────────────────────────┘
                               │
                               ▼
┌────────────────────────────────────────────────────────────────────┐
│ LAYER 6: IDR (intrinsically disordered regions) content             │
│   Information: structural-unverifiability mass per protein          │
│   Crypto role: unforgeable polymorphism (HALO_ENCODING §3)          │
│   Privacy: pathogens cannot mimic by structural matching            │
│   Adversary access: limited — IDRs DON'T present clean structure   │
│   Measurement: IUPred3, MobiDB, ESM2 disorder predictor            │
│   Status: NOT MEASURED, COMPUTABLE from sequence (no BAM needed)   │
└────────────────────────────────────────────────────────────────────┘
                               │
                               ▼
┌────────────────────────────────────────────────────────────────────┐
│ LAYER 7: Low-complexity regions / repeat regions                    │
│   Information: high-mutation-rate per-individual identity tags      │
│   Crypto role: per-individual fingerprinting (HALO_ENCODING §4)     │
│   Privacy: per-individual variation prevents population mimicry    │
│   Adversary access: pathogens trained on population see consensus  │
│                     not the individual variant                     │
│   Measurement: SEG, BLAST low-complexity filter                    │
│   Status: NOT MEASURED, COMPUTABLE from sequence                   │
└────────────────────────────────────────────────────────────────────┘
                               │
                               ▼
┌────────────────────────────────────────────────────────────────────┐
│ LAYER 8: Folded protein — chaperone-verified 3D structure           │
│   Information: fold family, structural identity                     │
│   Crypto role: structural authentication (3D verification)          │
│   Privacy: misfolded → ERAD / UPR / proteasome destruction         │
│   Adversary access: pathogens must reproduce fold to spoof         │
│   Measurement: implicit; AlphaFold predictions per protein         │
│   Status: predictable; aggregating per cell type adds power        │
└────────────────────────────────────────────────────────────────────┘
                               │
                               ▼
┌────────────────────────────────────────────────────────────────────┐
│ LAYER 9: Post-translational modifications (phospho, acetyl, etc.)   │
│   Information: state-dependent runtime tags                         │
│   Crypto role: programmable switches (HALO_ENCODING §11)            │
│   Privacy: PTMs require host-specific enzymes                      │
│   Adversary access: pathogens must hijack host PTM machinery       │
│   Measurement: phospho-proteomics, acetyl-proteomics                │
│   Status: NOT MEASURED, public data exists                         │
└────────────────────────────────────────────────────────────────────┘
                               │
                               ▼
┌────────────────────────────────────────────────────────────────────┐
│ LAYER 10: Glycosylation — second crypto layer                       │
│   Information: 200+ sugars / linkages = high-diversity code         │
│   Crypto role: SECOND crypto layer (HALO_ENCODING §5)               │
│   Privacy: per-individual (ABO/Lewis blood types), per-tissue      │
│   Adversary access: requires full host glycosyltransferase repertoire │
│   Measurement: NetNGlyc + NetOGlyc (sites); glycomics (composition) │
│   Status: NOT MEASURED, COMPUTABLE for sites; composition needs proteomics │
└────────────────────────────────────────────────────────────────────┘
                               │
                               ▼
┌────────────────────────────────────────────────────────────────────┐
│ LAYER 11 (broadcast surface): MHC-I peptide presentation            │
│   Information: subset of the proteome surfaced for T-cell scan     │
│   Crypto role: BROADCAST — what the cell deliberately reveals       │
│   Privacy: cell selects which proteasome products reach surface    │
│   Adversary access: visible to T cells (the intended audience)     │
│   Measurement: immunopeptidomics (MS-based)                        │
│   Status: NOT MEASURED in our cohort                               │
└────────────────────────────────────────────────────────────────────┘

┌────────────────────────────────────────────────────────────────────┐
│ LAYER 12 (paracrine): exosomes / EVs — encrypted RNA + protein      │
│   Information: cargo (mRNA + miRNA + lncRNA + protein), addressing  │
│   Crypto role: encrypted packets with surface addressing markers    │
│   Privacy: lipid envelope blocks RNase; receptor-matched uptake    │
│   Adversary access: limited to receivers with matching receptors   │
│   Measurement: exosome RNA-seq + tetraspanin profiling             │
│   Status: NOT MEASURED in our cohort                               │
└────────────────────────────────────────────────────────────────────┘
```

## Measurement priorities

The layers we can measure CHEAPLY (sequence-based or atlas-based, no new wet-lab data):
- ★ **Layer 2 (alt splicing / IR)** — already done, atlas + BT86
- **Layer 4 (codon usage)** — sequence-based, computable from BAM + reference (1-2 days)
- **Layer 6 (IDR content)** — sequence-based via IUPred3 (1 day)
- **Layer 7 (low-complexity content)** — sequence-based via SEG (hours)
- **Layer 10a (glyco SITES)** — sequence-based via NetNGlyc / NetOGlyc (1-2 days)
- **Layer 8 (fold family)** — predictable via AlphaFold structures (existing public DB)

The layers requiring **new public-data acquisition**:
- **Layer 3 (m6A)** — public MeRIP-seq from GEO, must integrate
- **Layer 9 (PTMs)** — public phospho-proteomics, must integrate
- **Layer 10b (glyco COMPOSITION)** — glycomics, sparser data
- **Layer 11 (immunopeptidome)** — public MHC-I data exists for some cell types
- **Layer 12 (exosome cargo)** — public datasets exist

The layers we **cannot easily measure**:
- **Layer 1 (nascent RNA)** — would require NET-seq / GRO-seq generation
- **Layer 5/8 (folded protein presence per cell)** — proteomics needed

## The privacy-stack-depth metric

Per cell type, a single integrated score:

```
PRIVACY_STACK_DEPTH(cell_type) =
    f( splicing_variance,         # Layer 2 — from BT86 framework
       codon_bias_strength,       # Layer 4 — Tier-0 task
       idr_load,                   # Layer 6 — Tier-0 task
       low_complexity_load,        # Layer 7 — Tier-0 task
       glyco_site_density          # Layer 10a — Tier-0 task
     )
```

Where `f` is some integration function (geometric mean, weighted sum, or entropy-based). The choice of `f` is itself a Tier-1 task to test against predictive power on TCGA.

## Predicted patterns

Based on `HALO_ENCODING_AS_CS_PROBLEM`:

| cell type | predicted privacy-stack depth |
|---|---|
| **Healthy primary differentiated cells** | DEEP (full stack engaged) |
| **Stem cells / progenitors** | DEEP (need to remain hidden during expansion) |
| **Senescent cells (BT86 / our cohort)** | DEEP at splicing layer (BT86 confirmed); other layers unknown |
| **Cancer cells (proliferating tumor)** | SHALLOW at multiple layers (the testable prediction) |
| **Immune-privileged-site cells (eye, brain, testis)** | SHALLOWER (less need for privacy in privileged niches) |
| **Activated immune cells (signaling broadly)** | INTERMEDIATE — broadcasting more, hiding less, but still verified |
| **Embryonic / fetal cells** | DEEP (high privacy investment for development protection) |
| **Senescent + cancer-passaged (e.g. SCC)** | UNCERTAIN — competing pressures |

These are testable predictions. Each provides a ground-truth case for validating the metric.

## First-pass empirical results — splicing layer (Day 1, 10-sample gradient)

The first measurement run on Layer 2 (splicing) showed two distinct "broken tool" patterns at the SF expression + variety + concentration levels:

### Per-million-reads junction diversity
| group | sample | unique junctions per M reads | top 50% in % of junctions |
|---|---|---:|---:|
| fetal | H9_fetal (ESC) | **271,000** | 3.0% (FLATTEST) |
| cancer (bulk) | HepG2_HCC | 94,000 | 1.06% |
| cancer (bulk) | K562_CML | 126,000 | 1.33% |
| cancer (10x sc) | Zilionis_NSCLC | 4,634 | **0.22% (MOST CONCENTRATED)** |
| prol (10x sc) | cohort P1-P3 | 3,200-3,300 | ~0.18% |
| sen (10x sc) | cohort S1-S3 | 3,100-3,500 | ~0.20% |

**Bulk-only comparison (apples-to-apples): fetal is 2-3× more diverse per read than cancer.**

### SF expression two-pattern signature
- **Senescent (cohort):** uniform -17% across all 7 SFs, range 0.74-0.95 → **range 0.20 (LOW variance, COORDINATED)**
- **Cancer (bulk vs fetal):** variable -15% to -70% across SFs → **range 0.49-0.52 (HIGH variance, DYSREGULATED)**

**Senescent looks coordinated. Cancer looks chaotic.** Both reduce SF expression overall, but via different mechanisms.

### Two distinct "broken tool" regimes

| | fetal | proliferative | senescent | cancer |
|---|---|---|---|---|
| SF expression | HIGHEST | HIGH | -17% (uniform) | variable -15-70% |
| Junction variety | HIGHEST | HIGH | preserved (~99%) | LOWER (35-46% of fetal) |
| Concentration | flattest (3.0%) | similar to sen | similar to prol | most concentrated (0.22-1.3%) |
| Pattern | full precision | full precision | **engineered uniform regression** | **emergent dysregulation** |

This empirically grounds `HALO_DELIBERATELY_BROKEN.md`'s framework distinction:
- Senescent is doing **deliberate privacy regression** (uniform downsizing, preserved variety, targeted gene-level redirection per BT86 hot zones).
- Cancer is doing **dysregulated parser collapse** (chaotic SF reduction, variety collapse, concentrated parser usage). May be compromised regulation OR a different privacy strategy operating on other layers (MHC-I downregulation, CD47, PD-L1).

The **operator's "fetal → proliferative → senescent/cancer = decreasing splicing variety" intuition is empirically confirmed**, with the refinement that senescent and cancer arrive at "less variety" via different mechanisms.

## Connection to the cohort empirics

The HAEC+PBMC cohort (BT85/86) operationalizes this for the splicing layer:

- 6 samples, 374-382M reads each
- 1.2M+ junctions identified per sample via STAFF C-kernel
- 15,658 junctions show ≥2× P-vs-S splice-rate change
- Coordinated hot zones at chr14:55M, chr17:40M (proliferation), chr20:31.7M (TPX2), chr4:103M (CENPE), chr7:36-119M (RALA-containing), chr2:110-124M (RALB-containing)
- Senescent cells hide BOTH proliferation machinery AND trafficking-rewiring genes (BT88)

Adding Layer 4 (codon usage), Layer 6 (IDR), Layer 7 (low-complexity), Layer 10a (glyco sites) will give a 5-layer profile per sample. Then we can compute the privacy-stack-depth per sample-state and validate cancer/senescence predictions.

## Next steps

1. **Tier 0 measurement infrastructure** (4 tasks in BOUNTY_BOARD)
2. **Tier 1 integration** (privacy-stack-depth metric definition + cohort validation)
3. **Tier 2 cancer prediction** (TCGA tumor-vs-normal validation)

The architecture is the map. The bounty board is the route. WORLDLINE is the log of how we got here. README is the why.
