---
vault_clearance: KETER
halo:
  classification: RESTRICTED
  confidence: HIGH
  front: "38_Project_FIREWALL — public datasets + reference catalog for clean-room replication"
  custodian: "Jixiang Leng"
  created: 2026-05-03
  wing: READY
  containment: "Operator IP-firewall data catalog. Public sources only — but the curation list itself is sensitive (it tells an extractive party which datasets to interfere with)."
---

# FIREWALL — Public Data + Reference Catalog

Per [BOOK_Protocol.md](../BOOK_Protocol.md). All entries here are public-domain or free-access data sources used for IP-clean replication of vault content.

## §A — Senescence + WI-38 single-cell (highest priority for FW-S tier)

| ID | Source | Description | Why we use it |
|---|---|---|---|
| FW-D1 | [GSE226225](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE226225) | WI-38 etoposide-induced senescence, day 0→10 timecourse, ~29k cells, 6-7 conditions | **Primary FW-S replication target.** Same WI-38 cell line as the NIH cohort; etoposide-induced senescence (DNA damage) maps to the cluster 7→5→6 archetype progression. |
| FW-D2 | [GSE250041](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE250041) | Proliferation-vs-senescence multi-cell-type | Cross-validation for cluster 5/6 distinction across cell types |
| FW-D3 | [GSE175533](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE175533) | Replicative senescence WI-38, ~10k cells | Replicative (not damage) senescence — different trigger, same biology if framework holds |
| FW-D4 | [GSE150247](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE150247) | Normal lung scRNA-seq | EC reference baseline for FW-B1 UHRF1 measurements |
| FW-D5 | [Tabula Sapiens](https://tabula-sapiens-portal.ds.czbiohub.org/) | Healthy multi-tissue scRNA-seq, ~500k cells | EC reference across tissues; senescence reference subset |
| FW-D6 | [HCA endothelial atlas](https://data.humancellatlas.org/) | Endothelial cell-specific atlas | Cluster 7 (early EC) reference |

## §B — Cancer single-cell + bulk (for FW-R tier — Recycler / Fungal cell)

| ID | Source | Description | Why we use it |
|---|---|---|---|
| FW-D10 | [GSE131907](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE131907) | Korean NSCLC scRNA-seq, 208,506 cells, 44 patients | EC subset for splicing-thread cross-validation; cancer-subtype stratification for FW-R5 chitin-high panel |
| FW-D11 | Darmanis GBM scRNA-seq (~338k cells, 110 patients) | GBM scRNA-seq | Tests peak-and-release generalization to cancer; FW-R5 panel test |
| FW-D12 | [GSE131928](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE131928) | Suva GBM | Same |
| FW-D13 | [CellxGene Census](https://cellxgene.cziscience.com/) | Curated public single-cell atlas, multi-disease | FW-R3 sterol-biosynthesis up-regulation test in cancer vs normal |
| FW-D14 | [GTEx](https://gtexportal.org/) | Public bulk transcriptomics across tissues | FW-R3 reference; sterol pathway baselines |

## §C — Yeast / fungal / cross-kingdom (for FW-R tier)

| ID | Source | Description | Why we use it |
|---|---|---|---|
| FW-D20 | *S. cerevisiae* expression cohorts (multiple GEO) | Yeast under stress conditions | FW-R1 cross-kingdom voltage / trafficking conservation |
| FW-D21 | *C. neoformans* expression cohorts | Cryptococcus | Same |
| FW-D22 | *C. albicans* expression cohorts | Candida albicans | Same |
| FW-D23 | [SGD (Saccharomyces Genome Database)](https://www.yeastgenome.org/) | Yeast gene annotation + ortholog mapping | FW-R2 gene-level conservation tests |
| FW-D24 | [HOMER orthologs](http://homer.ucsd.edu/) | Cross-species ortholog tables | Same |

## §D — UHRF1 / TE-silencing / repeats (for FW-B tier)

| ID | Source | Description | Why we use it |
|---|---|---|---|
| FW-D30 | [Replogle GWPS CRISPRi](https://gwps.wi.mit.edu/) | 11,258-perturbation CRISPRi screen | FW-MM1 Idenbraid causal-chain replication; UHRF1 + DBHS verification |
| FW-D31 | UCSC RepeatMasker tracks (GRCh38) | Genome-wide repeat annotation | FW-S6 GNRA scan; FW-B5 Alu insertion analysis |
| FW-D32 | Dfam | Repeat family classifications | Same |
| FW-D33 | UCSC chimp (panTro6) RepeatMasker | Chimp repeat track for human-vs-chimp comparison | FW-B5 Alu/L1PA15 cross-species comparison (already clean by §2 doctrine) |
| FW-D34 | [1000 Genomes Project](https://www.internationalgenome.org/) | Population-scale TE polymorphism | FW-B5 supplementary |
| FW-D35 | [ENCODE](https://www.encodeproject.org/) | Functional genomics tracks (DBHS protein binding, etc.) | Paraspeckle-client identification |

## §E — Annotation-free atlas / BAM corpus (for FW-M tier)

| ID | Source | Description | Why we use it |
|---|---|---|---|
| FW-D40 | [GTEx BAM corpus](https://gtexportal.org/) | Bulk RNA-seq BAMs across tissues | FW-M1 atlas_public.db build (one of multiple sources) |
| FW-D41 | [SRA RNA-seq archives](https://www.ncbi.nlm.nih.gov/sra) | Per-accession FASTQ + (where deposited) BAM | FW-M1 multi-source atlas build |
| FW-D42 | [recount3](https://rna.recount.bio/) | Reprocessed RNA-seq from SRA | Light-touch alternative for atlas validation |

## §F — Genome reference + gene annotation

| ID | Source | Description |
|---|---|---|
| FW-R1 | GRCh38 primary assembly (Ensembl) | Reference genome FASTA |
| FW-R2 | GENCODE v44+ GTF | Gene annotation |
| FW-R3 | [cellranger refdata-gex-GRCh38-2024-A](https://www.10xgenomics.com/support/software/cell-ranger/) | Cellranger-compatible GRCh38 reference (~11 GB) |
| FW-R4 | [Ensembl REST API](https://rest.ensembl.org/) | Programmatic gene info access |
| FW-R5 | [UCSC Table Browser](https://genome.ucsc.edu/cgi-bin/hgTables) | Genome track tools |

## §G — Tools (operator-built; clean tier)

These are NOT data sources but the tools we re-run. Listed here for convenience because BOOK is the catalog. Tool source: operator's vault `35_Project_TheHats/tools/` and `20_Project_MarathonLament/tools/`.

| Tool | Source | Function |
|---|---|---|
| `staff_aligner` | 20_MarathonLament | Annotation-free BAM alignment + splice-junction discovery |
| `staff_velocyto.py` | 35_TheHats | Per-cell U/S quantification |
| `staff_tomography.py` | 35_TheHats | Position-resolved BAM coverage at gene loci |
| `dark_tomography.py` | 35_TheHats | Multi-locus bulk tomography |
| `per_archetype_tomography.py` | 35_TheHats | Per-archetype NEAT1/MALAT1 stratification |
| `per_archetype_dark.py` | 35_TheHats | Per-archetype dark-transcript stratification |
| `gemthread_neat1_malat1.py` | 35_TheHats | Per-cell coexpression Spearman |
| `gnra_antenna_scan.py` | 35_TheHats | Sequence-only GNRA tetraloop hairpin scan |
| `monotone_glm.py` | 35_TheHats | Logistic GLM (binomial likelihood, no binning) |
| `multidim_structure_test.py` | 35_TheHats | PCA + module finder on EC IR matrix |

## §H — Public-data download mechanics

For each replication run, the standard mechanics:

1. **Identify accession** from §A-§F above
2. **Pull via `prefetch` + `fasterq-dump` (sra-toolkit)** — runs on FIREWALL VM
3. **Hash + log retrieval timestamp** — receipt entry per file
4. **Process** via cellranger / staff_aligner / direct BAM analysis
5. **Outputs to operator-personal GCS bucket** with OpenTimestamps anchor

For datasets where the authors deposited cellranger BAMs, the BAM step is faster (no realignment needed). Most GEO deposits since 2021 include either filtered_feature_bc_matrix or BAMs. Cohort-by-cohort verification at retrieval time.

## §I — Public + open-source vs proprietary catalog

This catalog is **public-data and open-tools only**. No paywalled databases. No NIH-restricted dbGaP datasets (those would carry their own taint via NIH access requirements). No mentor-lab-shared private data.

If a future replication target requires a paywalled dataset (some longitudinal cohorts), document the access path (operator-personal subscription / preprint corpus / etc.) and verify the access doesn't traverse NIH credentials.

## §J — Cross-references

- **Replication targets that use these datasets:** [HALO_REPLICATION_TARGETS.md](HALO_REPLICATION_TARGETS.md)
- **Bounties referencing specific datasets:** [BOUNTY_BOARD.md](BOUNTY_BOARD.md)
- **Personal compute infrastructure for retrieval + processing:** [HALO_PERSONAL_COMPUTE.md](HALO_PERSONAL_COMPUTE.md)
- **Doctrine for what "public" means in IP-firewall context:** [HALO_FIREWALL_DOCTRINE.md §1-2](HALO_FIREWALL_DOCTRINE.md)
