feat(organizations): migrate source storage to polymorphic records

2026-05-19 10:23:53 +02:00
parent 19a7d5a91c
commit 4ca2fa25d5
44 changed files with 7129 additions and 1551 deletions
--- a/docs/superpowers/plans/2026-05-18-direct-parser-source-ingestion.md
+++ b/docs/superpowers/plans/2026-05-18-direct-parser-source-ingestion.md
@@ -0,0 +1,85 @@
+# Direct Parser Source Ingestion Implementation Plan
+
+> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
+
+**Goal:** Move parser runtime reads and writes from legacy parser record tables to organization source storage.
+
+**Architecture:** Add a focused ingestion service in `organizations` that persists normalized source-record inputs directly into polymorphic source extensions. Parser services become adapters from parser dataclasses to ingestion inputs. Runtime reads use `OrganizationSourceRecord` and extension counters.
+
+**Tech Stack:** Django 3.2, PostgreSQL, DRF, django-polymorphic, pytest.
+
+---
+
+### Task 1: Direct Ingestion Core
+
+**Files:**
+- Create: `src/organizations/source_identity.py`
+- Create: `src/organizations/source_ingestion.py`
+- Modify: `src/organizations/source_backfill.py`
+- Test: `tests/apps/organizations/test_source_ingestion.py`
+- Test: `tests/apps/organizations/test_source_backfill.py`
+
+- [ ] Write failing tests for direct generic source ingestion.
+- [ ] Write failing tests for FNS report ingestion with financial lines.
+- [ ] Extract identity normalization from backfill into a shared helper.
+- [ ] Implement `SourceRecordInput` and `SourceFinancialLineInput`.
+- [ ] Implement `OrganizationSourceIngestionService.save_records`.
+- [ ] Keep backfill behavior green by using the same identity normalization helper.
+
+### Task 2: Parser Save Services
+
+**Files:**
+- Modify: `src/apps/parsers/services.py`
+- Test: `tests/apps/parsers/test_services.py`
+
+- [ ] Switch generic source saves to `OrganizationSourceIngestionService`.
+- [ ] Switch industrial certificate/manufacturer/product saves.
+- [ ] Switch inspection and procurement saves.
+- [ ] Switch FNS report saves and duplicate checks.
+- [ ] Replace period/deduplication helpers with source-record queries.
+
+### Task 3: Parser Tasks
+
+**Files:**
+- Modify: `src/apps/parsers/tasks.py`
+- Test: `tests/apps/parsers/test_tasks.py`
+
+- [ ] Remove source backfill queueing from parser completion.
+- [ ] Keep parser load logs and background job progress unchanged.
+- [ ] Return source-record identifiers for FNS processing instead of legacy report ids.
+
+### Task 4: Runtime Reads
+
+**Files:**
+- Modify: `src/apps/parsers/source_cards.py`
+- Modify: `src/apps/parsers/views.py`
+- Modify: `src/apps/parsers/serializers.py`
+- Modify: `src/apps/core/admin_dashboard.py`
+- Modify: `src/apps/backups/services.py`
+- Test: parser source-card and result endpoint tests.
+
+- [ ] Move source card counts and timestamps to source extensions/source records.
+- [ ] Move parser log organization counts to source records.
+- [ ] Adapt v1 parser result endpoints to read source records.
+- [ ] Move dashboard/export runtime reads off legacy parser models.
+
+### Task 5: Frontend Record Detail
+
+**Files:**
+- Modify: `mostovik-frontend/src/pages/main/model/source-record-detail/*`
+- Test: frontend source-detail/source-record-detail unit tests.
+
+- [ ] Replace legacy generated v1 detail clients with organization source-record reads.
+- [ ] Use `payload` plus top-level source-record fields for detail rendering.
+- [ ] Keep source-detail lists on the new source-record list endpoint.
+
+### Task 6: Validation
+
+**Files:**
+- No production files.
+
+- [ ] Run focused backend parser/organization tests.
+- [ ] Run frontend source-detail/source-record-detail checks.
+- [ ] Run live parser smoke against one small generic source.
+- [ ] Confirm legacy parser record counts do not change during the smoke.
+- [ ] Confirm new organization source-record counts do change.
--- a/docs/superpowers/plans/2026-05-18-polymorphic-organization-sources.md
+++ b/docs/superpowers/plans/2026-05-18-polymorphic-organization-sources.md
@@ -0,0 +1,100 @@
+# Polymorphic Organization Sources Implementation Plan
+
+> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
+
+**Goal:** Replace source-centric parser output access with organization-centric polymorphic source extensions.
+
+**Architecture:** Keep `Organization` as the root entity. Add polymorphic source extensions per product source group and a shared subordinate source-record table. Backfill legacy parser tables idempotently, then switch API v2 to the new extension data.
+
+**Tech Stack:** Django 3.2, Django REST Framework, django-filter, django-polymorphic, PostgreSQL, pytest.
+
+---
+
+### Task 1: Dependency And Schema
+
+**Files:**
+- Modify: `pyproject.toml`
+- Modify: `uv.lock`
+- Modify: `src/settings/base.py`
+- Modify: `src/organizations/models.py`
+- Create: `src/organizations/migrations/0006_polymorphic_source_extensions.py`
+- Test: `tests/apps/organizations/test_source_extensions_models.py`
+
+- [ ] Add `django-polymorphic` to project dependencies.
+- [ ] Add `"polymorphic"` to `INSTALLED_APPS` before local apps.
+- [ ] Add source-group and identity-status choices.
+- [ ] Add `identity_status` and `primary_identity` to `Organization`.
+- [ ] Add `OrganizationSourceExtension` as `PolymorphicModel`.
+- [ ] Add source extension subclasses.
+- [ ] Add `OrganizationSourceRecord`.
+- [ ] Add `OrganizationSourceFinancialLine`.
+- [ ] Write tests proving:
+  - one extension per `(organization, source_group)`;
+  - polymorphic queries return subclass instances;
+  - source records are unique by legacy model/pk;
+  - financial lines attach to a source record.
+
+### Task 2: Backfill Service
+
+**Files:**
+- Create: `src/organizations/source_groups.py`
+- Create: `src/organizations/source_backfill.py`
+- Create: `src/organizations/management/commands/backfill_organization_sources.py`
+- Test: `tests/apps/organizations/test_source_backfill.py`
+
+- [ ] Define source group mapping for all legacy parser sources.
+- [ ] Implement organization resolution by `inn + kpp`, `ogrn`, `ogrip`, unique `inn`, then normalized name.
+- [ ] Implement idempotent extension creation/update.
+- [ ] Implement idempotent source record creation/update.
+- [ ] Preserve legacy row payload and `(legacy_model, legacy_pk)`.
+- [ ] Backfill financial report lines into `OrganizationSourceFinancialLine`.
+- [ ] Report scanned, created organizations, created extensions, updated extensions, created records, updated records, unresolved rows.
+
+### Task 3: API v2 Switch
+
+**Files:**
+- Modify: `src/organizations/serializers.py`
+- Modify: `src/organizations/filters.py`
+- Modify: `src/organizations/views.py`
+- Delete or stop using: `src/organizations/api_enrichment.py`
+- Delete or stop using: `src/organizations/services.py` snapshot refresh paths
+- Test: `tests/apps/organizations/test_api_v2.py`
+
+- [ ] Replace embedded `data` JSON with compact `sources`.
+- [ ] Add source extension list/detail serializers.
+- [ ] Add source records endpoint.
+- [ ] Rework source filters to use `OrganizationSourceExtension`.
+- [ ] Remove snapshot dependency from list/retrieve behavior.
+- [ ] Keep old snapshot management command only as deprecated/no-op until cleanup.
+
+### Task 4: Parser Write Path
+
+**Files:**
+- Modify: `src/apps/parsers/tasks.py`
+- Modify: `src/organizations/tasks.py`
+- Test: `tests/apps/parsers/test_tasks.py`
+- Test: `tests/apps/organizations/test_tasks.py`
+
+- [ ] Replace snapshot refresh queueing with source backfill queueing for affected parser batches.
+- [ ] For each parser completion, backfill only the completed source/batch.
+- [ ] Keep full backfill command for initial migration and repair.
+
+### Task 5: Frontend Contract Repair
+
+**Files:**
+- Modify frontend generated API clients after backend OpenAPI changes.
+- Modify source detail table composables to consume `sources` and source records endpoints.
+
+- [ ] Regenerate API client.
+- [ ] Update source pages to request extension records instead of embedded `organization.data[source]`.
+- [ ] Verify planned inspections page loads from source records.
+
+### Task 6: Cleanup Phase
+
+**Files:**
+- Modify migrations only after successful backfill validation.
+
+- [ ] Remove `OrganizationDataSnapshot`.
+- [ ] Remove snapshot refresh schedules.
+- [ ] Decide which legacy parser tables remain as ingestion staging and which can be dropped.
+- [ ] Run full backend and frontend validation.
--- a/docs/superpowers/specs/2026-05-18-direct-parser-source-ingestion-design.md
+++ b/docs/superpowers/specs/2026-05-18-direct-parser-source-ingestion-design.md
@@ -0,0 +1,76 @@
+# Direct Parser Source Ingestion Design
+
+## Goal
+
+Parser runtime must write parsed source records directly into the organization-centric
+polymorphic storage:
+
+- `organizations_organization`
+- `organizations_source_extension`
+- source extension subclass tables
+- `organizations_source_record`
+- `organizations_source_financial_line`
+
+Legacy parser record tables remain only as migration/audit inputs until a later
+destructive cleanup. They must not be part of the parser runtime write path or the
+runtime read path used by the application.
+
+## Current Runtime Problem
+
+Current parser tasks write source rows into legacy parser tables such as
+`GenericParserRecord`, `InspectionRecord`, `ProcurementRecord`,
+`IndustrialProductRecord`, and `FinancialReport`, then enqueue source backfill into
+the new organization storage. This keeps old tables in the hot path and allows new
+runtime data to diverge before the async backfill runs.
+
+## Target Runtime
+
+Parser tasks keep using `ParserLoadLog`, `ParserBatchSequence`, and `BackgroundJob`
+as operational metadata. Parsed records are converted into normalized source-record
+inputs and persisted through one ingestion service.
+
+The ingestion service is responsible for:
+
+- normalizing identity fields before writing canonical organizations;
+- resolving or creating `Organization`;
+- creating or updating the source-group polymorphic extension;
+- creating or updating `OrganizationSourceRecord` by `(source, external_id)`;
+- writing structured financial lines for FNS reports;
+- refreshing extension counters in the same transaction.
+
+Parser save services return the number of inserted or updated source records. They no
+longer create or query legacy parser record models for runtime decisions.
+
+## Runtime Read Scope
+
+The following runtime reads must use organization source storage:
+
+- parser source cards and source item counters;
+- parser log organization counts;
+- source detail lists;
+- source record detail reads;
+- frontend-facing parser result compatibility endpoints while they remain exposed;
+- admin/dashboard/export paths that are used by the app during normal operation.
+
+Legacy parser tables may still be read by explicit migration/backfill tooling only.
+
+## Compatibility
+
+Existing v1 parser-result URLs can remain during transition, but their data source must
+be `OrganizationSourceRecord`, not the legacy parser models. Response shape can be
+kept best-effort through serializers/adapters that read source-record payloads.
+
+## Non-Goals
+
+- Do not drop legacy parser tables in this phase.
+- Do not rewrite parser clients.
+- Do not remove parser load logs or background jobs.
+- Do not make every payload strongly typed immediately.
+
+## Risks
+
+- Industrial product ingestion is large; the writer must avoid per-record table scans.
+- Existing tests assert legacy model counts and must be updated to assert source-record
+  behavior.
+- Some compatibility endpoints expose legacy primary keys. New records use UUIDs, so
+  compatibility adapters must accept source-record UUIDs where needed.
--- a/docs/superpowers/specs/2026-05-18-polymorphic-organization-sources-design.md
+++ b/docs/superpowers/specs/2026-05-18-polymorphic-organization-sources-design.md
@@ -0,0 +1,222 @@
+# Polymorphic Organization Sources Design
+
+## Goal
+
+Replace the current source-centric parser tables and API v2 JSON snapshot model with an organization-centric schema:
+
+- `Organization` is the main business entity and stores legal identity data.
+- Each source group is represented as a polymorphic organization extension.
+- Detailed source records hang under the extension as subordinate records.
+- API compatibility with the current frontend is not required.
+
+## Current Data Facts
+
+The current dev database contains:
+
+- `organizations.Organization`: 29,667 rows.
+- `OrganizationDataSnapshot`: 29,667 rows after refresh.
+- `registers.Organization`: 5,138 rows.
+- `InspectionRecord`: 14,059 rows.
+- `ProcurementRecord`: 1,000 rows.
+- `IndustrialCertificateRecord`: 23,640 rows.
+- `ManufacturerRecord`: 8,762 rows.
+- `IndustrialProductRecord`: 471,824 rows.
+- `GenericParserRecord`: 3,506 rows.
+- `FinancialReport`: 10 rows.
+
+Observed required-field candidates:
+
+- `Organization.name` is present in 100% of canonical organizations and can be required.
+- `inn` is present in 95.34% of canonical organizations.
+- `ogrn` is present in 74.84%.
+- `kpp` is present in 20.56%.
+- `ogrip` is present in 8.08%.
+- 673 canonical organizations have no `inn`, `ogrn`, or `ogrip`.
+
+Therefore only `name` can be required on `Organization`. Identifiers must stay optional and indexed. Identity completeness should be explicit through a status field, not hidden in nullability.
+
+## Source Groups
+
+The new source groups match the product navigation:
+
+- Financial indicators.
+- Government procurements, including 44-FZ, 223-FZ, contracts, and legacy procurement rows.
+- Russian manufacturers and products, including certificates, manufacturers, and industrial products.
+- Planned inspections.
+- Bankruptcy procedures.
+- Defense supplier risk, including unfair suppliers and FAS GOZ.
+- Arbitration cases.
+- Information security registries.
+- Vacancies from Trudvsem, HH, and SuperJob.
+
+## Target Schema
+
+### Organization
+
+`organizations.Organization` remains the root table.
+
+Required:
+
+- `uid`
+- `name`
+
+Optional but indexed:
+
+- `inn`
+- `kpp`
+- `ogrn`
+- `ogrip`
+
+New fields:
+
+- `identity_status`: one of `complete`, `partial`, `missing`.
+- `primary_identity`: short normalized search key used for deterministic deduplication and diagnostics.
+
+Existing uniqueness constraints on non-empty identifiers should remain conservative. They protect the API and backfill from accidental duplicate canonical organizations.
+
+### OrganizationSourceExtension
+
+Add a new polymorphic base model:
+
+- `uid`
+- `organization`
+- `source_group`
+- `title`
+- `status`
+- `records_count`
+- `first_seen_at`
+- `last_seen_at`
+- `last_load_batch`
+- `metadata`
+- timestamps
+
+Constraints:
+
+- One extension per `(organization, source_group)`.
+
+Subclasses:
+
+- `FinancialIndicatorsExtension`
+- `GovernmentProcurementExtension`
+- `IndustrialProductionExtension`
+- `PlannedInspectionExtension`
+- `BankruptcyExtension`
+- `DefenseSupplierExtension`
+- `ArbitrationExtension`
+- `SecurityRegistryExtension`
+- `VacancyExtension`
+
+Rationale: `Organization` itself must not be polymorphic because one organization can have many source groups simultaneously. The polymorphic boundary belongs to source extensions.
+
+### OrganizationSourceRecord
+
+Use one subordinate detail table for most source rows:
+
+- `uid`
+- `extension`
+- `record_type`
+- `source`
+- `external_id`
+- `title`
+- `record_date`
+- `amount`
+- `status`
+- `url`
+- `payload`
+- `legacy_model`
+- `legacy_pk`
+- `load_batch`
+- timestamps
+
+Constraints:
+
+- Unique `(source, external_id)` when `external_id` is non-empty.
+- Unique `(legacy_model, legacy_pk)` for migrated legacy rows.
+
+This keeps the number of tables low while still preserving every source-specific payload.
+
+### FinancialReportLine
+
+Financial report lines remain a subordinate table because they are structured, repeated, and likely to be queried by year/form/line:
+
+- `source_record`
+- `form_code`
+- `line_code`
+- `line_name`
+- `year`
+- `period_start`
+- `period_end`
+
+The existing legacy `FinancialReportLine` table is used only as a staging source after the migration.
+
+## Backfill Rules
+
+Backfill must be idempotent.
+
+For each legacy row:
+
+1. Resolve canonical `Organization` by identifiers in this order:
+   - exact `inn + kpp` where available,
+   - exact `ogrn`,
+   - exact `ogrip`,
+   - exact `inn` only when it maps to one canonical organization,
+   - normalized name fallback only when there is one unambiguous match.
+2. If no organization can be resolved, create or reuse an organization with `identity_status=missing` or `partial`.
+3. Create or update the matching `OrganizationSourceExtension`.
+4. Create or update `OrganizationSourceRecord`.
+5. Preserve the original source row in `payload`.
+6. Store `legacy_model` and `legacy_pk` for audit and repeatable updates.
+
+## API Shape
+
+The new API should be organization-centric:
+
+- `GET /api/v2/organizations/`
+- `GET /api/v2/organizations/{uid}/`
+- `GET /api/v2/organizations/{uid}/sources/`
+- `GET /api/v2/organization-sources/{uid}/records/`
+
+The list endpoint can expose compact source summaries:
+
+```json
+{
+  "uid": "...",
+  "name": "...",
+  "inn": "...",
+  "ogrn": "...",
+  "identity_status": "complete",
+  "sources": [
+    {
+      "uid": "...",
+      "source_group": "planned_inspections",
+      "title": "Плановые проверки Генпрокуратуры России",
+      "records_count": 12,
+      "last_seen_at": "2026-05-18T00:00:00Z"
+    }
+  ]
+}
+```
+
+Source records are fetched on demand from the extension, not embedded into every organization list row.
+
+## Migration Phases
+
+1. Add dependency and schema.
+2. Add idempotent backfill service and management command.
+3. Backfill all existing legacy parser data into source extensions.
+4. Switch API v2 to source extensions.
+5. Update frontend generated clients and source pages.
+6. After verification, remove `OrganizationDataSnapshot` and legacy parser tables in a separate cleanup phase.
+
+## Non-Goals For First Pass
+
+- No destructive deletion of legacy parser tables before backfill verification.
+- No attempt to make all source payloads strongly typed immediately.
+- No frontend visual redesign; only data contract changes needed for the new schema.
+
+## Risks
+
+- Polymorphic subclass queries add joins. List endpoints must query the base extension compactly and load records only on demand.
+- Legacy rows without identifiers require name fallback and can create duplicates. The command must report unresolved/created organizations.
+- Current v2 filters use source-specific existence checks. They must move to extension existence filters.
+- Frontend generated API clients will break and must be regenerated after backend OpenAPI changes.