feat(organizations): migrate source storage to polymorphic records

This commit is contained in:
2026-05-19 10:23:53 +02:00
parent 19a7d5a91c
commit 4ca2fa25d5
44 changed files with 7129 additions and 1551 deletions

View File

@@ -0,0 +1,85 @@
# Direct Parser Source Ingestion Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Move parser runtime reads and writes from legacy parser record tables to organization source storage.
**Architecture:** Add a focused ingestion service in `organizations` that persists normalized source-record inputs directly into polymorphic source extensions. Parser services become adapters from parser dataclasses to ingestion inputs. Runtime reads use `OrganizationSourceRecord` and extension counters.
**Tech Stack:** Django 3.2, PostgreSQL, DRF, django-polymorphic, pytest.
---
### Task 1: Direct Ingestion Core
**Files:**
- Create: `src/organizations/source_identity.py`
- Create: `src/organizations/source_ingestion.py`
- Modify: `src/organizations/source_backfill.py`
- Test: `tests/apps/organizations/test_source_ingestion.py`
- Test: `tests/apps/organizations/test_source_backfill.py`
- [ ] Write failing tests for direct generic source ingestion.
- [ ] Write failing tests for FNS report ingestion with financial lines.
- [ ] Extract identity normalization from backfill into a shared helper.
- [ ] Implement `SourceRecordInput` and `SourceFinancialLineInput`.
- [ ] Implement `OrganizationSourceIngestionService.save_records`.
- [ ] Keep backfill behavior green by using the same identity normalization helper.
### Task 2: Parser Save Services
**Files:**
- Modify: `src/apps/parsers/services.py`
- Test: `tests/apps/parsers/test_services.py`
- [ ] Switch generic source saves to `OrganizationSourceIngestionService`.
- [ ] Switch industrial certificate/manufacturer/product saves.
- [ ] Switch inspection and procurement saves.
- [ ] Switch FNS report saves and duplicate checks.
- [ ] Replace period/deduplication helpers with source-record queries.
### Task 3: Parser Tasks
**Files:**
- Modify: `src/apps/parsers/tasks.py`
- Test: `tests/apps/parsers/test_tasks.py`
- [ ] Remove source backfill queueing from parser completion.
- [ ] Keep parser load logs and background job progress unchanged.
- [ ] Return source-record identifiers for FNS processing instead of legacy report ids.
### Task 4: Runtime Reads
**Files:**
- Modify: `src/apps/parsers/source_cards.py`
- Modify: `src/apps/parsers/views.py`
- Modify: `src/apps/parsers/serializers.py`
- Modify: `src/apps/core/admin_dashboard.py`
- Modify: `src/apps/backups/services.py`
- Test: parser source-card and result endpoint tests.
- [ ] Move source card counts and timestamps to source extensions/source records.
- [ ] Move parser log organization counts to source records.
- [ ] Adapt v1 parser result endpoints to read source records.
- [ ] Move dashboard/export runtime reads off legacy parser models.
### Task 5: Frontend Record Detail
**Files:**
- Modify: `mostovik-frontend/src/pages/main/model/source-record-detail/*`
- Test: frontend source-detail/source-record-detail unit tests.
- [ ] Replace legacy generated v1 detail clients with organization source-record reads.
- [ ] Use `payload` plus top-level source-record fields for detail rendering.
- [ ] Keep source-detail lists on the new source-record list endpoint.
### Task 6: Validation
**Files:**
- No production files.
- [ ] Run focused backend parser/organization tests.
- [ ] Run frontend source-detail/source-record-detail checks.
- [ ] Run live parser smoke against one small generic source.
- [ ] Confirm legacy parser record counts do not change during the smoke.
- [ ] Confirm new organization source-record counts do change.

View File

@@ -0,0 +1,100 @@
# Polymorphic Organization Sources Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Replace source-centric parser output access with organization-centric polymorphic source extensions.
**Architecture:** Keep `Organization` as the root entity. Add polymorphic source extensions per product source group and a shared subordinate source-record table. Backfill legacy parser tables idempotently, then switch API v2 to the new extension data.
**Tech Stack:** Django 3.2, Django REST Framework, django-filter, django-polymorphic, PostgreSQL, pytest.
---
### Task 1: Dependency And Schema
**Files:**
- Modify: `pyproject.toml`
- Modify: `uv.lock`
- Modify: `src/settings/base.py`
- Modify: `src/organizations/models.py`
- Create: `src/organizations/migrations/0006_polymorphic_source_extensions.py`
- Test: `tests/apps/organizations/test_source_extensions_models.py`
- [ ] Add `django-polymorphic` to project dependencies.
- [ ] Add `"polymorphic"` to `INSTALLED_APPS` before local apps.
- [ ] Add source-group and identity-status choices.
- [ ] Add `identity_status` and `primary_identity` to `Organization`.
- [ ] Add `OrganizationSourceExtension` as `PolymorphicModel`.
- [ ] Add source extension subclasses.
- [ ] Add `OrganizationSourceRecord`.
- [ ] Add `OrganizationSourceFinancialLine`.
- [ ] Write tests proving:
- one extension per `(organization, source_group)`;
- polymorphic queries return subclass instances;
- source records are unique by legacy model/pk;
- financial lines attach to a source record.
### Task 2: Backfill Service
**Files:**
- Create: `src/organizations/source_groups.py`
- Create: `src/organizations/source_backfill.py`
- Create: `src/organizations/management/commands/backfill_organization_sources.py`
- Test: `tests/apps/organizations/test_source_backfill.py`
- [ ] Define source group mapping for all legacy parser sources.
- [ ] Implement organization resolution by `inn + kpp`, `ogrn`, `ogrip`, unique `inn`, then normalized name.
- [ ] Implement idempotent extension creation/update.
- [ ] Implement idempotent source record creation/update.
- [ ] Preserve legacy row payload and `(legacy_model, legacy_pk)`.
- [ ] Backfill financial report lines into `OrganizationSourceFinancialLine`.
- [ ] Report scanned, created organizations, created extensions, updated extensions, created records, updated records, unresolved rows.
### Task 3: API v2 Switch
**Files:**
- Modify: `src/organizations/serializers.py`
- Modify: `src/organizations/filters.py`
- Modify: `src/organizations/views.py`
- Delete or stop using: `src/organizations/api_enrichment.py`
- Delete or stop using: `src/organizations/services.py` snapshot refresh paths
- Test: `tests/apps/organizations/test_api_v2.py`
- [ ] Replace embedded `data` JSON with compact `sources`.
- [ ] Add source extension list/detail serializers.
- [ ] Add source records endpoint.
- [ ] Rework source filters to use `OrganizationSourceExtension`.
- [ ] Remove snapshot dependency from list/retrieve behavior.
- [ ] Keep old snapshot management command only as deprecated/no-op until cleanup.
### Task 4: Parser Write Path
**Files:**
- Modify: `src/apps/parsers/tasks.py`
- Modify: `src/organizations/tasks.py`
- Test: `tests/apps/parsers/test_tasks.py`
- Test: `tests/apps/organizations/test_tasks.py`
- [ ] Replace snapshot refresh queueing with source backfill queueing for affected parser batches.
- [ ] For each parser completion, backfill only the completed source/batch.
- [ ] Keep full backfill command for initial migration and repair.
### Task 5: Frontend Contract Repair
**Files:**
- Modify frontend generated API clients after backend OpenAPI changes.
- Modify source detail table composables to consume `sources` and source records endpoints.
- [ ] Regenerate API client.
- [ ] Update source pages to request extension records instead of embedded `organization.data[source]`.
- [ ] Verify planned inspections page loads from source records.
### Task 6: Cleanup Phase
**Files:**
- Modify migrations only after successful backfill validation.
- [ ] Remove `OrganizationDataSnapshot`.
- [ ] Remove snapshot refresh schedules.
- [ ] Decide which legacy parser tables remain as ingestion staging and which can be dropped.
- [ ] Run full backend and frontend validation.

View File

@@ -0,0 +1,76 @@
# Direct Parser Source Ingestion Design
## Goal
Parser runtime must write parsed source records directly into the organization-centric
polymorphic storage:
- `organizations_organization`
- `organizations_source_extension`
- source extension subclass tables
- `organizations_source_record`
- `organizations_source_financial_line`
Legacy parser record tables remain only as migration/audit inputs until a later
destructive cleanup. They must not be part of the parser runtime write path or the
runtime read path used by the application.
## Current Runtime Problem
Current parser tasks write source rows into legacy parser tables such as
`GenericParserRecord`, `InspectionRecord`, `ProcurementRecord`,
`IndustrialProductRecord`, and `FinancialReport`, then enqueue source backfill into
the new organization storage. This keeps old tables in the hot path and allows new
runtime data to diverge before the async backfill runs.
## Target Runtime
Parser tasks keep using `ParserLoadLog`, `ParserBatchSequence`, and `BackgroundJob`
as operational metadata. Parsed records are converted into normalized source-record
inputs and persisted through one ingestion service.
The ingestion service is responsible for:
- normalizing identity fields before writing canonical organizations;
- resolving or creating `Organization`;
- creating or updating the source-group polymorphic extension;
- creating or updating `OrganizationSourceRecord` by `(source, external_id)`;
- writing structured financial lines for FNS reports;
- refreshing extension counters in the same transaction.
Parser save services return the number of inserted or updated source records. They no
longer create or query legacy parser record models for runtime decisions.
## Runtime Read Scope
The following runtime reads must use organization source storage:
- parser source cards and source item counters;
- parser log organization counts;
- source detail lists;
- source record detail reads;
- frontend-facing parser result compatibility endpoints while they remain exposed;
- admin/dashboard/export paths that are used by the app during normal operation.
Legacy parser tables may still be read by explicit migration/backfill tooling only.
## Compatibility
Existing v1 parser-result URLs can remain during transition, but their data source must
be `OrganizationSourceRecord`, not the legacy parser models. Response shape can be
kept best-effort through serializers/adapters that read source-record payloads.
## Non-Goals
- Do not drop legacy parser tables in this phase.
- Do not rewrite parser clients.
- Do not remove parser load logs or background jobs.
- Do not make every payload strongly typed immediately.
## Risks
- Industrial product ingestion is large; the writer must avoid per-record table scans.
- Existing tests assert legacy model counts and must be updated to assert source-record
behavior.
- Some compatibility endpoints expose legacy primary keys. New records use UUIDs, so
compatibility adapters must accept source-record UUIDs where needed.

View File

@@ -0,0 +1,222 @@
# Polymorphic Organization Sources Design
## Goal
Replace the current source-centric parser tables and API v2 JSON snapshot model with an organization-centric schema:
- `Organization` is the main business entity and stores legal identity data.
- Each source group is represented as a polymorphic organization extension.
- Detailed source records hang under the extension as subordinate records.
- API compatibility with the current frontend is not required.
## Current Data Facts
The current dev database contains:
- `organizations.Organization`: 29,667 rows.
- `OrganizationDataSnapshot`: 29,667 rows after refresh.
- `registers.Organization`: 5,138 rows.
- `InspectionRecord`: 14,059 rows.
- `ProcurementRecord`: 1,000 rows.
- `IndustrialCertificateRecord`: 23,640 rows.
- `ManufacturerRecord`: 8,762 rows.
- `IndustrialProductRecord`: 471,824 rows.
- `GenericParserRecord`: 3,506 rows.
- `FinancialReport`: 10 rows.
Observed required-field candidates:
- `Organization.name` is present in 100% of canonical organizations and can be required.
- `inn` is present in 95.34% of canonical organizations.
- `ogrn` is present in 74.84%.
- `kpp` is present in 20.56%.
- `ogrip` is present in 8.08%.
- 673 canonical organizations have no `inn`, `ogrn`, or `ogrip`.
Therefore only `name` can be required on `Organization`. Identifiers must stay optional and indexed. Identity completeness should be explicit through a status field, not hidden in nullability.
## Source Groups
The new source groups match the product navigation:
- Financial indicators.
- Government procurements, including 44-FZ, 223-FZ, contracts, and legacy procurement rows.
- Russian manufacturers and products, including certificates, manufacturers, and industrial products.
- Planned inspections.
- Bankruptcy procedures.
- Defense supplier risk, including unfair suppliers and FAS GOZ.
- Arbitration cases.
- Information security registries.
- Vacancies from Trudvsem, HH, and SuperJob.
## Target Schema
### Organization
`organizations.Organization` remains the root table.
Required:
- `uid`
- `name`
Optional but indexed:
- `inn`
- `kpp`
- `ogrn`
- `ogrip`
New fields:
- `identity_status`: one of `complete`, `partial`, `missing`.
- `primary_identity`: short normalized search key used for deterministic deduplication and diagnostics.
Existing uniqueness constraints on non-empty identifiers should remain conservative. They protect the API and backfill from accidental duplicate canonical organizations.
### OrganizationSourceExtension
Add a new polymorphic base model:
- `uid`
- `organization`
- `source_group`
- `title`
- `status`
- `records_count`
- `first_seen_at`
- `last_seen_at`
- `last_load_batch`
- `metadata`
- timestamps
Constraints:
- One extension per `(organization, source_group)`.
Subclasses:
- `FinancialIndicatorsExtension`
- `GovernmentProcurementExtension`
- `IndustrialProductionExtension`
- `PlannedInspectionExtension`
- `BankruptcyExtension`
- `DefenseSupplierExtension`
- `ArbitrationExtension`
- `SecurityRegistryExtension`
- `VacancyExtension`
Rationale: `Organization` itself must not be polymorphic because one organization can have many source groups simultaneously. The polymorphic boundary belongs to source extensions.
### OrganizationSourceRecord
Use one subordinate detail table for most source rows:
- `uid`
- `extension`
- `record_type`
- `source`
- `external_id`
- `title`
- `record_date`
- `amount`
- `status`
- `url`
- `payload`
- `legacy_model`
- `legacy_pk`
- `load_batch`
- timestamps
Constraints:
- Unique `(source, external_id)` when `external_id` is non-empty.
- Unique `(legacy_model, legacy_pk)` for migrated legacy rows.
This keeps the number of tables low while still preserving every source-specific payload.
### FinancialReportLine
Financial report lines remain a subordinate table because they are structured, repeated, and likely to be queried by year/form/line:
- `source_record`
- `form_code`
- `line_code`
- `line_name`
- `year`
- `period_start`
- `period_end`
The existing legacy `FinancialReportLine` table is used only as a staging source after the migration.
## Backfill Rules
Backfill must be idempotent.
For each legacy row:
1. Resolve canonical `Organization` by identifiers in this order:
- exact `inn + kpp` where available,
- exact `ogrn`,
- exact `ogrip`,
- exact `inn` only when it maps to one canonical organization,
- normalized name fallback only when there is one unambiguous match.
2. If no organization can be resolved, create or reuse an organization with `identity_status=missing` or `partial`.
3. Create or update the matching `OrganizationSourceExtension`.
4. Create or update `OrganizationSourceRecord`.
5. Preserve the original source row in `payload`.
6. Store `legacy_model` and `legacy_pk` for audit and repeatable updates.
## API Shape
The new API should be organization-centric:
- `GET /api/v2/organizations/`
- `GET /api/v2/organizations/{uid}/`
- `GET /api/v2/organizations/{uid}/sources/`
- `GET /api/v2/organization-sources/{uid}/records/`
The list endpoint can expose compact source summaries:
```json
{
"uid": "...",
"name": "...",
"inn": "...",
"ogrn": "...",
"identity_status": "complete",
"sources": [
{
"uid": "...",
"source_group": "planned_inspections",
"title": "Плановые проверки Генпрокуратуры России",
"records_count": 12,
"last_seen_at": "2026-05-18T00:00:00Z"
}
]
}
```
Source records are fetched on demand from the extension, not embedded into every organization list row.
## Migration Phases
1. Add dependency and schema.
2. Add idempotent backfill service and management command.
3. Backfill all existing legacy parser data into source extensions.
4. Switch API v2 to source extensions.
5. Update frontend generated clients and source pages.
6. After verification, remove `OrganizationDataSnapshot` and legacy parser tables in a separate cleanup phase.
## Non-Goals For First Pass
- No destructive deletion of legacy parser tables before backfill verification.
- No attempt to make all source payloads strongly typed immediately.
- No frontend visual redesign; only data contract changes needed for the new schema.
## Risks
- Polymorphic subclass queries add joins. List endpoints must query the base extension compactly and load records only on demand.
- Legacy rows without identifiers require name fallback and can create duplicates. The command must report unresolved/created organizations.
- Current v2 filters use source-specific existence checks. They must move to extension existence filters.
- Frontend generated API clients will break and must be regenerated after backend OpenAPI changes.