feat(organizations): migrate source storage to polymorphic records

This commit is contained in:
2026-05-19 10:23:53 +02:00
parent 19a7d5a91c
commit 4ca2fa25d5
44 changed files with 7129 additions and 1551 deletions

View File

@@ -0,0 +1,76 @@
# Direct Parser Source Ingestion Design
## Goal
Parser runtime must write parsed source records directly into the organization-centric
polymorphic storage:
- `organizations_organization`
- `organizations_source_extension`
- source extension subclass tables
- `organizations_source_record`
- `organizations_source_financial_line`
Legacy parser record tables remain only as migration/audit inputs until a later
destructive cleanup. They must not be part of the parser runtime write path or the
runtime read path used by the application.
## Current Runtime Problem
Current parser tasks write source rows into legacy parser tables such as
`GenericParserRecord`, `InspectionRecord`, `ProcurementRecord`,
`IndustrialProductRecord`, and `FinancialReport`, then enqueue source backfill into
the new organization storage. This keeps old tables in the hot path and allows new
runtime data to diverge before the async backfill runs.
## Target Runtime
Parser tasks keep using `ParserLoadLog`, `ParserBatchSequence`, and `BackgroundJob`
as operational metadata. Parsed records are converted into normalized source-record
inputs and persisted through one ingestion service.
The ingestion service is responsible for:
- normalizing identity fields before writing canonical organizations;
- resolving or creating `Organization`;
- creating or updating the source-group polymorphic extension;
- creating or updating `OrganizationSourceRecord` by `(source, external_id)`;
- writing structured financial lines for FNS reports;
- refreshing extension counters in the same transaction.
Parser save services return the number of inserted or updated source records. They no
longer create or query legacy parser record models for runtime decisions.
## Runtime Read Scope
The following runtime reads must use organization source storage:
- parser source cards and source item counters;
- parser log organization counts;
- source detail lists;
- source record detail reads;
- frontend-facing parser result compatibility endpoints while they remain exposed;
- admin/dashboard/export paths that are used by the app during normal operation.
Legacy parser tables may still be read by explicit migration/backfill tooling only.
## Compatibility
Existing v1 parser-result URLs can remain during transition, but their data source must
be `OrganizationSourceRecord`, not the legacy parser models. Response shape can be
kept best-effort through serializers/adapters that read source-record payloads.
## Non-Goals
- Do not drop legacy parser tables in this phase.
- Do not rewrite parser clients.
- Do not remove parser load logs or background jobs.
- Do not make every payload strongly typed immediately.
## Risks
- Industrial product ingestion is large; the writer must avoid per-record table scans.
- Existing tests assert legacy model counts and must be updated to assert source-record
behavior.
- Some compatibility endpoints expose legacy primary keys. New records use UUIDs, so
compatibility adapters must accept source-record UUIDs where needed.

View File

@@ -0,0 +1,222 @@
# Polymorphic Organization Sources Design
## Goal
Replace the current source-centric parser tables and API v2 JSON snapshot model with an organization-centric schema:
- `Organization` is the main business entity and stores legal identity data.
- Each source group is represented as a polymorphic organization extension.
- Detailed source records hang under the extension as subordinate records.
- API compatibility with the current frontend is not required.
## Current Data Facts
The current dev database contains:
- `organizations.Organization`: 29,667 rows.
- `OrganizationDataSnapshot`: 29,667 rows after refresh.
- `registers.Organization`: 5,138 rows.
- `InspectionRecord`: 14,059 rows.
- `ProcurementRecord`: 1,000 rows.
- `IndustrialCertificateRecord`: 23,640 rows.
- `ManufacturerRecord`: 8,762 rows.
- `IndustrialProductRecord`: 471,824 rows.
- `GenericParserRecord`: 3,506 rows.
- `FinancialReport`: 10 rows.
Observed required-field candidates:
- `Organization.name` is present in 100% of canonical organizations and can be required.
- `inn` is present in 95.34% of canonical organizations.
- `ogrn` is present in 74.84%.
- `kpp` is present in 20.56%.
- `ogrip` is present in 8.08%.
- 673 canonical organizations have no `inn`, `ogrn`, or `ogrip`.
Therefore only `name` can be required on `Organization`. Identifiers must stay optional and indexed. Identity completeness should be explicit through a status field, not hidden in nullability.
## Source Groups
The new source groups match the product navigation:
- Financial indicators.
- Government procurements, including 44-FZ, 223-FZ, contracts, and legacy procurement rows.
- Russian manufacturers and products, including certificates, manufacturers, and industrial products.
- Planned inspections.
- Bankruptcy procedures.
- Defense supplier risk, including unfair suppliers and FAS GOZ.
- Arbitration cases.
- Information security registries.
- Vacancies from Trudvsem, HH, and SuperJob.
## Target Schema
### Organization
`organizations.Organization` remains the root table.
Required:
- `uid`
- `name`
Optional but indexed:
- `inn`
- `kpp`
- `ogrn`
- `ogrip`
New fields:
- `identity_status`: one of `complete`, `partial`, `missing`.
- `primary_identity`: short normalized search key used for deterministic deduplication and diagnostics.
Existing uniqueness constraints on non-empty identifiers should remain conservative. They protect the API and backfill from accidental duplicate canonical organizations.
### OrganizationSourceExtension
Add a new polymorphic base model:
- `uid`
- `organization`
- `source_group`
- `title`
- `status`
- `records_count`
- `first_seen_at`
- `last_seen_at`
- `last_load_batch`
- `metadata`
- timestamps
Constraints:
- One extension per `(organization, source_group)`.
Subclasses:
- `FinancialIndicatorsExtension`
- `GovernmentProcurementExtension`
- `IndustrialProductionExtension`
- `PlannedInspectionExtension`
- `BankruptcyExtension`
- `DefenseSupplierExtension`
- `ArbitrationExtension`
- `SecurityRegistryExtension`
- `VacancyExtension`
Rationale: `Organization` itself must not be polymorphic because one organization can have many source groups simultaneously. The polymorphic boundary belongs to source extensions.
### OrganizationSourceRecord
Use one subordinate detail table for most source rows:
- `uid`
- `extension`
- `record_type`
- `source`
- `external_id`
- `title`
- `record_date`
- `amount`
- `status`
- `url`
- `payload`
- `legacy_model`
- `legacy_pk`
- `load_batch`
- timestamps
Constraints:
- Unique `(source, external_id)` when `external_id` is non-empty.
- Unique `(legacy_model, legacy_pk)` for migrated legacy rows.
This keeps the number of tables low while still preserving every source-specific payload.
### FinancialReportLine
Financial report lines remain a subordinate table because they are structured, repeated, and likely to be queried by year/form/line:
- `source_record`
- `form_code`
- `line_code`
- `line_name`
- `year`
- `period_start`
- `period_end`
The existing legacy `FinancialReportLine` table is used only as a staging source after the migration.
## Backfill Rules
Backfill must be idempotent.
For each legacy row:
1. Resolve canonical `Organization` by identifiers in this order:
- exact `inn + kpp` where available,
- exact `ogrn`,
- exact `ogrip`,
- exact `inn` only when it maps to one canonical organization,
- normalized name fallback only when there is one unambiguous match.
2. If no organization can be resolved, create or reuse an organization with `identity_status=missing` or `partial`.
3. Create or update the matching `OrganizationSourceExtension`.
4. Create or update `OrganizationSourceRecord`.
5. Preserve the original source row in `payload`.
6. Store `legacy_model` and `legacy_pk` for audit and repeatable updates.
## API Shape
The new API should be organization-centric:
- `GET /api/v2/organizations/`
- `GET /api/v2/organizations/{uid}/`
- `GET /api/v2/organizations/{uid}/sources/`
- `GET /api/v2/organization-sources/{uid}/records/`
The list endpoint can expose compact source summaries:
```json
{
"uid": "...",
"name": "...",
"inn": "...",
"ogrn": "...",
"identity_status": "complete",
"sources": [
{
"uid": "...",
"source_group": "planned_inspections",
"title": "Плановые проверки Генпрокуратуры России",
"records_count": 12,
"last_seen_at": "2026-05-18T00:00:00Z"
}
]
}
```
Source records are fetched on demand from the extension, not embedded into every organization list row.
## Migration Phases
1. Add dependency and schema.
2. Add idempotent backfill service and management command.
3. Backfill all existing legacy parser data into source extensions.
4. Switch API v2 to source extensions.
5. Update frontend generated clients and source pages.
6. After verification, remove `OrganizationDataSnapshot` and legacy parser tables in a separate cleanup phase.
## Non-Goals For First Pass
- No destructive deletion of legacy parser tables before backfill verification.
- No attempt to make all source payloads strongly typed immediately.
- No frontend visual redesign; only data contract changes needed for the new schema.
## Risks
- Polymorphic subclass queries add joins. List endpoints must query the base extension compactly and load records only on demand.
- Legacy rows without identifiers require name fallback and can create duplicates. The command must report unresolved/created organizations.
- Current v2 filters use source-specific existence checks. They must move to extension existence filters.
- Frontend generated API clients will break and must be regenerated after backend OpenAPI changes.