# Polymorphic Organization Sources Design ## Goal Replace the current source-centric parser tables and API v2 JSON snapshot model with an organization-centric schema: - `Organization` is the main business entity and stores legal identity data. - Each source group is represented as a polymorphic organization extension. - Detailed source records hang under the extension as subordinate records. - API compatibility with the current frontend is not required. ## Current Data Facts The current dev database contains: - `organizations.Organization`: 29,667 rows. - `OrganizationDataSnapshot`: 29,667 rows after refresh. - `registers.Organization`: 5,138 rows. - `InspectionRecord`: 14,059 rows. - `ProcurementRecord`: 1,000 rows. - `IndustrialCertificateRecord`: 23,640 rows. - `ManufacturerRecord`: 8,762 rows. - `IndustrialProductRecord`: 471,824 rows. - `GenericParserRecord`: 3,506 rows. - `FinancialReport`: 10 rows. Observed required-field candidates: - `Organization.name` is present in 100% of canonical organizations and can be required. - `inn` is present in 95.34% of canonical organizations. - `ogrn` is present in 74.84%. - `kpp` is present in 20.56%. - `ogrip` is present in 8.08%. - 673 canonical organizations have no `inn`, `ogrn`, or `ogrip`. Therefore only `name` can be required on `Organization`. Identifiers must stay optional and indexed. Identity completeness should be explicit through a status field, not hidden in nullability. ## Source Groups The new source groups match the product navigation: - Financial indicators. - Government procurements, including 44-FZ, 223-FZ, contracts, and legacy procurement rows. - Russian manufacturers and products, including certificates, manufacturers, and industrial products. - Planned inspections. - Bankruptcy procedures. - Defense supplier risk, including unfair suppliers and FAS GOZ. - Arbitration cases. - Information security registries. - Vacancies from Trudvsem, HH, and SuperJob. ## Target Schema ### Organization `organizations.Organization` remains the root table. Required: - `uid` - `name` Optional but indexed: - `inn` - `kpp` - `ogrn` - `ogrip` New fields: - `identity_status`: one of `complete`, `partial`, `missing`. - `primary_identity`: short normalized search key used for deterministic deduplication and diagnostics. Existing uniqueness constraints on non-empty identifiers should remain conservative. They protect the API and backfill from accidental duplicate canonical organizations. ### OrganizationSourceExtension Add a new polymorphic base model: - `uid` - `organization` - `source_group` - `title` - `status` - `records_count` - `first_seen_at` - `last_seen_at` - `last_load_batch` - `metadata` - timestamps Constraints: - One extension per `(organization, source_group)`. Subclasses: - `FinancialIndicatorsExtension` - `GovernmentProcurementExtension` - `IndustrialProductionExtension` - `PlannedInspectionExtension` - `BankruptcyExtension` - `DefenseSupplierExtension` - `ArbitrationExtension` - `SecurityRegistryExtension` - `VacancyExtension` Rationale: `Organization` itself must not be polymorphic because one organization can have many source groups simultaneously. The polymorphic boundary belongs to source extensions. ### OrganizationSourceRecord Use one subordinate detail table for most source rows: - `uid` - `extension` - `record_type` - `source` - `external_id` - `title` - `record_date` - `amount` - `status` - `url` - `payload` - `legacy_model` - `legacy_pk` - `load_batch` - timestamps Constraints: - Unique `(source, external_id)` when `external_id` is non-empty. - Unique `(legacy_model, legacy_pk)` for migrated legacy rows. This keeps the number of tables low while still preserving every source-specific payload. ### FinancialReportLine Financial report lines remain a subordinate table because they are structured, repeated, and likely to be queried by year/form/line: - `source_record` - `form_code` - `line_code` - `line_name` - `year` - `period_start` - `period_end` The existing legacy `FinancialReportLine` table is used only as a staging source after the migration. ## Backfill Rules Backfill must be idempotent. For each legacy row: 1. Resolve canonical `Organization` by identifiers in this order: - exact `inn + kpp` where available, - exact `ogrn`, - exact `ogrip`, - exact `inn` only when it maps to one canonical organization, - normalized name fallback only when there is one unambiguous match. 2. If no organization can be resolved, create or reuse an organization with `identity_status=missing` or `partial`. 3. Create or update the matching `OrganizationSourceExtension`. 4. Create or update `OrganizationSourceRecord`. 5. Preserve the original source row in `payload`. 6. Store `legacy_model` and `legacy_pk` for audit and repeatable updates. ## API Shape The new API should be organization-centric: - `GET /api/v2/organizations/` - `GET /api/v2/organizations/{uid}/` - `GET /api/v2/organizations/{uid}/sources/` - `GET /api/v2/organization-sources/{uid}/records/` The list endpoint can expose compact source summaries: ```json { "uid": "...", "name": "...", "inn": "...", "ogrn": "...", "identity_status": "complete", "sources": [ { "uid": "...", "source_group": "planned_inspections", "title": "Плановые проверки Генпрокуратуры России", "records_count": 12, "last_seen_at": "2026-05-18T00:00:00Z" } ] } ``` Source records are fetched on demand from the extension, not embedded into every organization list row. ## Migration Phases 1. Add dependency and schema. 2. Add idempotent backfill service and management command. 3. Backfill all existing legacy parser data into source extensions. 4. Switch API v2 to source extensions. 5. Update frontend generated clients and source pages. 6. After verification, remove `OrganizationDataSnapshot` and legacy parser tables in a separate cleanup phase. ## Non-Goals For First Pass - No destructive deletion of legacy parser tables before backfill verification. - No attempt to make all source payloads strongly typed immediately. - No frontend visual redesign; only data contract changes needed for the new schema. ## Risks - Polymorphic subclass queries add joins. List endpoints must query the base extension compactly and load records only on demand. - Legacy rows without identifiers require name fallback and can create duplicates. The command must report unresolved/created organizations. - Current v2 filters use source-specific existence checks. They must move to extension existence filters. - Frontend generated API clients will break and must be regenerated after backend OpenAPI changes.