6.6 KiB
Polymorphic Organization Sources Design
Goal
Replace the current source-centric parser tables and API v2 JSON snapshot model with an organization-centric schema:
Organizationis the main business entity and stores legal identity data.- Each source group is represented as a polymorphic organization extension.
- Detailed source records hang under the extension as subordinate records.
- API compatibility with the current frontend is not required.
Current Data Facts
The current dev database contains:
organizations.Organization: 29,667 rows.OrganizationDataSnapshot: 29,667 rows after refresh.registers.Organization: 5,138 rows.InspectionRecord: 14,059 rows.ProcurementRecord: 1,000 rows.IndustrialCertificateRecord: 23,640 rows.ManufacturerRecord: 8,762 rows.IndustrialProductRecord: 471,824 rows.GenericParserRecord: 3,506 rows.FinancialReport: 10 rows.
Observed required-field candidates:
Organization.nameis present in 100% of canonical organizations and can be required.innis present in 95.34% of canonical organizations.ogrnis present in 74.84%.kppis present in 20.56%.ogripis present in 8.08%.- 673 canonical organizations have no
inn,ogrn, orogrip.
Therefore only name can be required on Organization. Identifiers must stay optional and indexed. Identity completeness should be explicit through a status field, not hidden in nullability.
Source Groups
The new source groups match the product navigation:
- Financial indicators.
- Government procurements, including 44-FZ, 223-FZ, contracts, and legacy procurement rows.
- Russian manufacturers and products, including certificates, manufacturers, and industrial products.
- Planned inspections.
- Bankruptcy procedures.
- Defense supplier risk, including unfair suppliers and FAS GOZ.
- Arbitration cases.
- Information security registries.
- Vacancies from Trudvsem, HH, and SuperJob.
Target Schema
Organization
organizations.Organization remains the root table.
Required:
uidname
Optional but indexed:
innkppogrnogrip
New fields:
identity_status: one ofcomplete,partial,missing.primary_identity: short normalized search key used for deterministic deduplication and diagnostics.
Existing uniqueness constraints on non-empty identifiers should remain conservative. They protect the API and backfill from accidental duplicate canonical organizations.
OrganizationSourceExtension
Add a new polymorphic base model:
uidorganizationsource_grouptitlestatusrecords_countfirst_seen_atlast_seen_atlast_load_batchmetadata- timestamps
Constraints:
- One extension per
(organization, source_group).
Subclasses:
FinancialIndicatorsExtensionGovernmentProcurementExtensionIndustrialProductionExtensionPlannedInspectionExtensionBankruptcyExtensionDefenseSupplierExtensionArbitrationExtensionSecurityRegistryExtensionVacancyExtension
Rationale: Organization itself must not be polymorphic because one organization can have many source groups simultaneously. The polymorphic boundary belongs to source extensions.
OrganizationSourceRecord
Use one subordinate detail table for most source rows:
uidextensionrecord_typesourceexternal_idtitlerecord_dateamountstatusurlpayloadlegacy_modellegacy_pkload_batch- timestamps
Constraints:
- Unique
(source, external_id)whenexternal_idis non-empty. - Unique
(legacy_model, legacy_pk)for migrated legacy rows.
This keeps the number of tables low while still preserving every source-specific payload.
FinancialReportLine
Financial report lines remain a subordinate table because they are structured, repeated, and likely to be queried by year/form/line:
source_recordform_codeline_codeline_nameyearperiod_startperiod_end
The existing legacy FinancialReportLine table is used only as a staging source after the migration.
Backfill Rules
Backfill must be idempotent.
For each legacy row:
- Resolve canonical
Organizationby identifiers in this order:- exact
inn + kppwhere available, - exact
ogrn, - exact
ogrip, - exact
innonly when it maps to one canonical organization, - normalized name fallback only when there is one unambiguous match.
- exact
- If no organization can be resolved, create or reuse an organization with
identity_status=missingorpartial. - Create or update the matching
OrganizationSourceExtension. - Create or update
OrganizationSourceRecord. - Preserve the original source row in
payload. - Store
legacy_modelandlegacy_pkfor audit and repeatable updates.
API Shape
The new API should be organization-centric:
GET /api/v2/organizations/GET /api/v2/organizations/{uid}/GET /api/v2/organizations/{uid}/sources/GET /api/v2/organization-sources/{uid}/records/
The list endpoint can expose compact source summaries:
{
"uid": "...",
"name": "...",
"inn": "...",
"ogrn": "...",
"identity_status": "complete",
"sources": [
{
"uid": "...",
"source_group": "planned_inspections",
"title": "Плановые проверки Генпрокуратуры России",
"records_count": 12,
"last_seen_at": "2026-05-18T00:00:00Z"
}
]
}
Source records are fetched on demand from the extension, not embedded into every organization list row.
Migration Phases
- Add dependency and schema.
- Add idempotent backfill service and management command.
- Backfill all existing legacy parser data into source extensions.
- Switch API v2 to source extensions.
- Update frontend generated clients and source pages.
- After verification, remove
OrganizationDataSnapshotand legacy parser tables in a separate cleanup phase.
Non-Goals For First Pass
- No destructive deletion of legacy parser tables before backfill verification.
- No attempt to make all source payloads strongly typed immediately.
- No frontend visual redesign; only data contract changes needed for the new schema.
Risks
- Polymorphic subclass queries add joins. List endpoints must query the base extension compactly and load records only on demand.
- Legacy rows without identifiers require name fallback and can create duplicates. The command must report unresolved/created organizations.
- Current v2 filters use source-specific existence checks. They must move to extension existence filters.
- Frontend generated API clients will break and must be regenerated after backend OpenAPI changes.