223 lines
6.6 KiB
Markdown
223 lines
6.6 KiB
Markdown
# Polymorphic Organization Sources Design
|
|
|
|
## Goal
|
|
|
|
Replace the current source-centric parser tables and API v2 JSON snapshot model with an organization-centric schema:
|
|
|
|
- `Organization` is the main business entity and stores legal identity data.
|
|
- Each source group is represented as a polymorphic organization extension.
|
|
- Detailed source records hang under the extension as subordinate records.
|
|
- API compatibility with the current frontend is not required.
|
|
|
|
## Current Data Facts
|
|
|
|
The current dev database contains:
|
|
|
|
- `organizations.Organization`: 29,667 rows.
|
|
- `OrganizationDataSnapshot`: 29,667 rows after refresh.
|
|
- `registers.Organization`: 5,138 rows.
|
|
- `InspectionRecord`: 14,059 rows.
|
|
- `ProcurementRecord`: 1,000 rows.
|
|
- `IndustrialCertificateRecord`: 23,640 rows.
|
|
- `ManufacturerRecord`: 8,762 rows.
|
|
- `IndustrialProductRecord`: 471,824 rows.
|
|
- `GenericParserRecord`: 3,506 rows.
|
|
- `FinancialReport`: 10 rows.
|
|
|
|
Observed required-field candidates:
|
|
|
|
- `Organization.name` is present in 100% of canonical organizations and can be required.
|
|
- `inn` is present in 95.34% of canonical organizations.
|
|
- `ogrn` is present in 74.84%.
|
|
- `kpp` is present in 20.56%.
|
|
- `ogrip` is present in 8.08%.
|
|
- 673 canonical organizations have no `inn`, `ogrn`, or `ogrip`.
|
|
|
|
Therefore only `name` can be required on `Organization`. Identifiers must stay optional and indexed. Identity completeness should be explicit through a status field, not hidden in nullability.
|
|
|
|
## Source Groups
|
|
|
|
The new source groups match the product navigation:
|
|
|
|
- Financial indicators.
|
|
- Government procurements, including 44-FZ, 223-FZ, contracts, and legacy procurement rows.
|
|
- Russian manufacturers and products, including certificates, manufacturers, and industrial products.
|
|
- Planned inspections.
|
|
- Bankruptcy procedures.
|
|
- Defense supplier risk, including unfair suppliers and FAS GOZ.
|
|
- Arbitration cases.
|
|
- Information security registries.
|
|
- Vacancies from Trudvsem, HH, and SuperJob.
|
|
|
|
## Target Schema
|
|
|
|
### Organization
|
|
|
|
`organizations.Organization` remains the root table.
|
|
|
|
Required:
|
|
|
|
- `uid`
|
|
- `name`
|
|
|
|
Optional but indexed:
|
|
|
|
- `inn`
|
|
- `kpp`
|
|
- `ogrn`
|
|
- `ogrip`
|
|
|
|
New fields:
|
|
|
|
- `identity_status`: one of `complete`, `partial`, `missing`.
|
|
- `primary_identity`: short normalized search key used for deterministic deduplication and diagnostics.
|
|
|
|
Existing uniqueness constraints on non-empty identifiers should remain conservative. They protect the API and backfill from accidental duplicate canonical organizations.
|
|
|
|
### OrganizationSourceExtension
|
|
|
|
Add a new polymorphic base model:
|
|
|
|
- `uid`
|
|
- `organization`
|
|
- `source_group`
|
|
- `title`
|
|
- `status`
|
|
- `records_count`
|
|
- `first_seen_at`
|
|
- `last_seen_at`
|
|
- `last_load_batch`
|
|
- `metadata`
|
|
- timestamps
|
|
|
|
Constraints:
|
|
|
|
- One extension per `(organization, source_group)`.
|
|
|
|
Subclasses:
|
|
|
|
- `FinancialIndicatorsExtension`
|
|
- `GovernmentProcurementExtension`
|
|
- `IndustrialProductionExtension`
|
|
- `PlannedInspectionExtension`
|
|
- `BankruptcyExtension`
|
|
- `DefenseSupplierExtension`
|
|
- `ArbitrationExtension`
|
|
- `SecurityRegistryExtension`
|
|
- `VacancyExtension`
|
|
|
|
Rationale: `Organization` itself must not be polymorphic because one organization can have many source groups simultaneously. The polymorphic boundary belongs to source extensions.
|
|
|
|
### OrganizationSourceRecord
|
|
|
|
Use one subordinate detail table for most source rows:
|
|
|
|
- `uid`
|
|
- `extension`
|
|
- `record_type`
|
|
- `source`
|
|
- `external_id`
|
|
- `title`
|
|
- `record_date`
|
|
- `amount`
|
|
- `status`
|
|
- `url`
|
|
- `payload`
|
|
- `legacy_model`
|
|
- `legacy_pk`
|
|
- `load_batch`
|
|
- timestamps
|
|
|
|
Constraints:
|
|
|
|
- Unique `(source, external_id)` when `external_id` is non-empty.
|
|
- Unique `(legacy_model, legacy_pk)` for migrated legacy rows.
|
|
|
|
This keeps the number of tables low while still preserving every source-specific payload.
|
|
|
|
### FinancialReportLine
|
|
|
|
Financial report lines remain a subordinate table because they are structured, repeated, and likely to be queried by year/form/line:
|
|
|
|
- `source_record`
|
|
- `form_code`
|
|
- `line_code`
|
|
- `line_name`
|
|
- `year`
|
|
- `period_start`
|
|
- `period_end`
|
|
|
|
The existing legacy `FinancialReportLine` table is used only as a staging source after the migration.
|
|
|
|
## Backfill Rules
|
|
|
|
Backfill must be idempotent.
|
|
|
|
For each legacy row:
|
|
|
|
1. Resolve canonical `Organization` by identifiers in this order:
|
|
- exact `inn + kpp` where available,
|
|
- exact `ogrn`,
|
|
- exact `ogrip`,
|
|
- exact `inn` only when it maps to one canonical organization,
|
|
- normalized name fallback only when there is one unambiguous match.
|
|
2. If no organization can be resolved, create or reuse an organization with `identity_status=missing` or `partial`.
|
|
3. Create or update the matching `OrganizationSourceExtension`.
|
|
4. Create or update `OrganizationSourceRecord`.
|
|
5. Preserve the original source row in `payload`.
|
|
6. Store `legacy_model` and `legacy_pk` for audit and repeatable updates.
|
|
|
|
## API Shape
|
|
|
|
The new API should be organization-centric:
|
|
|
|
- `GET /api/v2/organizations/`
|
|
- `GET /api/v2/organizations/{uid}/`
|
|
- `GET /api/v2/organizations/{uid}/sources/`
|
|
- `GET /api/v2/organization-sources/{uid}/records/`
|
|
|
|
The list endpoint can expose compact source summaries:
|
|
|
|
```json
|
|
{
|
|
"uid": "...",
|
|
"name": "...",
|
|
"inn": "...",
|
|
"ogrn": "...",
|
|
"identity_status": "complete",
|
|
"sources": [
|
|
{
|
|
"uid": "...",
|
|
"source_group": "planned_inspections",
|
|
"title": "Плановые проверки Генпрокуратуры России",
|
|
"records_count": 12,
|
|
"last_seen_at": "2026-05-18T00:00:00Z"
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
Source records are fetched on demand from the extension, not embedded into every organization list row.
|
|
|
|
## Migration Phases
|
|
|
|
1. Add dependency and schema.
|
|
2. Add idempotent backfill service and management command.
|
|
3. Backfill all existing legacy parser data into source extensions.
|
|
4. Switch API v2 to source extensions.
|
|
5. Update frontend generated clients and source pages.
|
|
6. After verification, remove `OrganizationDataSnapshot` and legacy parser tables in a separate cleanup phase.
|
|
|
|
## Non-Goals For First Pass
|
|
|
|
- No destructive deletion of legacy parser tables before backfill verification.
|
|
- No attempt to make all source payloads strongly typed immediately.
|
|
- No frontend visual redesign; only data contract changes needed for the new schema.
|
|
|
|
## Risks
|
|
|
|
- Polymorphic subclass queries add joins. List endpoints must query the base extension compactly and load records only on demand.
|
|
- Legacy rows without identifiers require name fallback and can create duplicates. The command must report unresolved/created organizations.
|
|
- Current v2 filters use source-specific existence checks. They must move to extension existence filters.
|
|
- Frontend generated API clients will break and must be regenerated after backend OpenAPI changes.
|