avm/mostovik-backend

Fork 0

Files

Aleksandr Meshchriakov 4ca2fa25d5 feat(organizations): migrate source storage to polymorphic records

2026-05-19 10:23:53 +02:00

6.6 KiB

Raw Blame History

Polymorphic Organization Sources Design

Goal

Replace the current source-centric parser tables and API v2 JSON snapshot model with an organization-centric schema:

Organization is the main business entity and stores legal identity data.
Each source group is represented as a polymorphic organization extension.
Detailed source records hang under the extension as subordinate records.
API compatibility with the current frontend is not required.

Current Data Facts

The current dev database contains:

organizations.Organization: 29,667 rows.
OrganizationDataSnapshot: 29,667 rows after refresh.
registers.Organization: 5,138 rows.
InspectionRecord: 14,059 rows.
ProcurementRecord: 1,000 rows.
IndustrialCertificateRecord: 23,640 rows.
ManufacturerRecord: 8,762 rows.
IndustrialProductRecord: 471,824 rows.
GenericParserRecord: 3,506 rows.
FinancialReport: 10 rows.

Observed required-field candidates:

Organization.name is present in 100% of canonical organizations and can be required.
inn is present in 95.34% of canonical organizations.
ogrn is present in 74.84%.
kpp is present in 20.56%.
ogrip is present in 8.08%.
673 canonical organizations have no inn, ogrn, or ogrip.

Therefore only name can be required on Organization. Identifiers must stay optional and indexed. Identity completeness should be explicit through a status field, not hidden in nullability.

Source Groups

The new source groups match the product navigation:

Financial indicators.
Government procurements, including 44-FZ, 223-FZ, contracts, and legacy procurement rows.
Russian manufacturers and products, including certificates, manufacturers, and industrial products.
Planned inspections.
Bankruptcy procedures.
Defense supplier risk, including unfair suppliers and FAS GOZ.
Arbitration cases.
Information security registries.
Vacancies from Trudvsem, HH, and SuperJob.

Target Schema

Organization

organizations.Organization remains the root table.

Required:

uid
name

Optional but indexed:

inn
kpp
ogrn
ogrip

New fields:

identity_status: one of complete, partial, missing.
primary_identity: short normalized search key used for deterministic deduplication and diagnostics.

Existing uniqueness constraints on non-empty identifiers should remain conservative. They protect the API and backfill from accidental duplicate canonical organizations.

OrganizationSourceExtension

Add a new polymorphic base model:

uid
organization
source_group
title
status
records_count
first_seen_at
last_seen_at
last_load_batch
metadata
timestamps

Constraints:

One extension per (organization, source_group).

Subclasses:

FinancialIndicatorsExtension
GovernmentProcurementExtension
IndustrialProductionExtension
PlannedInspectionExtension
BankruptcyExtension
DefenseSupplierExtension
ArbitrationExtension
SecurityRegistryExtension
VacancyExtension

Rationale: Organization itself must not be polymorphic because one organization can have many source groups simultaneously. The polymorphic boundary belongs to source extensions.

OrganizationSourceRecord

Use one subordinate detail table for most source rows:

uid
extension
record_type
source
external_id
title
record_date
amount
status
url
payload
legacy_model
legacy_pk
load_batch
timestamps

Constraints:

Unique (source, external_id) when external_id is non-empty.
Unique (legacy_model, legacy_pk) for migrated legacy rows.

This keeps the number of tables low while still preserving every source-specific payload.

FinancialReportLine

Financial report lines remain a subordinate table because they are structured, repeated, and likely to be queried by year/form/line:

source_record
form_code
line_code
line_name
year
period_start
period_end

The existing legacy FinancialReportLine table is used only as a staging source after the migration.

Backfill Rules

Backfill must be idempotent.

For each legacy row:

Resolve canonical Organization by identifiers in this order:
- exact inn + kpp where available,
- exact ogrn,
- exact ogrip,
- exact inn only when it maps to one canonical organization,
- normalized name fallback only when there is one unambiguous match.
If no organization can be resolved, create or reuse an organization with identity_status=missing or partial.
Create or update the matching OrganizationSourceExtension.
Create or update OrganizationSourceRecord.
Preserve the original source row in payload.
Store legacy_model and legacy_pk for audit and repeatable updates.

API Shape

The new API should be organization-centric:

GET /api/v2/organizations/
GET /api/v2/organizations/{uid}/
GET /api/v2/organizations/{uid}/sources/
GET /api/v2/organization-sources/{uid}/records/

The list endpoint can expose compact source summaries:

{
  "uid": "...",
  "name": "...",
  "inn": "...",
  "ogrn": "...",
  "identity_status": "complete",
  "sources": [
    {
      "uid": "...",
      "source_group": "planned_inspections",
      "title": "Плановые проверки Генпрокуратуры России",
      "records_count": 12,
      "last_seen_at": "2026-05-18T00:00:00Z"
    }
  ]
}

Source records are fetched on demand from the extension, not embedded into every organization list row.

Migration Phases

Add dependency and schema.
Add idempotent backfill service and management command.
Backfill all existing legacy parser data into source extensions.
Switch API v2 to source extensions.
Update frontend generated clients and source pages.
After verification, remove OrganizationDataSnapshot and legacy parser tables in a separate cleanup phase.

Non-Goals For First Pass

No destructive deletion of legacy parser tables before backfill verification.
No attempt to make all source payloads strongly typed immediately.
No frontend visual redesign; only data contract changes needed for the new schema.

Risks

Polymorphic subclass queries add joins. List endpoints must query the base extension compactly and load records only on demand.
Legacy rows without identifiers require name fallback and can create duplicates. The command must report unresolved/created organizations.
Current v2 filters use source-specific existence checks. They must move to extension existence filters.
Frontend generated API clients will break and must be regenerated after backend OpenAPI changes.

6.6 KiB Raw Blame History