feat(organizations): migrate source storage to polymorphic records
This commit is contained in:
64
docs/parser-external-access-note-ru.md
Normal file
64
docs/parser-external-access-note-ru.md
Normal file
@@ -0,0 +1,64 @@
|
||||
# Аналитическая записка по внешним обращениям парсеров
|
||||
|
||||
Дата подготовки: 2026-05-18
|
||||
|
||||
## Краткое описание
|
||||
|
||||
В backend Mostovik подсистема парсеров реализована как набор Celery-задач и HTTP-клиентов. Парсеры обращаются к открытым государственным источникам, публичным API и, при наличии ключей в переменных окружения, к платным/служебным API. Полученные данные нормализуются и сохраняются в PostgreSQL через сервисный слой.
|
||||
|
||||
Основные точки входа находятся в `src/apps/parsers/tasks.py`. Внешние HTTP-запросы выполняются через клиенты в `src/apps/parsers/clients/`. Общий HTTP-клиент использует GET/POST, таймауты, стандартный User-Agent и, если включено `PARSER_USE_RUNTIME_PROXIES`, может использовать активные RU-прокси из БД.
|
||||
|
||||
## Куда обращается и что скачивает
|
||||
|
||||
| Источник | Адреса обращения | Что скачивается / получается |
|
||||
|---|---|---|
|
||||
| Минпромторг: сертификаты промышленного производства | `https://minpromtorg.gov.ru/api/kss-document-preview` | JSON-список документов, затем последний Excel-файл из поля `files[].url`. Из Excel берутся номер заключения, даты, ссылка на документ, наименование организации, ИНН, ОГРН. |
|
||||
| Минпромторг: реестр производителей | `https://minpromtorg.gov.ru/api/kss-document-preview` | JSON-список документов, затем последний Excel-файл `data_orgs_YYYYMMDD...`. Из Excel берутся наименование производителя, ИНН, ОГРН, адрес. |
|
||||
| Минпромторг: промышленная продукция | `https://minpromtorg.gov.ru/api/kss-document-preview` | JSON-список документов, затем Excel-файл реестра промышленной продукции. Из Excel берутся организация, ИНН, ОГРН, регистрационный номер, наименование продукции, модель, ОКПД2, ТН ВЭД, нормативный документ. |
|
||||
| ГИСП: промышленная продукция | `https://gisp.gov.ru/pp719v2/pub/prod/`, технический API `https://gisp.gov.ru/pp719v2/pub/prod/b/` | В каталоге источников этот адрес указан как upstream для продукции. Универсальный файловый клиент умеет получить первую страницу JSON через POST на `/pp719v2/pub/prod/b/`; основная Celery-задача `parse_industrial_products` сейчас использует Excel discovery Минпромторга. |
|
||||
| Генпрокуратура: единый реестр проверок | `https://proverki.gov.ru/portal/public-open-data/check/{year}/{month}?isFederalLaw248=true|false`, `https://proverki.gov.ru/portal/public-open-data/check/{year}/plans?isFederalLaw248=true|false` | ZIP/XML выгрузки с проверками или планами проверок. При необходимости используется Playwright: открывается страница портала, выбирается вкладка скачивания, загружается ZIP/XML. |
|
||||
| ЕИС закупки: SOAP-интеграция | `https://int44.zakupki.gov.ru/eis-integration/services/getDocsIP` | SOAP-запрос возвращает `archiveUrl`; затем скачивается ZIP/XML-архив закупок/контрактов. Для доступа используется `ZAKUPKI_TOKEN` из окружения. |
|
||||
| ЕИС закупки: HTTP fallback | `https://zakupki.gov.ru/opendata/download/notifications/{region}/{year}/...` | ZIP-архивы с XML-файлами закупок, если SOAP-токен не используется или передана прямая ссылка. |
|
||||
| ЕИС/FAS generic-источники | `https://zakupki.gov.ru/epz/order/extendedsearch/results.html`, `https://zakupki.gov.ru/epz/orderclause/search/results.html`, `https://zakupki.gov.ru/epz/contract/search/results.html`, `https://zakupki.gov.ru/epz/dishonestsupplier/search/results.html`, `https://fas.gov.ru/pages/activity/reestr-uridicheskih-lic` | HTML-страницы официальных реестров. Парсер извлекает карточки/таблицы: закупки 44-ФЗ, закупки 223-ФЗ, контракты, недобросовестные поставщики, сведения ФАС по ГОЗ. |
|
||||
| ФНС: бухгалтерская отчетность | Автоматического HTTP-скачивания с ФНС в текущем коде не найдено. В каталоге источников указан справочный URL `https://bo.nalog.gov.ru/advanced-search/organizations/search?...` | Обрабатываются локально загруженные или положенные в папку `input/fns` файлы `fin_{id}_{ogrn}.xlsx`, а также ZIP-архивы с такими файлами. Из Excel берутся строки форм N 1, 2, 3, 4, 6 бухгалтерской отчетности. |
|
||||
| КАД Арбитр через Checko | Официальный источник в каталоге: `https://kad.arbitr.ru/`; фактический lookup в коде: `https://api.checko.ru/v2/legal-cases` | JSON-ответы по арбитражным делам для активных организаций из внутренних реестров. В запрос передаются ИНН/ОГРН. В payload сохраняются номер дела, суд, тип, статус, даты, суммы, стороны и ссылка на карточку. |
|
||||
| Федресурс/ЕФРСБ | `https://bankrot.fedresurs.ru/`; fallback: `https://api.checko.ru/v2/company` | Официальный источник обрабатывается как HTML/структурированная выгрузка. При недоступности портала используется Checko: по ИНН/ОГРН организации берутся сведения о банкротных сообщениях из JSON. |
|
||||
| ФСТЭК | `https://reestr.fstec.ru/reg3` и найденные на странице ссылки вида `module=rfiles` или `/uploads/reg...` | HTML-страница реестра, затем CSV/файловая выгрузка, если ссылка найдена. Для этого источника в коде отключена SSL-верификация. |
|
||||
| Вакансии: Работа России | Клиент использует `http://opendata.trudvsem.ru/api/v1/vacancies`, `http://opendata.trudvsem.ru/api/v1/vacancies/company/inn/{inn}`; в каталоге источников указан `https://opendata.trudvsem.ru/api/v1/vacancies` | JSON-список вакансий, включая работодателя, ИНН/ОГРН при наличии, название вакансии, дату, зарплату, статус, ссылку. |
|
||||
| Вакансии: HeadHunter | `https://api.hh.ru/vacancies` | JSON-список вакансий. Поиск выполняется по региону и/или тексту, для организаций без поиска по ИНН используется нормализованное название. |
|
||||
| Вакансии: SuperJob | `https://api.superjob.ru/2.0/vacancies/` | JSON-список вакансий. Используется только если задан `SUPERJOB_APP_ID`; ключ передается в заголовке `X-Api-App-Id`. |
|
||||
| Checko: контракты и проверки по организациям | `https://api.checko.ru/v2/contracts`, `https://api.checko.ru/v2/inspections` | JSON-данные по контрактам и проверкам для активных организаций из внутренних реестров. В запрос передаются ИНН/ОГРН, API-ключ передается параметром `key`. |
|
||||
| Proxy-Tools | `https://proxy-tools.com/api/v1/proxies` | Служебная загрузка списка RU-прокси для парсеров. Используется только при заданном `PROXY_TOOLS_API_KEY`; запрос идет с Bearer-токеном. |
|
||||
|
||||
## Форматы загружаемых данных
|
||||
|
||||
Парсеры работают со следующими форматами:
|
||||
|
||||
- JSON-ответы публичных/платных API.
|
||||
- Excel-файлы `.xlsx`/`.xlsm` с реестрами Минпромторга и бухгалтерской отчетностью ФНС.
|
||||
- ZIP-архивы с XML/CSV/JSON/HTML/XLSX-файлами.
|
||||
- XML-файлы выгрузок проверок и закупок.
|
||||
- HTML-страницы официальных реестров с таблицами или карточками.
|
||||
- CSV-файлы, в частности для ФСТЭК.
|
||||
|
||||
Скачанные файлы не исполняются как код: они читаются как данные, парсятся и сохраняются в БД.
|
||||
|
||||
## Передаваемые наружу параметры
|
||||
|
||||
Во внешние запросы могут уходить:
|
||||
|
||||
- периоды загрузки: год, месяц, дата;
|
||||
- коды регионов;
|
||||
- ИНН/ОГРН организаций из внутренних активных реестров;
|
||||
- поисковая строка по названию организации для вакансий;
|
||||
- служебные ключи API из окружения: `ZAKUPKI_TOKEN`, `CHECKO_API_KEY`, `SUPERJOB_APP_ID`, `PROXY_TOOLS_API_KEY`.
|
||||
|
||||
Ключи в коде не захардкожены, берутся из переменных окружения.
|
||||
|
||||
## Важные оговорки
|
||||
|
||||
- Для ряда задач предусмотрен параметр `file_url`. Если оператор передает его вручную, парсер скачивает файл по переданной ссылке, а не только по дефолтному адресу источника.
|
||||
- Для ФНС текущая реализация backend не скачивает файлы автоматически с сайта ФНС, а обрабатывает уже полученные Excel/ZIP-файлы через папку наблюдения или API-загрузку.
|
||||
- Для `proverki.gov.ru` возможен запуск headless Chromium через Playwright, потому что часть загрузок доступна через JS-интерфейс портала.
|
||||
- Для ФСТЭК SSL-верификация отключена настройкой клиента источника.
|
||||
- Runtime-прокси из БД используются только при включенном `PARSER_USE_RUNTIME_PROXIES=true`; отдельная задача синхронизации прокси обращается к Proxy-Tools только при наличии `PROXY_TOOLS_API_KEY`.
|
||||
@@ -0,0 +1,85 @@
|
||||
# Direct Parser Source Ingestion Implementation Plan
|
||||
|
||||
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
|
||||
|
||||
**Goal:** Move parser runtime reads and writes from legacy parser record tables to organization source storage.
|
||||
|
||||
**Architecture:** Add a focused ingestion service in `organizations` that persists normalized source-record inputs directly into polymorphic source extensions. Parser services become adapters from parser dataclasses to ingestion inputs. Runtime reads use `OrganizationSourceRecord` and extension counters.
|
||||
|
||||
**Tech Stack:** Django 3.2, PostgreSQL, DRF, django-polymorphic, pytest.
|
||||
|
||||
---
|
||||
|
||||
### Task 1: Direct Ingestion Core
|
||||
|
||||
**Files:**
|
||||
- Create: `src/organizations/source_identity.py`
|
||||
- Create: `src/organizations/source_ingestion.py`
|
||||
- Modify: `src/organizations/source_backfill.py`
|
||||
- Test: `tests/apps/organizations/test_source_ingestion.py`
|
||||
- Test: `tests/apps/organizations/test_source_backfill.py`
|
||||
|
||||
- [ ] Write failing tests for direct generic source ingestion.
|
||||
- [ ] Write failing tests for FNS report ingestion with financial lines.
|
||||
- [ ] Extract identity normalization from backfill into a shared helper.
|
||||
- [ ] Implement `SourceRecordInput` and `SourceFinancialLineInput`.
|
||||
- [ ] Implement `OrganizationSourceIngestionService.save_records`.
|
||||
- [ ] Keep backfill behavior green by using the same identity normalization helper.
|
||||
|
||||
### Task 2: Parser Save Services
|
||||
|
||||
**Files:**
|
||||
- Modify: `src/apps/parsers/services.py`
|
||||
- Test: `tests/apps/parsers/test_services.py`
|
||||
|
||||
- [ ] Switch generic source saves to `OrganizationSourceIngestionService`.
|
||||
- [ ] Switch industrial certificate/manufacturer/product saves.
|
||||
- [ ] Switch inspection and procurement saves.
|
||||
- [ ] Switch FNS report saves and duplicate checks.
|
||||
- [ ] Replace period/deduplication helpers with source-record queries.
|
||||
|
||||
### Task 3: Parser Tasks
|
||||
|
||||
**Files:**
|
||||
- Modify: `src/apps/parsers/tasks.py`
|
||||
- Test: `tests/apps/parsers/test_tasks.py`
|
||||
|
||||
- [ ] Remove source backfill queueing from parser completion.
|
||||
- [ ] Keep parser load logs and background job progress unchanged.
|
||||
- [ ] Return source-record identifiers for FNS processing instead of legacy report ids.
|
||||
|
||||
### Task 4: Runtime Reads
|
||||
|
||||
**Files:**
|
||||
- Modify: `src/apps/parsers/source_cards.py`
|
||||
- Modify: `src/apps/parsers/views.py`
|
||||
- Modify: `src/apps/parsers/serializers.py`
|
||||
- Modify: `src/apps/core/admin_dashboard.py`
|
||||
- Modify: `src/apps/backups/services.py`
|
||||
- Test: parser source-card and result endpoint tests.
|
||||
|
||||
- [ ] Move source card counts and timestamps to source extensions/source records.
|
||||
- [ ] Move parser log organization counts to source records.
|
||||
- [ ] Adapt v1 parser result endpoints to read source records.
|
||||
- [ ] Move dashboard/export runtime reads off legacy parser models.
|
||||
|
||||
### Task 5: Frontend Record Detail
|
||||
|
||||
**Files:**
|
||||
- Modify: `mostovik-frontend/src/pages/main/model/source-record-detail/*`
|
||||
- Test: frontend source-detail/source-record-detail unit tests.
|
||||
|
||||
- [ ] Replace legacy generated v1 detail clients with organization source-record reads.
|
||||
- [ ] Use `payload` plus top-level source-record fields for detail rendering.
|
||||
- [ ] Keep source-detail lists on the new source-record list endpoint.
|
||||
|
||||
### Task 6: Validation
|
||||
|
||||
**Files:**
|
||||
- No production files.
|
||||
|
||||
- [ ] Run focused backend parser/organization tests.
|
||||
- [ ] Run frontend source-detail/source-record-detail checks.
|
||||
- [ ] Run live parser smoke against one small generic source.
|
||||
- [ ] Confirm legacy parser record counts do not change during the smoke.
|
||||
- [ ] Confirm new organization source-record counts do change.
|
||||
@@ -0,0 +1,100 @@
|
||||
# Polymorphic Organization Sources Implementation Plan
|
||||
|
||||
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
|
||||
|
||||
**Goal:** Replace source-centric parser output access with organization-centric polymorphic source extensions.
|
||||
|
||||
**Architecture:** Keep `Organization` as the root entity. Add polymorphic source extensions per product source group and a shared subordinate source-record table. Backfill legacy parser tables idempotently, then switch API v2 to the new extension data.
|
||||
|
||||
**Tech Stack:** Django 3.2, Django REST Framework, django-filter, django-polymorphic, PostgreSQL, pytest.
|
||||
|
||||
---
|
||||
|
||||
### Task 1: Dependency And Schema
|
||||
|
||||
**Files:**
|
||||
- Modify: `pyproject.toml`
|
||||
- Modify: `uv.lock`
|
||||
- Modify: `src/settings/base.py`
|
||||
- Modify: `src/organizations/models.py`
|
||||
- Create: `src/organizations/migrations/0006_polymorphic_source_extensions.py`
|
||||
- Test: `tests/apps/organizations/test_source_extensions_models.py`
|
||||
|
||||
- [ ] Add `django-polymorphic` to project dependencies.
|
||||
- [ ] Add `"polymorphic"` to `INSTALLED_APPS` before local apps.
|
||||
- [ ] Add source-group and identity-status choices.
|
||||
- [ ] Add `identity_status` and `primary_identity` to `Organization`.
|
||||
- [ ] Add `OrganizationSourceExtension` as `PolymorphicModel`.
|
||||
- [ ] Add source extension subclasses.
|
||||
- [ ] Add `OrganizationSourceRecord`.
|
||||
- [ ] Add `OrganizationSourceFinancialLine`.
|
||||
- [ ] Write tests proving:
|
||||
- one extension per `(organization, source_group)`;
|
||||
- polymorphic queries return subclass instances;
|
||||
- source records are unique by legacy model/pk;
|
||||
- financial lines attach to a source record.
|
||||
|
||||
### Task 2: Backfill Service
|
||||
|
||||
**Files:**
|
||||
- Create: `src/organizations/source_groups.py`
|
||||
- Create: `src/organizations/source_backfill.py`
|
||||
- Create: `src/organizations/management/commands/backfill_organization_sources.py`
|
||||
- Test: `tests/apps/organizations/test_source_backfill.py`
|
||||
|
||||
- [ ] Define source group mapping for all legacy parser sources.
|
||||
- [ ] Implement organization resolution by `inn + kpp`, `ogrn`, `ogrip`, unique `inn`, then normalized name.
|
||||
- [ ] Implement idempotent extension creation/update.
|
||||
- [ ] Implement idempotent source record creation/update.
|
||||
- [ ] Preserve legacy row payload and `(legacy_model, legacy_pk)`.
|
||||
- [ ] Backfill financial report lines into `OrganizationSourceFinancialLine`.
|
||||
- [ ] Report scanned, created organizations, created extensions, updated extensions, created records, updated records, unresolved rows.
|
||||
|
||||
### Task 3: API v2 Switch
|
||||
|
||||
**Files:**
|
||||
- Modify: `src/organizations/serializers.py`
|
||||
- Modify: `src/organizations/filters.py`
|
||||
- Modify: `src/organizations/views.py`
|
||||
- Delete or stop using: `src/organizations/api_enrichment.py`
|
||||
- Delete or stop using: `src/organizations/services.py` snapshot refresh paths
|
||||
- Test: `tests/apps/organizations/test_api_v2.py`
|
||||
|
||||
- [ ] Replace embedded `data` JSON with compact `sources`.
|
||||
- [ ] Add source extension list/detail serializers.
|
||||
- [ ] Add source records endpoint.
|
||||
- [ ] Rework source filters to use `OrganizationSourceExtension`.
|
||||
- [ ] Remove snapshot dependency from list/retrieve behavior.
|
||||
- [ ] Keep old snapshot management command only as deprecated/no-op until cleanup.
|
||||
|
||||
### Task 4: Parser Write Path
|
||||
|
||||
**Files:**
|
||||
- Modify: `src/apps/parsers/tasks.py`
|
||||
- Modify: `src/organizations/tasks.py`
|
||||
- Test: `tests/apps/parsers/test_tasks.py`
|
||||
- Test: `tests/apps/organizations/test_tasks.py`
|
||||
|
||||
- [ ] Replace snapshot refresh queueing with source backfill queueing for affected parser batches.
|
||||
- [ ] For each parser completion, backfill only the completed source/batch.
|
||||
- [ ] Keep full backfill command for initial migration and repair.
|
||||
|
||||
### Task 5: Frontend Contract Repair
|
||||
|
||||
**Files:**
|
||||
- Modify frontend generated API clients after backend OpenAPI changes.
|
||||
- Modify source detail table composables to consume `sources` and source records endpoints.
|
||||
|
||||
- [ ] Regenerate API client.
|
||||
- [ ] Update source pages to request extension records instead of embedded `organization.data[source]`.
|
||||
- [ ] Verify planned inspections page loads from source records.
|
||||
|
||||
### Task 6: Cleanup Phase
|
||||
|
||||
**Files:**
|
||||
- Modify migrations only after successful backfill validation.
|
||||
|
||||
- [ ] Remove `OrganizationDataSnapshot`.
|
||||
- [ ] Remove snapshot refresh schedules.
|
||||
- [ ] Decide which legacy parser tables remain as ingestion staging and which can be dropped.
|
||||
- [ ] Run full backend and frontend validation.
|
||||
@@ -0,0 +1,76 @@
|
||||
# Direct Parser Source Ingestion Design
|
||||
|
||||
## Goal
|
||||
|
||||
Parser runtime must write parsed source records directly into the organization-centric
|
||||
polymorphic storage:
|
||||
|
||||
- `organizations_organization`
|
||||
- `organizations_source_extension`
|
||||
- source extension subclass tables
|
||||
- `organizations_source_record`
|
||||
- `organizations_source_financial_line`
|
||||
|
||||
Legacy parser record tables remain only as migration/audit inputs until a later
|
||||
destructive cleanup. They must not be part of the parser runtime write path or the
|
||||
runtime read path used by the application.
|
||||
|
||||
## Current Runtime Problem
|
||||
|
||||
Current parser tasks write source rows into legacy parser tables such as
|
||||
`GenericParserRecord`, `InspectionRecord`, `ProcurementRecord`,
|
||||
`IndustrialProductRecord`, and `FinancialReport`, then enqueue source backfill into
|
||||
the new organization storage. This keeps old tables in the hot path and allows new
|
||||
runtime data to diverge before the async backfill runs.
|
||||
|
||||
## Target Runtime
|
||||
|
||||
Parser tasks keep using `ParserLoadLog`, `ParserBatchSequence`, and `BackgroundJob`
|
||||
as operational metadata. Parsed records are converted into normalized source-record
|
||||
inputs and persisted through one ingestion service.
|
||||
|
||||
The ingestion service is responsible for:
|
||||
|
||||
- normalizing identity fields before writing canonical organizations;
|
||||
- resolving or creating `Organization`;
|
||||
- creating or updating the source-group polymorphic extension;
|
||||
- creating or updating `OrganizationSourceRecord` by `(source, external_id)`;
|
||||
- writing structured financial lines for FNS reports;
|
||||
- refreshing extension counters in the same transaction.
|
||||
|
||||
Parser save services return the number of inserted or updated source records. They no
|
||||
longer create or query legacy parser record models for runtime decisions.
|
||||
|
||||
## Runtime Read Scope
|
||||
|
||||
The following runtime reads must use organization source storage:
|
||||
|
||||
- parser source cards and source item counters;
|
||||
- parser log organization counts;
|
||||
- source detail lists;
|
||||
- source record detail reads;
|
||||
- frontend-facing parser result compatibility endpoints while they remain exposed;
|
||||
- admin/dashboard/export paths that are used by the app during normal operation.
|
||||
|
||||
Legacy parser tables may still be read by explicit migration/backfill tooling only.
|
||||
|
||||
## Compatibility
|
||||
|
||||
Existing v1 parser-result URLs can remain during transition, but their data source must
|
||||
be `OrganizationSourceRecord`, not the legacy parser models. Response shape can be
|
||||
kept best-effort through serializers/adapters that read source-record payloads.
|
||||
|
||||
## Non-Goals
|
||||
|
||||
- Do not drop legacy parser tables in this phase.
|
||||
- Do not rewrite parser clients.
|
||||
- Do not remove parser load logs or background jobs.
|
||||
- Do not make every payload strongly typed immediately.
|
||||
|
||||
## Risks
|
||||
|
||||
- Industrial product ingestion is large; the writer must avoid per-record table scans.
|
||||
- Existing tests assert legacy model counts and must be updated to assert source-record
|
||||
behavior.
|
||||
- Some compatibility endpoints expose legacy primary keys. New records use UUIDs, so
|
||||
compatibility adapters must accept source-record UUIDs where needed.
|
||||
@@ -0,0 +1,222 @@
|
||||
# Polymorphic Organization Sources Design
|
||||
|
||||
## Goal
|
||||
|
||||
Replace the current source-centric parser tables and API v2 JSON snapshot model with an organization-centric schema:
|
||||
|
||||
- `Organization` is the main business entity and stores legal identity data.
|
||||
- Each source group is represented as a polymorphic organization extension.
|
||||
- Detailed source records hang under the extension as subordinate records.
|
||||
- API compatibility with the current frontend is not required.
|
||||
|
||||
## Current Data Facts
|
||||
|
||||
The current dev database contains:
|
||||
|
||||
- `organizations.Organization`: 29,667 rows.
|
||||
- `OrganizationDataSnapshot`: 29,667 rows after refresh.
|
||||
- `registers.Organization`: 5,138 rows.
|
||||
- `InspectionRecord`: 14,059 rows.
|
||||
- `ProcurementRecord`: 1,000 rows.
|
||||
- `IndustrialCertificateRecord`: 23,640 rows.
|
||||
- `ManufacturerRecord`: 8,762 rows.
|
||||
- `IndustrialProductRecord`: 471,824 rows.
|
||||
- `GenericParserRecord`: 3,506 rows.
|
||||
- `FinancialReport`: 10 rows.
|
||||
|
||||
Observed required-field candidates:
|
||||
|
||||
- `Organization.name` is present in 100% of canonical organizations and can be required.
|
||||
- `inn` is present in 95.34% of canonical organizations.
|
||||
- `ogrn` is present in 74.84%.
|
||||
- `kpp` is present in 20.56%.
|
||||
- `ogrip` is present in 8.08%.
|
||||
- 673 canonical organizations have no `inn`, `ogrn`, or `ogrip`.
|
||||
|
||||
Therefore only `name` can be required on `Organization`. Identifiers must stay optional and indexed. Identity completeness should be explicit through a status field, not hidden in nullability.
|
||||
|
||||
## Source Groups
|
||||
|
||||
The new source groups match the product navigation:
|
||||
|
||||
- Financial indicators.
|
||||
- Government procurements, including 44-FZ, 223-FZ, contracts, and legacy procurement rows.
|
||||
- Russian manufacturers and products, including certificates, manufacturers, and industrial products.
|
||||
- Planned inspections.
|
||||
- Bankruptcy procedures.
|
||||
- Defense supplier risk, including unfair suppliers and FAS GOZ.
|
||||
- Arbitration cases.
|
||||
- Information security registries.
|
||||
- Vacancies from Trudvsem, HH, and SuperJob.
|
||||
|
||||
## Target Schema
|
||||
|
||||
### Organization
|
||||
|
||||
`organizations.Organization` remains the root table.
|
||||
|
||||
Required:
|
||||
|
||||
- `uid`
|
||||
- `name`
|
||||
|
||||
Optional but indexed:
|
||||
|
||||
- `inn`
|
||||
- `kpp`
|
||||
- `ogrn`
|
||||
- `ogrip`
|
||||
|
||||
New fields:
|
||||
|
||||
- `identity_status`: one of `complete`, `partial`, `missing`.
|
||||
- `primary_identity`: short normalized search key used for deterministic deduplication and diagnostics.
|
||||
|
||||
Existing uniqueness constraints on non-empty identifiers should remain conservative. They protect the API and backfill from accidental duplicate canonical organizations.
|
||||
|
||||
### OrganizationSourceExtension
|
||||
|
||||
Add a new polymorphic base model:
|
||||
|
||||
- `uid`
|
||||
- `organization`
|
||||
- `source_group`
|
||||
- `title`
|
||||
- `status`
|
||||
- `records_count`
|
||||
- `first_seen_at`
|
||||
- `last_seen_at`
|
||||
- `last_load_batch`
|
||||
- `metadata`
|
||||
- timestamps
|
||||
|
||||
Constraints:
|
||||
|
||||
- One extension per `(organization, source_group)`.
|
||||
|
||||
Subclasses:
|
||||
|
||||
- `FinancialIndicatorsExtension`
|
||||
- `GovernmentProcurementExtension`
|
||||
- `IndustrialProductionExtension`
|
||||
- `PlannedInspectionExtension`
|
||||
- `BankruptcyExtension`
|
||||
- `DefenseSupplierExtension`
|
||||
- `ArbitrationExtension`
|
||||
- `SecurityRegistryExtension`
|
||||
- `VacancyExtension`
|
||||
|
||||
Rationale: `Organization` itself must not be polymorphic because one organization can have many source groups simultaneously. The polymorphic boundary belongs to source extensions.
|
||||
|
||||
### OrganizationSourceRecord
|
||||
|
||||
Use one subordinate detail table for most source rows:
|
||||
|
||||
- `uid`
|
||||
- `extension`
|
||||
- `record_type`
|
||||
- `source`
|
||||
- `external_id`
|
||||
- `title`
|
||||
- `record_date`
|
||||
- `amount`
|
||||
- `status`
|
||||
- `url`
|
||||
- `payload`
|
||||
- `legacy_model`
|
||||
- `legacy_pk`
|
||||
- `load_batch`
|
||||
- timestamps
|
||||
|
||||
Constraints:
|
||||
|
||||
- Unique `(source, external_id)` when `external_id` is non-empty.
|
||||
- Unique `(legacy_model, legacy_pk)` for migrated legacy rows.
|
||||
|
||||
This keeps the number of tables low while still preserving every source-specific payload.
|
||||
|
||||
### FinancialReportLine
|
||||
|
||||
Financial report lines remain a subordinate table because they are structured, repeated, and likely to be queried by year/form/line:
|
||||
|
||||
- `source_record`
|
||||
- `form_code`
|
||||
- `line_code`
|
||||
- `line_name`
|
||||
- `year`
|
||||
- `period_start`
|
||||
- `period_end`
|
||||
|
||||
The existing legacy `FinancialReportLine` table is used only as a staging source after the migration.
|
||||
|
||||
## Backfill Rules
|
||||
|
||||
Backfill must be idempotent.
|
||||
|
||||
For each legacy row:
|
||||
|
||||
1. Resolve canonical `Organization` by identifiers in this order:
|
||||
- exact `inn + kpp` where available,
|
||||
- exact `ogrn`,
|
||||
- exact `ogrip`,
|
||||
- exact `inn` only when it maps to one canonical organization,
|
||||
- normalized name fallback only when there is one unambiguous match.
|
||||
2. If no organization can be resolved, create or reuse an organization with `identity_status=missing` or `partial`.
|
||||
3. Create or update the matching `OrganizationSourceExtension`.
|
||||
4. Create or update `OrganizationSourceRecord`.
|
||||
5. Preserve the original source row in `payload`.
|
||||
6. Store `legacy_model` and `legacy_pk` for audit and repeatable updates.
|
||||
|
||||
## API Shape
|
||||
|
||||
The new API should be organization-centric:
|
||||
|
||||
- `GET /api/v2/organizations/`
|
||||
- `GET /api/v2/organizations/{uid}/`
|
||||
- `GET /api/v2/organizations/{uid}/sources/`
|
||||
- `GET /api/v2/organization-sources/{uid}/records/`
|
||||
|
||||
The list endpoint can expose compact source summaries:
|
||||
|
||||
```json
|
||||
{
|
||||
"uid": "...",
|
||||
"name": "...",
|
||||
"inn": "...",
|
||||
"ogrn": "...",
|
||||
"identity_status": "complete",
|
||||
"sources": [
|
||||
{
|
||||
"uid": "...",
|
||||
"source_group": "planned_inspections",
|
||||
"title": "Плановые проверки Генпрокуратуры России",
|
||||
"records_count": 12,
|
||||
"last_seen_at": "2026-05-18T00:00:00Z"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
Source records are fetched on demand from the extension, not embedded into every organization list row.
|
||||
|
||||
## Migration Phases
|
||||
|
||||
1. Add dependency and schema.
|
||||
2. Add idempotent backfill service and management command.
|
||||
3. Backfill all existing legacy parser data into source extensions.
|
||||
4. Switch API v2 to source extensions.
|
||||
5. Update frontend generated clients and source pages.
|
||||
6. After verification, remove `OrganizationDataSnapshot` and legacy parser tables in a separate cleanup phase.
|
||||
|
||||
## Non-Goals For First Pass
|
||||
|
||||
- No destructive deletion of legacy parser tables before backfill verification.
|
||||
- No attempt to make all source payloads strongly typed immediately.
|
||||
- No frontend visual redesign; only data contract changes needed for the new schema.
|
||||
|
||||
## Risks
|
||||
|
||||
- Polymorphic subclass queries add joins. List endpoints must query the base extension compactly and load records only on demand.
|
||||
- Legacy rows without identifiers require name fallback and can create duplicates. The command must report unresolved/created organizations.
|
||||
- Current v2 filters use source-specific existence checks. They must move to extension existence filters.
|
||||
- Frontend generated API clients will break and must be regenerated after backend OpenAPI changes.
|
||||
Reference in New Issue
Block a user