Build and query indexes of clinical documents with easy-to-reuse pipelines

Electronic Health Records are a central source of healthcare data, containing structured data alongside unstructured clinical texts. The latter capture detailed reasoning, observations, treatment plans and clinical evolutions, which are crucial for phenotyping, and real-world evidence generation. Natural language processing enables the extraction, thus the subsequent use, of these crucial elements; however, these extractions remain one-off, study-specific efforts. This is detrimental as the extracted elements could be valuable for future research. We present medkit Seshat, an open source Python pipeline that: (1) ingests free text , (2) recognizes relevant entities, (3) normalizes them with OMOP vocabularies, (4) builds an index that can either be searched by concept or by document. In addition, we share a flexible web UI to illustrate the interest of built indexes in term of search, text analysis and export. Seshat aims at facilitating the reuse and adaptation of this prototypical pipeline to various purposes, with the main objective of enabling the secondary use of results of phenotyping campaigns.

Mots clés

Domaines

Fichier principal

2026_berthou_et_al_mie.pdf (966.78 Ko)

Origine	Fichiers produits par l'(les) auteur(s)
licence	CC BY-NC 4.0 - Attribution - Utilisation non commerciale

Connectez-vous pour contacter le contributeur

https://hal.science/hal-05569688

Soumis le : vendredi 27 mars 2026-10:50:01

Dernière modification le : jeudi 9 avril 2026-16:02:01

Dates et versions

hal-05569688 , version 1 (27-03-2026)

Licence

CC BY-NC 4.0 - Attribution - Utilisation non commerciale

Identifiants

HAL Id : hal-05569688 , version 1

Citer

Félix Berthou, Ghilsain Vaillant, Bastien Rance, Adrien Coulet. Build and query indexes of clinical documents with easy-to-reuse pipelines. MIE 2026 - Medical Informatics Europe, EFMI, May 2026, Genova, Italy. ⟨hal-05569688⟩

Exporter

Collections

693 Consultations

83 Téléchargements