SEM Healthcare Diagnoses Dataset (2017-2018)
melior_sem_diagnoses.Rd
This dataset contains information about diagnoses recorded during healthcare contacts for patients in the SEM cohort from the Melior journal system. The data represents diagnoses linked to healthcare contacts during 2017-2018.
Format
A data frame with 1,466,052 observations and 9 variables:
- contact_id
Character. Unique identifier for each healthcare contact/encounter, serves as a foreign key to link with other datasets. Original field name: KontaktId
- patient_id
Integer. Patient pseudonym identifier, serves as a foreign key to link with other patient-level data. Original field name: Alias
- activity_type
Character. Type of healthcare activity or note. 157 unique values. Most common: "Epikris, tvärprofessionell" (36.7%), "Akutkliniken Läk" (30.8%), "Inskrivning Läk" (8.8%). Original field name: AktivitetTyp
- diagnosis_type
Character. Type of diagnosis. 7 unique values with main categories: "Huvuddiagnos"/"huvuddiagnos" (66.4%, primary diagnosis), "Bidiagnos" (31.6%, secondary diagnosis), "bidiagnos tillägg ICD10" (2.1%, secondary diagnosis ICD10 supplement), "Diagnos" (<0.1%), "tillägg ICD-10" (<0.1%), "Preliminär diagnos" (<0.1%). Original field name: Diagnostyp
- care_episode_start
POSIXct. Start date/time of the care episode for the diagnosis. Date range: 2013-02-11 to 2020-04-22. Distribution by year: 2017 (50.2%), 2018 (49.5%), 2016 (0.2%). Original field name: VårdtillfälleFörDiagnos_StartDatum
- care_episode_end
POSIXct. End date/time of the care episode for the diagnosis. Date range: 2013-06-02 to 2020-10-22. Contains 281 NA values (<0.1%). Distribution by year: 2018 (49.9%), 2017 (49.0%), 2019 (1.1%). Original field name: VårdtillfälleFörDiagnos_SlutDatum
- diagnosis_code
Character. Patient diagnosis code (ICD-10 code). 10,786 unique values across the dataset. Original field name: PatientDiagnos_Kod
- diagnosis_description
Character. Description of the diagnosis. 10,484 unique values across the dataset. Contains 1,914 NA values (0.1%). Original field name: PatientDiagnos_Beskrivning
- diagnosis_modified_date
POSIXct. Date/time when the diagnosis was recorded/modified. Date range: 2013-05-31 to 2020-12-03. Distribution by year: 2018 (49.9%), 2017 (48.5%), 2019 (1.4%). Original field name: PatientDiagnos_ModifieradDatum
Details
This file was extracted from the Melior electronic health record system. The original filename indicates it contains information about diagnoses (Diagnoser) recorded during healthcare contacts (VidVårdkontakt) during 2017-2018. The diagnostic codes follow the ICD-10 coding system.
Although the dataset primarily focuses on the 2017-2018 period (with 99.7% of diagnoses from these years), it contains a small number of records with dates outside this range, including some from 2013-2016 (0.3%) and others from 2019-2020 (<0.1%).
Note
Several fields from the original dataset have been omitted for efficiency:
AktivitetTermId: Numeric identifier that didn't provide additional clinical information beyond what is captured in activity_type
VårdtillfälleFörDiagnos_VardformText (care_form): This field contained a single value ("Slutenvård" = Inpatient) across all records (100%), providing no discriminative information. All records in this dataset are from inpatient care.
The dataset contains a few anomalous dates with years 2028-2029 in care_episode_end that have been corrected in the processing script by subtracting 10 years, assuming a data entry error
The diagnosis_type field contains inconsistencies in capitalization (e.g., "Huvuddiagnos" vs. "huvuddiagnos") which are standardized to lowercase in the processed data
The diagnosis_description field has 1,914 missing values (0.1% of records)
POSIXct fields are stored in datetime format
Original field names are preserved in the documentation for reference