SEM Historical Diagnoses Dataset (5 Years Before 2017-2018 Contacts)
melior_pre_sem_diagnoses.Rd
This dataset contains information about diagnoses recorded within 5 years prior to healthcare contacts for patients in the SEM cohort from the Melior journal system. The data represents diagnoses linked to healthcare contacts during 2017-2018.
Format
A data frame with 23,631,961 observations and 9 variables:
- contact_id
Character. Unique identifier for each healthcare contact/encounter, serves as a foreign key to link with other datasets. Original field name: KontaktId
- patient_id
Integer. Patient pseudonym identifier, serves as a foreign key to link with other patient-level data. Original field name: Alias
- activity_type
Character. Type of healthcare activity or note. 355 unique values. Most common: "Akutkliniken Läk" (11.0%), "Epikris" (10.2%), "Epikris, tvärprofessionell" (9.1%). Original field name: AktivitetTyp
- diagnosis_type
Character. Type of diagnosis. 8 unique values. Most common: "Huvuddiagnos" (60.5%), "Bidiagnos" (31.9%), "huvuddiagnos" (5.9%), "bidiagnos tillägg ICD10" (1.4%). Original field name: Diagnostyp
- care_episode_start
POSIXct. Start date/time of the care episode for the diagnosis. Date range: 1971-01-01 to 2020-06-05. Distribution by year: 2017 (26.3%), 2016 (20.0%), 2015 (16.7%), 2018 (15.6%), 2014 (14.5%), 2013 (6.9%). 235 NA values (<0.01%). Original field name: VårdtillfälleFörDiagnos_StartDatum
- care_episode_end
POSIXct. End date/time of the care episode for the diagnosis. Date range: 2007-01-27 to 2020-09-21. Distribution by year: 2017 (27.4%), 2016 (19.5%), 2018 (16.8%), 2015 (16.0%), 2014 (13.6%), 2013 (6.5%). 12,393,648 NA values (52.4%), primarily for outpatient episodes where end dates are not typically recorded. Original field name: VårdtillfälleFörDiagnos_SlutDatum
- care_form
Character. Form of care. 2 unique values: "Öppenvård" (Outpatient, 54.6%), "Slutenvård" (Inpatient, 45.4%). Original field name: VårdtillfälleFörDiagnos_VardformText
- diagnosis_code
Character. Patient diagnosis code (ICD-10 code). 15,453 unique values across the dataset. Original field name: PatientDiagnos_Kod
- diagnosis_description
Character. Description of the diagnosis. 14,286 unique values across the dataset. 47,263 NA values (0.2%). Original field name: PatientDiagnos_Beskrivning
- diagnosis_modified_date
POSIXct. Date/time when the diagnosis was recorded/modified. Date range: 2013-04-22 to 2018-12-31. Distribution by year: 2017 (26.6%), 2016 (20.0%), 2015 (16.6%), 2018 (16.1%), 2014 (14.4%), 2013 (6.2%). Original field name: PatientDiagnos_ModifieradDatum
Details
This file was extracted from the Melior electronic health record system. The original filename indicates it contains information about diagnoses (Diagnoser) recorded within 5 years prior to healthcare contacts (5ÅrFöreVårdkontakt) during 2017-2018. The diagnostic codes follow the ICD-10 coding system.
This dataset provides a comprehensive view of patients' diagnostic history within 5 years before their inclusion in the SEM cohort, capturing both inpatient and outpatient diagnoses. The large number of records (23.6 million) reflects the substantial healthcare utilization of this patient population in the years preceding their SEM cohort contact.
Note
Several fields from the original dataset have been omitted for efficiency:
AktivitetTermId: Numeric identifier that almost perfectly corresponded to activity_type
TermId: Numeric identifier that almost perfectly corresponded to diagnosis_type These fields add file size without contributing significant analytical value.
The care_form field shows that 54.6% of diagnoses were from outpatient care and 45.4% from inpatient care
The care_episode_end field has a high proportion of missing values (52.4%), which is expected for outpatient episodes that typically don't have formal end dates
The diagnosis_type field contains inconsistencies in capitalization (e.g., "Huvuddiagnos" vs. "huvuddiagnos") which are standardized to lowercase in the processed data
There are some anomalous date values outside the expected range in care_episode_start (e.g., one record from 1971), which appear to be data entry errors
Most observations are from the 5-year period 2013-2018 as expected, but there are a small number of records with dates outside this range
POSIXct fields are stored in datetime format
Original field names are preserved in the documentation for reference
Care episode durations in days can be calculated during analysis from care_episode_start and care_episode_end
Standard translations of care_form values are:
"Slutenvård" = "Inpatient"
"Öppenvård" = "Outpatient"