Adding names to longitudinal and service sector codes
tutorial_naming_service_sector_data.RmdIntro
This tutorial shows how to append additional information to a
longitudinal or service-sector table using dplyr package
and the fgbq object from the FinnGenUtilsR
package.
Information to add includes: name of the medical codes in English, name for the type of visit, name for the type of provider.
To understand what the fgbq object is, see the dedicated
vignette tutorial_fgbq.
This tutorial also includes how to append additional information to the Kanta lab data. In particular, the name of the OMOP concept for the lab test.
Set up
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(FinnGenUtilsR)Connect to the latest versions of BigQuery tables using
get_fg_bq_tables(). Remember to change the environment to
your sandbox, or set other data freeze versions if needed (see
tutorial_fgbq vignette for more details).
fgbq <- get_fg_bq_tables(
environment = "sandbox-XX"
)
#> Connecting to BigQuery...
#> Using data freeze: dev
#> Finding latest table versions...
#> - minimum_extended: sandbox_tools_dev.minimum_extended_dev_dev
#> - service_sector_detailed_longitudinal: sandbox_tools_dev.finngen_dev_service_sector_detailed_longitudinal_dev
#> - kanta: sandbox_tools_dev.kanta_dev_dev
#> - kidney: sandbox_tools_dev.kidney_dev_dev
#> - vision: sandbox_tools_dev.vision_dev_dev
#> - birth_mother: sandbox_tools_dev.birth_mother_dev_dev
#> - hla_imputed: sandbox_tools_dev.hla_imputed_dev_dev
#> - drug_events: sandbox_tools_dev.drug_events_dev_dev
#> - kanta_medication_delivery: sandbox_tools_dev.kanta_medication_delivery_dev_dev
#> - kanta_prescription: sandbox_tools_dev.kanta_prescription_dev_dev
#> - code_counts: sandbox_tools_dev.code_counts_dev_dev
#> - code_prevalence_stratified: sandbox_tools_dev.code_prevalence_stratified_dev_dev
#> - endpoint_cohorts: sandbox_tools_dev.endpoint_cohorts_dev_dev
#> - fg_codes_info: medical_codes.fg_codes_info_dev
#> - finngen_vnrs: medical_codes.finngen_vnr_dev
#> - omop_concept: finngen_omop_dev_dev.concept
#> Creating table connections (this may take a moment)...
#> Successfully connected to 16 tables in 8.21 secondsWe can access the service sector data using the fgbq
object as follows:
fgbq$tbl$service_sector_detailed_longitudinal |> head()
#> # Source: SQL [?? x 16]
#> # Database: BigQueryConnection
#> FINNGENID SOURCE EVENT_AGE APPROX_EVENT_DAY CODE1 CODE2 CODE3 CODE4 CODE5
#> <chr> <chr> <dbl> <date> <chr> <chr> <chr> <chr> <chr>
#> 1 FG00000001 PURCH 41.9 2018-04-30 C10AA05 NA 119478 1 NA
#> 2 FG00000001 PURCH 42.1 2018-07-01 R03BA07 NA 007549 1 NA
#> 3 FG00000001 PURCH 42.2 2018-08-15 D06BX01 NA 007623 1 NA
#> 4 FG00000001 PURCH 42.6 2018-12-16 R01AD08 NA 018285 1 NA
#> 5 FG00000001 PURCH 41.9 2018-04-14 A06AC01 NA 063219 1 NA
#> 6 FG00000001 PURCH 41.8 2018-03-07 C10AA02 NA 014253 1 NA
#> # ℹ 7 more variables: CODE6 <chr>, CODE7 <chr>, CODE8 <chr>, CODE9 <chr>,
#> # ICDVER <chr>, CATEGORY <chr>, INDEX <chr>Generate a table in service sector format
For example we create a subset of the main service sector table by filtering events for one patient.
# Get events for subject FG00000001
ss_subject_1_tbl <- fgbq$tbl$service_sector_detailed_longitudinal |>
dplyr::filter(FINNGENID == "FG00000001")We can see that this table does not currently have information about the medical codes:
ss_subject_1_tbl |> head()
#> # Source: SQL [?? x 16]
#> # Database: BigQueryConnection
#> FINNGENID SOURCE EVENT_AGE APPROX_EVENT_DAY CODE1 CODE2 CODE3 CODE4 CODE5
#> <chr> <chr> <dbl> <date> <chr> <chr> <chr> <chr> <chr>
#> 1 FG00000001 PURCH 41.9 2018-04-30 C10AA05 NA 119478 1 NA
#> 2 FG00000001 PURCH 42.1 2018-07-01 R03BA07 NA 007549 1 NA
#> 3 FG00000001 PURCH 42.2 2018-08-15 D06BX01 NA 007623 1 NA
#> 4 FG00000001 PURCH 42.6 2018-12-16 R01AD08 NA 018285 1 NA
#> 5 FG00000001 PURCH 41.9 2018-04-14 A06AC01 NA 063219 1 NA
#> 6 FG00000001 PURCH 41.8 2018-03-07 C10AA02 NA 014253 1 NA
#> # ℹ 7 more variables: CODE6 <chr>, CODE7 <chr>, CODE8 <chr>, CODE9 <chr>,
#> # ICDVER <chr>, CATEGORY <chr>, INDEX <chr>Adding info to medical codes
Function fg_dbplyr_append_code_info_to_longitudinal_data
adds new columns with information about the medical code used in each
event/row.
We just need to indicate the table with the translations. This is
also available in the fgbq object.
ss_subject_1_with_translations_tbl <- fg_dbplyr_append_code_info_to_longitudinal_data(
ss_subject_1_tbl,
fg_bq_tables = fgbq
)
ss_subject_1_with_translations_tbl |>
dplyr::select(FINNGENID, SOURCE, APPROX_EVENT_DAY, CODE1, CODE2, CODE3, code, name_en, name_fi, omop_concept_id) |>
head()
#> # Source: SQL [?? x 10]
#> # Database: BigQueryConnection
#> FINNGENID SOURCE APPROX_EVENT_DAY CODE1 CODE2 CODE3 code name_en name_fi
#> <chr> <chr> <date> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 FG00000001 PURCH 2018-04-30 C10AA05 NA 119478 C10AA… atorva… NA
#> 2 FG00000001 PURCH 2018-07-01 R03BA07 NA 007549 R03BA… mometa… NA
#> 3 FG00000001 PURCH 2018-08-15 D06BX01 NA 007623 D06BX… metron… NA
#> 4 FG00000001 PURCH 2018-12-16 R01AD08 NA 018285 R01AD… flutic… NA
#> 5 FG00000001 PURCH 2018-04-14 A06AC01 NA 063219 A06AC… ispagh… NA
#> 6 FG00000001 PURCH 2018-03-07 C10AA02 NA 014253 C10AA… lovast… NA
#> # ℹ 1 more variable: omop_concept_id <chr>If an event/row uses more than one medical code
Some events include more than one medical code, or more than one way
to understand a medical code. Parameters in
fg_dbplyr_append_code_info_to_longitudinal_data allows to
chose how to name the medical code in this situations.
For example:
- PURCH_map_to = “VNR”, tells to use the VNR code in PURCH register, instead of the default ATC
- ICD10fi_map_to = “CODE1”, tells to use only the first code in ICD10fi, instead of the default CODE1*CODE2 combination code
fg_dbplyr_append_code_info_to_longitudinal_data(
ss_subject_1_tbl,
fg_bq_tables = fgbq,
PURCH_map_to = "VNR",
ICD10fi_map_to = "CODE1"
) |>
filter(vocabulary_id == "ICD10fi" | vocabulary_id == "VNRfi") |>
dplyr::select(FINNGENID, SOURCE, APPROX_EVENT_DAY, CODE1, CODE2, vocabulary_id, code, name_en, name_fi, omop_concept_id) |>
head()
#> # Source: SQL [?? x 10]
#> # Database: BigQueryConnection
#> FINNGENID SOURCE APPROX_EVENT_DAY CODE1 CODE2 vocabulary_id code name_en
#> <chr> <chr> <date> <chr> <chr> <chr> <chr> <chr>
#> 1 FG00000001 PURCH 2018-04-30 C10AA05 NA VNRfi 119478 ATORVAS…
#> 2 FG00000001 PURCH 2018-07-01 R03BA07 NA VNRfi 007549 ASMANEX…
#> 3 FG00000001 PURCH 2018-08-15 D06BX01 NA VNRfi 007623 ROSAZOL…
#> 4 FG00000001 PURCH 2018-12-16 R01AD08 NA VNRfi 018285 FLIXONA…
#> 5 FG00000001 PURCH 2018-04-14 A06AC01 NA VNRfi 063219 VI-SIBL…
#> 6 FG00000001 PURCH 2018-03-07 C10AA02 NA VNRfi 014253 LOVASTA…
#> # ℹ 2 more variables: name_fi <chr>, omop_concept_id <chr>Se the help, for more options
?fg_append_code_info_to_longitudinal_data_sqlReduce code precision
ICD10fi, ICD9fi, ICD8fi, ATC, and NCSPfi, code systems reflect the hierarchy of the code by the first letters on them.
It is possible to truncate these medical codes before the information is added. This in practice adds the code and names of the parent codes.
fg_dbplyr_append_code_info_to_longitudinal_data(
ss_subject_1_tbl,
fg_bq_tables = fgbq,
ICD10fi_precision = 3,
ICD9fi_precision = 3,
ICD8fi_precision = 3,
ATC_precision = 3,
NCSPfi_precision = 2
) |>
dplyr::select(FINNGENID, SOURCE, APPROX_EVENT_DAY, CODE1, CODE2, vocabulary_id, code, name_en, name_fi, omop_concept_id) |>
head()
#> # Source: SQL [?? x 10]
#> # Database: BigQueryConnection
#> FINNGENID SOURCE APPROX_EVENT_DAY CODE1 CODE2 vocabulary_id code name_en
#> <chr> <chr> <date> <chr> <chr> <chr> <chr> <chr>
#> 1 FG00000001 PURCH 2018-04-30 C10AA05 NA ATC C10 LIPID MO…
#> 2 FG00000001 PURCH 2018-07-01 R03BA07 NA ATC R03 DRUGS FO…
#> 3 FG00000001 PURCH 2018-08-15 D06BX01 NA ATC D06 ANTIBIOT…
#> 4 FG00000001 PURCH 2018-12-16 R01AD08 NA ATC R01 NASAL PR…
#> 5 FG00000001 PURCH 2018-04-14 A06AC01 NA ATC A06 DRUGS FO…
#> 6 FG00000001 PURCH 2018-03-07 C10AA02 NA ATC C10 LIPID MO…
#> # ℹ 2 more variables: name_fi <chr>, omop_concept_id <chr>We can see in column code the truncated code and in
name_en the name of the truncated code.
Adding info to visit type
Function
fg_dbplyr_append_visit_type_info_to_service_sector_data
adds new columns with information about the visit type.
In the service-sector data, information about the visit type is
defined in CODE5 to CODE9 depending on the SOURCE and time period.
fg_dbplyr_append_visit_type_info_to_service_sector_data
abstracts this nuances, and assign one visit type code per row. Notice
that a visit may contain more than one event/row. Events/rows belonging
to the same visit share the same combination of SOURCE+INDEX.
By default,
fg_bq_append_visit_type_info_to_service_sector_data will
also include two columns is_clinic_visit and
is_follow_up_visit that will be TRUE if the visit is a
clinic visit or a follow-up visit, respectively. However, this works
only if fg_codes_info_table version is v7 or higher. If you
need to use lower versions set the parameters
add_is_clinic_visist and
add_is_follow_up_visit to FALSE. See the help for more
details.
fg_dbplyr_append_visit_type_info_to_service_sector_data(
ss_subject_1_tbl,
fg_bq_tables = fgbq
) |>
dplyr::select(FINNGENID, SOURCE, INDEX, APPROX_EVENT_DAY, CODE5, CODE6, CODE8, CODE9, visit_type_code, visit_type_name_en, is_clinic_visit, is_follow_up_visit) |>
head()
#> # Source: SQL [?? x 12]
#> # Database: BigQueryConnection
#> FINNGENID SOURCE INDEX APPROX_EVENT_DAY CODE5 CODE6 CODE8 CODE9
#> <chr> <chr> <chr> <date> <chr> <chr> <chr> <chr>
#> 1 FG00000001 PURCH 1 2018-04-30 NA NA NA NA
#> 2 FG00000001 PURCH 2 2018-07-01 NA NA NA NA
#> 3 FG00000001 PURCH 3 2018-08-15 NA NA NA NA
#> 4 FG00000001 PURCH 4 2018-12-16 NA NA NA NA
#> 5 FG00000001 PURCH 5 2018-04-14 NA NA NA NA
#> 6 FG00000001 PURCH 6 2018-03-07 NA NA NA NA
#> # ℹ 4 more variables: visit_type_code <chr>, visit_type_name_en <chr>,
#> # is_clinic_visit <lgl>, is_follow_up_visit <lgl>Adding info to provider type
Function
fg_dbplyr_append_provider_info_to_service_sector_data adds
new columns with information about the personal or unit that provided
the diagnose during the visit.
In the service-sector data, information about the provider is defined
in CODE6 to CODE7 depending on the SOURCE.
fg_dbplyr_append_provider_info_to_service_sector_data
abstracts this nuances, and assign one provider type code per row.
fg_dbplyr_append_provider_info_to_service_sector_data(
ss_subject_1_tbl,
fg_bq_tables = fgbq
) |>
dplyr::select(FINNGENID, SOURCE, INDEX, APPROX_EVENT_DAY, CODE5, CODE6, CODE8, CODE9, provider_code, provider_name_en, provider_concept_class_id) |>
head()
#> # Source: SQL [?? x 11]
#> # Database: BigQueryConnection
#> FINNGENID SOURCE INDEX APPROX_EVENT_DAY CODE5 CODE6 CODE8 CODE9 provider_code
#> <chr> <chr> <chr> <date> <chr> <chr> <chr> <chr> <chr>
#> 1 FG00000001 PURCH 39 2020-03-13 NA NA NA NA NA
#> 2 FG00000001 PURCH 91 2022-05-09 NA NA NA NA NA
#> 3 FG00000001 PURCH 92 2022-10-07 NA NA NA NA NA
#> 4 FG00000001 PURCH 144 2008-03-06 NA NA NA NA NA
#> 5 FG00000001 PURCH 145 2008-03-12 NA NA NA NA NA
#> 6 FG00000001 PURCH 169 2010-08-09 NA NA NA NA NA
#> # ℹ 2 more variables: provider_name_en <chr>, provider_concept_class_id <chr>Alternative: Running SQL Queries
The previous examples used dplyr syntax with
fgbq$tbl objects. Alternatively, you can work directly with
BigQuery tables using the fg_bq_* family of functions,
which execute SQL queries directly on BigQuery.
This approach is useful when you need more control over the query execution or when working with existing BigQuery table references.
Setup for BigQuery approach
First, we need to set up the BigQuery connection and table references:
library(bigrquery)
# Your BigQuery project ID
project_id <- "atlas-development-270609"
# Path to the fg_codes_info table
fg_codes_info_table <- "atlas-development-270609.medical_codes_dev.fg_codes_info_dev"
# Create a bq_table reference to your service sector data
# This could be a table you've already created in BigQuery
service_sector_bq_table <- bq_table(
project = project_id,
dataset = "sandbox_tools_dev",
table = "finngen_dev_service_sector_detailed_longitudinal_dev"
)Generate a table in service sector format
# Get events for subject FG00000001 directly from BigQuery
ss_subject_1_bq <- bq_project_query(
project_id,
"SELECT * FROM `atlas-development-270609.sandbox_tools_dev.finngen_dev_service_sector_detailed_longitudinal_dev` WHERE FINNGENID = 'FG00000001'"
) Adding info to medical codes with BigQuery
Function fg_bq_append_code_info_to_longitudinal_data
works similarly to the dplyr version, but operates directly
on BigQuery tables:
# Add code information to the service sector table
result_bq_table <- fg_bq_append_code_info_to_longitudinal_data(
bq_project_id = project_id,
bq_table = ss_subject_1_bq,
fg_codes_info_table = fg_codes_info_table
)
# The result is a new temporary table in BigQuery
result_bq_table |> bq_table_download() |>
dplyr::select(FINNGENID, SOURCE, APPROX_EVENT_DAY, CODE1, CODE2, vocabulary_id, code, name_en, name_fi, omop_concept_id) |>
head()
#> # A tibble: 6 × 10
#> FINNGENID SOURCE APPROX_EVENT_DAY CODE1 CODE2 vocabulary_id code name_en
#> <chr> <chr> <date> <chr> <chr> <chr> <chr> <chr>
#> 1 FG00000001 PURCH 2018-04-30 C10AA05 NA ATC C10AA05 atorva…
#> 2 FG00000001 PURCH 2018-07-01 R03BA07 NA ATC R03BA07 mometa…
#> 3 FG00000001 PURCH 2018-08-15 D06BX01 NA ATC D06BX01 metron…
#> 4 FG00000001 PURCH 2018-12-16 R01AD08 NA ATC R01AD08 flutic…
#> 5 FG00000001 PURCH 2018-04-14 A06AC01 NA ATC A06AC01 ispagh…
#> 6 FG00000001 PURCH 2018-03-07 C10AA02 NA ATC C10AA02 lovast…
#> # ℹ 2 more variables: name_fi <chr>, omop_concept_id <chr>