USING FGconnectionHandler: Adding names to longitudinal and service sector codes • FinnGenUtilsR

Intro

This tutorial show how to append additional information to a longitudinal or service-sector table using dplyr package and the FGconnectionHandler object in the FinnGenUtilsR package.

Information to add includes, name of the medical codes in English, name for the type of visit, name for the type of provider.

To understand what is the FGconnectionHandler is recommended to see the dedicate vignette tutorial_connection_handler.

This tutorial also included how to append additional information to the Kanta lab data. In particular, the name of the OMOP concept for the lab test.

Set up

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(FinnGenUtilsR)

This tutorial has been developed using dummy data under the Atlas development project.

 config <- fg_get_bq_config(
   environment = "atlasDevelopment",
   dataFreezeNumber = 12
   )
#> Set option sqlRenderTempEmulationSchema = 'atlas-development-270609.sandbox'

To execute it in Sandbox, be sure to change the configuration to fit your needs.

 config <- fg_get_bq_config(
   environment = "sandbox-6", # set your sandbox number
   dataFreezeNumber = 12 # set the data freeze you want to use
   )

Create a FGconnectionHandler object with the connection configuration from fg_get_bq_config.

FGconnectionHandler <- create_fg_connection_handler_FromList(config)
#> Connecting using BigQuery driver
#> Created a temporary table named #cars
#> Inserting data took 0.0101 secs
#> Dropping tables 'sscs0frucars' from schema 'atlas-development-270609.sandbox'.
#>   |                                                                              |                                                                      |   0%  |                                                                              |===================================                                   |  50%  |                                                                              |======================================================================| 100%
#> Executing SQL took 2.06 secs

We can check if the connection has been successful.

FGconnectionHandler$connectionStatusLog
#> # A tibble: 4 × 4
#>   databaseName     type    step                          message                
#>   <chr>            <fct>   <chr>                         <chr>                  
#> 1 atlasDevelopment SUCCESS Check database connection     Valid connection       
#> 2 atlasDevelopment SUCCESS Check temp table creation     can create temp tables 
#> 3 atlasDevelopment SUCCESS sandboxToolsSchema connection Connected to tables: b…
#> 4 atlasDevelopment SUCCESS medicalCodesSchema connection Connected to tables: f…

We can access the service sector data using the connection object as follows:

service_sector_tbl <- FGconnectionHandler$getTblsandboxToolsSchema$finngen_r12_service_sector_detailed_longitudinal_v1()

Where service_sector_tbl is a tbl object that can be used with dplyr functions.

Generate a table in service sector format

For example we create a subset of the main service sector table by filtering the patients with an asthma diagnose (J45) in the INPAT source.

IMPORTANT: this is important to notice that the tables coming from the FGconnectionHandler have all the columns in lower case.

# get events with code "J45" in INPAT
ss_subject_1_tbl  <- FGconnectionHandler$getTblsandboxToolsSchema$finngen_r12_service_sector_detailed_longitudinal_v1() |> 
  dplyr::filter(finngenid == "FG00000001")

We can see that this table does not currently have information about the medical codes:

ss_subject_1_tbl
#> Note: method with signature 'DBIConnection#SQL' chosen for function 'dbQuoteIdentifier',
#>  target signature 'DatabaseConnectorJdbcConnection#SQL'.
#>  "DatabaseConnectorConnection#character" would also be valid
#> # Source:   SQL [?? x 16]
#> # Database: DatabaseConnectorJdbcConnection
#>    finngenid  source  event_age approx_event_day code1 code2 code3 code4 code5
#>    <chr>      <chr>       <dbl> <date>           <chr> <chr> <chr> <chr> <chr>
#>  1 FG00000001 OUTPAT       34.5 2010-11-08       X59   NA    NA    NA    93   
#>  2 FG00000001 OUTPAT       41.7 2018-01-16       W19   NA    NA    NA    NA   
#>  3 FG00000001 OUTPAT       42.5 2018-11-17       X58   NA    NA    NA    NA   
#>  4 FG00000001 OUTPAT       47.0 2023-05-22       M069  NA    NA    NA    NA   
#>  5 FG00000001 OUTPAT       47.0 2023-05-22       Z992  NA    NA    NA    NA   
#>  6 FG00000001 OUTPAT       47.0 2023-05-22       F4322 NA    NA    NA    NA   
#>  7 FG00000001 OUTPAT       47.0 2023-05-22       C5691 NA    NA    NA    NA   
#>  8 FG00000001 OUTPAT       47.0 2023-05-22       G564  G564  NA    NA    NA   
#>  9 FG00000001 OUTPAT       47.0 2023-05-22       N3282 NA    NA    NA    NA   
#> 10 FG00000001 OPER_IN      16.8 1993-02-18       8511  NA    NA    28    NA   
#> # ℹ more rows
#> # ℹ 7 more variables: code6 <chr>, code7 <chr>, code8 <chr>, code9 <chr>,
#> #   icdver <chr>, category <chr>, index <chr>

(same table in html format for exporation)

SERVICE SECTOR

Adding info to medical codes

Function fg_dbplyr_append_code_info_to_longitudinal_data adds new columns with information about the medical code used in each event/row.

We just need to indicate the table with the translations. This is also available in the FGconnectionHandler object.

We recommend to take the one with the highest version number.

ss_subject_1_with_translations_tbl <- fg_dbplyr_append_code_info_to_longitudinal_data(
  ss_subject_1_tbl, 
  FGconnectionHandler$getTblmedicalCodesSchema$fg_codes_info_v7()
  )

ss_subject_1_with_translations_tbl |>  
  dplyr::select(finngenid, source, approx_event_day, code1, code2, code3, code, name_en, name_fi, omop_concept_id) 
#> # Source:   SQL [?? x 10]
#> # Database: DatabaseConnectorJdbcConnection
#>    finngenid  source  approx_event_day code1 code2 code3 code   name_en  name_fi
#>    <chr>      <chr>   <date>           <chr> <chr> <chr> <chr>  <chr>    <chr>  
#>  1 FG00000001 OUTPAT  2010-11-08       X59   NA    NA    X59    Exposur… NA     
#>  2 FG00000001 OUTPAT  2018-01-16       W19   NA    NA    W19    Unspeci… NA     
#>  3 FG00000001 OUTPAT  2018-11-17       X58   NA    NA    X58    Exposur… NA     
#>  4 FG00000001 OUTPAT  2023-05-22       M069  NA    NA    M06.9  Rheumat… NA     
#>  5 FG00000001 OUTPAT  2023-05-22       Z992  NA    NA    Z99.2  Depende… NA     
#>  6 FG00000001 OUTPAT  2023-05-22       F4322 NA    NA    F43.22 Adaptat… Sopeut…
#>  7 FG00000001 OUTPAT  2023-05-22       C5691 NA    NA    C56.91 Maligna… Munasa…
#>  8 FG00000001 OUTPAT  2023-05-22       G564  G564  NA    G56.4  Causalg… NA     
#>  9 FG00000001 OUTPAT  2023-05-22       N3282 NA    NA    N32.82 Instabi… Virtsa…
#> 10 FG00000001 OPER_IN 1993-02-18       8511  NA    NA    8511   Cervica… Kaulak…
#> # ℹ more rows
#> # ℹ 1 more variable: omop_concept_id <chr>

(same table in html format for exporation)

If an event/row uses more than one medical code

Some events include more than one medical code, or more than one way to understand a medical code. Parameters in fg_dbplyr_append_code_info_to_longitudinal_data allows to chose how to name the medical code in this situations.

For example:

PURCH_map_to = “VNR”, tells to use the VNR code in PURCH register, instead of the default ATC
ICD10fi_map_to = “CODE1”, tells to use only the first code in ICD10fi, instead of the default CODE1*CODE2 combination code

fg_dbplyr_append_code_info_to_longitudinal_data(
  ss_subject_1_tbl, 
  FGconnectionHandler$getTblmedicalCodesSchema$fg_codes_info_v7(), 
    PURCH_map_to = "VNR", 
    ICD10fi_map_to = "CODE1" 
  ) |>
  filter(vocabulary_id == "ICD10fi" | vocabulary_id == "VNRfi") |>  
  dplyr::select(finngenid, source, approx_event_day, code1, code2, vocabulary_id, code, name_en, name_fi, omop_concept_id) 
#> # Source:   SQL [?? x 10]
#> # Database: DatabaseConnectorJdbcConnection
#>    finngenid  source approx_event_day code1 code2 vocabulary_id code   name_en  
#>    <chr>      <chr>  <date>           <chr> <chr> <chr>         <chr>  <chr>    
#>  1 FG00000001 OUTPAT 2010-11-08       X59   NA    ICD10fi       X59    Exposure…
#>  2 FG00000001 OUTPAT 2018-01-16       W19   NA    ICD10fi       W19    Unspecif…
#>  3 FG00000001 OUTPAT 2018-11-17       X58   NA    ICD10fi       X58    Exposure…
#>  4 FG00000001 OUTPAT 2023-05-22       M069  NA    ICD10fi       M06.9  Rheumato…
#>  5 FG00000001 OUTPAT 2023-05-22       Z992  NA    ICD10fi       Z99.2  Dependen…
#>  6 FG00000001 OUTPAT 2023-05-22       F4322 NA    ICD10fi       F43.22 Adaptati…
#>  7 FG00000001 OUTPAT 2023-05-22       C5691 NA    ICD10fi       C56.91 Malignan…
#>  8 FG00000001 OUTPAT 2023-05-22       G564  G564  ICD10fi       G56.4  Causalgia
#>  9 FG00000001 OUTPAT 2023-05-22       N3282 NA    ICD10fi       N32.82 Instabil…
#> 10 FG00000001 OUTPAT 2015-05-25       J450  NA    ICD10fi       J45.0  Predomin…
#> # ℹ more rows
#> # ℹ 2 more variables: name_fi <chr>, omop_concept_id <chr>

(same table in html format for exporation)

Se the help, for more options

?fg_append_code_info_to_longitudinal_data_sql

Reduce code precision

ICD10fi, ICD9fi, ICD8fi, ATC, and NCSPfi, code systems reflect the hierarchy of the code by the first letters on them.

It is possible to truncate these medical codes before the information is added. This in practice adds the code and names of the parent codes.

fg_dbplyr_append_code_info_to_longitudinal_data(
  ss_subject_1_tbl, 
  FGconnectionHandler$getTblmedicalCodesSchema$fg_codes_info_v7(), 
  ICD10fi_precision = 3,
  ICD9fi_precision = 3,
  ICD8fi_precision = 3,
  ATC_precision = 3,
  NCSPfi_precision = 2
)|>
  dplyr::select(finngenid, source, approx_event_day, code1, code2, vocabulary_id, code, name_en, name_fi, omop_concept_id) 
#> # Source:   SQL [?? x 10]
#> # Database: DatabaseConnectorJdbcConnection
#>    finngenid  source  approx_event_day code1 code2 vocabulary_id code  name_en  
#>    <chr>      <chr>   <date>           <chr> <chr> <chr>         <chr> <chr>    
#>  1 FG00000001 OUTPAT  2010-11-08       X59   NA    ICD10fi       X59   Exposure…
#>  2 FG00000001 OUTPAT  2018-01-16       W19   NA    ICD10fi       W19   Unspecif…
#>  3 FG00000001 OUTPAT  2018-11-17       X58   NA    ICD10fi       X58   Exposure…
#>  4 FG00000001 OUTPAT  2023-05-22       M069  NA    ICD10fi       M06   Other rh…
#>  5 FG00000001 OUTPAT  2023-05-22       Z992  NA    ICD10fi       Z99   Dependen…
#>  6 FG00000001 OUTPAT  2023-05-22       F4322 NA    ICD10fi       F43   Reaction…
#>  7 FG00000001 OUTPAT  2023-05-22       C5691 NA    ICD10fi       C56   Malignan…
#>  8 FG00000001 OUTPAT  2023-05-22       G564  G564  ICD10fi       G56   Mononeur…
#>  9 FG00000001 OUTPAT  2023-05-22       N3282 NA    ICD10fi       N32   Other di…
#> 10 FG00000001 OPER_IN 1993-02-18       8511  NA    FHL           8511  Cervical…
#> # ℹ more rows
#> # ℹ 2 more variables: name_fi <chr>, omop_concept_id <chr>

(same table in html format for exporation)

We can see in column code the truncated code and in name_en the name of the truncated code.

Adding info to visit type

Function fg_dbplyr_append_visit_type_info_to_service_sector_data adds new columns with information about the visit type.

In the service-sector data, information about the visit type is defined in CODE5 to CODE9 depending on the SOURCE and time period. fg_dbplyr_append_visit_type_info_to_service_sector_data abstracts this nuances, and assign one visit type code per row. Notice that a visit may contain more than one event/row. Events/rows belonging to the same visit share the same combination of SOURCE+INDEX.

By default, fg_bq_append_visit_type_info_to_service_sector_data will also include two columns is_clinic_visit and is_follow_up_visit that will be TRUE if the visit is a clinic visit or a follow-up visit, respectively. However, this works only if fg_codes_info_table version is v7 or higher. If you need to use lower versions set the parameters add_is_clinic_visist and add_is_follow_up_visit to FALSE. See the help for more details.

fg_dbplyr_append_visit_type_info_to_service_sector_data(
  ss_subject_1_tbl, 
  FGconnectionHandler$getTblmedicalCodesSchema$fg_codes_info_v7()
  )|> 
  dplyr::select(finngenid, source, index, approx_event_day, code5, code6, code8, code9, visit_type_code, visit_type_name_en, is_clinic_visit, is_follow_up_visit)
#> # Source:   SQL [?? x 12]
#> # Database: DatabaseConnectorJdbcConnection
#>    finngenid  source  index approx_event_day code5 code6 code8 code9
#>    <chr>      <chr>   <chr> <date>           <chr> <chr> <chr> <chr>
#>  1 FG00000001 OUTPAT  305   2010-11-08       93    10E   NA    NA   
#>  2 FG00000001 OUTPAT  258   2018-01-16       NA    30Q   R52   E    
#>  3 FG00000001 OUTPAT  110   2018-11-17       NA    15Y   R10   6    
#>  4 FG00000001 OUTPAT  97    2023-05-22       NA    80A   R52   E    
#>  5 FG00000001 OUTPAT  97    2023-05-22       NA    80A   R52   E    
#>  6 FG00000001 OUTPAT  97    2023-05-22       NA    80A   R52   E    
#>  7 FG00000001 OUTPAT  97    2023-05-22       NA    80A   R52   E    
#>  8 FG00000001 OUTPAT  97    2023-05-22       NA    80A   R52   E    
#>  9 FG00000001 OUTPAT  97    2023-05-22       NA    80A   R52   E    
#> 10 FG00000001 OPER_IN 105   1993-02-18       NA    90    NA    NA   
#> # ℹ more rows
#> # ℹ 4 more variables: visit_type_code <chr>, visit_type_name_en <chr>,
#> #   is_clinic_visit <chr>, is_follow_up_visit <chr>

(same table in html format for exporation)

Adding info to provider type

Function fg_dbplyr_append_provider_info_to_service_sector_data adds new columns with information about the personal or unit that provided the diagnose during the visit.

In the service-sector data, information about the provider is defined in CODE6 to CODE7 depending on the SOURCE. fg_dbplyr_append_provider_info_to_service_sector_data abstracts this nuances, and assign one provider type code per row.

fg_dbplyr_append_provider_info_to_service_sector_data(
  ss_subject_1_tbl, 
  FGconnectionHandler$getTblmedicalCodesSchema$fg_codes_info_v7()
  )|> 
  dplyr::select(finngenid, source, index, approx_event_day, code5, code6, code8, code9, provider_code, provider_name_en, provider_concept_class_id)
#> # Source:   SQL [?? x 11]
#> # Database: DatabaseConnectorJdbcConnection
#>    finngenid source index approx_event_day code5 code6 code8 code9 provider_code
#>    <chr>     <chr>  <chr> <date>           <chr> <chr> <chr> <chr> <chr>        
#>  1 FG000000… PURCH  308   1997-10-16       NA    NA    NA    NA    NA           
#>  2 FG000000… PURCH  269   2016-12-24       NA    NA    NA    NA    NA           
#>  3 FG000000… PURCH  304   1997-12-12       NA    NA    NA    NA    NA           
#>  4 FG000000… PURCH  65    2021-12-18       NA    NA    NA    NA    NA           
#>  5 FG000000… PRIM_… 265   2013-04-10       R20   T41   NA    NA    NA           
#>  6 FG000000… PURCH  140   2007-07-20       NA    NA    NA    NA    NA           
#>  7 FG000000… PRIM_… 250   2022-08-28       R20   T40   NA    NA    NA           
#>  8 FG000000… PURCH  166   2010-01-20       NA    NA    NA    NA    NA           
#>  9 FG000000… OUTPAT 177   2017-11-03       2     55    NA    NA    55           
#> 10 FG000000… PRIM_… 130   2022-09-29       R20   T40   NA    NA    51321        
#> # ℹ more rows
#> # ℹ 2 more variables: provider_name_en <chr>, provider_concept_class_id <chr>

(same table in html format for exporation)

KANTA LAB DATA

Generate a table in service sector format

For example we create a subset of the main service sector table by querying latest 20 events from subject “FG00000001”:

kanta_subject_1_tbl  <- FGconnectionHandler$getTblsandboxToolsSchema$kanta_r12_v1()|> 
  dplyr::filter(finngenid == "FG00000001") |> 
  dplyr::arrange(desc(approx_event_datetime))

We can see that this table does not currently have information about the medical codes:

kanta_subject_1_tbl
#> # Source:     SQL [?? x 19]
#> # Database:   DatabaseConnectorJdbcConnection
#> # Ordered by: desc(approx_event_datetime)
#>    finngenid  event_age approx_event_datetime test_name test_id omop_concept_id
#>    <chr>          <dbl> <dttm>                <chr>     <chr>             <dbl>
#>  1 FG00000001      71.7 2022-12-07 12:07:00   pt-gfre   656400               -1
#>  2 FG00000001      71.6 2022-11-04 16:18:00   p-crp     1605013         3020460
#>  3 FG00000001      71.6 2022-10-17 12:00:11   b-trom    3002791         3007461
#>  4 FG00000001      71.6 2022-10-15 07:01:00   e-mchc    1557            3003338
#>  5 FG00000001      71.5 2022-09-22 11:53:00   e-mchc    1557            3003338
#>  6 FG00000001      71.5 2022-09-12 12:03:00   s-ca-ion  9010            3016431
#>  7 FG00000001      71.5 2022-09-07 06:00:00   b-trom    2791            3007461
#>  8 FG00000001      71.4 2022-08-31 17:39:14   u-eryt    11344           3002582
#>  9 FG00000001      71.4 2022-08-26 08:57:00   e-mchc    1557            3003338
#> 10 FG00000001      71.4 2022-08-22 08:32:00   b-eryt    1101341         3026361
#> # ℹ more rows
#> # ℹ 13 more variables: measurement_value <dbl>, measurement_unit <chr>,
#> #   measurement_value_harmonized <dbl>, measurement_unit_harmonized <chr>,
#> #   test_outcome <chr>, measurement_status <chr>,
#> #   reference_range_low_value <dbl>, reference_range_high_value <dbl>,
#> #   coding_system_oid <chr>, test_id_source <chr>, test_name_source <chr>,
#> #   measurement_value_source <dbl>, measurement_unit_source <chr>

(same table in html format for exporation)

Adding info to OMOP_CONCEPT_ID codes

Function fg_dbplyr_append_concept_info_data adds a new column with information about the OMOP_CONCEPT_ID name.

omop_schema <- "atlas-development-270609.etl_sam_r12_5k_omop"
kanta_subject_1_tbl_with_omop_name <- kanta_subject_1_tbl |> fg_dbplyr_append_concept_info_data(omop_schema)

kanta_subject_1_tbl_with_omop_name |>  
  dplyr::select(finngenid, event_age, approx_event_datetime, test_name, test_id, omop_concept_id, omop_concept_name) 
#> # Source:   SQL [?? x 7]
#> # Database: DatabaseConnectorJdbcConnection
#>    finngenid  event_age approx_event_datetime test_name test_id omop_concept_id
#>    <chr>          <dbl> <dttm>                <chr>     <chr>             <dbl>
#>  1 FG00000001      67.7 2018-12-17 09:35:00   b-baso    3157            3006315
#>  2 FG00000001      68.9 2020-02-24 11:54:00   b-baso    3157            3006315
#>  3 FG00000001      69.2 2020-06-05 09:37:00   b-baso    9227            3006315
#>  4 FG00000001      66.2 2017-05-22 11:16:00   b-baso    3157            3006315
#>  5 FG00000001      70.9 2022-02-14 11:46:00   u-alb-o   1030            3029937
#>  6 FG00000001      67.0 2018-03-22 08:36:00   e-rdw     1349            3002385
#>  7 FG00000001      68.1 2019-04-16 07:40:00   e-rdw     20408           3002385
#>  8 FG00000001      65.9 2017-02-01 08:05:00   e-rdw     20408           3002385
#>  9 FG00000001      70.6 2021-10-11 08:08:00   b-hb-co   1566            3023081
#> 10 FG00000001      68.9 2020-02-29 07:01:00   li-baktvi 1154                 -1
#> # ℹ more rows
#> # ℹ 1 more variable: omop_concept_name <chr>

(same table in html format for exporation)