Adding Names to Kanta Lab Data
tutorial_naming_kanta_lab_data.RmdIntro
This tutorial demonstrates how to work with Kanta laboratory data
using the FinnGenUtilsR package. The Kanta lab data
contains laboratory measurements with OMOP concept mappings, and this
tutorial shows how to append human-readable names to these OMOP concept
IDs.
To understand what the fgbq object is, see the dedicated
vignette tutorial_fgbq.
Set up
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(FinnGenUtilsR)Connect to the latest versions of BigQuery tables using
get_fg_bq_tables(). Remember to change the environment to
your sandbox, or set other data freeze versions if needed (see
tutorial_fgbq vignette for more details).
fgbq <- get_fg_bq_tables(
environment = "sandbox-XX"
)
#> Connecting to BigQuery...
#> Using data freeze: dev
#> Finding latest table versions...
#> - minimum_extended: sandbox_tools_dev.minimum_extended_dev_dev
#> - service_sector_detailed_longitudinal: sandbox_tools_dev.finngen_dev_service_sector_detailed_longitudinal_dev
#> - kanta: sandbox_tools_dev.kanta_dev_dev
#> - kidney: sandbox_tools_dev.kidney_dev_dev
#> - vision: sandbox_tools_dev.vision_dev_dev
#> - birth_mother: sandbox_tools_dev.birth_mother_dev_dev
#> - hla_imputed: sandbox_tools_dev.hla_imputed_dev_dev
#> - drug_events: sandbox_tools_dev.drug_events_dev_dev
#> - kanta_medication_delivery: sandbox_tools_dev.kanta_medication_delivery_dev_dev
#> - kanta_prescription: sandbox_tools_dev.kanta_prescription_dev_dev
#> - code_counts: sandbox_tools_dev.code_counts_dev_dev
#> - code_prevalence_stratified: sandbox_tools_dev.code_prevalence_stratified_dev_dev
#> - endpoint_cohorts: sandbox_tools_dev.endpoint_cohorts_dev_dev
#> - fg_codes_info: medical_codes.fg_codes_info_dev
#> - finngen_vnrs: medical_codes.finngen_vnr_dev
#> - omop_concept: finngen_omop_dev_dev.concept
#> Creating table connections (this may take a moment)...
#> Successfully connected to 16 tables in 8.06 secondsWe can access the Kanta lab data using the fgbq object
as follows:
fgbq$tbl$kanta |> head()
#> # Source: SQL [?? x 21]
#> # Database: BigQueryConnection
#> FINNGENID EVENT_AGE APPROX_EVENT_DATETIME TEST_NAME TEST_ID OMOP_CONCEPT_ID
#> <chr> <dbl> <dttm> <chr> <chr> <int64>
#> 1 FG00000034 65.7 2022-03-17 07:55:00 na NA -1
#> 2 FG00000035 82.7 2022-08-30 09:58:00 na NA -1
#> 3 FG00000069 66.4 2020-10-16 07:01:00 na NA -1
#> 4 FG00000107 64.7 2019-09-22 13:55:00 na NA -1
#> 5 FG00000172 91.0 2022-09-01 11:59:00 na NA -1
#> 6 FG00000186 82.8 2021-08-04 07:37:00 na NA -1
#> # ℹ 15 more variables: MEASUREMENT_VALUE <dbl>, MEASUREMENT_UNIT <chr>,
#> # MEASUREMENT_VALUE_HARMONIZED <dbl>, MEASUREMENT_UNIT_HARMONIZED <chr>,
#> # MEASUREMENT_VALUE_EXTRACTED <dbl>, MEASUREMENT_VALUE_MERGED <dbl>,
#> # TEST_OUTCOME <chr>, MEASUREMENT_STATUS <chr>,
#> # REFERENCE_RANGE_LOW_VALUE <dbl>, REFERENCE_RANGE_HIGH_VALUE <dbl>,
#> # CODING_SYSTEM_OID <chr>, TEST_ID_SOURCE <chr>, TEST_NAME_SOURCE <chr>,
#> # MEASUREMENT_VALUE_SOURCE <dbl>, MEASUREMENT_UNIT_SOURCE <chr>Generate a table for analysis
For example we create a subset of the main Kanta table by querying latest events from subject “FG00000001”:
kanta_subject_1_tbl <- fgbq$tbl$kanta |>
filter(FINNGENID == "FG00000001") |>
arrange(desc(APPROX_EVENT_DATETIME)) We can see that this table contains OMOP_CONCEPT_ID but not the names of the concepts:
kanta_subject_1_tbl |>
select(FINNGENID, EVENT_AGE, APPROX_EVENT_DATETIME, TEST_NAME, TEST_ID, OMOP_CONCEPT_ID) |>
head()
#> # Source: SQL [?? x 6]
#> # Database: BigQueryConnection
#> # Ordered by: desc(APPROX_EVENT_DATETIME)
#> FINNGENID EVENT_AGE APPROX_EVENT_DATETIME TEST_NAME TEST_ID OMOP_CONCEPT_ID
#> <chr> <dbl> <dttm> <chr> <chr> <int64>
#> 1 FG00000001 71.7 2022-12-07 12:07:00 pt-gfre 656400 -1
#> 2 FG00000001 71.6 2022-11-04 16:18:00 p-crp 1605013 3020460
#> 3 FG00000001 71.6 2022-10-17 12:00:11 b-trom 3002791 3007461
#> 4 FG00000001 71.6 2022-10-15 07:01:00 e-mchc 1557 3003338
#> 5 FG00000001 71.5 2022-09-22 11:53:00 e-mchc 1557 3003338
#> 6 FG00000001 71.5 2022-09-12 12:03:00 s-ca-ion 9010 3016431Adding info to OMOP_CONCEPT_ID codes
Function fg_dbplyr_append_concept_info_data adds a new
column with information about the OMOP_CONCEPT_ID name.
kanta_subject_1_tbl_with_omop_name <- kanta_subject_1_tbl |>
fg_dbplyr_append_concept_info_data(fg_bq_tables = fgbq)
kanta_subject_1_tbl_with_omop_name |>
dplyr::select(FINNGENID, EVENT_AGE, APPROX_EVENT_DATETIME, TEST_NAME, TEST_ID, OMOP_CONCEPT_ID, concept_name) |>
head()
#> Warning: ORDER BY is ignored in subqueries without LIMIT
#> ℹ Do you need to move arrange() later in the pipeline or use window_order() instead?
#> # Source: SQL [?? x 7]
#> # Database: BigQueryConnection
#> # Ordered by: desc(APPROX_EVENT_DATETIME)
#> FINNGENID EVENT_AGE APPROX_EVENT_DATETIME TEST_NAME TEST_ID OMOP_CONCEPT_ID
#> <chr> <dbl> <dttm> <chr> <chr> <int64>
#> 1 FG00000001 70.4 2021-08-24 09:06:00 p-krea 4600 3020564
#> 2 FG00000001 68.0 2019-03-30 12:24:00 p-krea 4600 3020564
#> 3 FG00000001 64.2 2015-06-12 07:00:00 p-krea 4600 3020564
#> 4 FG00000001 68.3 2019-07-10 09:46:00 fp-trigly 1605041 3048773
#> 5 FG00000001 70.6 2021-10-27 07:01:00 fp-kol-hdl 4516 3023602
#> 6 FG00000001 67.6 2018-10-16 08:59:00 fp-kol-hdl 4516 3023602
#> # ℹ 1 more variable: concept_name <chr>(same table in html format for exploration)
#> Warning: ORDER BY is ignored in subqueries without LIMIT
#> ℹ Do you need to move arrange() later in the pipeline or use window_order() instead?