Tutorial connection to FinnGen BQ tables
tutorial_fgbq.Rmd
library(FinnGenUtilsR)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, unionGetting Started
The get_fg_bq_tables() function creates a connection to
FinnGen BigQuery tables and provides easy access to all available
tables.
The minimum required argument is the environment which
specifies where you running your queries. Typically, the environment is
sandbox-XX, where XX is the number of the sandbox you are using. You can
find the number of your sandbox in the URL when you are logged in to the
FinnGen Data Access Portal.

Moreover, for internal use it is possible to specify ‘build’ or ‘review’ as environment.
Create Connection to BigQuery Tables
fgbq <- get_fg_bq_tables(
environment = "sandbox-XX"
)
#> Connecting to BigQuery...
#> Using data freeze: dev
#> Finding latest table versions...
#> - minimum_extended: sandbox_tools_dev.minimum_extended_dev_dev
#> - service_sector_detailed_longitudinal: sandbox_tools_dev.finngen_dev_service_sector_detailed_longitudinal_dev
#> - kanta: sandbox_tools_dev.kanta_dev_dev
#> - kidney: sandbox_tools_dev.kidney_dev_dev
#> - vision: sandbox_tools_dev.vision_dev_dev
#> - birth_mother: sandbox_tools_dev.birth_mother_dev_dev
#> - hla_imputed: sandbox_tools_dev.hla_imputed_dev_dev
#> - drug_events: sandbox_tools_dev.drug_events_dev_dev
#> - kanta_medication_delivery: sandbox_tools_dev.kanta_medication_delivery_dev_dev
#> - kanta_prescription: sandbox_tools_dev.kanta_prescription_dev_dev
#> - code_counts: sandbox_tools_dev.code_counts_dev_dev
#> - code_prevalence_stratified: sandbox_tools_dev.code_prevalence_stratified_dev_dev
#> - endpoint_cohorts: sandbox_tools_dev.endpoint_cohorts_dev_dev
#> - fg_codes_info: medical_codes.fg_codes_info_dev
#> - finngen_vnrs: medical_codes.finngen_vnr_dev
#> - omop_concept: finngen_omop_dev_dev.concept
#> Creating table connections (this may take a moment)...
#> Successfully connected to 16 tables in 8.08 secondsBy default get_fg_bq_tables() will find the latest
available data freeze and table versions but you can specify a specific
version if needed.
fgbq <- get_fg_bq_tables(
environment = "sandbox-XX",
dataFreeze = "r13"
)View Connection Information
The bq_tables object displays all relevant connection
information when printed:
fgbq
#> FinnGen BigQuery Tables Handler
#> ================================
#>
#> Environment: build
#> Data Freeze: dev
#> Project: atlas-development-270609
#> Billing Project: atlas-development-270609
#>
#> Available Tables:
#> -----------------
#> minimum_extended atlas-development-270609.sandbox_tools_dev.minimum_extended_dev
#> service_sector_detailed_longitudinal atlas-development-270609.sandbox_tools_dev.finngen_dev_service_sector_detailed_longitudinal_dev
#> kanta atlas-development-270609.sandbox_tools_dev.kanta_dev
#> kidney atlas-development-270609.sandbox_tools_dev.kidney_dev
#> vision atlas-development-270609.sandbox_tools_dev.vision_dev
#> birth_mother atlas-development-270609.sandbox_tools_dev.birth_mother_dev
#> hla_imputed atlas-development-270609.sandbox_tools_dev.hla_imputed_dev
#> drug_events atlas-development-270609.sandbox_tools_dev.drug_events_dev
#> kanta_medication_delivery atlas-development-270609.sandbox_tools_dev.kanta_medication_delivery_dev
#> kanta_prescription atlas-development-270609.sandbox_tools_dev.kanta_prescription_dev
#> code_counts atlas-development-270609.sandbox_tools_dev.code_counts_dev
#> code_prevalence_stratified atlas-development-270609.sandbox_tools_dev.code_prevalence_stratified_dev
#> endpoint_cohorts atlas-development-270609.sandbox_tools_dev.endpoint_cohorts_dev
#> fg_codes_info atlas-development-270609.medical_codes_dev.fg_codes_info_dev
#> finngen_vnrs atlas-development-270609.medical_codes_dev.finngen_vnr_dev
#> omop_concept atlas-development-270609.finngen_omop_dev.conceptThis shows: - Environment (e.g., build, sandbox, review) - Data Freeze version (e.g., dev, r13_v3) - Project ID - Billing Project ID - All available tables with their full paths in case you want to run raw SQL queries
Using BigQuery Tables
The bq_tables object provides access to all tables
through the tbl field. These tables can be used as if they
were tibbles thanks to dbplyr.
Available Tables
You can view all available table names:
names(fgbq$tbl)
#> [1] "minimum_extended"
#> [2] "service_sector_detailed_longitudinal"
#> [3] "kanta"
#> [4] "kidney"
#> [5] "vision"
#> [6] "birth_mother"
#> [7] "hla_imputed"
#> [8] "drug_events"
#> [9] "kanta_medication_delivery"
#> [10] "kanta_prescription"
#> [11] "code_counts"
#> [12] "code_prevalence_stratified"
#> [13] "endpoint_cohorts"
#> [14] "fg_codes_info"
#> [15] "finngen_vnrs"
#> [16] "omop_concept"Accessing Tables
Access individual tables using the $tbl field:
fgbq$tbl$service_sector_detailed_longitudinal |> head()
#> # Source: SQL [?? x 16]
#> # Database: BigQueryConnection
#> FINNGENID SOURCE EVENT_AGE APPROX_EVENT_DAY CODE1 CODE2 CODE3 CODE4 CODE5
#> <chr> <chr> <dbl> <date> <chr> <chr> <chr> <chr> <chr>
#> 1 FG00000001 PURCH 41.9 2018-04-30 C10AA05 NA 119478 1 NA
#> 2 FG00000001 PURCH 42.1 2018-07-01 R03BA07 NA 007549 1 NA
#> 3 FG00000001 PURCH 42.2 2018-08-15 D06BX01 NA 007623 1 NA
#> 4 FG00000001 PURCH 42.6 2018-12-16 R01AD08 NA 018285 1 NA
#> 5 FG00000001 PURCH 41.9 2018-04-14 A06AC01 NA 063219 1 NA
#> 6 FG00000001 PURCH 41.8 2018-03-07 C10AA02 NA 014253 1 NA
#> # ℹ 7 more variables: CODE6 <chr>, CODE7 <chr>, CODE8 <chr>, CODE9 <chr>,
#> # ICDVER <chr>, CATEGORY <chr>, INDEX <chr>You can treat these tables as if they were regular tibbles. For example, to filter for events with code “D45” in the OUTPAT source:
# Get events with code "J45" in OUTPAT
fgbq$tbl$service_sector_detailed_longitudinal |>
filter(CODE1 == "J45" & SOURCE == "OUTPAT")
#> # Source: SQL [?? x 16]
#> # Database: BigQueryConnection
#> FINNGENID SOURCE EVENT_AGE APPROX_EVENT_DAY CODE1 CODE2 CODE3 CODE4 CODE5
#> <chr> <chr> <dbl> <date> <chr> <chr> <chr> <chr> <chr>
#> 1 FG00255635 OUTPAT 9.39 2007-11-20 J45 NA NA NA 93
#> 2 FG00257218 OUTPAT 23.7 2018-08-26 J45 NA NA NA 93
#> 3 FG00257493 OUTPAT 51.3 2022-03-28 J45 J45 NA NA 83
#> 4 FG00146210 OUTPAT 70.7 2010-04-24 J45 NA NA NA 93
#> 5 FG00147113 OUTPAT 71.6 2018-12-13 J45 J45 NA NA NA
#> 6 FG00147218 OUTPAT 68.1 2019-08-12 J45 NA NA NA NA
#> 7 FG00147218 OUTPAT 64.6 2016-03-07 J45 NA NA NA 91
#> 8 FG00366895 OUTPAT 38.6 2018-11-28 J45 NA NA NA NA
#> 9 FG00367217 OUTPAT 50.7 2015-08-08 J45 NA NA NA 91
#> 10 FG00367695 OUTPAT 42.5 2021-04-16 J45 NA NA NA 83
#> # ℹ more rows
#> # ℹ 7 more variables: CODE6 <chr>, CODE7 <chr>, CODE8 <chr>, CODE9 <chr>,
#> # ICDVER <chr>, CATEGORY <chr>, INDEX <chr>Or other tidyverse functions like
stringr::str_detect()
# Get events with code "J45" in OUTPAT
fgbq$tbl$service_sector_detailed_longitudinal |>
filter(CODE1 == "J45" & SOURCE == "OUTPAT")
#> # Source: SQL [?? x 16]
#> # Database: BigQueryConnection
#> FINNGENID SOURCE EVENT_AGE APPROX_EVENT_DAY CODE1 CODE2 CODE3 CODE4 CODE5
#> <chr> <chr> <dbl> <date> <chr> <chr> <chr> <chr> <chr>
#> 1 FG00255635 OUTPAT 9.39 2007-11-20 J45 NA NA NA 93
#> 2 FG00257218 OUTPAT 23.7 2018-08-26 J45 NA NA NA 93
#> 3 FG00257493 OUTPAT 51.3 2022-03-28 J45 J45 NA NA 83
#> 4 FG00146210 OUTPAT 70.7 2010-04-24 J45 NA NA NA 93
#> 5 FG00147113 OUTPAT 71.6 2018-12-13 J45 J45 NA NA NA
#> 6 FG00147218 OUTPAT 68.1 2019-08-12 J45 NA NA NA NA
#> 7 FG00147218 OUTPAT 64.6 2016-03-07 J45 NA NA NA 91
#> 8 FG00366895 OUTPAT 38.6 2018-11-28 J45 NA NA NA NA
#> 9 FG00367217 OUTPAT 50.7 2015-08-08 J45 NA NA NA 91
#> 10 FG00367695 OUTPAT 42.5 2021-04-16 J45 NA NA NA 83
#> # ℹ more rows
#> # ℹ 7 more variables: CODE6 <chr>, CODE7 <chr>, CODE8 <chr>, CODE9 <chr>,
#> # ICDVER <chr>, CATEGORY <chr>, INDEX <chr>Or other tidyverse functions like
stringr::str_detect()
# This will error
fgbq$tbl$service_sector_detailed_longitudinal |>
filter(stringr::str_detect(CODE1, "^J45")) |>
count(CODE1) |>
head()
#> # Source: SQL [?? x 2]
#> # Database: BigQueryConnection
#> CODE1 n
#> <chr> <int64>
#> 1 J458 13542
#> 2 J450 88559
#> 3 J459 183889
#> 4 J45 100414
#> 5 J451 58219Joining Tables
You can join multiple tables together. For example, to get smoking status for asthma patients:
fgbq$tbl$service_sector_detailed_longitudinal |>
filter(stringr::str_detect(CODE1, "^J45")) |>
distinct(FINNGENID) |>
left_join(
fgbq$tbl$minimum_extended,
by = "FINNGENID"
) |>
count(SMOKE2, SMOKE3, sort = TRUE) |>
head()
#> # Source: SQL [?? x 3]
#> # Database: BigQueryConnection
#> # Ordered by: desc(n)
#> SMOKE2 SMOKE3 n
#> <chr> <chr> <int64>
#> 1 no NA 43697
#> 2 NA NA 35806
#> 3 no never 32092
#> 4 NA never 26376
#> 5 no current 16633
#> 6 yes NA 15056Collect results into R
Once you have reduce the size of the total data you can download it
into memory using collect(). For example:
fgbq$tbl$service_sector_detailed_longitudinal |>
filter(stringr::str_detect(CODE1, "^J45")) |>
filter(EVENT_AGE > 50) |>
head() |>
collect()
#> # A tibble: 6 × 16
#> FINNGENID SOURCE EVENT_AGE APPROX_EVENT_DAY CODE1 CODE2 CODE3 CODE4 CODE5
#> <chr> <chr> <dbl> <date> <chr> <chr> <chr> <chr> <chr>
#> 1 FG00388172 OUTPAT 78.7 2010-08-16 J450 NA NA NA 93
#> 2 FG00388172 PRIM_OUT 81.3 2013-04-22 J45 NA NA NA R50
#> 3 FG00388173 PRIM_OUT 87.6 2021-08-13 J45 NA NA NA R20
#> 4 FG00388174 OUTPAT 63.2 2016-05-28 J459 NA NA NA 93
#> 5 FG00388180 PRIM_OUT 62.3 2020-01-10 J459 NA NA NA R20
#> 6 FG00388182 OUTPAT 50.9 2017-01-09 J450 NA NA NA 93
#> # ℹ 7 more variables: CODE6 <chr>, CODE7 <chr>, CODE8 <chr>, CODE9 <chr>,
#> # ICDVER <chr>, CATEGORY <chr>, INDEX <chr>Running SQL Queries
However, if dbplyr is not sufficient, you can still use the
query() method to directly run SQL queries against the
BigQuery tables. For example:
# Write a custom SQL query
sql <- paste0("SELECT FINNGENID, CODE1, SOURCE, APPROX_EVENT_DAY
FROM `", fgbq$tablePaths$service_sector_detailed_longitudinal, "`
WHERE CODE1 = 'J45' AND SOURCE = 'OUTPAT'
LIMIT 5")
# Execute the query
result <- fgbq$query(sql)
# Download the results
bigrquery::bq_table_download(result)
#> # A tibble: 5 × 4
#> FINNGENID CODE1 SOURCE APPROX_EVENT_DAY
#> <chr> <chr> <chr> <date>
#> 1 FG00042388 J45 OUTPAT 2018-08-26
#> 2 FG00042765 J45 OUTPAT 2016-07-12
#> 3 FG00042901 J45 OUTPAT 2021-01-11
#> 4 FG00042913 J45 OUTPAT 2019-12-25
#> 5 FG00030575 J45 OUTPAT 2007-11-20