In this video, we will demostrate how to install a package and then use the package to stage our data that lives in the S3 bucket.
Summary of this video:
main/master branch first and then click on create a new branch (feature/stage_tpch_data). If you get the button Pull from Remote, feel free to do so. This is an indictation that something has changed the remote branch and you will want to pull those changes into your remote branch.packages.yml file on the same level as your dbt_project.yml fileAdd the following code to the file.
packages:
- package: dbt-labs/dbt_external_tables
version: 0.8.0
- package: dbt-labs/dbt_utils
version: 0.8.1 The video mentions that installing the dbt_external_tables package will also install the dbt_utils package because it is used as a dependency. While this was true in previous versions, the updated version being used now does not perform the same auto-install. It’s important to include the dbt_utils package as shown above when creating your packages.yml file to avoid errors when creating your external tables in step 11.
4. Run dbt deps to install the package. A way to remember this command is deps = dependencies.
5. Create a staging folder in the models folder.
6. Create a folder named tpch inside of the staging folder.
7. Create the sources.yml file inside of the tpch folder.
8. Paste this code into the sources.yml file:
version: 2
sources:
- name: tpch
database: dbtworkshop
schema: tpch_s3_data
loader: s3
loaded_at_field: collector_tstamp
tables:
- name: lineitem
external:
location: "s3://dbt-data-lake-[Insert AWS Account ID]]/dbtworkshopdata/lineitem/"
row_format: serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
table_properties: "('skip.header.line.count'='1')"
columns:
- name: l_orderkey
data_type: varchar(255)
description: "primary key"
- name: l_partkey
data_type: varchar(255)
- name: l_suppkey
data_type: varchar(255)
- name: l_linenumber
data_type: varchar(255)
- name: l_quantity
data_type: varchar(255)
- name: l_extendedprice
data_type: varchar(255)
- name: l_discount
data_type: varchar(255)
- name: l_tax
data_type: varchar(255)
- name: l_returnflag
data_type: varchar(255)
- name: l_linestatus
data_type: varchar(255)
- name: l_shipmode
data_type: varchar(255)
- name: l_shipdate
data_type: timestamp
- name: l_commitdate
data_type: timestamp
- name: l_receiptdate
data_type: timestamp
- name: orders
external:
location: "s3://dbt-data-lake-[Insert AWS Account ID]/dbtworkshopdata/orders/"
row_format: serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
table_properties: "('skip.header.line.count'='1')"
columns:
- name: o_orderkey
data_type: varchar(255)
description: "primary key"
- name: o_custkey
data_type: varchar(255)
- name: o_orderstatus
data_type: varchar(255)
- name: o_totalprice
data_type: varchar(255)
- name: o_orderdate
data_type: varchar(255)
- name: o_orderpriority
data_type: varchar(255)
- name: o_clerk
data_type: varchar(255)
- name: o_shippriority
data_type: varchar(255)
- name: o_comment
data_type: varchar(255)
dbt-data-lake-s3://dbt-data-lake-486758181003/dbtworkshopdata/lineitem/ with different numbers for you S3 bucket name. Do not include the .csvdbt run-operation stage_external_sources to create the external tables.dbt docs generate to the documentation site.view docs to view the documentation.dbt run-operation stage_external_sources, go into the Redshift Query editor, connect and check out the tables in the schema tpch_s3_data