In this video, we will demostrate how to install a package and then use the package to stage our data that lives in the S3 bucket.
Summary of this video:
main
/master
branch first and then click on create a new branch (feature/stage_tpch_data
). If you get the button Pull from Remote
, feel free to do so. This is an indictation that something has changed the remote branch and you will want to pull those changes into your remote branch.packages.yml
file on the same level as your dbt_project.yml
fileAdd the following code to the file.
packages:
- package: dbt-labs/dbt_external_tables
version: 0.8.0
- package: dbt-labs/dbt_utils
version: 0.8.1
The video mentions that installing the dbt_external_tables
package will also install the dbt_utils
package because it is used as a dependency. While this was true in previous versions, the updated version being used now does not perform the same auto-install. It’s important to include the dbt_utils
package as shown above when creating your packages.yml
file to avoid errors when creating your external tables in step 11.
4. Run dbt deps
to install the package. A way to remember this command is deps = dependencies.
5. Create a staging
folder in the models
folder.
6. Create a folder named tpch
inside of the staging
folder.
7. Create the sources.yml
file inside of the tpch folder.
8. Paste this code into the sources.yml
file:
version: 2
sources:
- name: tpch
database: dbtworkshop
schema: tpch_s3_data
loader: s3
loaded_at_field: collector_tstamp
tables:
- name: lineitem
external:
location: "s3://dbt-data-lake-[Insert AWS Account ID]]/dbtworkshopdata/lineitem/"
row_format: serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
table_properties: "('skip.header.line.count'='1')"
columns:
- name: l_orderkey
data_type: varchar(255)
description: "primary key"
- name: l_partkey
data_type: varchar(255)
- name: l_suppkey
data_type: varchar(255)
- name: l_linenumber
data_type: varchar(255)
- name: l_quantity
data_type: varchar(255)
- name: l_extendedprice
data_type: varchar(255)
- name: l_discount
data_type: varchar(255)
- name: l_tax
data_type: varchar(255)
- name: l_returnflag
data_type: varchar(255)
- name: l_linestatus
data_type: varchar(255)
- name: l_shipmode
data_type: varchar(255)
- name: l_shipdate
data_type: timestamp
- name: l_commitdate
data_type: timestamp
- name: l_receiptdate
data_type: timestamp
- name: orders
external:
location: "s3://dbt-data-lake-[Insert AWS Account ID]/dbtworkshopdata/orders/"
row_format: serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
table_properties: "('skip.header.line.count'='1')"
columns:
- name: o_orderkey
data_type: varchar(255)
description: "primary key"
- name: o_custkey
data_type: varchar(255)
- name: o_orderstatus
data_type: varchar(255)
- name: o_totalprice
data_type: varchar(255)
- name: o_orderdate
data_type: varchar(255)
- name: o_orderpriority
data_type: varchar(255)
- name: o_clerk
data_type: varchar(255)
- name: o_shippriority
data_type: varchar(255)
- name: o_comment
data_type: varchar(255)
dbt-data-lake-
s3://dbt-data-lake-486758181003/dbtworkshopdata/lineitem/
with different numbers for you S3 bucket name. Do not include the .csv
dbt run-operation stage_external_sources
to create the external tables.dbt docs generate
to the documentation site.view docs
to view the documentation.dbt run-operation stage_external_sources
, go into the Redshift Query editor, connect and check out the tables in the schema tpch_s3_data