Stage Raw Data

In this video, we will demostrate how to install a package and then use the package to stage our data that lives in the S3 bucket.

Summary of this video:

  1. Check a new branch. To do this, you will need to check out the main/master branch first and then click on create a new branch (feature/stage_tpch_data). If you get the button Pull from Remote, feel free to do so. This is an indictation that something has changed the remote branch and you will want to pull those changes into your remote branch.
  2. Create a packages.yml file on the same level as your dbt_project.yml file
  3. Add the following code to the file.

    packages:
    - package: dbt-labs/dbt_external_tables
      version: 0.8.0
    - package: dbt-labs/dbt_utils
      version: 0.8.1  

The video mentions that installing the dbt_external_tables package will also install the dbt_utils package because it is used as a dependency. While this was true in previous versions, the updated version being used now does not perform the same auto-install. It’s important to include the dbt_utils package as shown above when creating your packages.yml file to avoid errors when creating your external tables in step 11. 4. Run dbt deps to install the package. A way to remember this command is deps = dependencies. 5. Create a staging folder in the models folder. 6. Create a folder named tpch inside of the staging folder. 7. Create the sources.yml file inside of the tpch folder. 8. Paste this code into the sources.yml file:

version: 2

    sources:
      - name: tpch
        database: dbtworkshop
        schema: tpch_s3_data
        loader: s3
        loaded_at_field: collector_tstamp
      
        tables:
                
          - name: lineitem
            external:
                location: "s3://dbt-data-lake-[Insert AWS Account ID]]/dbtworkshopdata/lineitem/" 
                row_format: serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
                table_properties: "('skip.header.line.count'='1')"
            columns:
              - name: l_orderkey
                data_type: varchar(255)
                description: "primary key"
              - name: l_partkey
                data_type: varchar(255)
              - name: l_suppkey
                data_type: varchar(255)
              - name: l_linenumber
                data_type: varchar(255)
              - name: l_quantity
                data_type: varchar(255)            
              - name: l_extendedprice
                data_type: varchar(255)
              - name: l_discount
                data_type: varchar(255)
              - name: l_tax
                data_type: varchar(255)                       
              - name: l_returnflag
                data_type: varchar(255)  
              - name: l_linestatus
                data_type: varchar(255)   
              - name: l_shipmode
                data_type: varchar(255)             
              - name: l_shipdate
                data_type: timestamp  
              - name: l_commitdate
                data_type: timestamp
              - name: l_receiptdate
                data_type: timestamp           
         
          - name: orders
            external:
                location: "s3://dbt-data-lake-[Insert AWS Account ID]/dbtworkshopdata/orders/"      
                row_format: serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
                table_properties: "('skip.header.line.count'='1')"
            columns:
              - name: o_orderkey
                data_type: varchar(255)
                description: "primary key"
              - name: o_custkey
                data_type: varchar(255)
              - name: o_orderstatus
                data_type: varchar(255)
              - name: o_totalprice
                data_type: varchar(255)
              - name: o_orderdate
                data_type: varchar(255)
              - name: o_orderpriority
                data_type: varchar(255)
              - name: o_clerk
                data_type: varchar(255)
              - name: o_shippriority
                data_type: varchar(255)
              - name: o_comment
                data_type: varchar(255)
     
  1. Now go to the S3 Management Console and find a S3 bucket that starts with the prefix dbt-data-lake-
  2. Update the location field in the sources.yml file to point to your lineitem and orders files. They will be almost the same as this s3://dbt-data-lake-486758181003/dbtworkshopdata/lineitem/ with different numbers for you S3 bucket name. Do not include the .csv
  3. Run the command dbt run-operation stage_external_sources to create the external tables.
  4. Run dbt docs generate to the documentation site.
  5. Click on view docs to view the documentation.
  • If you wish to see the external tables you created via the dbt run-operation stage_external_sources, go into the Redshift Query editor, connect and check out the tables in the schema tpch_s3_data
  • Operations Macros