Table of contents
  1. 5.1 Explain how to work with standard data
    1. Estimation functions
    2. Sampling
    3. Supported function types
      1. System functions
      2. Table functions
      3. External functions
      4. User-Defined Functions (UDFs)
    4. Stored procedures
    5. Streams
    6. Tasks
  2. 5.1: Practice Questions
  3. 5.2 Explain how to work with semi-structured data
    1. Supported data formats, data types, and sizes
    2. VARIANT column
    3. Flattening the nested structure
      1. FLATTEN command
      2. LATERAL FLATTEN command
    4. Semi-structured data functions
      1. ARRAY/OBJECT creation and manipulation
      2. Extracting values
      3. Type predicates
  4. 5.2: Practice Questions
  5. 5.3 Explain how to work with unstructured data
    1. Define and use directory tables
    2. SQL file functions. Types of URLs available to access files.
    3. Outline the purpose of User-Defined Functions (UDFs) for data analysis
    4. 5.3: Practice Questions

5.1 Explain how to work with standard data

Estimation functions

There are four groups of aggregate functions that deal with estimation.

Function Group Description Algorithm Function Name
Cardinality estimation Estimate the number of distinct values HyperLogLog HLL() or APPROX_COUNT_DISTINCT() (human-friendly alias) + 5 variations
Similarity estimation Estimate similarity of two or more sets MinHash APPROXIMATE_SIMILARITY which takes in two MinHash states as an argument; 0 indicates the sets have no overlap, 1 indicates the sets are identical
Frequency estimation Estimate frequency of values in a set Space-Saving algorithm APPROX_TOP_K() + 3 variations
Percentile estimation Estimate percentile of values in a set t-Digest algorithm APPROX_PERCENTILE() + 3 variations

Sampling

Table sampling is a convenient way to read a random subset of rows from a table. It returns more representative group of rows than LIMIT.

SAMPLE / TABLESAMPLE → Returns a subset of rows sampled randomly from the specified table.

SELECT * FROM LINEITEM TABLESAMPLE | SAMPLE [samplingMethod] (<probability>);

Two methods (ways) of determining how sample records are generated:

  • Fraction-based: Sample a fraction of a table, with a specified probability for including a given row. The number of rows returned depends on the size of the table and the requested probability.

    SELECT * FROM LINEITEM TABLESAMPLE | SAMPLE BERNOULLI | ROW (50) → includes probability percentage that an individual row will be present in the result.

    SELECT * FROM LINEITEM TABLESAMPLE | SAMPLE SYSTEM | BLOCK (50) → instead of applying the probability to single rows, it is applied to groups of rows.

    A seed can be specified to make the sampling deterministic.

    SELECT * FROM LINEITEM TABLESAMPLE | SAMPLE (50) REPEATABLE | SEED (12345)

  • Fixed-sized: Sample a fixed, specified number of rows. The exact number of specified rows is returned unless the table contains fewer rows.

    SELECT * FROM LINEITEM TABLESAMPLE | SAMPLE BERNOULLI | ROW (50 rows)

    The block size method and seed setting are not supported.

Supported function types

  • Scalar functions → return one value per invocation (call). Typically used for returning one value per row.
  • Aggregate functions → operate on values across rows, typically to perform mathematical calculations such as sum, average and counting. Also cover many statistical calculations.
  • Window functions → subset of aggregate functions, allowing to aggregate on a subset of rows (a window) used as input to a function.
  • Table functions → for each input row, a table function can output zero, one or multiple rows.
  • System functions → provide a way to execute actions in the system, provide information about the system or provide information about queries.
  • User-defined functions
  • External functions

System functions

Snowflake provides the following types of system functions:

  • Control functions that allow you to execute actions in the system (e.g. aborting a query).
  • Information functions that return information about the system (e.g. calculating the clustering depth of a table).
  • Information functions that return information about queries (e.g. information about EXPLAIN plans).

Table functions

External functions

An external function has two key properties: an API integration (account level Snowflake object) and a URL for the proxy service.

External functions can be written in any language as long as they return a type which Snowflake can understand.

We execute external functions just like UDFs, inside a select statement.

User-Defined Functions (UDFs)

You can write user-defined functions (UDFs) to extend the system to perform operations that are not available through the built-in system-defined functions provided by Snowflake.

Supported languages:

  • Java
  • JavaScript
  • Python
  • Scala
  • SQL

Overloading → you can have two functions of the same name as long as they accept a different number of arguments.

Stored procedures

Both UDFs and stored procedures overlap a lot in functionality. Typically, stored procedures are rather used for more administrative tasks.

The creator can specify whether the procedure executes with the privileges of the owner or the caller \(\Rightarrow\) the owner can delegate certain tasks to other roles without giving them explicit privileges.

It is not required to return anything from stored procedures.

Streams

A stream is an object created to view & track DML changes to a source table (Change Data Capture).

When queried, stream has the same structure as the underlying table + 3 additional columns: METADATA$ACTION (insert or delete), METADATA$ISUPDATE (true or false - indicates whether the operation in the action column was a part of an update statement) and METADATA$ROW_ID (unique row identifier - we can track changes to a specific row over time).

Note that a stream itself does not contain any table data. A stream only stores an offset for the source object and returns CDC records by leveraging the versioning history for the source object.

When the first stream for a table is created, several hidden columns are added to the source table and begin storing change tracking metadata. These columns consume a small amount of storage. The CDC records returned when querying a stream rely on a combination of the offset stored in the stream and the change tracking metadata stored in the table.

Types of streams:

  • Standard (delta): tables, directory tables, views. Tracks all DML changes (inserts, updates, deletes including truncates).
  • Append-only: tables, directory tables, views. Tracks inserts only.
  • Insert-only: external tables. Tracks inserts only (overwritten or appended files are essentially handled as new files).

Tasks

A task is an object used to schedule the execution of a SQL statement, a stored procedure or procedural logic using Snowflake Scripting.

To create a task, the ACCOUNTADMIN role or a role with CREATE TASK privilege is required.

To start a task, the ACCOUNTADMIN role or a role with EXECUTE TASK privilege is required + the ownership or OPERATE privilege on the task object.

A DAG can be composed of a maximum of 1000 tasks and one task can be linked to a maximum of 100 other tasks.

All tasks in a DAG must have the same owner and they must exist in the same database and schema.

5.1: Practice Questions

Which stream type can be used for tracking the records in external tables?

A. Standard

B. Append-only

C. External

D. Insert only


5.2 Explain how to work with semi-structured data

Supported data formats, data types, and sizes

VARIANT column

Flattening the nested structure

FLATTEN command

LATERAL FLATTEN command

Semi-structured data functions

ARRAY/OBJECT creation and manipulation

Extracting values

Type predicates

5.2: Practice Questions


5.3 Explain how to work with unstructured data

Define and use directory tables

SQL file functions. Types of URLs available to access files.

Both external (external cloud storage) and internal (i.e. Snowflake) stages support unstructured data.

Scoped URL → Encoded URL that permits temporary access to a staged file without granting privileges to the stage. The URL expires when the persisted query result period ends (i.e. the results cache expires), which is currently 24 hours. Generate: query the BUILD_SCOPED_FILE_URL function

File URL → URL that identifies the database, schema, stage, and file path to a set of files. A role that has sufficient privileges on the stage can access the files. Generate: query the BUILD_STAGE_FILE_URL function or query the directory table for the stage that references the staged files Pre-signed URL → Simple HTTPS URL used to access a file via a web browser. A file is temporarily accessible to users via this URL using a pre-signed access token. The expiration time for the access token is configurable. Generate: Query the GET_PRESIGNED_URL function.

Outline the purpose of User-Defined Functions (UDFs) for data analysis

5.3: Practice Questions

A table needs to be loaded. The input data is in a JSON format and is a concatenation of multiple JSON documents. The file size is 3 GB. A virtual warehouse size Small is being used. The following COPY INTO command was executed:

COPY INTO SAMPLE FROM @~SAMPLE.JSON (TYPE=JSON)

The load failed with this error:

Max LOB size (16777216) exceeded, actual size of parsed column is 17894470.

How can this issue be resolved?

A. Use a larger-sized warehouse.

B. Compress the file and load the compressed file.

C. Set STRIP_OUTER_ARRAY=TRUE in the COPY INTO command.

D. Split the file into multiple files in the recommended size range (100 MB - 250 MB).


Which file functions are non-deterministic? (Select TWO)

A. BUILD_SCOPED_FILE_URL

B. BUILD_STAGE_FILE_URL

C. GET_ABSOLUTE_PATH

D. GET_PRESIGNED_URL

E. GET_RELATIVE_PATH