Domain 1.0: Data Cloud Features and Architecture

Table of contents

1.1 Outline key features of the Snowflake Data Cloud
1.2 Outline key Snowflake tools and user interfaces
1.3 Outline Snowflake’s catalog and objects
1.3 Outline Snowflake storage concepts

1.1 Outline key features of the Snowflake Data Cloud

Snowflake is a self-managed service. There is no hardware or software to install, configure, or manage.

All components of Snowflake’s service (other than optional command line clients, drivers, and connectors), run in public cloud infrastructures.

Snowflake’s architecture is a hybrid of traditional shared-disk and shared-nothing database architectures. Similar to shared-disk architectures, Snowflake uses a central data repository for persisted data that is accessible from all compute nodes in the platform. But similar to shared-nothing architectures, Snowflake processes queries using MPP (massively parallel processing) compute clusters where each node in the cluster stores a portion of the entire data set locally. This approach offers the data management simplicity of a shared-disk architecture, but with the performance and scale-out benefits of a shared-nothing architecture.

Elastic storage

When data is loaded into Snowflake, Snowflake reorganizes that data into its internal optimized, compressed, columnar format.

Snowflake manages all aspects of how this data is stored — the organization, file size, structure, compression, metadata, statistics, and other aspects of data storage are handled by Snowflake. The data objects stored by Snowflake are not directly visible nor accessible by customers; they are only accessible through SQL query operations run using Snowflake.

Elastic compute

Query execution is performed in the processing layer. Snowflake processes queries using “virtual warehouses”. Each virtual warehouse is an MPP compute cluster composed of multiple compute nodes allocated by Snowflake from a cloud provider.

Each virtual warehouse is an independent compute cluster that does not share compute resources with other virtual warehouses. As a result, each virtual warehouse has no impact on the performance of other virtual warehouses.

Snowflake’s three distinct layers

Database Storage (Elastic storage)
Query Processing (Compute) (Elastic compute)
Cloud Services

The cloud services layer is a collection of services that coordinate activities across Snowflake:

Authentication
Infrastructure management
Metadata management
Query parsing and optimization
Access control

All three layers of Snowflake’s architecture (storage, compute, and cloud services) are deployed and managed entirely on a selected cloud platform.

Cloud partner categories

Data Integration → e.g. Fivetran, Qlik, Matillion, dbt
Business Intelligence → e.g. Tableau, Power BI, Looker
ML & Data Science → e.g. DataRobot, Alteryx, Dataiku, Databricks
Security, Governance & Observability → e.g. Alation, Collibra
SQL Development & Management → e.g. DBeaver
Native Programming Interfaces → Go, Java, .NET, Node.js, C, PHP, Python

Partner Connect lets you easily create trial accounts with selected Snowflake business partners and integrate these accounts with Snowflake. This feature provides a convenient option for trying various 3rd-party tools and services, and then adopting the ones that best meet your business needs.

Partner Connect is limited to account administrators (i.e. users with the ACCOUNTADMIN role) who have a verified email address in Snowflake.

Overview of Snowflake editions

Comparison of key features in different editions:

Feature	Standard	Enterprise	Business Critical
Time Travel	Maximum 1 day	Maximum 90 days	Maximum 90 days
Multi-Cluster Warehouses	❌ No	✅ Yes	✅ Yes
Materialized Views	❌ No	✅ Yes	✅ Yes
Column Level Security	❌ No	✅ Yes	✅ Yes
Access History	❌ No	✅ Yes	✅ Yes
Query Acceleration Service	❌ No	✅ Yes	✅ Yes
Search Optimization Service	❌ No	✅ Yes	✅ Yes
Query Statement Encryption	❌ No	❌ No	✅ Yes
Failover/Failback & Replication of objects other than database and share	❌ No	❌ No	✅ Yes
Private Link	❌ No	❌ No	✅ Yes
Tri-Secret Secure Encryption	❌ No	❌ No	✅ Yes

1.1: Practice Questions

Which of the following terms best describes Snowflake’s database architecture?

A. Columnar shared nothing

B. Cloud-native shared memory

C. Multi-cluster, shared data

D. Shared disk

In which layer of Snowflake architecture is stored all security-related information?

A. Cloud Services

B. All of the above

C. Compute

D. Storage

In which layer of its architecture does Snowflake store its metadata statistics?

A. Database Layer

B. Cloud Services Layer

C. Compute Layer

D. Storage Layer

What are the main differences between the Virtual Private Snowflake (VPS) and Business Critical Editions? (Select TWO)

A. Snowflake VPS provides a completely separate Snowflake environment, isolated from all other Snowflake accounts, whereas it's not included in the Business Critical Edition.

B. Snowflake VPS provides a dedicated metadata store and pool of computing resources, whereas it's not included in the Business Critical Edition.

C. Snowflake VPS provides customer-managed encryption keys through Tri-Secret secure, whereas it’s not included in the Business Critical Edition.

D. Snowflake VPS provides a direct proxy to virtual networks // on-premises data centers using AWS PrivateLink, whereas it’s not included in the Business Critical Edition.

You have two virtual warehouses in your Snowflake account. If one of them updates the data in the storage layer, when will the other one see it?

A. Once all the compute resources are provisioned for the second warehouse.

B. After an average time of 5 seconds.

C. Immediately.

D. After the sync process.

Which role is required to initiate a trial in Snowflake Partner Connect?

A. ACCOUNTADMIN

B. SECURITYADMIN

C. SYSADMIN

D. USERADMIN

1.2 Outline key Snowflake tools and user interfaces

Snowsight

↳ the web interface (UI)

SnowSQL

↳ CLI client (platform-specific versions)

Snowflake connectors

↳ native integrations of third-party applications and database systems

Kafka connector
Spark connector

The Snowflake Connector for Spark brings Snowflake into the Apache Spark ecosystem, enabling Spark to read data from, and write data to, Snowflake. From Spark’s perspective, Snowflake looks similar to other Spark data sources (PostgreSQL, HDFS, S3, etc.).

Snowflake drivers

↳ write applications that perform operations on Snowflake

Go
JDBC (Java Database Connectivity)
.NET
Node.js
ODBC (Open Database Connectivity)
PHP PDO (PHP Data Objects)
Python

JDBC & ODBC → APIs that enable software applications to interact with databases (connect to various types of databases, query data, retrieve results) → a bridge between applications and databases

Snowpark API

↳ Java, Python or Scala

The Snowpark library provides an intuitive library for querying and processing data at scale in Snowflake. Using a library for any of three languages, you can build applications that process data in Snowflake without moving data to the system where your application code runs, and process at scale as part of the elastic and serverless Snowflake engine.

SnowCD (Connectivity Diagnostic Tool)

↳ helps users to diagnose and troubleshoot their network connection to Snowflake

Streamlit

(only AWS commercial regions)

An open-source Python library that makes it easy to create and share custom web apps for machine learning and data science. Build applications that process and use data in Snowflake without moving data or application code to an external system.

1.2: Practice Questions

What authentication method does the Kafka connector use within Snowflake?

A. Multi-Factor Authentication (MFA)

B. OAuth

C. Key pair authentication

D. Username and password

1.3 Outline Snowflake’s catalog and objects

Databases

Each database consists of one or more schemas, which are logical groupings of database objects, such as tables and views. Snowflake does not place any hard limits on the number of databases, schemas (within a database), or objects (within a schema) you can create.

Stages

External → references data files stored in external location (outside Snowflake)
- Permanent
- Temporary → will be dropped at the end of the session in which was created
Internal → stores data files internally within Snowflake
- Named Stage
  - Permanent
  - Temporary → will be dropped at the end of the session in which was created; in case of internal stage also all files are dropped
- User Stage
- Table Stage

An internal or external stage can include a directory table. Directory tables store a catalog of staged files in cloud storage.

Recreating a stage (using CREATE OR REPLACE STAGE) has the following additional, potentially undesirable, outcomes:

The existing directory table for the stage, if any, is dropped. If the stage is recreated with a directory table, the directory is empty by default.
The association breaks between the stage and any external table that references it. External tables must be recreated.
Any pipes that reference the stage stop loading data. The execution status of the pipes changes to STOPPED_STAGE_DROPPED. To resume loading data, these pipe objects must be recreated

GET_STAGE_LOCATION → retrieve the URL for an external or internal named stage using the stage name as input

GET_RELATIVE_PATH → extract the path of a staged file relative to its location in the stage using the stage name and absolute file path in cloud storage as inputs

SELECT GET_RELATIVE_PATH(@images_stage, 's3://photos/national_parks/us/yosemite/half_dome.jpg');
-- result: us/yosemite/half_dome.jpg                                                                 

GET_ABSOLUTE_PATH → extract the absolute path of a staged file using the stage name and path of the file relative to its location in the stage as inputs

SELECT GET_ABSOLUTE_PATH(@images_stage, 'us/yosemite/half_dome.jpg');
-- result: s3://photos/national_parks/us/yosemite/half_dome.jpg

A directory table is an implicit object layered on a stage (not a separate database object) and is conceptually similar to an external table because it stores file-level metadata about the data files in the stage. A directory table has no grantable privileges of its own.

Syntax for querying a directory table: SELECT * FROM DIRECTORY( @<stage_name> )

The output from a directory table can include the following columns:

RELATIVE_PATH → path to the files
SIZE → size of the file (in bytes)
LAST_MODIFIED → timestamp when the file was last updated in the stage
MD5 → MD5 checksum for the file
ETAG
FILE_URL

Schema types

Schemas can be transient. By definition, all tables created in a transient schema are transient. Transient schemas do not have a fail-safe period.

Managed access schemas centralize privilege management with the schema owner. In regular schemas, the owner of an object (i.e. the role that has the OWNERSHIP privilege on the object) can grant further privileges on their objects to other roles. In managed schemas, the schema owner manages all privilege grants, including future grants, on objects in the schema. Object owners retain the OWNERSHIP privileges on the objects; however, only the schema owner can manage privilege grants on the objects.

DATA_RETENTION_TIME_IN_DAYS parameter:

Standard edition: 0 or 1 (default: 1)
Enterprise edition:
- permanent schemas: 0 - 90 (default: 1, configurable at account level)
- transient schemas: 0 or 1 (default: 1)

Table types

Permanent (default)
Transient

Persist until explicitly dropped and are available to all users with the appropriate privileges. Specifically designed for transitory data that needs to be maintained beyond each session (in contrast to temporary tables). After creation, transient tables cannot be converted to any other table type.
Temporary

↳ Non-permanent, transitory, session-specific data.

↳ Exists only within a single user session and only within the duration of the session.

↳ Not visible to other users or sessions and do not support some standard features (e.g. cloning).

↳ Once the session ends, the data is not recoverable.

↳ Have no Fail-safe and have a Time Travel retention period of only 0 or 1 day; however, the Time Travel period ends when the table is dropped ⇒ the actual retention period is for 24 hours or the remainder of the session, whichever is shorter.

↳ You can create temporary and non-temporary tables with the same name within the same schema. However, note that the temporary table takes precedence in the session over any other table with the same name in the same schema.
External Tables

A feature that allows you to query data stored in an external stage as if the data were inside a table in Snowflake. All external tables include the following columns: VALUE, METADATA$FILENAME, METADATA$FILE_ROW_NUMBER. SELECT * always returns the VALUE column in which all regular or semi-structured data is cast to variant rows.

Recommended file sizing for external tables:

Parquet files: 256 - 512 MB

Parquet row groups: 16 - 256 MB (Snowflake can operate on different row groups using different servers)

All other supported file formats: 16 - 256 MB

View types

Non-materialized (standard, regular) views

A view is basically a named definition of a query.
A non-materialized view’s results are created by executing the query at the time that the view is referenced in a query.
Most commonly used.
Underlying DDL available to any role with access to the view.

Secure views

The view definition and details are visible only to authorized users (with the role that owns the view) ⇒ improved data privacy and data sharing.
Both non-materialized and materialized views can be defined as secure.
Snowflake query optimizer behaves differently when running the SQL contained in a secure view so that it doesn’t accidentally reveal things about the underlying data.
Secure views can be slower than non-secure views, because the optimizer isn’t used the same way.

Materialized views

A materialized view’s results are stored which allows for faster access but requires storage space and active maintenance, both of which incur additional costs.
Behave more like a table.
In Snowflake, these are auto-refreshed.
Can provide cost savings, but are inflexible and somewhat limited in use.

Data types

User-Defined Functions (UDFs)

User-Defined Table Functions (UDTFs)

Stored procedures

Streams

Tasks

Pipes

Snowpipe enables loading data from files as soon as they’re available in a stage. The data is loaded according to the COPY statement defined in a referenced pipe.

The COPY statement identifies the source location of the data files (i.e., a stage) and a target table. All data types are supported.

Two mechanisms for detecting the staged files:

Automating Snowpipe using cloud messaging
Calling Snowpipe REST endpoints

Snowpipe load history is stored in the metadata of the pipe for 14 days. Must be requested from Snowflake via a REST endpoint, SQL table function, or ACCOUNT_USAGE view.

When a pipe is recreated, the load history is dropped. In general, this condition only affects users if they subsequently execute an ALTER PIPE … REFRESH statement on the pipe. Doing so could load duplicate data from staged files in the storage location for the pipe if the data was already loaded successfully and the files were not deleted subsequently.

Bulk data load is stored in the metadata of the target table for 64 days. Available upon completion of the COPY statement as the statement output.

To optimize the number of parallel operations for a load, it is recommended to produce data files roughly 100-250 MB (or larger) in size compressed and staging files once per minute.

If a data loading operation continues beyond the maximum allowed duration of 24 hours, it could be aborted without any portion of the file being committed.

Not all COPY INTO options are available during pipe creation (e.g. ON_ERROR or VALIDATION_MODE are not available). See: Usage Notes

Snowflake automatically manages the compute required to execute the Pipe’s COPY INTO commands.

Sequences

1.3: Practice Questions

Which Snowflake table type is only visible to the user who creates it, can have the same name as permanent tables in the same schema, and is dropped at the end of the session?

A. Transient

B. Local

C. Temporary

D. User

A role is created and owns 2 tables. This role is then dropped. Who will now own the two tables?

A. The assumed role that dropped the role

B. SYSADMIN

C. The user that deleted the role

D. The tables are now orphaned

When a Pipe is recreated using the CREATE OR REPLACE PIPE command:

A. Previously loaded files will be ignored.

B. The Pipe load history is reset to empty.

C. Previously loaded files will be purged.

D. The REFRESH parameter is set to TRUE.

Which of the following is true of Snowpipe via REST API? (Choose two.)

A. Snowpipe removes files after they have been loaded.

B. Snowpipe keeps track of which files it has loaded.

C. All COPY INTO options are available during pipe creation.

D. Snowflake automatically manages the compute required to execute the Pipe's COPY INTO commands.

E. You can only use it on Internal Stages.