Snowflake SnowPro Advanced: Data Engineer (DEA-C02) Sample Questions:
1. You are designing a data governance strategy for a Snowflake data warehouse. You need to track data lineage for compliance purposes. Specifically, you need to identify all downstream tables that depend on a specific column in a source table. Which combination of Snowflake features and techniques would you use to achieve this goal effectively?
A) Utilize Snowflake's data lineage feature in conjunction with object tagging. Tag relevant columns and tables, then query the lineage views to trace dependencies.
B) Use Snowflake's INFORMATION SCHEMA views (TABLES, COLUMNS) and regularly audit user query history to manually reconstruct the data lineage.
C) Implement a custom data lineage tracking system by parsing all SQL queries executed in the Snowflake environment and storing the dependencies in a separate metadata database.
D) Rely solely on user documentation and training to ensure data lineage is properly documented and maintained. Implement strict naming conventions for tables and columns.
E) Use Snowflake's ACCOUNT USAGE views related to query history and object dependencies, combined with a custom script to recursively trace data lineage based on SQL operations (e.g., INSERT INTO ... SELECT).
2. You're tasked with optimizing a Snowflake data pipeline that transforms and loads data into a target table. The pipeline uses a series of complex SQL queries with multiple joins and aggregations. After analyzing the query execution plans, you identify a few key bottlenecks. Which of the following optimization techniques would MOST directly address common performance bottlenecks in such a data pipeline within Snowflake?
A) Applying appropriate clustering keys to the target table, ensuring that commonly used filter columns are included in the clustering key definition.
B) Increasing the virtual warehouse size to the largest available option (e.g., X-Large). The bigger the warehouse, the faster all queries will run.
C) Utilizing materialized views to pre-compute and store the results of expensive aggregations, and ensuring the query optimizer rewrites queries to use the materialized views where applicable.
D) Converting all SQL queries to use stored procedures for improved performance and security. Stored procedures execute faster than ad-hoc SQL.
E) Disabling the query result cache to ensure that the most up-to-date data is always used.
3. You are monitoring a Snowpipe pipeline that loads data from an external stage into a Snowflake table. You observe the following error messages in the PIPE ERRORS view: 'Invalid UTF-8 detected in string'. The data files on the stage are encoded in UTF-8. Which of the following actions, taken individually or in combination, are MOST likely to resolve this issue? (Select TWO)
A) Verify the data files on the stage are actually valid UTF-8 and contain no corrupted characters.
B) Modify the COPY INTO statement to include the 'ON ERROR = 'SKIP_FILE" option.
C) Convert the problematic files to UTF-16 encoding before loading them into the stage.
D) Drop and recreate the external stage with 'TYPE = INTERNAL'.
E) Ensure the file format definition explicitly specifies 'ENCODING = 'UTF8".
4. A financial institution needs to mask sensitive customer data (PII) in a 'CUSTOMER' table. The table contains columns like 'CUSTOMER ID', 'FIRST NAME', 'LAST NAME', 'CREDIT CARD, and 'ADDRESS'. The data should be masked differently for different roles: 'ANALYST' role should see obfuscated values for names and addresses, while the 'SUPPORT' role should see the last four digits of the credit card and a hashed version of the address. The "CUSTOMER ID' should never be masked. Assume a central masking policy already exists called 'PII MASKING POLICY. Which of the following statements is the MOST efficient and secure way to achieve this?
A) Create multiple masking policies, one for each role and sensitive column combination, each with the appropriate masking expression. Then, apply each masking policy individually to its respective column. Use the function to implement role-based masking within each policy.
B) Create a single masking policy with a complex stored procedure that checks the current role and applies different masking functions accordingly, then apply this policy to all sensitive columns.
C) Create view for each role which applies masking functions to the columns. Grant SELECT access on those views to relevant roles.
D) Create external functions to handle the complex masking logic and call them from the masking policy.
E) Create multiple masking policies with different masking expressions and apply them directly to the columns based on the role using conditional expressions within the policies. Use 'CASE statements within the masking policy to differentiate between roles.
5. You are tasked with building a robust data quality monitoring system for a Snowflake data pipeline. The pipeline processes customer order data and loads it into a 'CUSTOMER ORDERS table. You need to implement checks to ensure that certain critical columns (e.g., 'ORDER ID, 'CUSTOMER ID', 'ORDER DATE, meet specific data quality requirements (e.g., not null, valid format, within acceptable range). You want to design a flexible and scalable solution that allows you to easily add, modify, and monitor data quality rules. Select the options to implement that and scale efficiently Assume there is a central Data Quality table for each metrics
A) Build a set of custom Snowflake Native Apps to monitor and report on data quality. Each App will focus on one or more critical tables or data quality checks
B) Utilize Snowflake's native Data Governance features, such as data masking and row-level security, to enforce data quality rules.
C) Implement a Snowpark Python UDF that leverages a data quality library (e.g., Great Expectations) to define and execute data quality rules. The UDF takes a DataFrame representing the data to be checked and returns a DataFrame containing the data quality check results.
D) Develop a parameterized stored procedure that accepts the table name, column name, data quality rule definition, and threshold values as input parameters. This procedure then dynamically constructs and executes the SQL query to check the data quality rule.
E) Create a series of individual SQL scripts, each checking a specific data quality rule for a specific column, and schedule these scripts to run using Snowflake tasks.
Solutions:
| Question # 1 Answer: A | Question # 2 Answer: A,C | Question # 3 Answer: A,E | Question # 4 Answer: E | Question # 5 Answer: C,D |

We're so confident of our products that we provide no hassle product exchange.


By Christine


