Snowflake SnowPro Advanced: Data Scientist Certification Sample Questions:
1. You are a data scientist working for a retail company using Snowflake. You're building a linear regression model to predict sales based on advertising spend across various channels (TV, Radio, Newspaper). After initial EDA, you suspect multicollinearity among the independent variables. Which of the following Snowflake SQL statements or techniques are MOST appropriate for identifying and addressing multicollinearity BEFORE fitting the model? Choose two.
A) Implement Principal Component Analysis (PCA) using Snowpark Python to transform the independent variables into uncorrelated principal components and then select only the components explaining a certain percentage of the variance.
B) Use ' on each independent variable to estimate its uniqueness. If uniqueness is low, multicollinearity is likely.
C) Calculate the Variance Inflation Factor (VIF) for each independent variable using a user-defined function (UDF) in Snowflake that implements the VIF calculation based on R-squared values from auxiliary regressions. This requires fitting a linear regression for each independent variable against all others.
D) Drop one of the independent variable randomly if they seem highly correlated.
E) Generate a correlation matrix of the independent variables using 'CORR aggregate function in Snowflake SQL and examine the correlation coefficients. Values close to +1 or -1 suggest high multicollinearity.
2. You are developing a model to predict house prices based on structured data including size, number of bedrooms, location, and age. You have built a linear regression model within Snowflake. During the evaluation, you observe that the residuals exhibit heteroscedasticity. Which of the following actions is the LEAST appropriate to address heteroscedasticity in this scenario, considering you want to implement the solution primarily using Snowflake's built-in features and capabilities?
A) Use robust standard errors in the linear regression analysis, even though Snowflake doesn't directly support calculating them. You decide to export model coefficients to an external statistics package (e.g., Python with Statsmodels) to compute robust standard errors and then bring insights back to Snowflake.
B) Transform independent variables using Box-Cox transformation and include in Snowflake Linear Regression Model Training
C) Apply a logarithmic transformation to the target variable ('SALES_PRICE) using the 'LOG' function within Snowflake before training the linear regression model.
D) Include interaction terms between the independent variables in your linear regression model.
E) Implement Weighted Least Squares (WLS) regression by calculating weights inversely proportional to the variance of the residuals for each data point. This involves creating a UDF to calculate weights and modifying the linear regression model fitting process. (Assume direct modification of the fitting process is possible within Snowflake).
3. You are tasked with creating a new feature in a machine learning model for predicting customer lifetime value. You have access to a table called 'CUSTOMER ORDERS which contains order history for each customer. This table contains the following columns: 'CUSTOMER ID', 'ORDER DATE, and 'ORDER AMOUNT. To improve model performance and reduce the impact of outliers, you plan to bin the 'ORDER AMOUNT' column using quantiles. You decide to create 5 bins, effectively creating quintiles. You also want to create a derived feature indicating if the customer's latest order amount falls in the top quintile. Which of the following approaches, or combination of approaches, is most appropriate and efficient for achieving this in Snowflake? (Choose all that apply)
A) Use 'WIDTH_BUCKET function, after finding the boundaries of quantile using 'APPROX_PERCENTILE' or 'PERCENTILE_CONT. Using MAX(ORDER to determine recent amount is in top quantile.
B) Create a temporary table storing quintile information, then join this table to original table to find the top quintile order amount.
C) Use a Snowflake UDF (User-Defined Function) written in Python or Java to calculate the quantiles and assign each 'ORDER AMOUNT to a bin. Later you can use other statement to check the top quintile amount from result set.
D) Use the window function to create quintiles for 'ORDER AMOUNT and then, in a separate query, check if the latest 'ORDER AMOUNT for each customer falls within the NTILE that represents the top quintile.
E) Calculate the 20th, 40th, 60th, and 80th percentiles of the 'ORDER AMOUNT' using 'APPROX PERCENTILE or 'PERCENTILE CONT and then use a 'CASE statement to assign each order to a quantile bin. Calculate and see if on that particular date is in top quintile.
4. You have deployed a sentiment analysis model on AWS SageMaker and want to integrate it with Snowflake using an external function. You've created an API integration object. Which of the following SQL statements is the most secure and efficient way to create an external function that utilizes this API integration, assuming the model expects a JSON payload with a 'text' field, the API integration is named 'sagemaker_integration' , the SageMaker endpoint URL is 'https://your-sagemaker-endpoint.com/invoke' , and you want the Snowflake function to be named 'predict_sentiment'?
A) Option D
B) Option B
C) Option E
D) Option C
E) Option A
5. You have trained a linear regression model in Snowpark ML to predict house prices. After training, you want to assess the overall feature importance using the model's coefficients. Consider the following Snowflake table containing the coefficients:
Which of the following statements are correct interpretations of these coefficients regarding feature impact?
A) An increase of one square foot (sqft) in house size is associated with an increase of $120.5 in the predicted house price.
B) The 'bedrooms' feature has a positive impact on the house price since the coefficient is negative.
C) The 'age' feature has an insignificant impact because its coefficient is small.
D) The 'location_score' feature is the most influential predictor in determining house price.
E) Increasing the number of bedrooms is associated with a decrease in the predicted house price.
Solutions:
Question # 1 Answer: C,E | Question # 2 Answer: A | Question # 3 Answer: A,D,E | Question # 4 Answer: D | Question # 5 Answer: A,D,E |