Black Friday Sale - Limited Time 65% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: 65percent

Welcome To DumpsPedia

Databricks-Machine-Learning-Associate Sample Questions Answers

Questions 4

Which statement describes a Spark ML transformer?

Options:

A.

A transformer is an algorithm which can transform one DataFrame into another DataFrame

B.

A transformer is a hyperparameter grid that can be used to train a model

C.

A transformer chains multiple algorithms together to transform an ML workflow

D.

A transformer is a learning algorithm that can use a DataFrame to train a model

Buy Now
Questions 5

A data scientist wants to parallelize the training of trees in a gradient boosted tree to speed up the training process. A colleague suggests that parallelizing a boosted tree algorithm can be difficult.

Which of the following describes why?

Options:

A.

Gradient boosting is not a linear algebra-based algorithm which is required for parallelization

B.

Gradient boosting requires access to all data at once which cannot happen during parallelization.

C.

Gradient boosting calculates gradients in evaluation metrics using all cores which prevents parallelization.

D.

Gradient boosting is an iterative algorithm that requires information from the previous iteration to perform the next step.

Buy Now
Questions 6

A machine learning engineer would like to develop a linear regression model with Spark ML to predict the price of a hotel room. They are using the Spark DataFrametrain_dfto train the model.

The Spark DataFrametrain_dfhas the following schema:

The machine learning engineer shares the following code block:

Which of the following changes does the machine learning engineer need to make to complete the task?

Options:

A.

They need to call the transform method on train df

B.

They need to convert the features column to be a vector

C.

They do not need to make any changes

D.

They need to utilize a Pipeline to fit the model

E.

They need to split thefeaturescolumn out into one column for each feature

Buy Now
Questions 7

A data scientist has developed a random forest regressor rfr and included it as the final stage in a Spark MLPipeline pipeline. They then set up a cross-validation process with pipeline as the estimator in the following code block:

Which of the following is a negative consequence of includingpipelineas the estimator in the cross-validation process rather thanrfras the estimator?

Options:

A.

The process will have a longer runtime because all stages of pipeline need to be refit or retransformed with each mode

B.

The process will leak data from the training set to the test set during the evaluation phase

C.

The process will be unable to parallelize tuning due to the distributed nature of pipeline

D.

The process will leak data prep information from the validation sets to the training sets for each model

Buy Now
Questions 8

A data scientist is utilizing MLflow Autologging to automatically track their machine learning experiments. After completing a series of runs for the experiment experiment_id, the data scientist wants to identify the run_id of the run with the best root-mean-square error (RMSE).

Which of the following lines of code can be used to identify the run_id of the run with the best RMSE in experiment_id?

A)

B)

C)

D)

Options:

A.

OptionA

B.

Option B

C.

Option C

D.

Option D

Buy Now
Questions 9

The implementation of linear regression in Spark ML first attempts to solve the linear regression problem using matrix decomposition, but this method does not scale well to large datasets with a large number of variables.

Which of the following approaches does Spark ML use to distribute the training of a linear regression model for large data?

Options:

A.

Logistic regression

B.

Spark ML cannot distribute linear regression training

C.

Iterative optimization

D.

Least-squares method

E.

Singular value decomposition

Buy Now
Questions 10

Which of the following machine learning algorithms typically uses bagging?

Options:

A.

IGradient boosted trees

B.

K-means

C.

Random forest

D.

Decision tree

Buy Now
Questions 11

Which of the Spark operations can be used to randomly split a Spark DataFrame into a training DataFrame and a test DataFrame for downstream use?

Options:

A.

TrainValidationSplit

B.

DataFrame.where

C.

CrossValidator

D.

TrainValidationSplitModel

E.

DataFrame.randomSplit

Buy Now
Questions 12

A data scientist learned during their training to always use 5-fold cross-validation in their model development workflow. A colleague suggests that there are cases where a train-validation split could be preferred over k-fold cross-validation when k > 2.

Which of the following describes a potential benefit of using a train-validation split over k-fold cross-validation in this scenario?

Options:

A.

A holdout set is not necessary when using a train-validation split

B.

Reproducibility is achievable when using a train-validation split

C.

Fewer hyperparameter values need to be tested when usinga train-validation split

D.

Bias is avoidable when using a train-validation split

E.

Fewer models need to be trained when using a train-validation split

Buy Now
Questions 13

A machine learning engineer is using the following code block to scale the inference of a single-node model on a Spark DataFrame with one million records:

Assuming the default Spark configuration is in place, which of the following is a benefit of using anIterator?

Options:

A.

The data will be limited to a single executor preventing the model from being loaded multiple times

B.

The model will be limited to a single executor preventing the data from being distributed

C.

The model only needs to be loaded once per executor rather than once per batch during the inference process

D.

The data will be distributed across multiple executors during the inference process

Buy Now
Questions 14

Which of the following statements describes a Spark ML estimator?

Options:

A.

An estimator is a hyperparameter arid that can be used to train a model

B.

An estimator chains multiple alqorithms toqether to specify an ML workflow

C.

An estimator is a trained ML model which turns a DataFrame with features into a DataFrame with predictions

D.

An estimator is an alqorithm which can be fit on a DataFrame to produce a Transformer

E.

An estimator is an evaluation tool to assess to the quality of a model

Buy Now
Questions 15

A data scientist wants to efficiently tune the hyperparameters of a scikit-learn model in parallel. They elect to use the Hyperopt library to facilitate this process.

Which of the following Hyperopt tools provides the ability to optimize hyperparameters in parallel?

Options:

A.

fmin

B.

SparkTrials

C.

quniform

D.

search_space

E.

objective_function

Buy Now
Questions 16

Which of the following describes the relationship between native Spark DataFrames and pandas API on Spark DataFrames?

Options:

A.

pandas API on Spark DataFrames are single-node versions of Spark DataFrames with additional metadata

B.

pandas API on Spark DataFrames are more performant than Spark DataFrames

C.

pandas API on Spark DataFrames are made up of Spark DataFrames and additional metadata

D.

pandas API on Spark DataFrames are less mutable versions of Spark DataFrames

Buy Now
Questions 17

Which of the following approaches can be used to view the notebook that was run to create an MLflow run?

Options:

A.

Open the MLmodel artifact in the MLflow run paqe

B.

Click the "Models" link in the row corresponding to the run in the MLflow experiment paqe

C.

Click the "Source" link in the row corresponding to the run in the MLflow experiment page

D.

Click the "Start Time" link in the row corresponding to the run in the MLflow experiment page

Buy Now
Questions 18

A data scientist has produced three new models for a single machine learning problem. In the past, the solution used just one model. All four models have nearly the same prediction latency, but a machine learning engineer suggests that the new solution will be less time efficient during inference.

In which situation will the machine learning engineer be correct?

Options:

A.

When the new solution requires if-else logic determining which model to use to compute each prediction

B.

When the new solution's models have an average latency that is larger than the size of the original model

C.

When the new solution requires the use of fewer feature variables than the original model

D.

When the new solution requires that each model computes a prediction for every record

E.

When the new solution's models have an average size that is larger than the size of the original model

Buy Now
Questions 19

A data scientist wants to use Spark ML to impute missing values in their PySpark DataFrame features_df. They want to replace missing values in all numeric columns in features_df with each respective numeric column’s median value.

They have developed the following code block to accomplish this task:

The code block is not accomplishing the task.

Which reasons describes why the code block is not accomplishing the imputation task?

Options:

A.

It does not impute both the training and test data sets.

B.

The inputCols and outputCols need to be exactly the same.

C.

The fit method needs to be called instead of transform.

D.

It does not fit the imputer on the data to create an ImputerModel.

Buy Now
Questions 20

A data scientist has replaced missing values in their feature set with each respective feature variable’s median value. A colleague suggests that the data scientist is throwing away valuable information by doing this.

Which of the following approaches can they take to include as much information as possible in the feature set?

Options:

A.

Impute the missing values using each respective feature variable's mean value instead of the median value

B.

Refrain from imputing the missing values in favor of letting the machine learning algorithm determine how to handle them

C.

Remove all feature variables that originally contained missing values from the feature set

D.

Create a binary feature variable for each feature that contained missing values indicating whether each row's value has been imputed

E.

Create a constant feature variable for each feature that contained missing values indicating the percentage of rows from the feature that was originally missing

Buy Now
Questions 21

A data scientist is working with a feature set with the following schema:

Thecustomer_idcolumn is the primary key in the feature set. Each of the columns in the feature set has missing values. They want to replace the missing values by imputing a common value for each feature.

Which of the following lists all of the columns in the feature set that need to be imputed using the most common value of the column?

Options:

A.

customer_id, loyalty_tier

B.

loyalty_tier

C.

units

D.

spend

E.

customer_id

Buy Now
Questions 22

A data scientist has been given an incomplete notebook from the data engineering team. The notebook uses a Spark DataFrame spark_df on which the data scientist needs to perform further feature engineering. Unfortunately, the data scientist has not yet learned the PySpark DataFrame API.

Which of the following blocks of code can the data scientist run to be able to use the pandas API on Spark?

Options:

A.

import pyspark.pandas as ps

df = ps.DataFrame(spark_df)

B.

import pyspark.pandas as ps

df = ps.to_pandas(spark_df)

C.

spark_df.to_sql()

D.

import pandas as pd

df = pd.DataFrame(spark_df)

E.

spark_df.to_pandas()

Buy Now
Exam Code: Databricks-Machine-Learning-Associate
Exam Name: Databricks Certified Machine Learning Associate Exam
Last Update: Nov 17, 2024
Questions: 74
$57.75  $164.99
$43.75  $124.99
$36.75  $104.99
buy now Databricks-Machine-Learning-Associate