Do My Homework / Homework Help Answers / Computer Science Homework Help / Coding assignment in Google colab or jupytr

# Coding assignment in Google colab or jupytr

Need help with this question or any other Computer Science assignment help task?

The assignment can be done in either Google colab or a jupyter notebook and it is in the format of a .ipynb file. Although there is already code on the file, I am asking you to fill out the “to do” portion of the code that are left for the student to complete.
import numpy as np import pandas as pd from sklearn.feature_extraction.text import TfidfVectorizer from scipy.sparse import csr_matrix from sklearn.preprocessing import normalize from sklearn.metrics import pairwise_distances wiki = pd.read_csv('people_wiki.csv') vectorizer = TfidfVectorizer(max_df=0.95) # ignore words with very high doc frequency tf_idf = vectorizer.fit_transform(wiki['text']) words = vectorizer.get_feature_names_out() tf_idf = csr_matrix(tf_idf) tf_idf = normalize(tf_idf) def get_initial_centroids(data, k, seed=None): """ Randomly choose k data points as initial centroids """ if seed is not None: # useful for obtaining consistent results np.random.seed(seed) n = data.shape[0] # number of data points # Pick K indices from range [0, N). rand_indices = np.random.randint(0, n, k) # Keep centroids as dense format, as many entries will be nonzero due to averaging. # As long as at least one document in a cluster contains a word, # it will carry a nonzero weight in the TF-IDF vector of the centroid. centroids = data[rand_indices, :].toarray() return centroids # Question 1 # TODO distances = ... dist = ... # Question 2 # TODO closest_cluster = ... # Question 3 def assign_clusters(data, centroids): """ Parameters: - data - is an np.array of float values of length n. - centroids - is an np.array of float values of length k. Returns - A np.array of length n where the ith index represents which centroid data[i] was assigned to. The assignments range between the values 0, ..., k-1. """ # TODO return ... # Question 4 def revise_centroids(data, k, cluster_assignment): """ Parameters: - data - is an np.array of float values of length N. - k - number of centroids - cluster_assignment - np.array of length N where the ith index represents which centroid data[i] was assigned to. The assignments range between the values 0, ..., k-1. Returns - A np.array of length k for the new centroids. """ # TODO new_centroids = [] for i in range(k): # Select all data points that belong to cluster i. Fill in the blank (RHS only) member_data_points = ... # Compute the mean of the data points. Fill in the blank (RHS only) centroid = ... # Convert numpy.matrix type to numpy.ndarray type centroid = centroid.A1 new_centroids.append(centroid) new_centroids = np.array(new_centroids) return new_centroids # Question 5 def kmeans(data, k, initial_centroids, max_iter, record_heterogeneity=None, verbose=False): """ This function runs k-means on given data and initial set of centroids. Parameters: - data - is an np.array of float values of length N. - k - number of centroids - initial_centroids - is an np.array of float values of length k. - max_iter - maximum number of iterations to run the algorithm - record_heterogeneity - if provided an empty list, it will compute the heterogeneity at each iteration and append it to the list. Defaults to None and won't record heterogeneity. - verbose - set to True to display progress. Defaults to False and won't display progress. Returns - centroids - A np.array of length k for the centroids upon termination of the algorithm. - cluster_assignment - A np.array of length N where the ith index represents which centroid data[i] was assigned to. The assignments range between the values 0, ..., k-1 upon termination of the algorithm. """ centroids = initial_centroids[:] prev_cluster_assignment = None for itr in range(max_iter): # Print itereation number if verbose: print(itr) # 1. Make cluster assignments using nearest centroids cluster_assignment = ... # 2. Compute a new centroid for each of the k clusters, averaging all data points assigned to that cluster. centroids = ... # Check for convergence: if none of the assignments changed, stop if prev_cluster_assignment is not None and \ (prev_cluster_assignment == cluster_assignment).all(): break # Print number of new assignments if prev_cluster_assignment is not None: num_changed = sum(abs(prev_cluster_assignment - cluster_assignment)) if verbose: print(f' {num_changed:5d} elements changed their cluster assignment.') # Record heterogeneity convergence metric if record_heterogeneity is not None: score = ... record_heterogeneity.append(score) prev_cluster_assignment = cluster_assignment[:] return centroids, cluster_assignment k = 3 q5_initial_centroids = get_initial_centroids(tf_idf, k, seed=0) q5_centroids, q5_cluster_assignment = kmeans(tf_idf, k, q5_initial_centroids, max_iter=400) # Question 6 # TODO largest_cluster = ... # Setup def compute_heterogeneity(data, k, centroids, cluster_assignment): """ Computes the heterogeneity metric of the data using the given centroids and cluster assignments. """ heterogeneity = 0.0 for i in range(k): # Select all data points that belong to cluster i. Fill in the blank (RHS only) member_data_points = data[cluster_assignment == i, :] if member_data_points.shape[0] > 0: # check if i-th cluster is non-empty # Compute distances from centroid to data point distances = pairwise_distances(member_data_points, [centroids[i]], metric='euclidean') squared_distances = distances ** 2 heterogeneity += np.sum(squared_distances) return heterogeneity def smart_initialize(data, k, seed=None): """ Use k-means++ to initialize a good set of centroids """ if seed is not None: # useful for obtaining consistent results np.random.seed(seed) centroids = np.zeros((k, data.shape[1])) # Randomly choose the first centroid. # Since we have no prior knowledge, choose uniformly at random idx = np.random.randint(data.shape[0]) centroids[0] = data[idx, :].toarray() # Compute distances from the first centroid chosen to all the other data points distances = pairwise_distances(data, centroids[0:1], metric='euclidean').flatten() for i in range(1, k): # Choose the next centroid randomly, so that the probability for each data point to be chosen # is directly proportional to its squared distance from the nearest centroid. # Roughtly speaking, a new centroid should be as far as from ohter centroids as possible. idx = np.random.choice(data.shape[0], 1, p=distances / sum(distances)) centroids[i] = data[idx, :].toarray() # Now compute distances from the centroids to all data points distances = np.min(pairwise_distances(data, centroids[0:i + 1], metric='euclidean'), axis=1) return centroids # Question 7 def kmeans_multiple_runs(data, k, max_iter, seeds, verbose=False): """ Runs kmeans multiple times Parameters: - data - is an np.array of float values of length n. - k - number of centroids - max_iter - maximum number of iterations to run the algorithm - seeds - Either number of seeds to try (generated randomly) or a list of seed values - verbose - set to True to display progress. Defaults to False and won't display progress. Returns - final_centroids - A np.array of length k for the centroids upon termination of the algorithm. - final_cluster_assignment - A np.array of length n where the ith index represents which centroid data[i] was assigned to. The assignments range between the values 0, ..., k-1 upon termination of the algorithm. """ min_heterogeneity_achieved = float('inf') final_centroids = None final_cluster_assignment = None if type(seeds) == int: seeds = np.random.randint(low=0, high=10000, size=seeds) num_runs = len(seeds) for seed in seeds: # Use k-means++ initialization: Fill in the blank # TODO # Set record_heterogeneity=None because we will compute that once at the end. initial_centroids = ... # Run k-means: Fill in the blank centroids, cluster_assignment = ... # To save time, compute heterogeneity only once in the end seed_heterogeneity = ... if verbose: print(f'seed={seed:06d}, heterogeneity={seed_heterogeneity:.5f}') # if current measurement of heterogeneity is lower than previously seen, # update the minimum record of heterogeneity. if seed_heterogeneity < min_heterogeneity_achieved: min_heterogeneity_achieved = seed_heterogeneity final_centroids = centroids final_cluster_assignment = cluster_assignment # Return the centroids and cluster assignments that minimize heterogeneity. return final_centroids, final_cluster_assignment q7_centroids, q7_cluster_assignment = kmeans_multiple_runs(tf_idf, 5, max_iter=100, seeds=[20000, 40000, 80000]) # Saved seed with best result to save time # Question 8 # Takes too long to run, so we will hard code the answer # TODO q8_centroids, q8_cluster_assignment = kmeans_multiple_runs(tf_idf, 100, max_iter=400, seeds=[80000]) # Saved seed with best result to save time num_small_clustesrs = ...
There are no answers to this question.
Login to buy an answer or post yours. You can also vote on other others

Get Help With a similar task to - Coding assignment in Google colab or jupytr

Popular Services
Tutlance Experts offer help in a wide range of topics. Here are some of our top services:

Post your project now for free and watch professional experts outbid each other in just a few minutes.

We accept:-
##### Whatsapp/SMS/Call
 US (International) +1 716 493 2397 UK +1 716 493 2397 CA/AU +1 716 493 2397
Tutlance - Hire homework helpers for cheap homework help onlineand tutoring services. Are you a Homework Doer? Browse our online tutoring jobs near me section to get hired to provide online homework help. Tutlance Freelance Services Homework Help | Online Tutoring Marketplace | Jobs | Tutors - @copyright 2005 - 2023 Tutlance Homework Help and Online Tutoring Service