Math Rstudio Psychology Statistics

Posted Under: Programming

Ask A Question
DESCRIPTION
Posted
Modified
Viewed 20
Problem 1) In this problem you will modify the code so that the agent learns from its feedback. In particular, we will implement a classic learning model called temporal difference learning or TD-learning. Suppose the agent has an estimate for the value of each of the 10 bandits. Let’s call this estimate ˆθ. Note that this is actually a vector, so that ˆθk represents the estimated value for the k-th alternative. On a particular trial, the agent selects alternative k, and receives a reward r that is either 1 or 0. How should the agent update its beliefs about the value of alternative k? According to the TD-learning rule, learning is driven by the difference between what the agent predicted, and what it observed. Mathematically, we have: ˆθk ← ˆθk + α  r − ˆθk  In plain english, this says that the new estimate for the value is equal to the old estimate, plus a term proportional to the difference between what was observed and what was predicted, (r − ˆθk). The parameter α is called the learning rate. When α = 0, the term on the right cancels out and no learning occurs. When α = 1, the updated value is exactly equal to the most recent reward signal r. Using the code above as a starting point, create a new function called simulate_td_random(). The agent should update its beliefs about the value of each bandit, using the TD-learning rule. Your function should take an additional argument, alpha which determines the learning rate for the agent. The default value for this argument should be specified as alpha = 0.05. Your function should return a data frame that contains three columns: • A column labeled bandit with the values 1 through 10 • A column labeled theta_true with the true reward probability • A column labeled theta_est with the estimated value (based on TD-learning) for each bandit, at the end of the simulation Note: Your agent will still select actions randomly, but will learn on the basis of the reward signal. Some specific requirements: • You should initialize the estimated value for each bandit to 0.5. 4 Solution: # Your solution here Problem 2) Run your function from problem 1. Generate a bar graph that shows the estimated value (estimated reward probability) for each bandit at the end of learning. Overlaid over the bars, also show plot markers that indicate the true reward probability for each bandit. Specific requirements: • Use ggplot() to construct your graph • Set the x-axis label to “Bandit”, and the y-axis label to “Estimated reward probability” • The Pantone “Color of the Year” for 2023 is something called “Viva Magenta”. Set the bar colors for your graph to a close approximation to “Viva Magenta”. • Set the limits of the y-axis to the range 0 to 1. • The x-axis should have labels at the integer values 1 through 10. Solution: # Your solution here Problem 3) Modify your function simulate_td_random() so that it keeps track of the total accumulated reward received by the agent at each trial. For example, if the agent receives a reward on trials 1, 3, and 5, then its total accumulated reward over the first five trials should be 1, 1, 2, 2, 3. The updated function should return a data frame with three columns: • trial (1 . . . n_trials) • reward (the reward obtained on each trial, 0 or 1) • accumulated_reward (the total accumulated reward on each trial) Solution: # Your solution here Problem 4) Run your function simulate_td_random() 100 times. Stack together the results into one big data frame with three columns, and 100,000 rows (1000 trials × 100 simulations). Once you’ve done that, assuming your results are stored in a variable called results you can use the following tidyverse magic to get the average accumulated reward: avg_results <- results %>% group_by(trial) %>% summarise(mean_accumulated_reward = mean(accumulated_reward)) 5 Generate a line graph that shows how average accumulated reward increases over time. Specific requirements: • Use ggplot() to construct your graph • Set the x-axis label to “Trial”, and the y-axis label to “Mean accumulated reward” Solution: # Your solution here Problem 5) Notice that so far, your agent is choosing its actions at random —it is exploring, but not exploiting what is has learned. In the reinforcement learning literature, extensive research has gone into how to optimally balance exploration and exploitation, as well as how best to model this tradeoff in human learning. We will consider a simple heuristic approach, called ϵ-greedy action selection (ϵ is the Greek letter epsilon). The idea is simple: With probability ϵ, choose an action at random, and with probability (1 − ϵ) choose the action that currently has the highest estimated value. Create a function called simulate_td_eps(), that uses TD-learning and ϵ-greedy action selection. Note that in the case of a tie (several alternatives have the highest value), you should choose randomly between the tied options. Try to find a value for ϵ that maximizes the agent’s performance (you can just do this through trial and error, a complex search for the exact optimal value is not needed). Update your graph from problem 3, to show data for both the random action selection and ϵ-greedy action selection mechanism (using average performance over 100 simulations for each algorithm.) Additional requirements: • The data for the two action selection methods should be plotted using different colors • Your figure should include a legend, with labels “Random” and “TD-Epsilon” Solution: # Your solution here Problem 6) ϵ-greedy is just one possible approach to balancing exploration and exploitation. Another common approach uses the so-called “softmax” operator. If ˆθ represents a vector storing the estimated values for each bandit, then the probability of choosing alternative k is given by: P(choice = k) = e βθˆk Pn j=1 e βθˆj where β is a parameter that controls how random or deterministic the choices are. As β → 0, the probability for each choice approches 1/n (random action selection). As β → ∞, the probability of choosing the option with the highest value approaches 1 (deterministic action selection). Intermediate values balance exploration and exploitation. 6 Create a function called simulate_td_softmax that uses TD learning and the softmax action selection mechanism. It should have an additional argument beta. Try to find a value for β that maximizes the agent’s performance (as before, you can just do this through trial and error, a complex search for the exact optimal value is not needed). Update your graph from problem 4 to include data for all three approaches (TD-random, TD-epsilon, and TD-softmax). Solution: # Your solution here Problem 7) So far we have been using the TD-learning rule to model how the agent updates its beliefs. Given that we have been discussing Bayesian parameter estimation in class, it is natural to apply the same ideas to model learning in the bandit setting. In particular, lets assume the agent seeks to learn the distribution p(θk) for each bandit. We will use a Beta distribution as the prior, with parameters α = β = 1. Recall that this is equivalent to a uniform distribution over the interval (0, 1). After each choice, the agent receives a reward of 1 or 0. We can think of this as a coin flip experiment where the coin has an unknown bias, except now there are 10 coins (corresponding to 10 bandits) and so we need to keep track of the posterior distribution for each one. You will do this by keeping track of the count of heads and tails (reward and no-reward) for each bandit. Create a function called simulate_bayesian_agent that implements this idea. Note: We are no longer using TD-learning. In addition, for this problem, go back to choosing actions completely at random. You might start with the function simulate_baseline_agent as your starting point. Your function should return a data frame with 4 columns: • A column labeled bandit with the values 1 . . . 10 • A column labeled theta_true that stores the true value for θ for each bandit • A column labeled a and a column labeled b; these should store the shape parameters of the posterior distribution for each bandit at the end of the simulation. (We’ll use a and b to avoid confusion with the α and β parameters used earlier—there’s only so many Greek letters.) Run your function. Generate a plot that shows the posterior probability distributions p(θk) for each bandit. Also include vertical dashed lines that show the true values for θ. Requirements: • Each distribution should be drawn using a different line color • Each vertical line should be drawn in the same color as its corresponding probability distribution Solution: # Your solution here Problem 8) Using a Bayesian inference algorithm instead of TD-learning does not avoid the problem of balancing exploration and exploitation. So far your algorithm has been selecting actions randomly. 7 One nice feature of Bayesian inference is that it explicitly represents uncertainty about the world. We can use this to guide exploration. A simple approach, is that on each trial, the agent generates a random sample from the posterior distribution for each bandit. It then selects the alternative that has the highest value according to these random samples. Notice how this idea naturally balances exploration and exploitation—at the beginning of the simulation, each distribution is a uniform distribution, so its choices will be completely random. As the agent learns more about each bandit, its posterior distributions will get narrower, and so the random samples will be closer to the true values and its behavior will become more deterministic. In the machine learning literature, this approach is known as posterior sampling, or Thompson sampling. It is not necessarily the optimal solution to the exploration-exploitation tradeoff, but it often performs very well. Modify your function simulate_bayesian_agent() to implement this idea. In addition, modify your function so that it returns the reward and accumulated reward, in the same way that you did for problem 3. Solution: # Your solution here Problem 9) Generate one more plot (updating your results from problem 6) that shows the average accumulated reward for all 4 models considered: TD-random, TD-epsilon, TD-softmax, and Bayesian. Solution: # Your solution here Problem 10) Define θ1 to be the probability that a given bandit produces a reward. Assume that θ1 is unknown, but has a posterior probability distribution defined by a Beta distribution: p(θ1) = Beta(α = 7, β = 4). Part a) Using numerical integration, what is the probability that θ1 > 0.5? # Your solution here Part b) Using the built-in cumulative distribution function (c.d.f.), what is the probability that θ1 > 0.5? # Your solution here Part c) Using Monte Carlo simulation (using 1 million samples), what is the probability that θ1 > 0.5? # Your solution here 8 Part d) Define θ2 to be the probability that a different bandit produces a reward. Assume that the posterior for θ2 is given by p(θ2) = Beta(α = 2, β = 2). Using Monte Carlo simulation, what is the probability that θ1 > θ2? # Your solution here Part e) What is the equal-tailed 95% credible interval for θ1? # Your solution here
Attachments
Math Methods in Psychological Science: Exam #1 [Your name here] Spring 2023 Due date The due date for this exam is Friday, February 24, by 2:00PM. Late submissions will not be accepted apart from exceptional circumstances. Consequently, you should plan on submitting before the due date. Instructions This exam consists of 10 problems. The first 9 problems build on each other. Problem 10 consists of 5 parts, but can be completed without first solving problems 1–9. Each problem will be graded out of a maximum of 5 points. For this exam you must provide all of your answers in a single file, either straight source code (.R), or an R markdown file (.Rmd), or a ‘knit’ markdown document. The R markdown document used to write this exam will be provided as a template. Regardless of what option you choose, you should delineate your answer for each question clearly. For example: # ****************************************************************** # Problem 1) # [ Solution code goes here ] # ****************************************************************** # Problem 2) # [ Solution code goes here ] etc. If I run your source code (or knit your markdown file), it should run from beginning to end without producing any errors. Your code should conform to the tidyverse R programming style guide, available here for reference: https://style.tidyverse.org/index.html. You can refer to your notes, class lecture slides, Google, StackExchange, or other online material. However, you cannot post content from the exam or questions related to it on the internet (Discord, etc.), or consult with other students. Anyone caught violating this policy will be given an immediate zero for the exam. Partial credit will be given, so if you are unsure of a solution or can’t get your code to work, you should include concise comments in your code that explain your thought process/approach. 1 https://style.tidyverse.org/index.html Introduction & Background The following background comes from Gureckis, T. M., & Love, B. C. (2015). “Computational reinforcement learning”. The Oxford handbook of computational and mathematical psychology, 99-117: There are few general laws of behavior, but one may be that humans and other animals tend to repeat behaviors that have led to positive outcomes in the past and avoid those associated with punishment or pain. Such tendencies are on display in the behavior of young children who learn to avoid touching hot stoves following a painful burn, but behave in school when rewarded with toys. This basic principle exerts such a powerful influence on behavior, it manifests throughout our culture and laws. Behaviors that society wants to discourage are tied to punishment (e.g., prison time, fines, consumption taxes), whereas behaviors society condones are tied to positive outcomes (e.g., tax credits for fuel-efficient cars). The scientific study of how animals use experience to adapt their behavior in order maximize rewards is known as reinforcement learning (RL). Reinforcement learning differs from other types of learning behavior of interest to psychologists (e.g., unsupervised learning, supervised learning) since it deals with learning from feedback that is largely evaluative rather than corrective. A restaurant diner doesn’t necessarily learn that eating at a particular business is “wrong,” simply that the experience was less than exquisite. This particular aspect of RL – learning from evaluate rather than corrective feedback – makes it a particularly rich domain for studying how people adapt their behavior based on experience. The history of RL can be traced to early work in behavioral psychology (Thorndike, 1911; Skinner, 1938). However, the modern field of RL is a highly interdisciplinary area at the crossroads of computer science, machine learning, psychology, and neuroscience. In particular, contemporary research on RL is characterized by detailed behavioral models that make predictions across a wide range of circumstances, as well as neuroscience findings that have linked aspects of these models to particular neural substrates. In many ways, RL today stands as one of the major triumphs of cognitive science in that it offers an integrated theory of behavior at the computational, algorithmic, and implementational (i.e., neural) levels (Marr, 1982). 2 Multi-armed bandits For this exam, we will be exploring very simple models of human reinforcement learning. In particular, we will focus on learning in “multi-armed bandit” tasks. What is a multi-armed bandit? You have probably heard of a slot machine. It’s a gambling device where you put in some money, pull a lever, and if you are lucky you win money. In Las Vegas (so the story goes), “one-armed bandit” is a slang term for a slot machine. One-armed, because the machine has a single lever that you pull. Bandit, because generally speaking it steals your money. You can think of a multi-armed bandit as a row of slot machines. However, in the general case, each slot machine has a different payout rate: some machines are ‘luckier’ than others. Given a finite number of choices, the goal in this setting is to maximize your expected payout. While abstract, multi-armed bandits are a useful analogy to a very large number of real-world scenarios. For example, medical doctors might have a choice of n different treatments available for a particular disease, but the effectiveness of each treatment varies and is not entirely known. Do you select a treatment that you are confident works moderately well, or do you try a different treatment that you don’t know as much about, but has the potential to be far more effective? In machine learning, this tradeoff is known as the ‘exploration-exploitation’ dilemma. You need to explore new (and potentially suboptimal) options in order to learn about them, but you also need to exploit what you already know in order to maximize reward. You also navigate this tradeoff constantly in your daily life without realizing it. For example, do you go out to your favorite restaurant, or do you risk trying a new place that just opened up? Do you stay at your current job, where you might be unhappy but stable, or do you risk the unknown for the possibility of higher pay or more job satisfaction? For more information about multi-armed bandits, see: https://en.wikipedia.org/wiki/Multi-armed_bandit Multi-armed bandit setup The code below provides a skeleton for a reinforcement learning experiment with a multi-armed bandit task. In this experiment, the learning agent faces a choice between 10 bandits on each trial. Each bandit provides a binary reward (either 0 or 1), but the probability of reward differs between the bandits. The goal for the learning agent is to maximize the total reward received. On each trial, the agent selects one of the alternatives, and receives a randomly generated reward, with probability determined by the particular bandit they selected. Mathematically, let k ∈ {1 . . . 10} indicate the choice made on a given trial. r indicates the reward received on that trial, where r ∼ Bernoulli(θk) and θ is a vector of length 10 that defines the reward probability for each bandit. The following code provides a basic implementation of a 10-armed bandit task. Listing 1: A basic 10-armed bandit task # Include this just once at the top of your code set.seed(42) simulate_baseline_agent <- function(n_arms = 10, n_trials = 1000) { # n_arms = Number of bandits to choose from on each trial # n_trials = Number of trials to simulate # Generate the true reward probability for each arm theta_true <- runif(n_arms) for(i in 1:n_trials) { 3 https://en.wikipedia.org/wiki/Multi-armed_bandit # Choose an action randomly k <- sample(1:n_arms, 1) # Generate a binary reward (0 or 1) according to the choice r <- as.numeric(runif(1) < theta_true[k]) } # This function doesn't return anything (yet) } Note that in the code above, there are two obvious limitations as a theory of human or animal learning: • The agent chooses actions completely at random between the 10 alternatives. • The agent doesn’t actually learn anything from the feedback that it receives. As part of this exam, you will address each of these limitations. Problem 1) In this problem you will modify the code so that the agent learns from its feedback. In particular, we will implement a classic learning model called temporal difference learning or TD-learning. Suppose the agent has an estimate for the value of each of the 10 bandits. Let’s call this estimate θ̂. Note that this is actually a vector, so that θ̂k represents the estimated value for the k-th alternative. On a particular trial, the agent selects alternative k, and receives a reward r that is either 1 or 0. How should the agent update its beliefs about the value of alternative k? According to the TD-learning rule, learning is driven by the difference between what the agent predicted, and what it observed. Mathematically, we have: θ̂k ← θ̂k + α ( r − θ̂k ) In plain english, this says that the new estimate for the value is equal to the old estimate, plus a term proportional to the difference between what was observed and what was predicted, (r − θ̂k). The parameter α is called the learning rate. When α = 0, the term on the right cancels out and no learning occurs. When α = 1, the updated value is exactly equal to the most recent reward signal r. Using the code above as a starting point, create a new function called simulate_td_random(). The agent should update its beliefs about the value of each bandit, using the TD-learning rule. Your function should take an additional argument, alpha which determines the learning rate for the agent. The default value for this argument should be specified as alpha = 0.05. Your function should return a data frame that contains three columns: • A column labeled bandit with the values 1 through 10 • A column labeled theta_true with the true reward probability • A column labeled theta_est with the estimated value (based on TD-learning) for each bandit, at the end of the simulation Note: Your agent will still select actions randomly, but will learn on the basis of the reward signal. Some specific requirements: • You should initialize the estimated value for each bandit to 0.5. 4 Solution: # Your solution here Problem 2) Run your function from problem 1. Generate a bar graph that shows the estimated value (estimated reward probability) for each bandit at the end of learning. Overlaid over the bars, also show plot markers that indicate the true reward probability for each bandit. Specific requirements: • Use ggplot() to construct your graph • Set the x-axis label to “Bandit”, and the y-axis label to “Estimated reward probability” • The Pantone “Color of the Year” for 2023 is something called “Viva Magenta”. Set the bar colors for your graph to a close approximation to “Viva Magenta”. • Set the limits of the y-axis to the range 0 to 1. • The x-axis should have labels at the integer values 1 through 10. Solution: # Your solution here Problem 3) Modify your function simulate_td_random() so that it keeps track of the total accumulated reward received by the agent at each trial. For example, if the agent receives a reward on trials 1, 3, and 5, then its total accumulated reward over the first five trials should be 1, 1, 2, 2, 3. The updated function should return a data frame with three columns: • trial (1 . . . n_trials) • reward (the reward obtained on each trial, 0 or 1) • accumulated_reward (the total accumulated reward on each trial) Solution: # Your solution here Problem 4) Run your function simulate_td_random() 100 times. Stack together the results into one big data frame with three columns, and 100,000 rows (1000 trials × 100 simulations). Once you’ve done that, assuming your results are stored in a variable called results you can use the following tidyverse magic to get the average accumulated reward: avg_results <- results %>% group_by(trial) %>% summarise(mean_accumulated_reward = mean(accumulated_reward)) 5 Generate a line graph that shows how average accumulated reward increases over time. Specific requirements: • Use ggplot() to construct your graph • Set the x-axis label to “Trial”, and the y-axis label to “Mean accumulated reward” Solution: # Your solution here Problem 5) Notice that so far, your agent is choosing its actions at random —it is exploring, but not exploiting what is has learned. In the reinforcement learning literature, extensive research has gone into how to optimally balance exploration and exploitation, as well as how best to model this tradeoff in human learning. We will consider a simple heuristic approach, called ϵ-greedy action selection (ϵ is the Greek letter epsilon). The idea is simple: With probability ϵ, choose an action at random, and with probability (1− ϵ) choose the action that currently has the highest estimated value. Create a function called simulate_td_eps(), that uses TD-learning and ϵ-greedy action selection. Note that in the case of a tie (several alternatives have the highest value), you should choose randomly between the tied options. Try to find a value for ϵ that maximizes the agent’s performance (you can just do this through trial and error, a complex search for the exact optimal value is not needed). Update your graph from problem 3, to show data for both the random action selection and ϵ-greedy action selection mechanism (using average performance over 100 simulations for each algorithm.) Additional requirements: • The data for the two action selection methods should be plotted using different colors • Your figure should include a legend, with labels “Random” and “TD-Epsilon” Solution: # Your solution here Problem 6) ϵ-greedy is just one possible approach to balancing exploration and exploitation. Another common approach uses the so-called “softmax” operator. If θ̂ represents a vector storing the estimated values for each bandit, then the probability of choosing alternative k is given by: P (choice = k) = e βθ̂k∑n j=1 e βθ̂j where β is a parameter that controls how random or deterministic the choices are. As β → 0, the probability for each choice approches 1/n (random action selection). As β →∞, the probability of choosing the option with the highest value approaches 1 (deterministic action selection). Intermediate values balance exploration and exploitation. 6 Create a function called simulate_td_softmax that uses TD learning and the softmax action selection mechanism. It should have an additional argument beta. Try to find a value for β that maximizes the agent’s performance (as before, you can just do this through trial and error, a complex search for the exact optimal value is not needed). Update your graph from problem 4 to include data for all three approaches (TD-random, TD-epsilon, and TD-softmax). Solution: # Your solution here Problem 7) So far we have been using the TD-learning rule to model how the agent updates its beliefs. Given that we have been discussing Bayesian parameter estimation in class, it is natural to apply the same ideas to model learning in the bandit setting. In particular, lets assume the agent seeks to learn the distribution p(θk) for each bandit. We will use a Beta distribution as the prior, with parameters α = β = 1. Recall that this is equivalent to a uniform distribution over the interval (0, 1). After each choice, the agent receives a reward of 1 or 0. We can think of this as a coin flip experiment where the coin has an unknown bias, except now there are 10 coins (corresponding to 10 bandits) and so we need to keep track of the posterior distribution for each one. You will do this by keeping track of the count of heads and tails (reward and no-reward) for each bandit. Create a function called simulate_bayesian_agent that implements this idea. Note: We are no longer using TD-learning. In addition, for this problem, go back to choosing actions completely at random. You might start with the function simulate_baseline_agent as your starting point. Your function should return a data frame with 4 columns: • A column labeled bandit with the values 1 . . . 10 • A column labeled theta_true that stores the true value for θ for each bandit • A column labeled a and a column labeled b; these should store the shape parameters of the posterior distribution for each bandit at the end of the simulation. (We’ll use a and b to avoid confusion with the α and β parameters used earlier—there’s only so many Greek letters.) Run your function. Generate a plot that shows the posterior probability distributions p(θk) for each bandit. Also include vertical dashed lines that show the true values for θ. Requirements: • Each distribution should be drawn using a different line color • Each vertical line should be drawn in the same color as its corresponding probability distribution Solution: # Your solution here Problem 8) Using a Bayesian inference algorithm instead of TD-learning does not avoid the problem of balancing exploration and exploitation. So far your algorithm has been selecting actions randomly. 7 One nice feature of Bayesian inference is that it explicitly represents uncertainty about the world. We can use this to guide exploration. A simple approach, is that on each trial, the agent generates a random sample from the posterior distribution for each bandit. It then selects the alternative that has the highest value according to these random samples. Notice how this idea naturally balances exploration and exploitation—at the beginning of the simulation, each distribution is a uniform distribution, so its choices will be completely random. As the agent learns more about each bandit, its posterior distributions will get narrower, and so the random samples will be closer to the true values and its behavior will become more deterministic. In the machine learning literature, this approach is known as posterior sampling, or Thompson sampling. It is not necessarily the optimal solution to the exploration-exploitation tradeoff, but it often performs very well. Modify your function simulate_bayesian_agent() to implement this idea. In addition, modify your function so that it returns the reward and accumulated reward, in the same way that you did for problem 3. Solution: # Your solution here Problem 9) Generate one more plot (updating your results from problem 6) that shows the average accumulated reward for all 4 models considered: TD-random, TD-epsilon, TD-softmax, and Bayesian. Solution: # Your solution here Problem 10) Define θ1 to be the probability that a given bandit produces a reward. Assume that θ1 is unknown, but has a posterior probability distribution defined by a Beta distribution: p(θ1) = Beta(α = 7, β = 4). Part a) Using numerical integration, what is the probability that θ1 > 0.5? # Your solution here Part b) Using the built-in cumulative distribution function (c.d.f.), what is the probability that θ1 > 0.5? # Your solution here Part c) Using Monte Carlo simulation (using 1 million samples), what is the probability that θ1 > 0.5? # Your solution here 8 Part d) Define θ2 to be the probability that a different bandit produces a reward. Assume that the posterior for θ2 is given by p(θ2) = Beta(α = 2, β = 2). Using Monte Carlo simulation, what is the probability that θ1 > θ2? # Your solution here Part e) What is the equal-tailed 95% credible interval for θ1? # Your solution here 9 Due date Instructions Introduction & Background Multi-armed bandits Multi-armed bandit setup Listing 1: A basic 10-armed bandit task Problem 1) Solution: Problem 2) Solution: Problem 3) Solution: Problem 4) Solution: Problem 5) Solution: Problem 6) Solution: Problem 7) Solution: Problem 8) Solution: Problem 9) Solution: Problem 10) Part a) Part b) Part c) Part d) Part e)
Explanations and Answers 0

No answers posted

Post your Answer - free or at a fee

Login to your tutor account to post an answer

Posting a free answer earns you +20 points.

Login

NB: Post a homework question for free and get answers - free or paid homework help.

Get answers to: Math Rstudio Psychology Statistics or similar questions only at Tutlance.

Related Questions