Do My Homework
/
Homework Help Answers
/
Programming Homework Help
/ Math Rstudio Psychology Statistics

# Math Rstudio Psychology Statistics

**Need help with this question or any other
Programming assignment help
task?**

Problem 1)
In this problem you will modify the code so that the agent learns from its feedback. In particular, we will
implement a classic learning model called temporal difference learning or TD-learning.
Suppose the agent has an estimate for the value of each of the 10 bandits. Let’s call this estimate ˆθ. Note that
this is actually a vector, so that ˆθk represents the estimated value for the k-th alternative. On a particular
trial, the agent selects alternative k, and receives a reward r that is either 1 or 0. How should the agent
update its beliefs about the value of alternative k?
According to the TD-learning rule, learning is driven by the difference between what the agent predicted, and
what it observed. Mathematically, we have:
ˆθk ← ˆθk + α
r − ˆθk
In plain english, this says that the new estimate for the value is equal to the old estimate, plus a term
proportional to the difference between what was observed and what was predicted, (r − ˆθk). The parameter
α is called the learning rate. When α = 0, the term on the right cancels out and no learning occurs. When
α = 1, the updated value is exactly equal to the most recent reward signal r.
Using the code above as a starting point, create a new function called simulate_td_random(). The agent
should update its beliefs about the value of each bandit, using the TD-learning rule.
Your function should take an additional argument, alpha which determines the learning rate for the agent.
The default value for this argument should be specified as alpha = 0.05.
Your function should return a data frame that contains three columns:
• A column labeled bandit with the values 1 through 10
• A column labeled theta_true with the true reward probability
• A column labeled theta_est with the estimated value (based on TD-learning) for each bandit, at the
end of the simulation
Note: Your agent will still select actions randomly, but will learn on the basis of the reward signal.
Some specific requirements:
• You should initialize the estimated value for each bandit to 0.5.
4
Solution:
# Your solution here
Problem 2)
Run your function from problem 1.
Generate a bar graph that shows the estimated value (estimated reward probability) for each bandit at the
end of learning. Overlaid over the bars, also show plot markers that indicate the true reward probability for
each bandit.
Specific requirements:
• Use ggplot() to construct your graph
• Set the x-axis label to “Bandit”, and the y-axis label to “Estimated reward probability”
• The Pantone “Color of the Year” for 2023 is something called “Viva Magenta”. Set the bar colors for
your graph to a close approximation to “Viva Magenta”.
• Set the limits of the y-axis to the range 0 to 1.
• The x-axis should have labels at the integer values 1 through 10.
Solution:
# Your solution here
Problem 3)
Modify your function simulate_td_random() so that it keeps track of the total accumulated reward received
by the agent at each trial. For example, if the agent receives a reward on trials 1, 3, and 5, then its total
accumulated reward over the first five trials should be 1, 1, 2, 2, 3.
The updated function should return a data frame with three columns:
• trial (1 . . . n_trials)
• reward (the reward obtained on each trial, 0 or 1)
• accumulated_reward (the total accumulated reward on each trial)
Solution:
# Your solution here
Problem 4)
Run your function simulate_td_random() 100 times. Stack together the results into one big data frame
with three columns, and 100,000 rows (1000 trials × 100 simulations).
Once you’ve done that, assuming your results are stored in a variable called results you can use the following
tidyverse magic to get the average accumulated reward:
avg_results <- results %>%
group_by(trial) %>%
summarise(mean_accumulated_reward = mean(accumulated_reward))
5
Generate a line graph that shows how average accumulated reward increases over time.
Specific requirements:
• Use ggplot() to construct your graph
• Set the x-axis label to “Trial”, and the y-axis label to “Mean accumulated reward”
Solution:
# Your solution here
Problem 5)
Notice that so far, your agent is choosing its actions at random —it is exploring, but not exploiting what
is has learned. In the reinforcement learning literature, extensive research has gone into how to optimally
balance exploration and exploitation, as well as how best to model this tradeoff in human learning. We will
consider a simple heuristic approach, called ϵ-greedy action selection (ϵ is the Greek letter epsilon). The idea
is simple:
With probability ϵ, choose an action at random, and with probability (1 − ϵ) choose the action that currently
has the highest estimated value.
Create a function called simulate_td_eps(), that uses TD-learning and ϵ-greedy action selection.
Note that in the case of a tie (several alternatives have the highest value), you should choose randomly
between the tied options.
Try to find a value for ϵ that maximizes the agent’s performance (you can just do this through trial and error,
a complex search for the exact optimal value is not needed).
Update your graph from problem 3, to show data for both the random action selection and ϵ-greedy action
selection mechanism (using average performance over 100 simulations for each algorithm.)
Additional requirements:
• The data for the two action selection methods should be plotted using different colors
• Your figure should include a legend, with labels “Random” and “TD-Epsilon”
Solution:
# Your solution here
Problem 6)
ϵ-greedy is just one possible approach to balancing exploration and exploitation. Another common approach
uses the so-called “softmax” operator. If ˆθ represents a vector storing the estimated values for each bandit,
then the probability of choosing alternative k is given by:
P(choice = k) = e
βθˆk
Pn
j=1 e
βθˆj
where β is a parameter that controls how random or deterministic the choices are. As β → 0, the probability
for each choice approches 1/n (random action selection). As β → ∞, the probability of choosing the option
with the highest value approaches 1 (deterministic action selection). Intermediate values balance exploration
and exploitation.
6
Create a function called simulate_td_softmax that uses TD learning and the softmax action selection
mechanism. It should have an additional argument beta.
Try to find a value for β that maximizes the agent’s performance (as before, you can just do this through
trial and error, a complex search for the exact optimal value is not needed).
Update your graph from problem 4 to include data for all three approaches (TD-random, TD-epsilon, and
TD-softmax).
Solution:
# Your solution here
Problem 7)
So far we have been using the TD-learning rule to model how the agent updates its beliefs. Given that we
have been discussing Bayesian parameter estimation in class, it is natural to apply the same ideas to model
learning in the bandit setting.
In particular, lets assume the agent seeks to learn the distribution p(θk) for each bandit. We will use a Beta
distribution as the prior, with parameters α = β = 1. Recall that this is equivalent to a uniform distribution
over the interval (0, 1).
After each choice, the agent receives a reward of 1 or 0. We can think of this as a coin flip experiment where
the coin has an unknown bias, except now there are 10 coins (corresponding to 10 bandits) and so we need to
keep track of the posterior distribution for each one. You will do this by keeping track of the count of heads
and tails (reward and no-reward) for each bandit.
Create a function called simulate_bayesian_agent that implements this idea. Note: We are no longer using
TD-learning. In addition, for this problem, go back to choosing actions completely at random. You might
start with the function simulate_baseline_agent as your starting point.
Your function should return a data frame with 4 columns:
• A column labeled bandit with the values 1 . . . 10
• A column labeled theta_true that stores the true value for θ for each bandit
• A column labeled a and a column labeled b; these should store the shape parameters of the posterior
distribution for each bandit at the end of the simulation. (We’ll use a and b to avoid confusion with the
α and β parameters used earlier—there’s only so many Greek letters.)
Run your function. Generate a plot that shows the posterior probability distributions p(θk) for each bandit.
Also include vertical dashed lines that show the true values for θ.
Requirements:
• Each distribution should be drawn using a different line color
• Each vertical line should be drawn in the same color as its corresponding probability distribution
Solution:
# Your solution here
Problem 8)
Using a Bayesian inference algorithm instead of TD-learning does not avoid the problem of balancing
exploration and exploitation. So far your algorithm has been selecting actions randomly.
7
One nice feature of Bayesian inference is that it explicitly represents uncertainty about the world. We can
use this to guide exploration. A simple approach, is that on each trial, the agent generates a random sample
from the posterior distribution for each bandit. It then selects the alternative that has the highest value
according to these random samples.
Notice how this idea naturally balances exploration and exploitation—at the beginning of the simulation,
each distribution is a uniform distribution, so its choices will be completely random. As the agent learns more
about each bandit, its posterior distributions will get narrower, and so the random samples will be closer
to the true values and its behavior will become more deterministic. In the machine learning literature, this
approach is known as posterior sampling, or Thompson sampling. It is not necessarily the optimal solution to
the exploration-exploitation tradeoff, but it often performs very well.
Modify your function simulate_bayesian_agent() to implement this idea.
In addition, modify your function so that it returns the reward and accumulated reward, in the same way
that you did for problem 3.
Solution:
# Your solution here
Problem 9)
Generate one more plot (updating your results from problem 6) that shows the average accumulated reward
for all 4 models considered: TD-random, TD-epsilon, TD-softmax, and Bayesian.
Solution:
# Your solution here
Problem 10)
Define θ1 to be the probability that a given bandit produces a reward. Assume that θ1 is unknown, but has a
posterior probability distribution defined by a Beta distribution: p(θ1) = Beta(α = 7, β = 4).
Part a)
Using numerical integration, what is the probability that θ1 > 0.5?
# Your solution here
Part b)
Using the built-in cumulative distribution function (c.d.f.), what is the probability that θ1 > 0.5?
# Your solution here
Part c)
Using Monte Carlo simulation (using 1 million samples), what is the probability that θ1 > 0.5?
# Your solution here
8
Part d)
Define θ2 to be the probability that a different bandit produces a reward. Assume that the posterior for θ2 is
given by p(θ2) = Beta(α = 2, β = 2).
Using Monte Carlo simulation, what is the probability that θ1 > θ2?
# Your solution here
Part e)
What is the equal-tailed 95% credible interval for θ1?
# Your solution here

##### Additional Instructions:

Math Methods in Psychological Science:
Exam #1
[Your name here]
Spring 2023
Due date
The due date for this exam is Friday, February 24, by 2:00PM. Late submissions will not be accepted apart
from exceptional circumstances. Consequently, you should plan on submitting before the due date.
Instructions
This exam consists of 10 problems. The first 9 problems build on each other. Problem 10 consists of 5 parts,
but can be completed without first solving problems 1–9. Each problem will be graded out of a maximum of
5 points.
For this exam you must provide all of your answers in a single file, either straight source code (.R), or an R
markdown file (.Rmd), or a ‘knit’ markdown document. The R markdown document used to write this exam
will be provided as a template. Regardless of what option you choose, you should delineate your answer for
each question clearly. For example:
# ******************************************************************
# Problem 1)
# [ Solution code goes here ]
# ******************************************************************
# Problem 2)
# [ Solution code goes here ]
etc.
If I run your source code (or knit your markdown file), it should run from beginning to end without producing
any errors.
Your code should conform to the tidyverse R programming style guide, available here for reference:
https://style.tidyverse.org/index.html.
You can refer to your notes, class lecture slides, Google, StackExchange, or other online material. However,
you cannot post content from the exam or questions related to it on the internet (Discord, etc.), or consult
with other students. Anyone caught violating this policy will be given an immediate zero for the exam.
Partial credit will be given, so if you are unsure of a solution or can’t get your code to work, you should
include concise comments in your code that explain your thought process/approach.
1
https://style.tidyverse.org/index.html
Introduction & Background
The following background comes from Gureckis, T. M., & Love, B. C. (2015). “Computational reinforcement
learning”. The Oxford handbook of computational and mathematical psychology, 99-117:
There are few general laws of behavior, but one may be that humans and other animals tend to
repeat behaviors that have led to positive outcomes in the past and avoid those associated with
punishment or pain. Such tendencies are on display in the behavior of young children who learn
to avoid touching hot stoves following a painful burn, but behave in school when rewarded with
toys. This basic principle exerts such a powerful influence on behavior, it manifests throughout
our culture and laws. Behaviors that society wants to discourage are tied to punishment (e.g.,
prison time, fines, consumption taxes), whereas behaviors society condones are tied to positive
outcomes (e.g., tax credits for fuel-efficient cars).
The scientific study of how animals use experience to adapt their behavior in order maximize
rewards is known as reinforcement learning (RL). Reinforcement learning differs from other types
of learning behavior of interest to psychologists (e.g., unsupervised learning, supervised learning)
since it deals with learning from feedback that is largely evaluative rather than corrective. A
restaurant diner doesn’t necessarily learn that eating at a particular business is “wrong,” simply
that the experience was less than exquisite. This particular aspect of RL – learning from evaluate
rather than corrective feedback – makes it a particularly rich domain for studying how people
adapt their behavior based on experience.
The history of RL can be traced to early work in behavioral psychology (Thorndike, 1911; Skinner,
1938). However, the modern field of RL is a highly interdisciplinary area at the crossroads of
computer science, machine learning, psychology, and neuroscience. In particular, contemporary
research on RL is characterized by detailed behavioral models that make predictions across a
wide range of circumstances, as well as neuroscience findings that have linked aspects of these
models to particular neural substrates. In many ways, RL today stands as one of the major
triumphs of cognitive science in that it offers an integrated theory of behavior at the computational,
algorithmic, and implementational (i.e., neural) levels (Marr, 1982).
2
Multi-armed bandits
For this exam, we will be exploring very simple models of human reinforcement learning. In particular, we
will focus on learning in “multi-armed bandit” tasks. What is a multi-armed bandit?
You have probably heard of a slot machine. It’s a gambling device where you put in some money, pull a lever,
and if you are lucky you win money. In Las Vegas (so the story goes), “one-armed bandit” is a slang term for
a slot machine. One-armed, because the machine has a single lever that you pull. Bandit, because generally
speaking it steals your money.
You can think of a multi-armed bandit as a row of slot machines. However, in the general case, each slot
machine has a different payout rate: some machines are ‘luckier’ than others. Given a finite number of choices,
the goal in this setting is to maximize your expected payout.
While abstract, multi-armed bandits are a useful analogy to a very large number of real-world scenarios. For
example, medical doctors might have a choice of n different treatments available for a particular disease, but
the effectiveness of each treatment varies and is not entirely known. Do you select a treatment that you are
confident works moderately well, or do you try a different treatment that you don’t know as much about, but
has the potential to be far more effective?
In machine learning, this tradeoff is known as the ‘exploration-exploitation’ dilemma. You need to explore
new (and potentially suboptimal) options in order to learn about them, but you also need to exploit what
you already know in order to maximize reward. You also navigate this tradeoff constantly in your daily life
without realizing it. For example, do you go out to your favorite restaurant, or do you risk trying a new
place that just opened up? Do you stay at your current job, where you might be unhappy but stable, or do
you risk the unknown for the possibility of higher pay or more job satisfaction?
For more information about multi-armed bandits, see: https://en.wikipedia.org/wiki/Multi-armed_bandit
Multi-armed bandit setup
The code below provides a skeleton for a reinforcement learning experiment with a multi-armed bandit task.
In this experiment, the learning agent faces a choice between 10 bandits on each trial. Each bandit provides
a binary reward (either 0 or 1), but the probability of reward differs between the bandits. The goal for the
learning agent is to maximize the total reward received.
On each trial, the agent selects one of the alternatives, and receives a randomly generated reward, with
probability determined by the particular bandit they selected. Mathematically, let k ∈ {1 . . . 10} indicate the
choice made on a given trial. r indicates the reward received on that trial, where
r ∼ Bernoulli(θk)
and θ is a vector of length 10 that defines the reward probability for each bandit. The following code provides
a basic implementation of a 10-armed bandit task.
Listing 1: A basic 10-armed bandit task
# Include this just once at the top of your code
set.seed(42)
simulate_baseline_agent

There are no answers to this question.

Login to buy an answer or post yours. You can also vote on other
others

Get Help With a similar task to - Math Rstudio Psychology Statistics

## Related Questions

Similar orders to
Math Rstudio Psychology Statistics

Tutlance Experts offer help in a wide range of topics. Here are some
of our top services:

- Online writing help
- Online homework help
- Personal statement help
- Essay writing help
- Research paper help
- Term paper help
- Do my homework
- Online assignment help
- Online class help
- Dissertation help
- Thesis help
- Proofreading and editing help
- Lab report writing help
- Case study writing help
- White paper writing help
- Letter writing help
- Resume writing help

Post your project now for free and watch professional experts outbid each other in just a few minutes.