The Effects of Reinforcement and Punishment on Equine Training and Learning

NB This isn't an article so much as an essay I wrote for a university course. I think it shows! Being able to write like this won't necessarily make you a better horse-trainer, but being able to understand the concepts - and to recognise them in practice - will be a big step in the right direction.

It can be argued that the science of operant conditioning began with Thorndyke's Law of Effect (Thorndyke 1911). Extension of this law gives a standard definition of reinforcement: a reinforcer is a stimulus that, when presented following a behaviour, increases the probability that the behaviour will reoccur. Thus positive reinforcement (PR) is the addition of a pleasant stimulus, or reward, and negative reinforcement (NR) is the cessation of an aversive stimulus. In contrast to reinforcement, punishment (P) can be defined as a consequence that, if presented immediately following a behaviour, makes the behaviour less likely to reoccur (e.g. Burch & Bailey 1999; see Fig. 1).

Operant Conditioning

Figure 1: Contingencies of reinforcement: arrows indicate the likelihood of recurrence of behaviour

A positive reinforcer is anything that will motivate the subject, e.g. a cash bonus, praise or promotion following hard work or a food reward for a horse which performs a desired behaviour. The reward must be presented simultaneously with, or within three seconds of, the behaviour; if the reward is delayed then it will not be clear which behaviour is being rewarded and the desired behaviour will be less likely to occur. The effect of PR is to increase motivation and confidence; it encourages creativity and exploration of the environment since there is no fear of a negative consequence of an ``incorrect action".

NR is typically the release of an aversive stimulus; a negative reinforcer is a stimulus something will work to avoid. If the desired behaviour causes the aversive to be reduced then it is ``escape" NR, e.g. moving away from the fire when it becomes too hot or a horse learning to stop when pressure is applied to the bit, provided the reins are released as soon as the horse halts. Alternatively ``avoidance" NR involves working to avoid the escalation of the aversive, such as opening a window to allow the heat to escape or the horse learning to halt because otherwise the rein-pressure will increase.

The timing of the release is critical so that the subject understands which behaviour is being reinforced. The release must coincide with the desired behaviour or there will be a punishing effect. However, even a well-timed release does not necessarily teach the animal the desired behaviour, merely a means of avoiding the pressure. It must be used with care so as not to cause loss of confidence, confusion and resentment.

Studies on horses comparing the effects of positive and negative reinforcement on horses have suggested that some horses have a higher response to NR and others to PR (e.g. Visser et al. 2003). However, the study does not include reasons as to why this might be or address the possible welfare issues implicated in stating that some horses respond better to avoidance training.

By definition, a punisher needs to be aversive in order to eliminate an undesirable behaviour. For example we may be hungover following excess alcohol or whip a horse for refusing a jump. Occurring after the behaviour, P does not alter the fact that the behaviour has taken place and provides no information as to the desired behaviour. Instead it just teaches the subject avoidance of the aversive (e.g. taking pain-killers in anticipation of a hangover or a horse refusing to be caught if it expects P) and resentment of the trainer. In order to maximize its effect, P should be intense (as opposed to escalated from a mild form which can lead to habituation) and delivered as quickly as possible after the behaviour (see Fig. 2 for effects of delayed P).


Figure 2: The effects on rats of punishment (electric shocks), delayed punishment and response-independent shocks. Clearly punishment needs to be immediate for it to be effective (Schwartz et al. 1978 and references therein).

When teaching a horse to load into a trailer, PR can be used to reward every step towards the trailer. A detailed shaping plan would be necessary and the horse should be happy with each step before moving onto the next. Care should be taken so that the horse's desire for treats does not lead to it loading too quickly and flooding itself. Once the steps forwards and loading are well-established they can be placed on a variable schedule of reinforcement (VSR) so that eventually just the final result is rewarded. NR could be used by applying pressure to the lead rope and releasing when the horse takes a step forwards. In this case it is harder to avoid flooding since it is necessary to keep the pressure on if the horse halts. Care should be taken not to allow the horse to release the pressure through incorrect behaviour e.g. rearing. Loading a horse using P would require e.g. hitting the horse with a whip every time it halted and/or walked backwards. Regardless of the method, loading should be practiced in different locations and at different times of day so as to generalize the response.

When training the horse to be safely shod by the farrier PR can be used to reward first shifting the weight and then lifting the foot for progressively longer periods in response to a cue. Lifting the foot using NR or P might require squeezing the chestnut (and punishing keeping four feet on the ground) and releasing as soon as the foot is lifted (NR). If the horse struggles then the foot could be held tightly and released when the horse stands still again (NR) or the horse could be smacked/reprimanded (P). In each case, after gradual shaping the behaviour should be extended to include the range of leg positions required by the farrier, hitting the foot with a hammer, various locations (including next to the farrier's van when another horse is being shod) and various people (including men).

PR when used incorrectly is the least likely of the three to cause serious psychological trauma. Typical problems might include incorrect timing and the inadvertent rewarding of undesirable behaviours. For example a mother might give sweets to a child ``just to keep him quiet" following a tantrum or a rider may stroke/praise a horse with the intention of offering reassurance but in reality rewarding ``being scared". A clicker trainer risks rewarding mugging or establishing behaviour chains, e.g. the horse thinking the desired behaviour is mugging followed by looking away. PR might not work long-term if the rewards are insufficiently motivating or if the shaping plan leads to the desired behaviour too quickly, causing flooding. PR should not be used to mask pain.

In practice training with PR is not as simple as Thorndyke's Law of Effect suggests. A behaviour rewarded every time it is presented will tend to decrease over long periods of time. Instead it is necessary to use a VSR, thus maintaining motivation and a high response rate, e.g. gamblers winning occasional jackpots are more susceptible to addiction than through winning a small amount every time they play (see Fig. 3). Similarly we do not reward an older horse every time it walks a step under saddle although we may do this initially with a youngster.

Variable Schedules

Figure 3: A variable-ratio schedule of reinforcement (left) will maintain motivation more successfully than a fixed-ratio schedule (right) (Schwartz et al. 1978 and references therein).

There is also the possibility that the subject will do the minimum required to earn the treat, e.g. inadvertently rewarding progressively smaller foot-falls. Similarly (s)he might only offer previously-rewarded behaviours, rather than offering new ones, unless creativity is encouraged (e.g. only ever using the same bread-making recipe Schwartz et al. 1978).

PR training can encourage the subject to offer lots of behaviours which may be undesirable for the owner, particularly since withholding the reward leads to escalation of the behaviour. Inadvertent rewarding at the peak of the extinction burst firmly reinforces that unwanted behaviour (e.g. horse kicking stable door). PR may also create a worried, ``neurotic" animal desperately trying to offer behaviours, particularly if used in conjunction with P of incorrect behaviours.

Incorrect use of NR and P can be much more serious. At best it can lead to pain, fear, confusion, worry and resentfulness, e.g. a poorly timed release of pressure, giving no indication of the correct behaviour. Alternatively the wrong behaviour could be reinforced, such as rearing to escape a pressure halter. NR and P could mask pain, typically where the aversive nature of the stimulus outweighs the aversive nature of the desired behaviour (e.g. using a gum line to prevent bucking under saddle; Roberts 2002), and lead to flooding (e.g. forcing someone to confront a phobia). Conversely the P may be considered reinforcing by the subject (e.g. the expulsion of a child from school).

If the pressure is not released then NR can become punishing, resulting in a ``shut down" attitude and conditioned suppression of behaviours, (e.g. electric shock experiments on rats by Estes \& Skinner 1941). Continued P can lead to learned helplessness since the subject realises that its behaviour and the outcomes are independent. This learning produces the motivational, cognitive and emotional effects of uncontrollability, causing severe stress and depletion of the neurochemical necessary for mediation of movement (e.g. Maier \& Seligman 1976). While the original experiments involved the electrocution of dogs, this effect is apparently closely linked with depression in humans, although both can be ``alleviated" by forcing the subject to experience success in re-learning to control his environment (Seligman 1968, 1990). The long-term effects on the subject of such ``retraining'' were not discussed.


Burch M., Bailey J., 1999, ``How Dogs Learn", Howell
Estes W.K., Skinner B.F., 1941, Journal of Experimental Psychology, 29, 390
Maier S.F., Seligman M.E.P., 1976, Journal of Experimental Psychology, 105, 3
Roberts M., 2002, ``From My Hands To Yours", Monty \& Pat Roberts Inc.
Schwartz B., Wasserman E.A., Robbins S.J., 1978, ``Psychology of Learning and Behaviour", 5th edition, Norton
Seligman M.E.P., 1968, Journal of Comparative and Physiological Psychology, 66, 402
Seligman M.E.P., 1990, ``Learned Optimism", New York:Knopf
Thorndyke E.L., 1911, ``Animal Intelligence"
Visser E.K. et al., Applied Animal Behaviour Science, 2003, 80, 311

Copyright Catherine Bell 2004