The Normal Distribution: A Derivation through Multi-Variable Calculus
Posted 2023/08/29
The normal distribution or Gaussian distribution (informally known as a bell curve) is the most used continuous probability distribution with many applications in statistics and science. Its importance comes from the central limit theorem, which establishes that the distribution of a sum (or sample mean) of many independent and identically distributed random variables will tend to look like a standard normal distribution with a large enough sample size. This property makes the distribution commonly used to represent random variables whose distributions are not known.
As such, the distribution is commonly taught as a tool in many courses without a full explanation of all of its properties and derivation. For me, the distribution was recently used in my SYDE162 Human Factors in Design course for calculating the distribution of anthropometric dimensions using z-scores, but no explanation of what the distribution is and its properties were presented in the course. However, the distribution and many of its properties are easy to understand and the Hershel-Maxwell derivation can be understood with some basic understanding of multi-variable differentiation and integration.
Basic Properties
Continuous probability distributions like the normal distribution have a cumulative distribution (CDF) and a probability density function (PDF). However, in this post I will be focusing on the probability density function which is the “bell curve” function most people use; the cumulative distribution function can be found by taking the antiderivative of the probability density function we will be deriving.
The probability density function describes how likely a random outcome is with outcomes near the mean being more likely and outcomes further away from the mean being less likely. This creates a function that looks like a “bell curve” as seen in the image on the right.
An example of this distribution being used is the sum of die rolls. In a two die roll, rolling a sum of 7 is more likely than the more extreme values of 2 and 12, graphing the probabilities of each outcome from 2-12 will give you a graph that vaguely looks like a normal distribution. However, if you increased the number of die and redrew the graph, the distribution will continue to look closer to the normal “bell curve” distribution as you increase the number of die. This tendency to approach the probability density function of the normal distribution as sample size is increased is called the central limit theorem, and the theorem can be seen in many applications which is why this distribution and its probability density function is so important.
Probability density functions (in general for any probability distribution) have a few properties that are important:
-
The probability density function is an integrable over its domain (or else our distribution would not have a cumulative distribution function and property number 4 would not work)
-
The probability density function is positive everywhere in its domain, (the probability of an outcome is never negative)
-
The area under the curve of the probability density function (area between the curve and the -axis) is ()
-
The probability that a random value lies in the interval is given by the area under the curve over that interval,
This last property is what probability density functions are most commonly used for, they help calculate the probability that a random variable will be a certain value or fall within a certain range. The normal distribution specifically has a couple more properties that are useful and will be used in the derivation:
- The probability density function is symmetric around the point which is the mean of the distribution (and also the median and mode).
- The PDF is unimodal (has a single peak), this means that the first derivative (instantaneous rate of change) is positive at , negative at , and zero at
The Herschel-Maxwell Theorem and the Cartesian Dartboard Analogy
Earlier, I explained that we will derive the normal probability density function by following the Herschel-Maxwell derivation. This derivation uses the Herschel-Maxwell’s probability theory, which states that if the probability distribution of a vector in is unchanged by (or independent of) rotation and the individual vector components are independently randomly distributed, then the components of the vector are all normally distributed in an identical manner. Following this theorem, we will derive the probability distribution in a basic two-dimensional case before simplifying to one-dimension as it makes some of the later integration easier.
A good analogy of this theorem and a two-dimensional normal distribution is throwing darts at the origin of a two-dimensional Cartesian plane. The darts are aimed at the origin, but random errors in the throw produce different results. Under this analogy, we can assume that small errors are more likely than large errors, errors of equal distance from the origin are equally likely (errors are independent of orientation/rotation), and errors in the two axes are independent and normally distributed.
Finding the Basic Shape of the Probability Density Function
Under the assumptions of independent axes and independence from rotation, we can start with this equation, which equates probabilities in Cartesian coordinates to probabilities in polar coordinates:
Here, represents the probability density function on the -axis, represents the function on the -axis, and represents the probability density with respect to distance from the origin (or radius). We can then partially differentiate the equation with respect to or rotation around the origin to get the following:
Observe that the probability with respect to radius () is independent of orientation and as such differentiates to and the use product rule on the right side. We can then substitute in and to convert Cartesian coordinates into polar coordinates on the right side and then separate the variables:
For this differential equation to be true for any independent and value, the ratio of the differential equation must be constant. This unknown constant can be written as :
We can then use this equation to solve for the basic form of the normal distribution’s probability density function:
However, there are two conditions that need to be met for this to be a valid probability density function. First, we are assuming large deviations from the mean (or large errors in the dartboard analogy) are less likely than small errors and the area under the curve must converge to . This condition makes the unknown constant negative, as if it was positive, large deviations would be more likely and the area would diverge to . The second condition is that all probability density functions must be positive, this makes the unknown coefficient positive (the exponential function it scales is always positive). This allows us to rewrite the basic form of the function into the following:
Finding the Coefficient
The value of in this basic function form can be found by using the area property of probability density functions (area under the curve is equal to ):
The function is symmetric and in its current form is an even function, so the integral can be rewritten like this:
It is difficult to integrate a Gaussian function () so we will bring back the second-dimension from earlier (which looks identical due to symmetry) to make integration easier by rewriting the equation into a double integral:
This double integral is symmetrical around the origin, so we can convert it from Cartesian coordinates to polar coordinates. Note the appearance of as the Jacobian determinant from this change of coordinates. Following this change of coordinates, we can use standard integration techniques to solve for :
This single improper integral can be evaluated with an -substitution of (), allowing the rest of the equation to be solved:
Note that due to the condition that must be positive that we deduced above. This gives us a probability function of the form with being the last step we need to solve for.
Finding the Value
To solve for the value , we have to bring in the variance of the probability distribution. The variance is the square of the perfect deviation and is defined as . The mean of the function () is defined by the integral . This integral is odd given the even probability function we found above, so our mean in this case is . We can substitute this value and our probability density function into the integral for variance (squared deviation) to solve for :
This integral can then be solved with an integration by parts with and :
This final integral we have already solved above when finding the value of :
The Normal Probability Density Function
We can plug in this value of to find the normal probability distribution function:
This form of the normal probability density function has a single constant , the desired standard deviation of the distribution, and has a mean () of . However, most applications of this distribution don’t have a mean of , so we can horizontally translate the function so that the peak of the “bell curve” lies on the desired mean:
Personal Thoughts
I first worked on a presentation about this derivation of the normal distribution with a group of friends back in my Grade 12 Data Management course in high school as a final project. That presentation and this post were loosely based on a “dartboard” based derivation written by Dan Teague which I did not fully understand back in high school due to some of the derivation being beyond my limited high school level calculus knowledge. After finishing my Calculus 1 and 2 courses (SYDE111 & SYDE112) and seeing more applications of the distribution during my studies at the University of Waterloo, I was inspired to revisit the derivation and hopefully write it out in a more comprehensive form than the original derivation.
I am not a mathematician or educator, so there may be errors post above or potential improvements I can make for clarity. If you spot any errors or improvements that can be made, feel free to send me an email!
3Blue1Brown also recently covered the Central Limit Theorem in a series of videos that I found very interesting available here which demonstrate why the normal probability distribution is so useful.
I haven’t posted anything to my blog in a while so hopefully this marks the beginning of more technical writing posts related to math, programming, and information security!