Saturday, April 06, 2013
Thursday, April 01, 2010
Gaussian FAQ
I found this on my computer and can't seem to find it on the web, so I'm putting it here in case other's would benefit. Note: I didn't write this.
====================John D'Errico, 24 Mar 1995=======ssm
From: John D'Errico
Subject: Gaussian FAQ
Message-ID: <3kv7m3$61f@thetimes.pixel.kodak.com>
This is a FAQ which attempts to cover questions about the normal
(Gaussian) distribution. I will attempt to add or enhance topics
as I have the time and people make suggestions.
Questions to be answered in this FAQ are:
1. What is the normal (Gaussian) density function?
2. What is the cumulative normal distribution function?
3. How is ERF related to the cumulative normal distribution function?
4. What is a good approximation to the cumulative normal function?
5. What is a good approximation to the inverse of the cumulative
distribution normal function?
6. How do you generate a normally distributed random variable?
7. If X is a normally distributed random variable with mean = 0
standard deviation = 1, how do you convert it into one with mean = mu
and standard deviation = sigma?
8. How do you generate bivariate or multivariate random normal
variates with a specified correlation matrix? Covariance matrix?
9. If X is a normally distributed random variable, what is the
distribution of f(X)?
10. What are some good references on these topics?
======================================================================
Answers to questions 1-10:
1. What is the normal (Gaussian) density function?
The standard (mean=0 and variance=1) normal density function is
1 -(1/2)*x^2
Z(x) = ---------- e
sqrt(2*pi)
For mean=mu and variance=sigma^2
[ correction, inserted (1/2) below; 17-aug-00/rfu ]
1 -(1/2)((x-mu)/sigma)^2
Z(x) = ---------------- e
sqrt(2*pi*sigma)
The normal distribution with mean=mu and variance=sigma^2 is
customarily referred to as N(mu,sigma^2). The standard normal
is N(0,1).
----------------------------------------------------------------------
2. What is the cumulative normal distribution function?
The cumulative normal, referred to here by P(X), is simply the
integral of Z(x) dx from minus infinity to X.
We will use Q(X) to refer to the integral of Z(x) dx from X to
plus infinity.
P(X) + Q(X) = 1
----------------------------------------------------------------------
3. How is ERF related to the cumulative normal distribution function?
Since some systems do not provide a cumulative normal function but do
provide ERF, this can be useful if you know the transformation between
them.
ERF(X) = integral [2/sqrt(pi)]*exp(-x^2) dx from 0 to X.
P(X) = [ERF(X/sqrt(2)) + 1]/2
----------------------------------------------------------------------
4. What is a good approximation to the cumulative normal function?
While it is not possible to write a solution to this in a simple,
closed form expression not involving infinite sums, there are many
good approximations available. It is not necessary to resort to
numerical integration techniques such as trapezoidal or Simpson's
rule. In fact, far more accuracy can be achieved through one of the
approximations which can be found in the references listed below.
One simple approximation to P(X), the cumulative normal, is:
(from Abramowitz and Stegun)
Let t = 1/(1+pX)
where p=.33267
then
P(x) = 1-Z(x)*[a_1*t + a_2*t^2 + a_3*t^3]
where
a_1 = 0.4361836
a_2 = -0.1201676
a_3 = 0.9372980
The error in this approximation is less than 0.00001 for all X.
Other approximations can be found in Abramowitz & Stegun; Press, et al;
and in Hart, et al. In particular, Hart has a very extensive set of
approximations, some with accuracy over 20 significant digits within
a given region.
----------------------------------------------------------------------
5. What is a good approximation to the inverse of the cumulative
normal distribution function?
Mathematically we can write this as: Find X such that Q(X) = p for
any 0 < p < 1. (Note: this computes an upper tail probability.)
Again, it is not possible to write this as a closed form expression,
so we resort to approximations. Because of the symmetry of the normal
distribution, we need only consider 0 < p < 0.5. If you have p > 0.5,
then apply the algorithm below to q = 1-p, and then negate the value
for X obtained. (This approximation is also from Abramowitz and
Stegun.)
t = sqrt[ ln(1/p^2) ]
c_0 + c_1*t + c_2*t^2
X = t - ------------------------------
1 + d_1*t + d_2*t^2 + d_3*t^3
c_0 = 2.515517
c_1 = 0.802853
c_2 = 0.010328
d_1 = 1.432788
d_2 = 0.189269
d_3 = 0.001308
See Abramowitz and Stegun; Press, et al.
----------------------------------------------------------------------
6. How do you generate a normally distributed random variable?
There are several good ways to do this, some better than others.
It is assumed that the user has available a good uniform pseudo-
random deviate generator available. This is the starting point
for all such approaches.
The first algorithm is to make use of the central limit theorem.
This says that the sum of a set of independent, identically
distributed (commonly abbreviated as iid) random numbers will
approach normality as enough of them are added together.
So, simply generate n uniform [0,1] random numbers, and sum them
up. Then shift and scale this sum to have the appropriate mean
and variance to give us a roughly N(0,1) random variable. Thus,
sum(X_i) - n/2
Z = --------------
sqrt(n/12)
Note that if we choose n = 12, we need not even do a divide,
Z = sum(X_i) - 6
While the method described above is quite simple to program, it
does a poor job of generating random deviates in the tails of the
distribution. There are far better ways to do accomplish this
task. A logical way to generate a normal random deviate is to
look back to the answer to question (5).
Simply generate a single uniform [0,1] random deviate X, and then
invert the cumulative normal at that point. If the inverse is an
accurate one, then that number will be a pseudo-random normal deviate.
This method will be as fast and as good as is the approximation
you choose for the inverse of the cumulative normal.
The other common method, which generates pairs of normal deviates
at a time (or if you want, bivariate normal deviates) is the Box-
Muller transformation. See Numerical Recipes (Press, et. al.) for
an explanation of this method.
----------------------------------------------------------------------
7. If X is a N(0,1) distributed random variable, how do you convert
it into N(mu,sigma^2)?
Referring back to (1), we see that a simple transformation will
accomplish this.
Z = mu + sigma*X
----------------------------------------------------------------------
8. How do you generate multivariate random normal variates with
a specified correlation matrix and set of variances?
Lets start with a simple case, just a bivariate normal. We will
assume s1 and s2 are the standard deviations, c is the desired
correlation between them (-1 <= c <= 1) and mu1 and mu2 are the
desired means.
First generate two unit normal random deviates, N1 and N2.
X = mu1 + s1*N1
Y = mu2 + c*s2*N1 + sqrt(1-c^2)*s2*N2
X and Y will be bivariate random normal deviates with the desired
distribution parameters.
Now for the more general case of a multivariate normal where "multi"
means larger than two.
Suppose we have a correlation matrix C. We need to convert this to a
covariance matrix, S, first.
Let D be a diagonal matrix with the desired variable standard
deviations on its diagonal. Then
S = D C D
Form the Cholesky factorization of S (actually any matrix "square
root" of s will do.) This is a factorization of S in terms of a lower
triangular matrix L such that
S = L' L
This factorization always exists whenever S is positive definite
(non-singular in this case, since we started from a supposedly
valid correlation matrix.)
Generate an n by p array, A, of N(0,1) random deviates, where p is
the number of variables and n is the number of multivariate sample
you wish to end up with. Then
B = A L'
is an array of normally distributed random deviates where each row
is an event with roughly the correlation/covariance matrix you wanted.
(It won't be exactly that because of pseudo-random variability.)
Once you have generated a set of random deviates with the desired
correlation matrix and mean vector zero, simply add a vector offset
to obtain any desired mean.
If the correlation/covariance matrix supplied is singular, a similar
approach can still be used, but now we must define L using a different
type of matrix square root. This can be achieved using an eigenvalue
decomposition of the matrix S.
----------------------------------------------------------------------
9. If X is a normally distributed random variable, what is the
distribution of f(X)?
There are several simple cases we can cover. If X and Y are N(0,1)
independent R.V.s, then
aX + bY is a normal random variable for constants a and b.
X/Y is a Cauchy random variable (and its moments are undefined.)
e^X is a lognormal random variable.
X^2 + Y^2 is a Chi-squared random variable with two degrees of
freedom.
sqrt(x^2 + y^2) is a Rayleigh random variable.
These distributions along with many others can all be found in
Johnson, et al.
The general problem of determining the distribution of a function of
one or more random variables is known as statistical tolerancing.
This can be done by several techniques, Monte Carlo methods, first
or second order Taylor series approximations, and Taguchi style
methods (actually related to Gaussian quadrature) to name a few.
A common approach it to estimate the first few (generally either
2 or 4) moments of the transformation in question and then find a
distribution which has those same moments. A family of distributions
which can be used to match the first 4 moments is the Pearson system
(see Johnson, et al.)
See D'Errico & Zaino for a discussion of statistical tolerancing.
----------------------------------------------------------------------
10. What are some good references on these topics?
The references listed here are not necessarily the best possible ones
for this subject.
1. Abramowitz and Stegun, "Handbook of Mathematical Functions",
Dover Publications, 1965 (originally published by the National
Bureau of Standards, 1964)
2. Press, Flannery, Teukolsky, Vetterling, "Numerical Recipes",
Cambridge University Press, 1986
3. Hart, et al, "Computer Approximations", Wiley, 1968
4. Johnson, Kotz, & Balakrishnan, "Continuous Univariate
Distributions", Volume 1, Wiley, 1994
5. Kalos & Whitlock, "Monte Carlo Methods", Wiley, 1986
6. Knuth, "The Art of Computer Programming, Volume 2, Seminumerical
Algorithms", Addison-Wesley, 1969
7. D'Errico & Zaino, "Statistical Tolerancing Using a Modification
of Taguchi's Method", Technometrics, Vol 30, no. 4
======================================================================
John D'Errico, derrico@pixel.kodak.com
Any errors or omissions are my fault and mine alone.
======================================================================
Saturday, January 31, 2009
Where's my $HOME?
Once upon a time, life was simple. My progress could be measure by the state of my home directory. My work lived in folders that represented input or output data, program source code, and latex source code. Beyond that there were some odds and ends like gnuplot script files and the resulting postscript files, either to analyze or include in whatever paper I was writing. I also had a Mail directory that my email client (elm at first and then pine) access, but I didn't really use email that much. I would use emacs (or vi if it was a quick edit) to edit a file or run my programs to generate output data. That was pretty much it. I lived in the shell. I could telnet from wherever and have the same environment from wherever I was. As time went on, I found myself using emacs more and more as my shell, but the idea was basically the same.
Now... I usually have email, a web browser, and some office app or adobe open, usually to read, and sometimes to write. I'll be editing code sometimes (less often then I like) and have used pythonwin, but recently I've rediscovered emacs (now on windows). Instead of constantly having an awareness of my directory structure, because I'm cd'ing around it in the shell, it's mostly a place to store some program's files or downloads from my web browser. It ends up getting messy because I'm not as cognizant of where everything belongs. The file system is just a dumping ground.
To make matters worse, I have stuff on multiple computers (home, work, and laptop) and I have stuff that just lives on the web (delicious, twine, google docs, evernote, whatever I'm intrigued by this week). I'm homeless. Where's $HOME?
Holy cow. Where am I? How's my progress? What's my operating system doing for me? I used to live in the shell. Todo management? "grep -R 'TODO' ." I'd make up scripts to handle common tasks and throw them in ~/bin. A lot of that was made possible by the miracle of plain text. Unfortunately, I deal in binary files (.doc, .ppt, .pdf, .xls, etc.) these days and I have to open some separate window to look at them.
Now... I usually have email, a web browser, and some office app or adobe open, usually to read, and sometimes to write. I'll be editing code sometimes (less often then I like) and have used pythonwin, but recently I've rediscovered emacs (now on windows). Instead of constantly having an awareness of my directory structure, because I'm cd'ing around it in the shell, it's mostly a place to store some program's files or downloads from my web browser. It ends up getting messy because I'm not as cognizant of where everything belongs. The file system is just a dumping ground.
To make matters worse, I have stuff on multiple computers (home, work, and laptop) and I have stuff that just lives on the web (delicious, twine, google docs, evernote, whatever I'm intrigued by this week). I'm homeless. Where's $HOME?
Holy cow. Where am I? How's my progress? What's my operating system doing for me? I used to live in the shell. Todo management? "grep -R 'TODO' ." I'd make up scripts to handle common tasks and throw them in ~/bin. A lot of that was made possible by the miracle of plain text. Unfortunately, I deal in binary files (.doc, .ppt, .pdf, .xls, etc.) these days and I have to open some separate window to look at them.
Saturday, January 03, 2009
50th time is the charm with GTD?
Well, it's a new year and I, like many, and making an effort to organize. I've read David Allen's Getting Things Done and been inspired, but have never quite got a system I'm terrifically happy with. Usually, I try something new, move all my lists, get a temporary boost (more from inspiration than efficiency) and then things end up about the same.
Well, this is another one of those times! This post from March sounds startling like this one, but I will press ahead. The idea of using google notebook didn't last long. I'm not sure why. I think part of it is the interface. It just has all these borders and stuff that get in the way of the lists. I fell in love again with emacs and after using org-mode for outlining, put my lists in there and used svn for versioning. I found myself reordering branches, indenting, and generally messing with my giant .org files (work and home). Worse, I found I was generally doing things that never made it on the list and not looking at the list to see what to do.
Next (yes, there's a next), I tried taskodone, which based on taskpaper. I was inspired by this review of taskpaper
All of the above with tags and outlines actually has little to do with what's in David Allen's book. His lists are just lists. Many of us have this need to keep actions tied (electronically, preferably) to their projects. His project list though, is just a list of projects! You only consult it at the weekly review or when adding a new one. The other lists or lists contain your next actions.
Where does the planning happen? In the project plans, of course. These project plans are culled for actions in the weekly review. They go in folders in the book, I believe. You get out these manilla things, look and alter them, and then update your plain old flat GTD lists. I had never quite figured this out (and am still not sure I have).
Where are the project plans in my org-mode giant list? Well, they are *all* right in there. Yes you can collapse and expand and only look at what you need with some fancy emacs keystrokes, but still, they are *all* right there. Too much for me.
Where are the project plans with taskpaper? Well, umm, umm... There's a flat list of tasks under the project headings, but that's just not enough for a project of any complexity. You can break a project into separate project (which I did) and have something like "Build a house:", "Build a house - hire an architect," " Build a house - hire a contractor," etc. That's just cheating though, and your forced to look at all the sub-projects.
OK, so my projects don't really live in manilla folders too often. What could I use? Well, gee, how about those things on the filesystem we call "folders?" Wait, it's all coming back to me now. Back in grad school, I had a small set of folders that my life depended on. One folder would hold the .tex, .ps, and gnuplot script files that represent the paper (or my thesis) that I was working on. The other folder(s) would hold code and the output of that code. The state of those folder represented my progress in pretty absolute terms. When the paper was done, refereed, and published, I was pretty much done. I would put "TODO" or "FIXME" into code or tex files and have TODO files with lists of task. grep was my friend for finding where work was needed.
Much of that simplicity came from what I was doing and it was just natural. Now I'm more connected with other people and there's a lot of quick turnaround exchanges via email that can be confused with advancing the ball. Often there's not a product I'm actively working on that would naturally go in a folder. I might have one due in the distant future, but haven't actually got to the point of creating a folder and opening a word processor.
So here's the idea -- use project folders. Have a project? Make a folder. Put some stuff in there that represents what's been done and what has to be done. The state of these folders represent progress (or lack thereof). I'm thinking org-mode can help here with a TODO.org with the tasks. Keep TODOs separate from notes, and references. Flag the next action with TODO and use org agenda to pull the master action list together. The project list is just "ls ~/projects."
Really the only revelation (to me) is that having all your project plans in one file can be terribly distracting. OK. I've wasted enough time not actually doing things while writing this, so back to work!
Well, this is another one of those times! This post from March sounds startling like this one, but I will press ahead. The idea of using google notebook didn't last long. I'm not sure why. I think part of it is the interface. It just has all these borders and stuff that get in the way of the lists. I fell in love again with emacs and after using org-mode for outlining, put my lists in there and used svn for versioning. I found myself reordering branches, indenting, and generally messing with my giant .org files (work and home). Worse, I found I was generally doing things that never made it on the list and not looking at the list to see what to do.
Next (yes, there's a next), I tried taskodone, which based on taskpaper. I was inspired by this review of taskpaper
"TaskPaper’s strength is that it lets you focus on crossing out those tasks instead of building a self-referential web of unfinished business which separates you from the cold, harsh reality of all the work you need to do." — Scott McNulty, Tuaw.com
Hey, that's me! OK, so I tried taskodone and I like it, and I really think it might be OK. It has a lot of charm in that it's pretty fiddle proof with only one level of indendation. It didn't quite stick with me though. I still wanted to reorganize my unfinished business. I also didn't like looking at it all all the time. I suppose some discipline, like just looking at the @na lines would work, but still when I opened it to delete a finished line or add a next task, I'm still looking at everything.All of the above with tags and outlines actually has little to do with what's in David Allen's book. His lists are just lists. Many of us have this need to keep actions tied (electronically, preferably) to their projects. His project list though, is just a list of projects! You only consult it at the weekly review or when adding a new one. The other lists or lists contain your next actions.
Where does the planning happen? In the project plans, of course. These project plans are culled for actions in the weekly review. They go in folders in the book, I believe. You get out these manilla things, look and alter them, and then update your plain old flat GTD lists. I had never quite figured this out (and am still not sure I have).
Where are the project plans in my org-mode giant list? Well, they are *all* right in there. Yes you can collapse and expand and only look at what you need with some fancy emacs keystrokes, but still, they are *all* right there. Too much for me.
Where are the project plans with taskpaper? Well, umm, umm... There's a flat list of tasks under the project headings, but that's just not enough for a project of any complexity. You can break a project into separate project (which I did) and have something like "Build a house:", "Build a house - hire an architect," " Build a house - hire a contractor," etc. That's just cheating though, and your forced to look at all the sub-projects.
OK, so my projects don't really live in manilla folders too often. What could I use? Well, gee, how about those things on the filesystem we call "folders?" Wait, it's all coming back to me now. Back in grad school, I had a small set of folders that my life depended on. One folder would hold the .tex, .ps, and gnuplot script files that represent the paper (or my thesis) that I was working on. The other folder(s) would hold code and the output of that code. The state of those folder represented my progress in pretty absolute terms. When the paper was done, refereed, and published, I was pretty much done. I would put "TODO" or "FIXME" into code or tex files and have TODO files with lists of task. grep was my friend for finding where work was needed.
Much of that simplicity came from what I was doing and it was just natural. Now I'm more connected with other people and there's a lot of quick turnaround exchanges via email that can be confused with advancing the ball. Often there's not a product I'm actively working on that would naturally go in a folder. I might have one due in the distant future, but haven't actually got to the point of creating a folder and opening a word processor.
So here's the idea -- use project folders. Have a project? Make a folder. Put some stuff in there that represents what's been done and what has to be done. The state of these folders represent progress (or lack thereof). I'm thinking org-mode can help here with a TODO.org with the tasks. Keep TODOs separate from notes, and references. Flag the next action with TODO and use org agenda to pull the master action list together. The project list is just "ls ~/projects."
Really the only revelation (to me) is that having all your project plans in one file can be terribly distracting. OK. I've wasted enough time not actually doing things while writing this, so back to work!
Thursday, October 30, 2008
proud papa
I installed python on my 10-year-old's computer a while ago and showed him a few things. He's really into football now and for some reason he wanted to calculate NFL passer ratings for some QBs. I helped him write a program in python that prompts the user for completions, attempts, etc. and then prints out the passer rating. This morning I found him on nfl.com looking up stats and punching them into his new calculator. I think it's the first time he's created a tool. It feels a bit like watching him discover fire. I'm a proud papa.
Thursday, September 25, 2008
good old SQL
It's been a little while since I've done this, but somewhere along the way while I was building a dictionary of dictionaries, that what I was doing was creating a little database and that there was this language extracting data from such things. Plus, the data was already in an MS Access database, which I could query using odbc.
I think I can get about 90% of what I want with a well-crafted query. I could do the rest in python, but I think I could simplify things with a user-defined aggregate function. I think I could do that with Access but I don't know how, so I'll probably move the data to sqlite and then use that to define my functions. Knowing that sqlite trick is also probably a more reusable skill (for me at least).
The database came with lots of nearly impenetrable VB code that creates temporary tables and aggregates in VB instead of in SQL (using GROUP BY). To make sure I understand what they're doing (and the data) before I start generating my own reports, I'd like to be able to reproduce their results. It'd be a neat trick to turn thousands of lines of VB into a tenth that in SQL and python. Ulimately, it would be more transparent and maintainable as well.
I think I can get about 90% of what I want with a well-crafted query. I could do the rest in python, but I think I could simplify things with a user-defined aggregate function. I think I could do that with Access but I don't know how, so I'll probably move the data to sqlite and then use that to define my functions. Knowing that sqlite trick is also probably a more reusable skill (for me at least).
The database came with lots of nearly impenetrable VB code that creates temporary tables and aggregates in VB instead of in SQL (using GROUP BY). To make sure I understand what they're doing (and the data) before I start generating my own reports, I'd like to be able to reproduce their results. It'd be a neat trick to turn thousands of lines of VB into a tenth that in SQL and python. Ulimately, it would be more transparent and maintainable as well.
Wednesday, September 24, 2008
defaultdict and R
In answering a question on the python-tutor list, someone used collections.defaultdict. I didn't know about this. I guess it looks a little cleaner to say a = defaultdict(list) rather than use a.setdefault(key,[]).
Well, that isn't much learning for the day, but it's something.
I'm also trying to learn and use R, but fell back to python for my task today when I couldn't quickly figure out how to do what I wanted to in R. R is frustrating right now because it seems so insanely powerful, but at the same time foreign to how my python-oriented brain thinks.
I have "Modern Applied Statistics with S-PLUS" and learned to parrot the examples in order to do some cool stuff, but when I have to deviate much from the examples, I find myself a bit lost. I figure I don't understand the basics of the language well enough and now have "Programming with Data." Hopefully this will get me up to speed, so I can be one of those guys on the R-help list who can answer a question with six different one-liners to accomplish the same thing.
Well, that isn't much learning for the day, but it's something.
I'm also trying to learn and use R, but fell back to python for my task today when I couldn't quickly figure out how to do what I wanted to in R. R is frustrating right now because it seems so insanely powerful, but at the same time foreign to how my python-oriented brain thinks.
I have "Modern Applied Statistics with S-PLUS" and learned to parrot the examples in order to do some cool stuff, but when I have to deviate much from the examples, I find myself a bit lost. I figure I don't understand the basics of the language well enough and now have "Programming with Data." Hopefully this will get me up to speed, so I can be one of those guys on the R-help list who can answer a question with six different one-liners to accomplish the same thing.
emacs org-mode
I tried org-mode a while ago, loved it, and then inexplicably stopped using it. I think I was trying to figure out how to integrate my entire workflow (calendar, task, notes, etc.) in emacs and it was too much at once. This time I was just using it for notes/outlining and it is quite amazing. The keys are very intuitive also. You use the Meta and Shift keys combined with the arrow keys to move sections up and down, indent, etc.
I also used the table mode to actually make a table I needed and it is, quite simply, incredible. It's amazingly intuitive to fill a table, move rows and columns, insert, and delete. The thing can even be used as a spreadsheet, but I haven't tried that yet.
I also used the table mode to actually make a table I needed and it is, quite simply, incredible. It's amazingly intuitive to fill a table, move rows and columns, insert, and delete. The thing can even be used as a spreadsheet, but I haven't tried that yet.
Subscribe to:
Posts (Atom)
