TimLennox.com, since 2007. Politics, Civil Rights, Science, Sociology, Photography, Media + more!
The most Popular Posts of the past seven days.
Sep 9, 2009
Math Help Please!
Like many journalists, I am math challenged. So will someone please help confirm or deny a suspicion of mine?
When you have a very small sample size...or a small number or anything...changes in the group are difficult to attribute to a particular cause.
Here's what I'm talking about.
2008 Labor Day Holiday Weekend: 8 people die on Alabama Highways.
2009 Labor Day Holiday Weekend: 6 people die on Alabama Highways.
Isn't it statistically impossible to say what caused the drop, since the original number was so small to begin with? Even if there had been zero fatalities over the weekend..a 100% drop...wouldn't it be relatively and statistically almost meaningless (except of course for those who didn't die)?
Thanks in advance for any assistance.
Subscribe to:
Post Comments (Atom)
Right, Tim. Remember what a famous theologian said: "there are three kinds of lies--lies, damn lies and statistics."
ReplyDeleteSmall numbers are statistically insignificant. I, an English major, am no expert on this subject. But I had to proof-read my wife's doctoral dissertation twice, and I learned a thing or two about statistics.
Now you see why Vermont has such a low "crime rate." Probably, there's proportionately a similar rate as other largely rural areas. Every time someone up there goes batty and knocks over a 7-11, it's a "crime spree."
To answer your question - in a word, "no."
ReplyDeleteLet's consider your question, Tim.
Samples are taken because they can represent the greater part, or the whole, accurately. The degree of accuracy is expressed as a part of two functions: 1.) Confidence Interval (or Confidence Level), and 2.) Sample Size.
A confidence interval is a range of numbers which defined, will show a specific chance (probability) that the value of a certain parameter lies within it. In other words, it's a guide to accuracy. A confidence level is the probability that the value of a parameter falls within a specific range of values.
Let's examine some numbers to see how this works.
Using the entire population of Alabama (4,600,000) as an approximate guide, were we to sample 25 people, with a 95% Confidence Interval (CI), we'd have a standard error of 20.41%. Not good.
Were we to double, or even quadruple that sample to 100, our RSE (Relative Standard Error) with a 95% CI would be 10.05%.
However, were we to sample 1000 people, our RSE would be 3.16%
A 2000 size sample would yield a 2.24% RSE, while a 5000 size sample would yield a 1.41% RSE.
But let's tighten up that CL to 99% with pop of 4.6M and see what happens.
A sample size of 200 would give us a 7.09% RSE, while a sample size of 400 would give us a RSE of 5.01%.
Notice how when the CL tightens, the sample size necessary to obtain accurate results does not change. Either 95% or 99% CL will require a sample size of 400 for a population of 4,600,000.
But let's get smaller.
If our population was 1000, and our CL 95%, with a sample size of 5, we'd have a RSE or 49.9%. Again, not good.
But if we raise that sample to 20, our RSE becomes 22.72% Still not good.
To have a 5% RSE, on a population of 1000, with a CL of 95%, we'd need a sample of 287.
But, let's get smaller yet.
On a population of 100, with a 95% CL, in order to obtain a RSE of 5%, you'd need about 80 or 81 respondents.
Now... that may not have answered your question, but hopefully it advanced some understanding of the relationship of accuracy in a sample.
This site - www.mmucc.us - may prove useful for stats on crashes in the USA.
As regarding inferences that could be drawn from data, that's a whole 'nother ball game. May I suggest directing those questions to the following NHTSA statisticians, and the MAD (Mathematical Analysis Division):
John Kindelberger, NHTSA (202-366-3365). - mathematical statistician in NHTSA’s National Center for Statistics and Analysis
Dr. Rory A. Austin, PhD NHTSA rory.austin@dot.gov - mathematical statistician within the Office of Rulemaking
Dr. Chou-lin Chen, NHTSA - Division Chief of the Mathematical Analysis Division at National Highway Traffic Safety Administration’s (NHTSA) National Center for Statistics and Analysis (NCSA)
Tom Bragan, NHTSA - senior technical editor and NCSA staff supervisor for Advanced Systems Technology & Management, Inc.
National Highway Traffic Safety Administration
National Center for Statistics & Analysis
1200 New Jersey Avenue, S.E.
West Building
Washington, DC 20590
Phone Numbers:
Washington, DC Line: 202-366-1503
Automated Data Request Line: 1-800-934-8517
Fax Number: 202-366-7078
Email:
NCSAWeb@nhtsa.dot.gov
Oh, and Jay... I have a little something for you, too!
As promised, here's something for you, Jay!
ReplyDeleteThe phrase "lies, damned lies, and statistics" is probably best known as being used by American author Samuel Clemons, whose pen name is better known as "Mark Twain" in his 1924 work "Mark Twain's Autobiography, p.246.
The phrase, however, most likely had it origins with British Prime Minister Benjamin Disraeli, which Clemons acknowledged, writing, "Figures often beguile me, particularly when I have the arranging of them myself; in which case the remark attributed to Disraeli would often apply with justice and force: "There are three kinds of lies: lies, damned lies, and statistics."
Yet even Clemons' attribution is not conclusive.
A similar example is the phrase most often attributed to Benjamin Franklin which says that "Those who would give up essential Liberty, to purchase a little temporary Safety, deserve neither Liberty nor Safety."
The phrase did not originate with Franklin, and was in great popular use during our Revolutionary period. The earliest printed mention of it is in a November 11, 1755 letter by the Assembly of Pennsylvania to the Governor. It also appeared four years later in Franklin's 1759 "Historical Review" as the motto of the book, in the body of that work.
Sources: Frothingham, Rise of the Republic of the United States, p, 413 & "The Papers of Benjamin Franklin," edited by Leonard W. Labaree, vol. 6, p. 242 (1963)
Just a bit more on statistics!
ReplyDeleteIn classical statistics, we can use a deck of cards or dice to illustrate the concepts.
A standard deck of cards has 52 cards in four suits. Each suit has 13 similar cards, sequentially ranging from Ace=1, through King=13. A die (singular) is a cube having six equal-sized sides, with each side containing a pip or dots individually and uniquely numbered 1-6.
The chance, or probability of drawing the Ace of Spades is 1/52. In other words, if you drew 52 times, you would expect to, at least once in those 52 times, to draw the Ace of Spades.
Consider the die. If you threw a die six times, you would expect to, at least once, have the number 6 appear on the upper side. And remember... the opposite sides of the die always equal 7. Opposite the 6 is 1. Opposite the 5 is 2. Opposite the 3 is 4.
Now, if you had two dice, you would expect similar, yet somewhat different results simultaneously. In one sense, your chances/possibility/likelihood of obtaining a certain number on each die are doubled, while the likelihood of any number from 2-12 occurring is calculated by the total possibility of numbers (the combination).
The possibility of the number 7 appearing after a roll is increased. If the chance of rolling a 6 is 1 in 6, and the possibility of rolling a 1 is 1 in 6, then the likelihood of rolling a 1 AND a 6 would be increased because of one factor... there are TWO dice. Thus, your chances (or expected findings) have increased at least 50%!
Let's take that fundamental concept and apply it (albeit with some further analysis) to the examination or study of anything. You mention traffic fatalities on Labor Day weekend. So for this study, we need people, cars and death... and a time frame - the Labor Day weekend.
Ever wonder how many people die from choking on hot dogs on Labor Day? Are there more hot dogs sold than miles driven? What’s the relationship between buns, burgers and miles driven? Are burger eaters more likely to die if they traveled more than 50 miles? What about drowning deaths on public waters on LD? Or children that drown in backyard wading pools while choking on hot dogs? Or, train/car collision deaths on LD? Or, even heart attacks on the Tuesday afterwards?
These are all questions that statistics could answer.
Earlier, we discussed sample size as a method of obtaining accurate results.
ReplyDeleteAfterwards, when re-reading your question and my response, I don't think I answered the heart of your question, or the core issue you hoped to address. After further consideration, as I understand it, you were asking about the relation of findings to a selected sample. I addressed that issue by citing interpretation of results.
However, what I sensed that you may have been asking about is how a sample size could be so small, yet obtain accurate results.
Pointedly, you asked if it were statistically impossible to determine what might have caused a drop in traffic-related deaths over the Labor Day weekend.
I said "no," because we can account for certain events (control for variables), and we can see or determine cause-and-effect correlation. (This may be getting a bit deep for the "non-math" type. But I'll continue.)
Let's get kinda' silly in this, and create a scenario where there are no traffic-related deaths over the Labor Day weekend - the "100% drop" to which you referred.
Would it be accurate to infer that no cars were driven? Why of course not! So, we can definitely rule out that possibility. Remote... but it's a possibility, nevertheless.
Were there fewer cars driven? Perhaps, but not likely. There are more cars, and more people now than last year, and the year before and the year before, etc. So, we could probably rule out that possibility.
What about folks driving fewer miles? Well, that could be a possibility. Why? We're in some stressful economic times nationally, which have affected Alabamians as well. So, folks might be more inclined to stay nearby rather than travel far. Travel, of course, takes money. Combined with the fact that Tuesday is a work & school day, we might also consider that "recovery" time from such a mini-vacation could be important. Thus, staying closer to home would or could facilitate that.
If there were wrecks, what factors could be considered? Type and severity of the wreck? Speed? Seat belt usage? Air bags? Intoxication with ETOH/narcotics/prescription meds? Sleep deprivation? Age? Vehicle condition? Road condition?
With the introduction of these various scenarios, what we're starting to see here are factors for which we can control, can examine, and which can affect the outcome. What is the outcome? Traffic-related deaths over the Labor Day weekend.
Let's get into some numbers.
ReplyDeleteAlabama's estimated population - 4,600,000
sample size - 400
4,600,000 / 400 = 11,500
A 1% sample of 4,600,000 is 46,000. That's a whole lotta' folks!
In fact, if we looked at samples as a percentage of the whole, and cut that sample in half to 1/2% or 0.005, we'd still have 2300 people! That's far more than we need to obtain accurate results with a 95% or 99% CL.
But we don't need such a large sample in order to obtain accurate results. That's part of the entire point of sampling!
You don't buy an entire manufacturing lot of milk or bread when you go to the grocery store, do you? No! You sample. Same thing when you contemplate buying wine. You sample. No need to buy a case (or more) to determine if its good... right?
That's the beauty of sampling.
RSE - 5.01%
A sample size of less than 1% can yield accurate results if the population is sufficiently large.
However, notice what happens when the population is small. The sample size - expressed as a percentage of the population - increases.
Remember the "bell curve"?
The profile, or outline of the bell shape is created when we analyze a statistically normal group.
Let's consider, for example, humans... or, if you prefer, let's consider numbers as representing humans.
Take the number 1,000,000.
If we physically examined, considered or tested 1,000,000 humans, we'd expect to see certain normal findings, such as two arms, two legs, one nose with two nostrils, one mouth, reproductive capacity, two lungs, and a host of other issues, including select health abnormalities such as diabetes, hypertension, hypercholesterolemia, emotional abnormalities such as anxiety, depression, schizophrenia, mental retardation, reproductive incapacity, GI problems and others.
We'd expect to find most people are "normal," i.e., two arms, legs, etc. while a minority would be born with or develop loss of limb(s), or other health impairments.
We all know what "normal" is in health. So I draw a parallel analogy to illustrate a point.
Most folks would be normal.
That is to say, we would expect with a great level of confidence, say 95%, to find most folks as being normal.
The equal distribution portion of the bell curve says that 95% of the population will be found between two standard deviations of the mean. Corollary to that, we can also say that equal distribution means that on the upper end, we would expect to find 2.5% abnormal health, and 2.5% on the lower end as abnormal health. Yet, those are "normal" findings. Mathematically, of course, 2.5+2.5=5.
The mean is the mathematical average. For example 1+2+3+4+5=15. We have five numbers. Divide the number of items (five) by the total value they equal - in this example, 15 - to obtain the average. 15/5=3. The mathematical average, or mean, of those numbers is 3.
The median, however, refers to the item found literally in the middle. In this case, the median is also 3. We almost always expect the median and mode to be very closely related.
These points and issues are part and parcel of the analysis that is done by NHTSA and others to determine whether or not regulations or efforts are successful.
The bigger message?
Be smart. Smart is cool. Stay in school. Learn math. Help people.
Good info, Kevin L. I just hope Tim doesn't pay for blog space by the word ;-p
ReplyDelete