First and foremost I have been out of school awhile and my memory of statistics is hazy at best but I have a business situation. Please consider this when forming your response.
I work for a calibrating firm and we need to calculate machine uncertainty as described in ASTM procedure E74 (I attached a copy of the page. You may want to read 8.1 - 8.4 before answering). We take a series of known values and compare them against what the customers equipment reads (i.e. 10.00 lb weight weighed 10.20 lbs, 20.00 lb weight weighed 20.30 lbs, etc, etc). Based on this data we are supposed to "fit a polynomial equation...using the method of least squares". Then, using this calculation I calculate the standard deviation which I then use to report my uncertainty.
I would love to (A)comprehend what the heck this means and (B)set up a template in Excel for completing the calculations going forward. ANY HELP you can provide would be greatly appreciated. Most of the info on web is way too technically oriented.
T-H-A-N-K-S !!!!
Answer barry -
unfortunately, the attachment sent with your question did not seem to get
transmitted to me. you might want to email the tech staff at allexperts and
find out if one can get an attachment sent to the 'expert'.
but what you are talking about - polynomial regression using least squares - is
pretty standard stuff - so i can perhaps supply some backround information, even
without seeing what specifics are in the attachment.
the point of view taken in regression is that if one were to obtain repeated
y-values for the same true weight x, the various y-values will vary somewhat -
due to random flucuations in the customer's equipment and possibly also in the
measurement process. these y-values flucuate about an unknown true average
(or mean) value - which i'll call M(x). the notation indicates that the true mean
for the y-values is different for the different x's. if one were to graph M(x)
against x, the result would be some sort of curve - called the true regression
curve [or function].
if one were to add to this graph the observed (x,y) values from the data, plotted
as points, the resulting picture would show that some of the observed (x,y) points
lie above the M(x) curve and others below it. i.e. the M(x) curve passes through
the cluster of observed (x,y) points.
if one knew the true regression curve M(x) [over a range of x-values], one would
use it to predict the y-value that would occur for a weight x - especially one
not actually represented in the calibration data set. in this context, x is called
the explanatory [or independent or predictor] variable and y is the response [or
dependent] variable.
of course one doesn't know what the actual M(x) curve is. one can see only the
cluster of observed (x,y) points on the graph.
a regression analysis assumes some sort of mathematical form for M(x) [a straight
line or a polynomial, for example] and tries to fit a curve of that form through
the cluster of observed (x,y) points, in the hope of getting a reasonably close
approximation to [or estimate of] the true M(x) curve. the [assumed] mathemetical
expression for M(x) is called the model for M(x). the curve is fitted using a
regression routine [such as in excel] and resuls in a fitted [data dependent]
curve m(x), which is an estimate of M(x).
m(x) is then used as the predictor of a y-value to be obtained at that x.
the simplest sort of regression analysis assumes that M(x) is (nearly) a straight
line. [this will usually be so if one restricts x to a fairly narrow range of
values. any smooth curve looks straight over a short distance.] that is, one
adopts the model
M(x)= A + Bx.
one can use a regression routine [as in excel] to fit a straight line through the
cluster of observed (x,y) points and get
m(x) = a + bx.
the 'a' and 'b', called the fitted regression coefficients, are calculated from the observed data by the regression routine and estimate the true but unknown 'A' and 'B'.
so for a weight of 15 lbs, say, you would predict that the customer's equipment
would give a y-reading of m(15) = a + 15b.
for a given x, the extent to which actual [or potential] y-values differ from
their true average value M(x) is measured by a quantity called the standard
deviation (SD) about the regression curve. if SD is small, the y-values adhere
closely to the true regression curve M(x) and one would expect to get good
information about M(x) from the cluster of observed (x,y) values. that is, one
would then expect to be able to fit an m(x) that is close to M(x). SD indicates the
extent to which the fitted regression curve m(x) differs from the true M(x). this
is elaborated on a bit more below.
[consequently, SD also reflects the extent to which a predicted value m(x)will be
close to actual y values.]
when a straight line is not an appropriate representation for M(x), one tries a
more complex model. a polynomial is a good choice because any smooth curve can be
closely approximated by a polynomial of sufficiently high degree [over a bounded
range of x-values]. one hopes that the degree of the polynomial need not be very
high to get an adequate approximation to M(x).
the specs apparently say to take a polynomial as the model for M(x).
just as in the case of fitting a straight line [a polynomial of degree 1], a
regression routine can be used to fit a polynomial model of the form
($) m(x) = a + b_1 x + b_2 x^2 + ... + b_k x^k
through the observed (x,y) cluster. here x^2 means x-squared, etc, and the degree
of the polynomial is k.
how this is actually done with excel is to use a function in its data analysis
pack [which comes with excel but which must be activated] called multiple
regression.
itself, multiple regression involves more than one explanatory variable [like
x_1 = temperature and x_2 = pressure in a chemical reaction, for which the response
is y = yield].
in general multiple regression, one has k (say) explanatory variables
x_1, x_2, ..., x_k
and a response y, observed on each of n (say) trials. multiple regression assumes
a model of the form
to do polynomial regression, one creates additional predictor variables by taking
x_1 = x, x_2 = x^2, ..., x_k = x^k. then the fitted version of ($$) turns into the
polynomial ($).
[how does one know what value of k to choose? it may be specified, for example,
in the specs you are using. if not, you've got a more complicated situation to
deal with. i can't go into all of that here - but sometimes trial and error taking
k = 1, 2 or 3 will suffice. the formal statistical methods for determining a 'good'
value of k to use are somewhat complicated - unfortunately - and require rather
more time, space and effort to describe. they are not so easy to implement in excel
either. for that, one is better off using a statistical package such as SAS.]
assuming that you have decided on a value of k to use [sometimes past experience
is a guide - do the specs have anything to say about that?], the multiple
regression routine fits a polynomial m(x) as in ($) thru the observed data
cluster. m(x) is then used as a stand-in for M(x) to predict y for that x.
the regression routine also provides an estimate i'll call sd of the quantity SD
mentioned above.
recall that SD [and thus also sd] measures the extent to which actual y-values
differ from the true regression curve M(x). it also [somewhat less directly] sheds
light on how good m(x) is as an estimate of M(x).
if SD = 0, one always has y = M(x): i.e. the true regression curve [polynomial,
in your case] perfectly predicts the y-readings. then also sd = 0 and the fitted
m(x) = the true M(x).
usually sd > 0, and then one understands that there ARE (random) discrepancies
between m(x) and M(x), as well as between actual y-readings and predictions given
by m(x) [or between y and M(x), were the latter available].
sd is actually converted (by the regression routine) into what is called [an
estimated] standard error (se) that goes along with the fitted m(x) given by ($).
se is different for different values of x and is better denoted by se(x). se(x)
more directly represents the discrepancy between m(x) and the actual M(x):
by converting se(x) to a margin of error me(x) = 2se(x), one can get a 95%
confidence interval for M(x): with 95% confidence, M(x) is in the range
m(x) ± me(x).
[btw - the multiplier 2 used here to get me(x) from se(x) should actually be 1.96
to have 95% confidence for the interval m(x) ± me(x). one sometimes rounds up to 2
for convenience.]
that is pretty much what the specs are talking about when referring to polynomial regression. [you can forget about the least-squares part of the name - it just
refers to the method the regression routine uses to compute the fitted regression
coefficients a, b_1, ..., b_k in ($).]