<p>^Xiggi, I don’t see how those two statements are contradictory: a comparison is made not between two students taking the same test, but between a group of students taking a new test with a group that took a previous test. I describe the scaling procedure below as clearly as I can; it is consistent with <em>both</em> statements. Regardless, scaling doesn’t work by parsing a sentence on a web site; rather, it is a logical, mathematical procedure well described in the PDF I linked to above and described below. I’m not a language person, so let’s not debate semantics. If you have a procedure in mind which makes math sense, and which is supported by both the CB web site and technical docs, please describe it in detail.</p>
<p>A standardized test simply doesn’t work unless you have a basis of comparison by which you can separate two situations: a) variations in test difficulty, and b) variations in the ability of the test-takers. Imagine a case where two SATs happened to be identical in difficulty and the first group of test takers was much stronger than the second. The distribution of raw scores would go down from the first test to the second. Now imagine a case where the first SAT was easier than the second, but the two groups of test takers were of the same ability level. Again the raw scores would go down. In the first case, the scaled scores should on average be lower for the second group: they are weaker students after all. The curve should therefore stay the same for both tests. In the second case, the scaled scores should on average be the same: the two groups of students have the same ability and it would be unfair not to help the second group with a more generous curve.</p>
<p>In general both the test difficulty and the test takers’ ability level can vary, and neither are known before the test date. How do you determine ability level separately from test difficulty? By comparing the new bunch of test takers with a previous bunch that had the same equating sections. The difference in the two distributions of raw scores on these sections (which have identical questions) tells the CB “OK, this new bunch of students is lower in ability than the previous bunch, since they didn’t do as well on these identical questions”. Or, it could say “OK, this new bunch of students is right in line in ability with the previous bunch, because they did just as well on these repeated questions.” Or, it could say: “OK, this new bunch of students is higher in ability than the previous bunch, since they did better on these identical questions”.</p>
<p>At this point the CB can correct for these varying ability levels before determining whether the test was easier, harder, or just right. For example, in the case of two groups with the same abilities, the two distribution of raw scores on the scored sections can be compared directly. If the second set of scores is lower on average than previously, for example, it is because the new test was a little harder than before and the curve should be more generous. In the case of the second group being lower in ability, the second distribution of raw scores on the scored sections would have to be shifted up (simplifying nasty math here) before a comparison to the scores of the first group can be made. If these two distributions are now the same, for example, then the test difficulty was just right and the curve should be the same as it was before.</p>
<p>All of this leads to several effects: 1) curves aren’t known ahead of time and take after-the-fact number crunching to determine; 2) the curves will vary from one test to another, and the variation in any particular month is random; 3) percentiles are not fixed and the final score distribution is not adjusted to fit a particular bell curve (the % of people scoring 750 and above, e.g., will vary from one test to another, as will the average score); 4) your score does not depend on how well others did on the test.</p>