<p>Introduction</p>
<p>There are many theories about which month’s SAT curve is the “easiest.” One of the most common theory is that the May test has the kindest curve because it is when the majority of slackers who have put off taking the test take the SAT, hence bringing down the curve. However, this theory has never been proven. In fact, there is relatively little evidence that suggests any test has an easier, or for that matter harsher, curve.</p>
<p>Interestingly, those who believe that certain curves will be easier forget an important point-the test is curved to account for difficulty. Kinder curves are made for harder tests while harsher curves accompany easier tests. This is done to make each test essentially equal to each other. Nonetheless, it is still worth asking the question: Are any SAT curves easier?</p>
<p>This Mini-Study</p>
<p>This Mini-study seeks to address this question of SAT curve difficulties. Thanks to data gathered by erikthered, it is possible to view the curves of 20 past SAT’s. This mini-study uses statistics in order to analyze whether any of the curves is actually easier (or harsher) than others.</p>
<p>Throughout this mini-study I employ a bit of statistical jargon. If you haven’t taken stats, don’t fret; just skip down to the conclusion where I explain the results in plain English.</p>
<p>WARNING!!</p>
<p>I must warn against drawing any definitive conclusions!!! I do draw my own personal conclusions at the end, but I do not guarantee these to be 100% correct. This data is likely insufficient to warrant adjusting one’s testing schedule. This mini-study was done merely out of curiosity sake. I do not believe it is prudent to use this mini-study to decide which test to take, hoping to score an easier curve. As I said earlier, the tests are curved to adjust for difficulty. Also, there is the chance that I made a mistake somewhere along the line.</p>
<p>Methodology</p>
<p>The data was taken offline from erikthered’s PDF that lists 20 past curves. 7 were from October, 6 from May, 6 from January, and one from March. Because there was insufficient data for March, only May, January, and October were compared. The curves were also compared by section. Thus, there were nine month/section combinations being compared.</p>
<p>To find the difficulty of the curve, the first thing that I did was take the average scaled scores from the first 15 raw scores for each sitting (similar to what erikthered did). A higher average indicates a kinder curve (and harder test) while a lower average indicates a harsher curve (and easier test). Each month’s data was then combined and average. Two-tailed hypothesis testing was used to compare the three months. Positive t-scores indicate a kinder curve while negative t-scores indicate a harsher curve. Two-tail alpha was set to 0.1, so the critical cutoff was roughly 1.943 standard deviations.</p>
<hr>
<p>Statistics Box</p>
<p>A common theme that runs through this mini-study is the z-score/t-score. Simply put, the z-score is the number of standard deviations that the data is from the average. Standard deviation is a measure of how widely spread the data is. It is the average distance that a data point is from the center.</p>
<p>Another number that is used is the p-value. P-value is the probability that a result occurs because of natural variability. If a p-value is very high (for example, 50%), then that indicates that there is a 50% chance that the effect was due to natural variability. </p>
<h2>Alpha is the maximum acceptable p-value for the result to be considered “statistically significant.” Significance indicates that it is likely not due to chance and that there is something going on. Statistically significant should not be confused with the common definition of significant, which normally means “a lot.” Results (as you shall soon see) can be statistically significant without actually being significant.</h2>
<p>The second test done broke each curve down into the top 15 scaled/raw scores. Each of the 15 highest scaled scores for each sitting was compared to the average scaled score throughout all the curves for the identical raw score. Thus, for each SAT sitting I generated 15 points of data that gave how much above or below average the scaled score was for the same raw score for each curve. To be honest, I’m not sure if this is allowed in stats. But I went ahead anyways!</p>
<p>Once again, I performed two-tailed hypothesis test with alpha=.1. Because of the large sample size, 1.645 became the critical cutoff. The z-score that was produced signified how many standard deviations away from the average the scaled score of the curve was away from the average scaled score for the corresponding raw score. For each month there was between 90 and 105 data points, depending on the month.</p>
<p>Results</p>
<p>The table below indicates the results. Positive z-scores indicate a kinder curve while negative z-scores indicate a harsher curve. The absolute value of the z-scores indicates the significance of the result with higher absolute values indicating greater significance. I also averaged the three subjects to get each month’s z-score. The top half of the table is for the first tests where the curves were compared as one unit while the bottom half of the table is for the second test where the curves were broken down into smaller parts. This table DOES NOT show how much the curves actually were above or below average, only the z-score. Sorry that the table doesn't look very pretty.</p>
<p>Full January May October</p>
<p>Math 0.2685 0.710 -1.115</p>
<p>Crit. Read -1.025 -0.812 1.788</p>
<p>Writing -0.928 -0.148 0.352</p>
<p>AVG -0.561 -0.083 0.342</p>
<p>Broken January May October</p>
<p>Math 1.351 2.770 -3.816</p>
<p>Crit. Read -3.511 -2.803 5.845</p>
<p>Writing -2.648 0.269 2.202</p>
<p>AVG -1.603 0.0787 1.411</p>
<p>The following table replaces all the numbers with not significant, significant high, and significant low.</p>
<p>Full January May October</p>
<p>Math Not Sig. Not Sig. Not Sig.</p>
<p>Crit. Read Not Sig. Not Sig. Not Sig.</p>
<p>Writing Not Sig. Not Sig. Not Sig.</p>
<p>Broken January May October</p>
<p>Math Not Sig. High Low</p>
<p>Crit. Read Low Low High</p>
<p>Writing Low Not Sig. High</p>
<p>Discussion</p>
<p>As you can see from the tables, the results are rather contradictory. Although for several months and several sections the curve was found to be significantly harsher/kinder, in no instances did both tests agree that the curve was actually different for any combination. The fact that the two tests never both indicated significance for any month/section combo suggests that any “significance” must be taken with caution.</p>
<p>When the curves were taken as a whole (all three sections), none of the months were significantly different from the average. When averaged out through all three sections, none of the months is more than .56 standard deviations from the average in the first test. This indicates that there is a very high probability (close to 29%) that this result was due to natural variability. Remember, the chance that it is due to natural variability must be less than 10% for the result to be considered significant.</p>
<p>For the broken tests, the results were a bit different. There actually were several months/subject combos that displayed high enough significance to be different from the mean. Some of the month/section combos even had up to 5.8 sigma confidence (enough to prove the existence of a new particle). However, although there was a statistically significant difference for several months/sections, the actual difference was miniscule. At best, the October sitting of the SAT for the critical reading was above average (indicating an easier curve) by a mere 5.09 points. Differences of just 2 or 3 points were much more common. The likely reason that they were so statistically significant was because the way the tests were performed created roughly a hundred data points per month, considerably shrinking the standard deviation. </p>
<p>When the broken tests were averaged out, none of the months were statistically significantly different from the others. The one that came closest was the January sitting (The reason it is not significant at alpha=.1 is because a two-tailed test was used. Had it been one-tailed, the p-value of .0545 would have been considered significant at alpha=.1)</p>
<p>Conclusion</p>
<p>The primary conclusions that can be drawn is as follows. Keep in mind that these are merely my conclusions and are not guaranteed to be correct:</p>
<p>[ul]
[li]In no case did both statistical tests indicate significance for any scenario. This suggests that there may be no month/section combination that is sure to be significantly different from the average curve.[/li][li]In the broken test, many of the section/month combos actually were significant. However,** the effect was very small; typically less than 5 points with differences of 2-3 being more common.[/li][li]Variation was enormous.** Although this was not shown, the variance in the broken tests was quite large with differences between the average scaled score for a given raw score and the actual scaled score being as much as 26 points. [/li][li]Thus, even if you were to try and use these results to pick an appropriate date, you could get unlucky and very likely swing the opposite direction.[/ul]</p>[/li]
<p>To put it sweetly and simply for those of you who are getting tired of slogging through AP stats jargon:</p>
<p>There were no month/section combos that were significantly different from the average in both statistical tests. In the ones that were significant in one test, the effect was so small that it would not warrant changing testing date.</p>
<p>Once again, keep in mind that although the curves may be (very slightly) different from month to month, the tests are created so that the difficulty of the test and the curve cancel out so that each sitting is roughly the same difficulty.</p>
<p>Some of you may have noticed something that I did incorrectly. Because each month represented roughly one third of the data, I violated the 10% condition in which no more than 10% of the population may be sampled. However, the reason for this condition is because samplings above 10% cause the sample to no longer be normal and instead resemble the population. But because the population was roughly normal, I did not think this would be a major issue.</p>
<p>If there is anything that I did incorrectly throughout my mini-study, please tell me and I will try and fix it. I typed this up in a single day so I didn’t spend a lot of time developing a perfect method.</p>
<p>Finally, thanks to erikthered/fignewton for posting all of the past SAT curves. This mini-study would not have been possible without you.</p>
<hr>
<p>Questions? Comments? Concerns? Compliments? Rants? Rages? Post away!</p>