page+of+musings

Course as "Open Source"
In the software business open source products are released under an appropriate license, and everyone is entitled to modify the source code (that's why "open **source**") to suit their needs, with the condition that they, in turn, release their work so that others may use it, and, in turn, modify it again.

In such a model, a product is "completed" only when everyone has lost interest in improving it. The result has been impressive in terms of speed of development, innovation, and quick resolution of glitches that inevitably surface from time to time.

Hopefully, something like this, with all due differences, will happen to open courses: they should not be considered a finished product, and, in fact, are coming out with the hope of being vastly improved and enriched by any and all interested parties. This refers to content, presentation, technical tools, and any other aspect. This desire for openness poses a strong limitation on the use of non-free tools ("free as in speech", not necessarily "free as in beer", to use a common insider phrase of the open source movement), as they would prevent many of the possible modifications that could be made by people everywhere. We don't want to limit their contributions, as that would make us all losers.

Recall that, at least ideally, this is exactly the model which has allowed scientific research to explode over the last five centuries. There is no "copyright" or "patent" on mathematical theorems, and, while due credit to the discoverer is strictly required, everyone is free to use these theorems as tools for their own research.

Limits to this openness have been created by the need to resort to commercial enterprises for the physical dissemination of results, that is the need for journals and their publishers, but that has been an external necessity: every author tried to send a (free) copy of their work to any researcher who asked for it. With the escalating cost of scientific journal forcing libraries to reduce their subscriptions drastically, there is a push for something like "open source" journal publishing which is still in its infancy, but has the strength of necessity.

The conclusion to these considerations is that this course will be based as much as possible on free resources, accessible to all without any limitation except as provided by the relevant open licenses.

Computational Technology

 * Disclaimer:** In the following I will refer to open source tools available, to my knowledge, both on Linux and Microsoft Windows platforms. Even though Macintosh OS X systems are Unix based (specifically, on the very open BSD platform), they have strict proprietary closed aspects that may limit the availability of open source tools. Also, I don't have personal experience and familiarity with OS X, and cannot provide informed comments on this.

There is no doubt that a statistics class requires fairly extensive computations both by the instructor, and by the students. Now, to do them by hand (and this includes the possible use of calculators) is not a realistic setting. That is because this limits automatically the problems to deal with very small data sets, if for no other reason, because we can't expect to have people deal with a lot of data entry. This, in any case, would almost certainly produce a host of errors. While this limitation is of relative importance when presenting examples, it is crippling when it comes to realistic application of statistical tools. Small data sets require strong assumptions on the underlying distribution to allow for sensible use of these tools, as the application of the usual limit theorems is at best debatable.

One way out, as experienced by this author in previous real classes, is to provide well formatted tables of data (for easy reading off of quantiles), and pre-calculated convenient summaries (like the sum of the data and the sum of squares of the data). This is sufficient in all problems involving Gaussian tools, which, in any case, will include most problems treated in this course.

Still, it would be preferable to pose problems providing a file of data (or have students search and find the data) and let each student work out everything that's needed out of this file. Hence, it would be vastly preferable if each student had access to a computer equipped with adequate spreadsheet software. While the ideal situation would be for students to have their own computer, this may not always be possible, in which case, the school should be able to provide enough computer access to allow students to work adequately. Fortunately, all the software tools that are needed are available for free, and are of high quality. Assuming we can actually produce live media version of the course (as discussed later), it would not even be necessary to worry about computers not having adequate software installed, as long as they were capable of booting off CD or USB disks.

While any modern spreadsheet will have built-in statistical functions. the spreadsheet I will refer to for this class is //Gnumeric//, http://www.gnumeric.org. This spreadsheet is open source, free to use under a GPL license, and in this sense, fits the context of our project. Moreover, it includes many more built-in tools for probability and statistics than other well-known spreadsheets. In this sense, it combines the best of both aspects. It is available for Linux and Windows.

An underlying asssumption here is that the most important outcome of an introductory staisstics class is not a given amount of techncial expertise in statistical calculations, since these are by now, in most working environments, left to dedicated software (be it a specialized program like S-Plus, R, SSPS, and so on), or a pre-arranged spreadsheet. In fact, even the manual calculations that are required when cut off such programs are of a pretty elementary nature, as long as we are limiting ourselves to standard tools. Acquiring basic skills at theselevels is not a daunting task, and is often done "on the job", precisely because of its limited difficulty.

What is really important is, in my humble opinion, learning what statistical results really mean, and, even more, **what they do not mean**. While a full discussion of this is compltely out of the scope of an introductory course, a good start is to address the scope and the limitations of the tools that we will actually introduce.