![]() Statisticians and machine learners work on similar problems, albeit sometimes with a different aesthetic and perhaps different (but overlapping) skill sets. In what follows, the committee does not make a sharp distinction between “statistics” and “machine learning” and believes that any attempt to do so is becoming increasingly difficult. Statistical modeling represents a powerful approach for understanding and analyzing data (see McCullagh, 2002). In this approach, one can only guess a certain form of the relationship up to some unknown parameters, and the error-or what is missed in this formulation-will be regarded as noise. This approach to data modeling can be regarded as statistical modeling: although there are no precise formulas that can deterministically describe the relationship among observed variables, the distribution underlying the data can be characterized. The number of visits to the website can be constrained or predicted by these additional quantities, and their relationship will lead to a better model for the variable. A better model for this random variable might take into account other observable quantities such as the day of the week, the month of the year, whether the date is near some major event, and so on. These quantities characterize long-term trends of this random variable and, thus, put constraints on its potential values. In order to “model”-or characterize the distribution of-this random variable, statistical quantities (or parameters) might be considered, such as the average number of visits over time, the corresponding variance, and so on. ![]() ![]() For example, the number of people visiting a particular website on a given day is random. Good model building entails both specifying a model rich enough to embody structure that might be of use to the analyst and using a parameter estimation technique that can extract this structure while ignoring noise.ĭata-analytic models are rarely purely deterministic-they typically include a component that allows for unexplained variation or “noise.” This noise is usually specified in terms of random variables, that is, vari-Ībles whose values are not known but are generated from some probability distribution. For example, some measurements might always be positive or take on values from a discrete set. For a model to be realistic and hence more useful, it will typically be constrained to honor known or assumed properties of the data. These model parameters are typically regarded as unknown, so that they need to be estimated from the data. Both of these components can be specified in terms of some unknown model parameters. Typically these equations describe (conditional) probability distributions, which can often be separated into a systematic component and a noise component. Statistical models are usually presented as a family of equations (mathematical formulas) that describe how some or all aspects of the data might have been generated. They also allow one to make predictions and assess their uncertainty. Models make it possible to identify relationships between variables and to understand how variables, working on their own and together, influence an overall system. Statistical models provide a convenient framework for achieving this. The general goal of data analysis is to acquire knowledge from data. ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |