What your mother never told about linear regression

If you think regression modeling is inaccessible, or if you have had problems with algebra in high school, then this article is for you. Of course, it won't hurt the rest either.

Imagine that you were given a database containing the age and income of each resident of a certain area. Your boss wants you to use this data to create a model that predicts a person’s income based on his age. And now you are calling for urgent statistical assistance to a certain Doctor Ivanov from Information Systems. Good luck accompanies you - the doctor is in touch. Dok Ivanov wisely makes sure that among the data there are no anomalous values that can distort the analysis. Then he conjures the data and faithfully presents you with a mathematical model: “Multiply the age in years by 971.4, add 1536.2 and get the annual income in dollars. Here is your optimal model. ”

You are very grateful to Dr. Ivanov and hurry to prepare a report to your boss. You use the formula to build a graph with income on the vertical axis and age on the horizontal, and you admire the simplicity with which this rule relates age and income. This is a straight line - and, moreover, optimal. But its brilliance fades a little when you notice that with this model, the income of 18-year-olds is $ 19,021 (these young people should have been doing their homework, not making such sums!) And it finally disappears when you see that the estimated income is 70- summer is $ 69,534, and each subsequent year of life adds an automatic bonus of $ 971 (and is unlikely due to allowances to the state pension).

So why does Dr. Ivanov’s formula look suspicious? Because she is bad. But how can a model be bad when it is “optimal”? It will be optimal only if Ivanov made the correct assumption about its form. He suggested that the correct form of the model is a straight line. The computer did its part of the work, choosing the most suitable straight line from all possible through the use of highly respected technology, created by Karl Gauss (1777-1855).

Trick-22

If it seems to you that there is a Trick-22 here, then you are right. If you knew the correct form of the model from the very beginning, you would not need Doctor Ivanov. Doc also did not know which form was correct, so because of his employment he chose the simplest and suggested that it was a straight line. The straight line equation looked scientific, at least at that moment, but in fact it was not scientific. Straight lines often reflect incredible physical laws in science and engineering, and there is no reason to believe that they apply to economic situations. The algebraic formula is really simple and convenient, but who needs a simple description of a bad model?

Did the combined forces of mathematics and Pentium processor extract exactly what was needed from the data? That's not it. What Doc did has happened too often, because there is always the temptation to thoughtlessly use a commonly used tool called linear regression.

Linear regression

The formula that Doc gave you multiplies age by 971.4 and adds 1536.2 to the result. He obtained 971.4 and 1536.2 using a linear regression computer program that performed all the laborious calculations to find these numbers. These numbers determine the specific line on which the original data falls.

Linear regression is a mathematical method for estimating a certain quantitative value (for example, dollar amounts) by “weighting” one or more predictive parameters, such as age, number of children, average bowling score, and so on. It was developed long before digital computers, and its eternal fame is due to its appeal to academic research.

If we assume that linear regression was the only modeling tool in Doc's arsenal, then we can see how its model created from improvised means came into being. Such tools make the assumption that a straight line is the correct form that determines the ratio of each of the predictive parameters to the quantitative indicator. Let's assume that in addition to age, your data would include the “number of children” as a predictive income parameter. The introduction of both parameters in the regression will give the formula of the form:

Income = 1007.8 * Age -752.35 * Number of children +933.6

The asterisk is a multiplication sign. The effect of our new variable “number of children” is also linear. This is because the estimated income is straightforwardly reduced by $ 752.35 for each additional child. We use this formula to show the ratio of age and number of children to income to illustrate what is important to know about the numbers provided by the linear regression.

1) Quite often, it is incorrectly assumed that 1007.8 is the "weight" of age, and -752.35 is the "weight" of the number of children. If age was expressed in months, not in years, but IT trackers track the respondent's visual response to a product or its location in a store. In advertising research, galvanic skin response is monitored. More and more in practice applied research of emotions. In GfK, this method is called Emoscan. The last thing I would like to say is the question of presenting data to the client. Today, we are in most cases transmitting the results of research in electronic form to the client. But, the electronic presentation of data is not only sending a report to the client by email. Clients require automated reporting with constant access to it through Internet portals. In our company there are various forms of portals that are constantly being improved. Access to them should be 24 hours 7 days a week. Customers attach great importance to data visualization. This presentation design, and infographics, and drawings, and video. Like all consumers, our customers want a user-friendly product. What conclusions can be drawn from the current trends in marketing research technologies? Many analysts believe that today there is not an evolution, but a revolution in marketing research. First of all, it is associated with the rapid development of digital technology, but not only. Changing attitudes to understanding consumers. From individual studies, we are increasingly moving to integrated problem solving. The marketing researcher becomes for the client not just a supplier of information, but a partner - consultant. The methods of data collection and analysis are changing in a revolutionary way. Changes in interaction with respondents. From simple respondents, they are increasingly becoming participants in the study. What follows from this? My recommendations are simple: it is necessary to master digital technologies, understand the client’s tasks and better understand the consumer. It is clear that there is a lot of work behind these simple recommendations.