# Big Data, Complexity and Scientific Method

*Published in Aspenia 63 “Where East meets West”, published in Aspenia 63, Aspen Institute Italia, Rome, December 2013 ( I thank Maurizio Paglia for the help in the English translation)*

Nowadays the collection of an enormous quantity of data is technically quite feasible. Still, the treatment of “big data” cannot by itself improve the ability to forecast natural or social phenomena. In fact, even when the underlying dynamic laws are known, it keeps being difficult to understand the evolution of the forces that often give rise to chaotic behaviors.

Thanks to the evolution of informatics and internet it is now possible to assemble great quantities of data; hence the development of a new situation which appeared unthinkable up to only a few years back. The quantity of data filed in digital form is growing exponentially: this scenario gives rise to a new series of problems to be solved, from those relating to personal privacy to the quality of information which can be obtained from data banks. While applications do exist in which data really represent a fundamental innovation – see the development of automatic translators – the question asked by many is if these data, by themselves (meaning without a theoretical reference model) can be sufficient to unravel natural or social phenomena and if this new situation implies an “end of the theory” of sort. Actually there are intrinsic limits to the possibility of extracting information from a large quantity of data. To make our point, let’s start from a “historical” example of how we explained a natural phenomenon which took place without a theoretical reference model.

THE CHAOTIC BEHAVIOR OF PLANETS

Every civilization we know has developed some astronomical knowledge: the sun’s and moon’s cycles were singled out because knowing them meant to know when to sow and when to reap. Among such civilizations the Maya stand out; they did not formulate a physical model to explain the movement of the celestial bodies but could still develop very accurate forecasts thanks to their astronomical observations stretching over a period of hundred of years. In particular their civilization learned how to forecast not only the moon’s but also the sun’s eclipses, much more difficult to monitor. Therefore by summing up the data obtained from their observations the Maya understood the diverse and subtle periodicity of the planets’ movement, without having a physical reference model. This was possible because the physical problem was correctly laid out. In fact today we know that not only there are deterministic laws that rule the planets’ motion (the law of gravity) but that the solar system itself shows a chaotic behavior only on time scales much longer than the ones which are useful for man’s forecasts.

The concept of chaos in a system rests at the basis of both the possibility of making accurate forecasts and of finding periodicities or recurring events in the dynamic evolution of the system itself. To clarify: in simple terms we can state that the finite precision with which today we know the status of the solar system — the position of the planets, the moon, etc. – will cause a relevant difference to the forecast of the planets’ position in a few thousand years. This might sound surprising since we would expect that, by knowing the laws governing the dynamics of a system, in order to determine the position of a body it is sufficient to solve the equations of motion (which are known to us as well) and to calculate the various physical quantities starting from the knowledge of the initial conditions, i.e. position and speed at a given moment in time.

Actually the situation is not as simple since a system containing “many bodies” interacting in a non-linear way rapidly becomes chaotic, therefore non-linear: a slight variation in the initial conditions produces a great change in position and speed once the system has evolved over a sufficiently long period of time. In the case of systems which interact according to the gravitational law, three or more bodies (such as earth, moon and sun) are enough to give rise to a chaotic behavior.

INITIAL CONDITIONS AND THE BUTTERFLY EFFECT

The main feature of chaos was precisely summarized by French mathematician Henri Poincare’: “Even if it happened that nature’s laws held no more secrets for us, it would still be the case that we would get to know the initial situation only by approximation […] it can happen that small differences in the initial conditions produce huge ones in the final phenomena. A slight error in the former produces an enormous error in the latter. Forecast becomes impossible and we have a fortuitous phenomenon”. When one speaks of “butterfly effect” (the flutter of a butterfly in Brazil can cause a tornado in Florida) one refers to the remarkable dependence on initial conditions. In other words, a small uncertainty characterizing the state of a system in a given period (the butterfly flaps or does not flap its wings) grows exponentially over time provoking a high or very high degree of uncertainty in the future conditions of the system. Even under ideal conditions – that is in the absence of external disruptions and with an exact physical model, in other words with known deterministic laws which govern its dynamic evolution – the error which we know to be present in the system’s initial state is destined to widen over time due to the chaos characterizing most non-linear systems.

The error in the initial condition, albeit infinitesimal, grows exponentially and becomes relevant to the development of the phenomenon, thus making forecasts beyond a certain period impossible. Therefore, in making a forecast a tolerance threshold shall have to be set on the error with which one wants to forecast a given phenomenon, for instance the position of the moon. In its turn, this threshold will determine the maximum time within which the forecast is deemed acceptable, in other words the predictability horizon. Therefore the chaoticity of the dynamics sets intrinsic limitations to our capacity to make forecasts. Such limitations vary from one system to the other: the predictability horizon for eclipses is about thousand years while for meteorology it varies from a few hours to days according to weather conditions and to the specific location. This happens because the atmosphere is chaotic and with a much greater complexity than the solar system’s: we have here a non-linear system “at N bodies”, with N being much greater than the solar system.

DETERMINISTIC LAWS AND FORECASTS

Let’s assume for a moment that a given system is ruled by deterministic laws which we ignore. In comparison to the previous example, where the laws were known to us, the complexity of the problem has increased. So now we wonder if in this case, by studying a large amount of data describing the evolution of the system – just as the Maya did with the earth-moon-sun system – we can come to single out the features in the system that can help us understand its status in a future time, i.e. those features enabling us to carry out a reliable forecast. The main idea is to apply to these data the so-called “method of the analogs”, which allows to infer the future status on the basis of the knowledge of the system’s status back to a relatively remote time in the past. In other words, we look to the past for a situation “close” to a present one and from it we infer the evolution of the system: if in the time series describing the past evolution we find a situation comparable to the present one, we can hope to learn something about the system’s future, even if the model depicting its evolution is lacking.

Polish mathematician Kac showed that the average return time of a system to a certain position grows exponentially along the dimensionality of the system itself, that is, with the number of relevant variables describing its physical status. From a practical point of view, the regularities of a system with high dimensionality (a system with a sufficiently large number of interacting bodies) appear on time scales which are and will remain inaccessible in spite of the desirable growth of digital data banks. The earth-moon-sun system allowed us to single out regularities from the data of historical series thanks to the fact that its dynamics are driven by deterministic laws and that the number of bodies is limited to three. Still this case is a lucky exception.

Even though a system is ruled by known deterministic laws, it is possible to make forecasts about its evolution in the long run only for periods of time determined by the features of the system itself. The gravity law which moves the planets, just as well as the laws of fluid dynamics describing the dynamics of the atmosphere (or the laws of elasticity ruling the movement of the earth plates provoking earthquakes) are all of them well-known laws of physics, also known as “deterministic laws”. In spite of this, since the systems to which they apply are made up of many bodies, there exists a horizon of predictability, a time beyond which it is not possible to make a reliable forecast because the system becomes chaotic. In the case of earthquakes, for example, it is possible to know the status of the system at a given time only in a very rough measure, thus making reliable forecasts impossible. The situation becomes ever more complicated if the deterministic laws ruling the system’s dynamics are not known or if they do not even exist (like in the case of systems whose evolution is ruled either by statistical laws or by laws changing over time: think of economics or other social sciences).

At this point it becomes legitimate to ask if in the economic field there exist laws which govern the dynamics of the markets just as the law of gravity governs planets. As far as it is given to know today, the answer is negative, because these laws are certainly depending on time, since at given historical periods different rulings of the commercial exchanges were adopted; in addition to the above, we cannot ignore different historical and social conditions.

SCIENTIFIC EXPERIMENTS AND STATISTIC CORRELATIONS

In order to grasp the difference between social sciences and natural sciences we must bear in mind that natural laws are by definition universal and unchangeable. The knowledge of these laws makes forecasts verifiable through experiments carried out in controlled conditions, allowing to eliminate or minimize the effects of external factors not taken into account by theory. Only in this case and all conditions being equal, the result of an experiment is universal, that is, repeatable in another place and moment.

When these conditions are not verified we must be very cautious in using the mathematical and statistical methods which were developed in the study of natural sciences. The risk is to obtain results which appear to be scientific – i.e. similar to those which are obtained when natural sciences are studied – but which in reality are determined by “a priori” assumptions (or by an actual ideological setting) which were used in the analysis in a more or less explicit manner.

Let’s try to better clarify this central point. In many cases we witness a rather carefree usage of statistical analyses to find correlations among variables: even in want of a reference method, we look for correlations, hoping that they will enable us to infer the laws regulating the dynamics of a system. Data banks are the ideal place where to look for correlations “a posteriori”, that is correlations which are non expected “a priori” on the basis of a theoretical model but simply identified in the data and for which we will try to develop a later explanation. The values of two variables are measured periodically and the coefficient of correlation is calculated: the latter’s value is equal to 1 if the two variables are proportional, zero if they are totally independent and -1 if they are inversely proportional. A high correlation does not imply though that a variable has to have a relationship of causality with the other one; it could be the case that these variables have a common cause. For example, in Italian cities the number of churches and the number of homicides per year are proportional to the population, which of course does not mean that an increase in the number of churches corresponds to an increase in the number of homicides, or vice versa!

Another example: the correlation coefficient between the number of personal computers and AIDS sufferers from 1983 to 2004 was equal to 0,99: a very high correlation but totally irrelevant. We are talking about two penetration processes casually born and grown together and that now, always together, are on a braking course. This example clarifies how it is possible to find spurious correlations which make no sense whatsoever: this happens when data are many but the conceptual instruments to analyze them are few or – even worse – when there are preconceptions and the data are used to find some sort of correlation justifying their use “a posteriori”. Let’s look at another example which illustrates the problem of spurious correlations “a posteriori”: a study found a statistically significant correlation between chocolate consumption and the number of Nobel prizes awarded to the citizens of a given nation, that is to say the higher the chocolate consumption of a nation, the higher the number of Nobel prizes awarded to its citizens. It was even discovered that in order to increase by one unit the number of Nobel prizes per ten million inhabitants it would be necessary to increase the per head chocolate consumption by 0.4 kilos. Such a result is clearly senseless, the problem being that the presence of a correlation does not imply the presence of a causality. Many more examples could be made of cases in which the analysis of a large number of data uncovers correlations between completely independent phenomena.

In order to understand what is going to happen in the future, can we use modern digital data banks just like the Maya civilization utilized astronomical data, that is finding “regularities” in the time series of a given phenomenon but without a reference model? The answer to this question is generally negative and the “end of the theory” becomes a mirage. Even physical systems – more manageable since we know their underlying dynamic laws – are governed by forces which although being deterministic give rise to chaotic behaviors; they also create intrinsic difficulties when making a forecast and when trying to map their future behavior.

THE MIRAGE OF THE “END OF THE THEORY”

When we do not know the laws governing the evolution of a system or when such laws are not deterministic and universal (but they rather change in time or are statistical laws), the situation rapidly becomes intractable. In such cases, can we hope to find correlations in the data tying the change of a few sizes and use the knowledge of these correlations, even without understanding their origin, to predict the future behavior of a system? In this case too the answer must usually be negative. If we take into consideration a sufficiently complex system made up of many bodies and governed by deterministic laws, the analysis of historical series does not help to find an analog, that is to single out a situation close to the current one which already took place in the past and thus capable of giving indications to infer the future evolution of the system itself. On the other hand it is possible to use data banks to find “a posteriori” correlations between variables describing the status of a system, always bearing in mind that an “a posteriori” correlation – which is not the genuine forecast of a theory – does not usually imply the existence of a causal link. On the contrary it can be very misleading and by using a pseudo-scientific analysis it could be used in an instrumental manner to uphold theses which are only ideological assumptions.

Big data can be an instrument useful to understand whether the assumptions at the basis of certain models or theories are verified and this also applies to the field of social sciences. For example, as far as the economy goes, financial markets represent an ideal laboratory to test certain fundamental concepts. Such a situation is particularly important for the mainstream theory of efficient markets. This theory presumes that deregulated markets should be efficient, the rational agents being capable to rapidly adjust any partially correct price or any valuation error. For example, credit cards and e-commerce should be able to monitor consumptions in real time thus testing the theories of consumers’ behavior in great detail. This would make it possible to answer the following questions: does the price of goods accurately reflect the underlying reality and guarantee the optimal allocation of resources? Is the price really set in such a way that the offer meets the demand? Are price changes due to specific information available to operators? In a similar way, the data terabytes elaborated every day by the financial markets could allow us to compare in detail theories with observations: are “balancing” markets stable? Are the great economic crises activated only by large external disturbances such as hurricanes, earthquakes or political upheavals (or, less dramatically so, by the instability of a government coalition) or are they caused by the intrinsic instability of the markets themselves? Therefore, rather than looking for “a posteriori” correlations to find a fleeting empiric support to some theoretical model, it is necessary to be prepared to handle large quantities of data and to learn how to analyze them without preconceptions. At times – though not always – it will thus be possible to verify if the theoretical assumptions resting at the basis of important models which interpret the social reality and which are often in open competition with alternative models do have a validation in the reality.

## 2 Comments

[…] (see theÂ English version : Big Data, Complexity and the Scientific Method) […]

[…] Originally Published in Aspenia 63 â€œWhere East meets Westâ€, published in Aspenia 63, Aspen Institute Italia, Rome, December 2013 ( I thank Maurizio Paglia for the help in the EnglishÂ translation that can be found here) […]