In the study of biological systems, two major goals are inference and prediction. Inference creates a mathematical model of the data-generation process to formalize understanding or test a hypothesis about how the system behaves. Prediction, on the other hand, finds generalizable predictive patterns. Both statistics and machine learning can be used for both inference and prediction, but they differ in their approaches and applications.
Classical Statistics vs. Machine Learning
Classical statistics draws population inferences from a sample. It is designed for data with a few dozen input variables and sample sizes that would be considered small to moderate today. In this scenario, the model fills in the unobserved aspects of the system.
Classical statistical modeling makes strong assumptions about data-generating systems, such as linearity, normality, and independence. It is effective when the data are gathered with a carefully controlled experimental design and in the absence of complicated nonlinear interactions.
Machine learning, by contrast, concentrates on prediction by using general-purpose learning algorithms to find patterns in often rich and unwieldy data. Machine learning methods are particularly helpful when one is dealing with ‘wide data’, where the number of input variables exceeds the number of subjects, in contrast to ‘long data’, where the number of subjects is greater than that of input variables.
Machine learning makes minimal assumptions about the data-generating systems; they can be effective even when the data are gathered without a carefully controlled experimental design and in the presence of complicated nonlinear interactions. However, despite convincing prediction results, the lack of an explicit model can make machine learning solutions difficult to directly relate to existing biological knowledge.Â
Inferential Statistics vs Descriptive Statistics: A Comparative Study
Simulation of Gene Expression
To compare traditional statistics to machine learning approaches, a simulation of the expression of 40 genes in two phenotypes (-/+) was conducted. Mean gene expression differed between phenotypes, but the simulation was set up so that the mean difference for the first 30 genes was not related to phenotype. The last ten genes were dysregulated, with systematic differences in mean expression between phenotypes.
Using these average expression values, an RNA-seq experiment was simulated in which the observed counts for each gene were sampled from a Poisson distribution with mean exp(x + ε), where x is the mean log expression, unique to the gene and phenotype, and ε ~ N(0, 0.15) acts as biological variability that varies from subject to subject.
Difference Between Statistics and Machine Learning
In the simulation of gene expression, classical statistical methods and machine learning methods were compared. For classical statistical methods, a two-sample t-test was used to compare the mean expression of each gene between the two phenotypes. For machine learning methods, a random forest classifier was used to predict the phenotype based on the expression of all 40 genes. The results showed that both classical statistical methods and machine learning methods were able to identify the dysregulated genes with high accuracy.
However, the classical statistical methods were only able to identify the dysregulated genes because they were pre-selected based on prior knowledge. In contrast, the machine learning methods were able to identify the dysregulated genes without any prior knowledge by considering all 40 genes simultaneously.
The Importance of Feature Selection
One of the challenges of machine learning is feature selection, or identifying which input variables (genes) are most important for predicting the output variable (phenotype). In the simulation of gene expression, the random forest classifier was able to identify the dysregulated genes as the most important features for predicting the phenotype. However, it is important to note that not all dysregulated genes were identified as important features, and some non-dysregulated genes were identified as important features.
This highlights the importance of careful feature selection in machine learning, as well as the potential for machine learning to identify novel biomarkers that may not be identified using classical statistical methods.
Best Practices For Machine Learning Applications
Conclusion
In conclusion, both classical statistical methods and machine learning methods have their strengths and weaknesses in the study of biological systems.
Classical statistical methods are effective when the data are gathered with a carefully controlled experimental design and in the absence of complicated nonlinear interactions. Machine learning methods are effective when dealing with ‘wide data’ and complicated nonlinear interactions, but may be difficult to directly relate to existing biological knowledge.
The simulation of gene expression showed that both classical statistical methods and machine learning methods were able to identify dysregulated genes with high accuracy. However, machine learning methods were able to identify dysregulated genes without any prior knowledge, by considering all genes simultaneously. This highlights the potential for machine learning to identify novel biomarkers that may not be identified using classical statistical methods. Overall, the choice of statistical method or machine learning method should depend on the specific research question and the characteristics of the data being analyzed. By understanding the strengths and weaknesses of both approaches, researchers can make informed decisions about which method to use in their research.
FAQS
How are Statistics and Machine Learning connected?
Both statistics and machine learning are centered around analyzing data. Many machine learning techniques have their roots in statistics, and statistical models are frequently used in machine learning to estimate parameters and evaluate model performance.
Can Machine Learning and Statistics be used together?
Absolutely. In fact, they often are. Statistics provides the foundation for understanding data and quantifying uncertainty, while machine learning provides robust tools for pattern recognition and prediction. Using both can offer a powerful approach to data analysis.
How does AI relate to Statistics?
AI, or artificial intelligence, encompasses more than just statistics. While AI uses statistical techniques for tasks like pattern recognition and prediction, it also includes other aspects like knowledge representation, natural language processing, and robotics.
Is Statistics necessary for Machine Learning?
Yes, a foundational understanding of statistics is crucial for machine learning. It helps in understanding how different machine learning algorithms work and how to interpret their results. It also assists in the proper design of machine learning models and the evaluation of their performance.
Which one is better, Statistics or Machine Learning?
Neither is inherently better than the other. The choice between a statistical or machine learning approach depends heavily on the task at hand. Statistics excels at providing understanding and interpretation of data. Whereas machine learning shines at making accurate predictions, especially with large and complex datasets.