I learned a lot from Professor YY’s talk on Machine Learning . The following is my notes on the lecture.
1. What is machine learning #
Build a computer program that can learn from data (experiences).
1.1 Examples of machine learning tasks #
-
Email spam filters
-
Handwriting recognition (USPS)
-
Voice recognition (Alexa, Siri)
-
Prediction models of any kind
-
AlphaGo, autonomous driving
-
Chatbot
-
Create a new song, a recipe, or an article
1.2 Topics in machine learning #
-
Do a new task
-
Do a task with new data
-
Do a task better (with new data or method)
-
Measure the performance of an algorithm employed in a task (Why is it performing better; what is it learning)
2. Primary domains of machine learning #
-
Supervised learning: you have the data and the answers
-
Unsupervised learning: you have the data and you want to find patterns in it, for example, find compact representation of high-dimensional data
-
Reinforcement learning: creating an agent that can learn from occasional feedback from the environment.
2.1 Supervised learning #
2.2 Unsupervised learning #
An example task: given many documents, you use ML algorithms to find out how language is structured.
2.3 Reinforcement learning #
The agent receives infrequent feedback (e.g., rewards or punishments) and learn how to perform a task. The feedback is infrequent or occasional in the sense that you cannot know the result until the end, for example, in a Go game .
A prime example of reinforcement learning is AlphaGo.
3. Regression models and machine learning #
Your regression model is indeed doing machine learning.
Machine learning is task-oriented. You can use any tools to achieve this goal. Since statistical models are good at revealing patterns in data, they are widely used in machine learning. That said, tools not grounded in statistics (e.g., evolutionary algorithms, Deep Learning) are also used in machine learning, as long as they work.
Machine learning is more about “solving a task” whereas statistics is more about understanding something.
3.1 Social sciences and machine learning #
Models in social sciences should be (1) simple, and (2) be based on theory. You cannot just throw everything into the model and you should care about the interpretation of your model.
Machine learning is different. Simplicity and interpretation of the models are much less important than their performance.
4. Bias-Variance trade off #
Machine learning prioritizes prediction, so it is prone to overfitting. Applying thousands of features or billions of parameters to machine learning is not unusual.
4.1 How to avoid overfitting: Split the data #
-
Training—test split: In the beginning, split the data into two parts: test dataset, and training data. Test dataset can be small but it should only be used at last to evaluate the model and report the performance (Don’t peek into it until evaluation). Train your model with the training data.
-
Training—validation—test split: train multiple models using training dataset, pick the best model using validation data, and report the final performance of the model using test data.
-
Training—cross validation—validation—test split: I do not fully understand this method.
5. Social scientists V.S. ML scientists #
Suppose the topic of a research project is “poverty of an individual”.
5.1 Social scientists #
The following is how social scientists approach this topic:
-
Any existing theories of poverty? Known factors in existing literature? —— Social scientists focus on existing theories/literature, rather than the task/performance.
-
What are the hypotheses? What to measure? What should be the dataset? —— Social scientists focus on mechanism, hypotheses, measurement, and data
-
Let’s collect a small dataset through a survey. —— Social scientists tend to create their own dataset
-
Let’s run our regression models. —— Social scientists are likely to be restricted by methodologies.
-
Statistically significant? Let’s publish it. —— Much emphasis on significance, effect size, and hypotheses, less about the prediction performance of the model.
5.2 ML scientists #
The following is how machine learning scientists approach this topic:
-
Alright, it’s about predicting a person’s poverty using some features. —— ML focuses on the task.
-
Any labeled dataset? “What are the state-of-the-art models on this dataset”? —— ML focuses on “the performance metric”
-
What are the features we have now? Can we add new features: satellite images? mobile phone usage? social networks? —— ML scientists have a large toolbox
-
Let’s try many models. Or we can develop a new method. —— ML scientists are flexible with methods and are more likely to develop new methods.
-
Make sure we split the dataset and do not look into the test dataset. —— ML scientists emphasize generalizability.
-
Do we bet on the state-of-the-art (SOTA)? Let’s publish. —— ML scientists focus on performance and less interested in understanding
5.3 A cool paper combining these two methods #
Blumenstock, J., Cadamuro, G., & On, R. (2015). Predicting poverty and wealth from mobile phone metadata . Science, 350(6264), 1073-1076.
6. Some Machine Learning Methods (non-regression ones) #
-
KNN (k-nearest neighbors)
-
Decision Tree
-
Random forest (ensemble learning rocks)
This is the go-to method if you are unsure which methods to use.
- XGBoost
Many Kaggle winners use this method.
- Topic modeling (Bayesian inference)
7. Deep Learning #
Image recognition, language tasks, protein folding, go, word embedding, etc.
8. Summary #
- Is machine learning just a linear regression?
Not really. ML is all about making machines to learn. Statistical models are just one tool in its toolbox.
- Can ML be useful for social scientists?
Yes. With ML, you can (1) extract richer information from diverse datasets (texts, images, networks, etc.), (2) try other models, (3) think about generalizability of the model out-of-sample, (4) try incorporating many features or letting machines learn features, and (5) pay attention to prediction performance of the model.
- What can ML researchers learn from social scientists?
(1) Understand the variables and models, rather than just throwing in complex models; (2) Have some domain knowledge; (3) Collect quality data on your own; (4) Think about social implications of the task
8.1 How to speak MLese #
-
Dependent variables -> labels (classification) or output/target variable (regression)
-
Independent variable -> features
-
Coefficients -> parameters
-
Estimating the model -> training a classifier (or a model)
-
$R^2$
, p value, effect size -> accuracy, F1-score, AUROC -
Regularization
-
Split dataset. Cross-validation
-
Using Random Forest, XGBoost, Ensemble learning, Neural network
9. Other information #
The 3blue1brown channel on YouTube is highly recommended.
ML is not magic. There is lots of work behind the scenes: preparing data, finding out different models, tweaking hyperparameters. Lots of tweaking.
Last modified on 2021-10-05