Disclaimer: The views and opinions expressed in this blog are entirely my own and do not necessarily reflect the views of my current or any previous employer. This blog may also contain links to other websites or resources. I am not responsible for the content on those external sites or any changes that may occur after the publication of my posts.
End Disclaimer
But First, A Thank You Note:
Thank you, thank you, to all my readers.
I started this blog in October of 2023, afraid to fail and unsure of how it would go. As one of my friends said to me, a blog is the digital equivalent of going to the center of the town square, shouting out loud and waiting for either laughs and applause or for people to throw things at you.
I didn’t think anyone would sign up, and I certainly didn’t think anyone would pay to read what I had to say. Both of these things have happened and continue to do so.
Each time I go to put a story behind a paywall and don’t because I think/hope it will help more people, it’s precisely the paying readers that I have to thank. For those patrons who I know personally, I can hear your voices as I remove the paywall:
“Come on man, make that one free.”
Thank you.
-Otakar
Below are the answers to a test (not the test, of course). Glory and riches await you should you so desire.
I gave a talk to some interns this summer, after which several of them not specifically in CompSci or ML asked how they could better get a handle on all the nomenclature and how the pieces fit.
Here is a start. Being able to speak to any and all of these would be a bonus in a multi hyphenate interview- e.g.- “I want to be a finance/marketing/sales person-coder- machine learning-smattering of generative AI”.
Stringing together hyphens is a sure fire way to increase your chances of sticking out in a pile of resumes, getting hired, and being more desirable to retain during a forced corporate cut.
These are “fast facts”- tidbits, morsels of AI and ML goodness. This list is a starting point and by no means exhaustive. It’s up to you dear reader to choose to dive down the rabbit hole.
“This is your last chance. After this, there is no turning back. You take the blue pill - the story ends, you wake up in your bed and believe whatever you want to believe. You take the red pill - you stay in Wonderland and I show you how deep the rabbit hole goes”
-Morpheus
1. Gradient descent is an optimization algorithm used to minimize loss functions.
2. Backpropagation is used to calculate gradients in neural networks.
3. Overfitting occurs when a model performs well on training data but poorly on unseen data.
4. Cross-validation helps assess a model's performance on unseen data.
5. The bias-variance tradeoff is a fundamental concept in machine learning. High bias can lead to underfitting, high variance can lead to overfitting
6. L1 regularization (Lasso) can lead to sparse models by driving some weights to zero.
Here’s the Broader Mind Map Graphic of how some of the pieces fit:
More Granular Version- It’s a little spaghetti, but I think still helpful:
7. L2 regularization (Ridge) helps prevent overfitting by penalizing large weights.
8. Dropout is a regularization technique that randomly sets a fraction of inputs to zero during training.
9. Batch normalization normalizes the inputs of each layer to reduce internal covariate shift.
10. The curse of dimensionality refers to various phenomena that arise when analyzing data in high-dimensional spaces.
11. Principal Component Analysis (PCA) is used for dimensionality reduction.
12. t-SNE is a technique for visualizing high-dimensional data in 2D or 3D space.
13. The perceptron is the simplest type of artificial neural network.
14. ReLU (Rectified Linear Unit) is a commonly used activation function in neural networks.
15. Sigmoid function outputs values between 0 and 1, useful for binary classification.
16. Softmax function is used for multi-class classification, outputting probabilities that sum to 1.
17. Convolutional Neural Networks (CNNs) are particularly effective for image processing tasks.
18. Recurrent Neural Networks (RNNs) are designed to work with sequence data.
19. Long Short-Term Memory (LSTM) networks are a type of RNN that can learn long-term dependencies.
20. Transformers use self-attention mechanisms and have revolutionized natural language processing.
21. Transfer learning involves using pre-trained models for new, related tasks.
22. Ensemble methods combine multiple models to improve overall performance.
23. Random Forests are an ensemble learning method using multiple decision trees.
24. Boosting is an ensemble technique that combines weak learners sequentially.
25. XGBoost is a popular implementation of gradient boosting.
26. Support Vector Machines (SVMs) find the hyperplane that best separates classes in high-dimensional space.
27. K-means is an unsupervised clustering algorithm.
28. DBSCAN is a density-based clustering algorithm that can find clusters of arbitrary shape.
29. The elbow method is used to determine the optimal number of clusters in K-means.
30. Naive Bayes is a probabilistic classifier based on Bayes' theorem.
31. A confusion matrix is used to evaluate classification models.
32. Precision is the ratio of true positives to all predicted positives.
33. Recall is the ratio of true positives to all actual positives.
34. F1 score is the harmonic mean of precision and recall.
35. ROC curve plots true positive rate against false positive rate at various thresholds.
36. AUC (Area Under the Curve) summarizes the performance of a classifier across all possible thresholds.
37. Mean Squared Error (MSE) is a common loss function for regression problems.
38. Cross-entropy loss is commonly used for classification problems.
39. Stochastic Gradient Descent (SGD) updates model parameters using one sample at a time.
40. Mini-batch gradient descent uses a small batch of samples for each update.
41. Learning rate is a hyperparameter that controls the step size in gradient descent.
42. Adam is an adaptive learning rate optimization algorithm.
43. Early stopping is a regularization technique that stops training when validation error stops improving.
44. One-hot encoding is used to represent categorical variables as binary vectors.
45. Feature scaling is important for many machine learning algorithms.
46. Min-max scaling scales features to a fixed range, typically between 0 and 1.
47. Z-score normalization scales features to have zero mean and unit variance.
48. Word embeddings are dense vector representations of words.
49. Word2Vec is a technique for learning word embeddings.
50. BERT is a transformer-based model for natural language processing.
51. GPT (Generative Pre-trained Transformer) is a decoder-only model designed for natural language generation.
52. Reinforcement learning involves an agent learning to make decisions by interacting with an environment.
53. Q-learning is a model-free reinforcement learning algorithm.
54. The exploration-exploitation tradeoff is a key concept in reinforcement learning.
55. Generative Adversarial Networks (GANs) consist of a generator and a discriminator competing against each other.
56. Variational Autoencoders (VAEs) are generative models that learn latent representations of data.
57. The No Free Lunch Theorem states that no single machine learning algorithm is universally best for all problems.
58. Bagging (Bootstrap Aggregating) involves training multiple models on random subsets of the data.
59. A decision boundary is the region of a feature space in which a classifier switches from one class to another.
60. Dimensionality reduction can help combat the curse of dimensionality.
61. Feature selection involves choosing a subset of relevant features for use in model construction.
62. Imbalanced datasets can lead to biased models if not properly addressed.
63. SMOTE (Synthetic Minority Over-sampling Technique) is used to address class imbalance.
64. A learning curve shows a model's performance on training and validation sets as a function of training set size.
65. Hyperparameter tuning is the process of optimizing a model's hyperparameters.
66. Grid search is an exhaustive search through a manually specified subset of hyperparameters.
67. Random search can be more efficient than grid search for hyperparameter optimization.
68. Bayesian optimization is a global optimization technique for noisy black-box functions.
69. Cross-entropy measures the difference between two probability distributions.
70. The vanishing gradient problem can occur in deep neural networks, especially with certain activation functions.
71. Exploding gradients can occur when error gradients accumulate, resulting in very large updates to neural network weights.
72. Gradient clipping is a technique to prevent exploding gradients.
73. Beam search is a heuristic search algorithm often used in sequence generation tasks.
74. The cold start problem refers to the challenge of recommending items to new users or recommending new items.
75. Collaborative filtering is a technique used in recommender systems.
76. Content-based filtering makes recommendations based on a user's item preferences and item features.
77. A Boltzmann machine is a type of stochastic recurrent neural network.
78. Self-organizing maps are a type of artificial neural network used for dimensionality reduction.
79. The Vapnik-Chervonenkis (VC) dimension is a measure of the capacity of a classification algorithm.
80. The bias-complexity tradeoff relates model complexity to its bias and variance.
81. Gaussian processes are a method for regression and probabilistic classification.
82. The kernel trick allows algorithms to operate in a high-dimensional feature space without computing coordinates in that space.
83. Active learning is a subfield of machine learning where the algorithm can query a user for labels.
84. Semi-supervised learning uses both labeled and unlabeled data for training.
85. Multi-task learning involves learning multiple related tasks simultaneously.
86. Domain adaptation addresses the situation where the training and test data are from different distributions.
87. Anomaly detection is the identification of rare items, events or observations which raise suspicions.
88. Time series forecasting involves making predictions about future values based on previously observed values.
89. ARIMA (AutoRegressive Integrated Moving Average) is a popular model for time series forecasting.
90. The Fourier transform is often used in signal processing and time series analysis.
91. Kalman filters are used for estimating unknown variables from noisy measurements.
92. A Markov chain is a stochastic model describing a sequence of possible events.
93. Hidden Markov Models (HMMs) are used for modeling systems that are assumed to be Markov processes with hidden states.
94. The expectation-maximization (EM) algorithm is used to find maximum likelihood estimates of parameters in statistical models.
95. A Restricted Boltzmann Machine (RBM) is a generative stochastic artificial neural network that can learn a probability distribution over its set of inputs.
96. Adversarial examples are inputs to machine learning models designed to cause the model to make a mistake.
97. Federated learning allows for training models on distributed datasets without exchanging the data itself.
98. Differential privacy is a system for publicly sharing information about a dataset while withholding information about individuals in the dataset.
99. Quantum machine learning explores the intersection of quantum computing and machine learning.
100. Neural Architecture Search (NAS) is a technique for automating the design of artificial neural networks.
If you’ve made it this far, congratulations, you’ve taken the red pill.