Neural Networks Starter Kit: 9 Fundamental Building Blocks for Developers
AI is booming in today’s age, so to better prepare for what is coming and to understand what is happening, rather than treating all this as some magic, we need to arm ourselves with the necessary mathematics and concepts.
So this article aims to do exactly that: give you the starter kit related to neural networks.
There will be several concepts explained, and using these you will be better equipped to explore deeper concepts and eventually start developing advanced use cases and leverage the tech in today’s world.
So let’s start nice and simple.
1. The Core Building Block: Weights, Biases, and Neurons
To understand complex AI, we must start with the smallest unit: the neuron. A neuron is a mathematical function that takes an input and produces an output.
What is a Neuron?
A neuron simply performs the following calculation:
w.x + b
- x: The Input
- w: The Weight (determines the "steepness" or importance of the input)
- b: The Bias (shifts the output up or down)
These formulas we will be using more and more. So consider them as one of the tools in your starter kit.
Understanding Hidden Layer
A hidden layer consists of multiple neurons. Since each neuron has its own weight and bias, they can look at the same input differently. By combining their outputs, the network can learn flexible and complex patterns rather than just simple straight lines.
Now all we see are some multiplications of numbers, for AIs to be more smart, we need to add few more incredients. Let's go over to the next one.
Visualizing Weights and Bias.
Suppose there are two straight lines:
- w = 1, b = 0 → gentle slope, passes through the origin
- w = 2, b = -1 → steeper slope, shifted downward
From this, we know that:
- Increasing the weight makes the line steeper
- Changing the bias moves the line up or down
Python visualization
import numpy as np
import matplotlib.pyplot as plt
x = np.linspace(-5, 5, 100)
# Two neurons
y1 = 1 * x + 0 # w=1, b=0
y2 = 2 * x - 1 # w=2, b=-1
plt.figure()
plt.plot(x, y1, label="w=1, b=0")
plt.plot(x, y2, label="w=2, b=-1")
plt.xlabel("Input (x)")
plt.ylabel("Output")
plt.title("Effect of Weight and Bias")
plt.legend()
plt.show()
2. Activation Functions: Adding "Logic" to Neurons
If neurons only performed w.x +b, a neural network would just be a series of straight lines. Activation functions introduce non-linearity, allowing the network to learn complex relationships. Think of them as a weighing scale: they decide how much a neuron should "fire" or activate.
Common Activation Functions
1. ReLU (Rectified Linear Unit):
The modern default. If the input is negative, it outputs 0. If positive, it outputs the same value. For example, you feed in -2 to ReLU, it will give 0. If you feed in 2 to ReLU, it will give 2. Simple. ReLU is what we will mostly be using here. Just think of it as something that prevents negative numbers from coming up.
2. Softplus:
A "smooth" version of ReLU that avoids sharp turns.
The graph of softplus looks like this
3. Sigmoid:
Squashes any input into a value between 0 and 1, making it ideal for probability and binary classification.
Now, another thing we often hear about is the AI making mistakes and then correcting itself, eventually learning and becoming better.
This ability is made possible using gradients
3. Gradients: The Engine of Learning
If a neural network makes a mistake, how does it fix itself? It uses Gradients. In our previous look at activation functions, we saw how they transform inputs. But when an output is wrong, the network needs a "map" to find its way back to the correct answer.
What is a Gradient?
Imagine you are standing on a hill. If the ground is steep, you know exactly which direction leads down to the valley. If the ground is flat, you’re lost.
In AI, a gradient is a number that tells us the steepness and direction of a curve at a specific point.
- Large Gradient: Represents a steep slope; the network makes a big update to the weights.
- Zero Gradient: The ground is flat; the neuron stops learning because it doesn't know which way to move.
Why Gradients Matter
Training a neural network is a three-step cycle:
- Prediction: The network makes a guess.
- Loss Calculation: We measure how wrong that guess was.
- Update: We use gradients to adjust the weights and reduce the loss.
Every activation function has its own "gradient curve" that dictates how efficiently the network learns.
Gradients of Popular Activation Functions
1. ReLU Gradient
The gradient of ReLU is binary. Either it’s "on" or it’s "off."
- Gradient = 0 when input
- Gradient = 1 when input
ReLU learns very fast when active because the gradient is a constant 1. However, if a neuron always receives negative inputs, the gradient stays at 0, and the neuron "dies" (stops learning entirely).
2. Softplus Gradient
The gradient of the Softplus function is actually the Sigmoid function:
Unlike ReLU, Softplus has a smooth transition. Learning always continues because the gradient never abruptly hits zero. This adds stability and prevents "dying neurons," though it is computationally slower than ReLU.
3. Sigmoid Gradient
The sigmoid gradient is strongest in the center and tapers off at the ends:
The gradient is only strong near the middle (around 0). For very large positive or negative values, the gradient becomes almost zero. This leads to the Vanishing Gradient Problem, where the network becomes too "flat" to learn anything new—a topic we will dive into in the next article.
The Vanishing Gradient Problem
- In deep neural networks, we often run into a scenario where the "engine" of learning simply stalls. This is known as the Vanishing Gradient Problem.
- If the gradient becomes too small, the network learns extremely slowly or stops learning completely. Because the gradient tells the weights how much to change, a gradient of nearly zero means the weights stay exactly as they are, even if the prediction is wrong.
- Due to this ReLU is often preferred—its gradient remains 1 for all positive values, keeping the learning process stable.
4. The Math of Prediction: Slope and Regression
Before building deep neural networks, we must understand the fundamental principles used in Linear Regression: Which is the foundation of predictive modeling. This starts with a simple mathematical concept: the Slope.
Understanding Slope
Think of a slope as a measure of responsiveness: it tells us exactly how much Y changes when the value of X changes.
In Machine Learning, the slope represents how strongly an input feature affects the output.
- Positive Slope (+): As X increases, Y increases (e.g., Study hours vs. Exam marks).
- Negative Slope (-): As X increases, Y decreases (e.g., Car age vs. Resale value).
- Zero Slope: Y does not change regardless of X.
- Steepness: A large slope means a small change in X causes a massive jump in Y .
Visualizing Slope
We can visualize this with a python code
import numpy as np
import matplotlib.pyplot as plt
# Sample data
x = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 5, 4, 5])
# Step 1: compute averages
x_mean = sum(x) / len(x)
y_mean = sum(y) / len(y)
# Step 2: compute slope (m)
# We can't use (y2 - y1) / (x2 - x1) here because we have many points,
# so this safely combines their behavior into one best-fit slope
num = sum((x - x_mean) * (y - y_mean))
den = sum((x - x_mean) * (x - x_mean))
m = num / den
# Step 3: compute intercept (b)
b = y_mean - m * x_mean
# Step 4: predicted values
y_pred = m * x + b
# Plot
plt.scatter(x, y)
plt.plot(x, y_pred)
plt.xlabel("X")
plt.ylabel("Y")
plt.title("Slope in Machine Learning (no polyfit)")
plt.show()
Linear Regression in Action
Linear regression draws a straight line through data points (like study hours vs. exam marks). By finding the optimal slope (m) and intercept (c), the model can predict outcomes for data it has never seen before.
So our set of lines may look like this.
Now if we draw the regression line over this, it will give the following result.
5. The Chain Rule: Connecting the Links
In machine learning, things are rarely simple. Changing a weight in the first layer of a network doesn't change the final output directly; it causes a "chain reaction" through every subsequent layer.
The Core Concept: Link
Think of the word "chain" as a series of connected events. Let’s use a simple example involving three measurements:
- Weight
- Height
- Shoe Size
In this scenario, we assume the following
- Weight predicts Height
- Height predicts Shoe Size
If you want to know how much Weight affects Shoe Size, the chain rule says you simply multiply the derivatives (the rate of change) of these links together.
The Formula:
Change in Shoe Size per Weight = (Change in Shoe Size per Height) x (Change in Height per Weight)
Pluggin in the values
- Let’s look at the math to see how this works in practice:
- Link 1: For every 1 unit increase in weight, height increases by 2 units.
- Change in Height / Change in Weight = 2/1 = 2
- Link 2: For every 1 unit increase in height, shoe size increases by 1/4 unit.
- Change in Shoe Size / Change in Height = (1/4)/1 = 1/4
- Link 1: For every 1 unit increase in weight, height increases by 2 units.
- Using chain rule we will multiply these links
- Change in shoe size per weight = (1/4) x 2 = 1/2
For every 1 unit increase in weight, the shoe size increases by 1/2 unit. This simple multiplication allows neural networks to calculate how a tiny change in a deep, early layer affects the final prediction at the end of the chain.
6. Sum of Squared Residuals (SSR): Measuring "Wrongness"
Before a model can improve, it needs a way to quantify its error. When learning backpropagation, this measurement is essential because it defines how we calculate and reduce the model's mistakes. This is where Residuals come in.
What is a Residual?
Suppose you have real data points and a line attempting to fit them. To find out how accurate that line is, we look at:
- Actual Value: What really happened in the data.
- Predicted Value: What your model says should happen.
A residual is simply the difference between the two:
Residual = Actual - Predicted
It tells you exactly how wrong your prediction is for a single data point.
Why Square the Residuals?
We don't just add up the raw residuals. Instead, we square them for two main reasons:
- Ensures Positivity: It prevents positive and negative errors from canceling each other out (which would make the model look more accurate than it actually is).
- Penalizes Outliers: Squaring penalizes large errors more heavily than small ones, forcing the model to pay more attention to significant misses.
The SSR Equation
When you calculate the squared residual for every data point and add them all together, you get the Sum of Squared Residuals (SSR):
SSR = ∑(actual − predicted)²
The "Least Squares Principle" states that the best-fitting model is the one that minimizes this SSR value.
Now that we have seen how neural networks measure their "wrongness," we need to see exactly how they perform the optimization to reach that minimum error.
7. Gradient Descent: The Strategy for Optimization
Gradient Descent is the actual process of finding those "best values" for your weights and biases by minimizing the SSR.
Step-by-Step Optimization
Imagine we have a dataset where the true relationship is y = 2x + 1, but we don't know that the intercept is 1. Our goal is to find it using Gradient Descent.
1. Start with a Random Guess
To begin, pick a random value for the parameter. This gives the algorithm a starting point.
- Our Guess: b = 10
- The Result: This makes our predictions way too high (10, 12, 14... instead of 1, 3, 5...).
2. Measure the Error (The Loss Function)
- We quantify how bad our guess was using the Sum of Squared Residuals (SSR).
- If our guess of results in a Loss of 405, we know we have a lot of work to do. A large loss means a bad guess; our goal is to drive this number toward zero.
3. Find the Slope (Gradient) of the Loss Function
- To improve the guess, we need to know which direction to move. We take the derivative of the loss function.
- This value (the gradient) tells us which direction to move and how strongly to move. If the gradient is a large positive number, we know we need to decrease our guess significantly.
4. Taking a Step (Learning Rate)
The algorithm moves toward the lowest point by taking "steps." The size of each step is determined by two things:
- The Gradient: A steeper slope means we are far from the goal and should take a bigger step. A flatter slope means we are close and should slow down.
- The Learning Rate: This is a small multiplier (like ) that ensures our steps aren't so large that we accidentally overstep the goal and increase the error.
The update formula is the following
b_new = b_old - (Learning Rate x Gradient)
5. Repeat Until Finished
The process repeats: measure the new loss, compute the new gradient, and update the parameter again.
- Step 0: b = 10.00
- Step 1: b = 9.10
- Step 29: b = 1.38
As the steps continue, the intercept b steadily converges toward the true value of 1. The algorithm stops once the steps become infinitesimally small or a set maximum number of steps is reached.
8. Backpropagation: The "Reverse Gear" of Learning
Backpropagation is how we apply the Chain Rule and Gradient Descent together. It starts at the output, calculates how far off the prediction was, and then travels backward through the layers to update the weights and biases.
Setting the Stage
Imagine a simple network with an upper path and a lower path. We have weights and biases for most of the network, but we need to find the optimal value for the final bias, b_final.
- Upper Path: Calculates a value based on its specific weights and activations.
- Lower Path: Does the same for the bottom half of the network.
- Final Output: The sum of both paths plus our mystery bias, b_final
Comparing Predicted vs. Observed Data
Initially, we guess b3 = 0. When we plot this against our actual observed data points (at dosages 0.0, 0.5, and 1.0), we see a "green squiggle" that doesn't quite match our targets 0,1,0.
To fix this, we calculate the Sum of Squared Residuals (SSR) for different values of .
- If b3 = 0, the SSR is high.
- By testing multiple values (0,1,2,4), we notice the error dips and then rises again, forming a U-shaped curve.
Optimizing with the Chain Rule
To find the bottom of that U-shaped error curve without guessing, we use the Chain Rule. We calculate the derivative of the SSR with respect to b3.
The Gradient Descent Step
- Calculate the Slope: At b3 = 0, the math shows a slope of -15.7
- Determine Step Size: Multiply the slope by a Learning Rate (e.g., 0.1).
- Step Size = -15.7 x 0.1 = -1.57
- Update the Bias:
- New_b3 = 0 - (-1.57) = 1.57
Repeating this process shifts the "green squiggle" upward until the step size becomes nearly zero.
When b3 reaches 2.61, the predicted curve finally aligns with our observed data. The error is minimized, and the model has successfully "learned" the correct bias through backpropagation.
9. Why Standard Neural Networks Fail at Images
Imagine a small 6 × 6 pixel image of the letter O.
- To process this with a standard network, you flatten it into 36 input nodes.
- If the first hidden layer has 100 nodes, you must estimate 3,600 weights.
- For a modern smartphone photo (e.g., 12 megapixels), a standard network would require billions of parameters for just the first layer! This is computationally impossible.
CNNs solve this by:
- Reducing inputs: They don't look at every pixel individually at once.
- Handling Shifts: They recognize an "O" even if it moves a few pixels to the left.
- Recognizing Patterns: They focus on correlations, like edges and curves.
Step 1: Convolution (The Filter)
The core of a CNN is the Filter (or Kernel), usually a small 3 × 3 grid.
Creating a Feature Map
To apply the filter, we perform a Dot Product:
- Overlay the filter on a 3x3 section of the image.
- Multiply the overlapping values and add them up.
(0 × 0) + (0 × 0) + (1 × 1) +
(0 × 0) + (1 × 1) + (0 × 0) +
(1 × 1) + (0 × 0) + (0 × 0) = 3
- Add a Bias term.
- Move the filter one pixel to the right and repeat.
The resulting grid is called a Feature Map. To keep the model stable, we pass this map through a ReLU Activation Function, which turns all negative results into 0.
Step 2: Pooling (Reducing Data)
After creating a feature map, we use Pooling to simplify the information and make the model "translation invariant" (meaning it can recognize shapes regardless of their exact position).
- Max Pooling: Look at a small window (e.g., 2x2) and only keep the maximum value. This highlights the most prominent features found by the filter.
Step 3: Classification
Once the image has been filtered and pooled, the data is "flattened" into a column of nodes and fed into a Fully Connected Neural Network.
Example: Classifying "O" vs "X"
In a network trained to distinguish these letters:
- The input goes through filters that look for "curves" (O) and "diagonals" (X).
- The Max Pooling ensures that even if the "X" is shifted one pixel to the right, the most important diagonal features are still captured.
- The final output nodes provide a score: 1 for a match, 0 for no match.
The CNN Blueprint
Every Convolutional Neural Network, no matter how complex (like those used in self-driving cars), follows this repeated cycle:
- Convolution: Identify features using filters.
- Activation (ReLU): Remove negative noise.
- Pooling: Shrink the data while keeping important parts.
- Fully Connected Layer: Make the final prediction based on the gathered features.
So these are some concepts to care about when you are trying to learn machine learning. From here, you have a variety of choices to explore and understand. Once you gain understanding there, you will be able to solve and apply advanced use cases.