Compatibility Conditions and Properties
Understanding the compatibility conditions and properties of matrix operations is crucial in machine learning, especially when dealing with neural networks and other complex models.
Compatibility Conditions
Matrix operations have specific requirements for the dimensions of the matrices involved. This is particularly important for matrix multiplication.
Matrix Operations
Matrix operations are fundamental to many machine learning algorithms and techniques. Understanding these operations is crucial for implementing and optimizing ML models efficiently.
Properties of Matrix Operations
Understanding these properties helps in optimizing computations and designing efficient algorithms.
1. Non-commutativity of Matrix Multiplication
Unlike scalar multiplication, matrix multiplication is not commutative. In general, A B ≠ B A AB ≠ BA A B = B A .
Example : A = [ 1 2 ] A = \begin{bmatrix} 1 & 2 \end{bmatrix} A = [ 1 2 ] , B = [ 5 6 3 4 7 8 ] B = \begin{bmatrix} 5 & 6 \\ 3 & 4 \\ 7 & 8 \end{bmatrix} B = 5 3 7 6 4 8
A B = [ 19 22 ] ≠ B A = [ 23 34 43 50 31 46 ] AB = \begin{bmatrix} 19 & 22 \end{bmatrix} \neq BA = \begin{bmatrix} 23 & 34 \\ 43 & 50 \\ 31 & 46 \end{bmatrix} A B = [ 19 22 ] = B A = 23 43 31 34 50 46
ML Application: The order of operations matters in neural network computations. For instance, applying activation functions before or after matrix multiplication can lead to different results.
2. Associativity of Matrix Multiplication
( A B ) C = A ( B C ) (AB)C = A(BC) ( A B ) C = A ( BC ) for matrices with compatible dimensions.
Example : A ( 2 × 2 ) ∗ ( B ( 2 × 2 ) ∗ C ( 2 × 1 ) ) = ( A ( 2 × 2 ) ∗ B ( 2 × 2 ) ) ∗ C ( 2 × 1 ) A (2×2) * (B (2×2) * C (2×1)) = (A (2×2) * B (2×2)) * C (2×1) A ( 2 × 2 ) ∗ ( B ( 2 × 2 ) ∗ C ( 2 × 1 )) = ( A ( 2 × 2 ) ∗ B ( 2 × 2 )) ∗ C ( 2 × 1 )
ML Application: This property allows for optimizing computations in deep neural networks by grouping operations efficiently.
3. Distributivity of Matrix Multiplication over Addition
A ( B + C ) = A B + A C A(B + C) = AB + AC A ( B + C ) = A B + A C and ( A + B ) C = A C + B C (A + B)C = AC + BC ( A + B ) C = A C + BC for matrices with compatible dimensions.
Example : A ∗ ( B + C ) = ( A ∗ B ) + ( A ∗ C ) A * (B + C) = (A * B) + (A * C) A ∗ ( B + C ) = ( A ∗ B ) + ( A ∗ C )
ML Application: This property is useful in backpropagation when computing gradients with respect to multiple parameters.
Addition and Subtraction
Matrix addition and subtraction are performed element-wise between matrices of the same dimensions.
Examples (2 x 2 matrices):
Given: A = [ 1 2 3 4 ] , B = [ 5 6 7 8 ] A = \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix}, \quad B = \begin{bmatrix} 5 & 6 \\ 7 & 8 \end{bmatrix} A = [ 1 3 2 4 ] , B = [ 5 7 6 8 ] then A + B = [ 1 + 5 2 + 6 3 + 7 4 + 8 ] = [ 6 8 10 12 ] A + B = \begin{bmatrix} 1 + 5 & 2 + 6 \\ 3 + 7 & 4 + 8 \end{bmatrix} = \begin{bmatrix} 6 & 8 \\ 10 & 12 \end{bmatrix} A + B = [ 1 + 5 3 + 7 2 + 6 4 + 8 ] = [ 6 10 8 12 ]
Step by step explanationStep 1: Add corresponding elements
( 1 , 1 ) : 1 + 5 = 6 (1,1): 1 + 5 = 6 ( 1 , 1 ) : 1 + 5 = 6
( 1 , 2 ) : 2 + 6 = 8 (1,2): 2 + 6 = 8 ( 1 , 2 ) : 2 + 6 = 8
( 2 , 1 ) : 3 + 7 = 10 (2,1): 3 + 7 = 10 ( 2 , 1 ) : 3 + 7 = 10
( 2 , 2 ) : 4 + 8 = 12 (2,2): 4 + 8 = 12 ( 2 , 2 ) : 4 + 8 = 12
Step 2: Write the result A + B = [ 6 8 10 12 ] A + B = \begin{bmatrix} 6 & 8 \\ 10 & 12 \end{bmatrix} A + B = [ 6 10 8 12 ]
Given: A = [ 1 2 7 8 ] , B = [ 5 6 3 4 ] A = \begin{bmatrix} 1 & 2 \\ 7 & 8 \end{bmatrix}, \quad B = \begin{bmatrix} 5 & 6 \\ 3 & 4 \end{bmatrix} A = [ 1 7 2 8 ] , B = [ 5 3 6 4 ] then A − B = [ 1 − 5 2 − 6 7 − 3 8 − 4 ] = [ − 4 − 2 4 4 ] A - B = \begin{bmatrix} 1 - 5 & 2 - 6 \\ 7 - 3 & 8 - 4 \end{bmatrix} = \begin{bmatrix} -4 & -2 \\ 4 & 4 \end{bmatrix} A − B = [ 1 − 5 7 − 3 2 − 6 8 − 4 ] = [ − 4 4 − 2 4 ]
Step by step explanationStep 1: Add corresponding elements
( 1 , 1 ) : 1 − 5 = − 4 (1,1): 1 - 5 = -4 ( 1 , 1 ) : 1 − 5 = − 4
( 1 , 2 ) : 2 − 6 = − 2 (1,2): 2 - 6 = -2 ( 1 , 2 ) : 2 − 6 = − 2
( 2 , 1 ) : 7 − 3 = 4 (2,1): 7 - 3 = 4 ( 2 , 1 ) : 7 − 3 = 4
( 2 , 2 ) : 8 − 4 = 4 (2,2): 8 -4 = 4 ( 2 , 2 ) : 8 − 4 = 4
Step 2: Write the result A − B = [ − 4 − 2 4 4 ] A - B = \begin{bmatrix} -4 & -2 \\ 4 & 4 \end{bmatrix} A − B = [ − 4 4 − 2 4 ]
Example in ML : Updating weights in neural networks. In gradient descent, we update parameters by subtracting the gradient multiplied by the learning rate:
W = [ 0.1 0.2 0.3 0.4 ] , gradient = [ 0.01 0.02 0.03 0.04 ] , learning rate = 0.1 W = \begin{bmatrix} 0.1 & 0.2 \\ 0.3 & 0.4 \end{bmatrix}, \quad \text{gradient} = \begin{bmatrix} 0.01 & 0.02 \\ 0.03 & 0.04 \end{bmatrix}, \quad \text{learning rate} = 0.1 W = [ 0.1 0.3 0.2 0.4 ] , gradient = [ 0.01 0.03 0.02 0.04 ] , learning rate = 0.1
W new = W − ( learning rate × gradient ) W_{\text{new}} = W - (\text{learning rate} \times \text{gradient}) W new = W − ( learning rate × gradient )
W new = [ 0.1 0.2 0.3 0.4 ] − 0.1 ⋅ [ 0.01 0.02 0.03 0.04 ] = [ 0.099 0.198 0.297 0.396 ] W_{\text{new}} = \begin{bmatrix} 0.1 & 0.2 \\ 0.3 & 0.4 \end{bmatrix} - 0.1 \cdot \begin{bmatrix} 0.01 & 0.02 \\ 0.03 & 0.04 \end{bmatrix} = \begin{bmatrix} 0.099 & 0.198 \\ 0.297 & 0.396 \end{bmatrix} W new = [ 0.1 0.3 0.2 0.4 ] − 0.1 ⋅ [ 0.01 0.03 0.02 0.04 ] = [ 0.099 0.297 0.198 0.396 ]
Step by step explanationStep 1: Multiply gradient by learning rate
0.1 ⋅ [ 0.01 0.02 0.03 0.04 ] = [ 0.001 0.002 0.003 0.004 ] 0.1 \cdot \begin{bmatrix} 0.01 & 0.02 \\ 0.03 & 0.04 \end{bmatrix} = \begin{bmatrix} 0.001 & 0.002 \\ 0.003 & 0.004 \end{bmatrix} 0.1 ⋅ [ 0.01 0.03 0.02 0.04 ] = [ 0.001 0.003 0.002 0.004 ]
Step 2: Subtract from
[ 0.1 0.2 0.3 0.4 ] − [ 0.001 0.002 0.003 0.004 ] = [ 0.099 0.198 0.297 0.396 ] \begin{bmatrix} 0.1 & 0.2 \\ 0.3 & 0.4 \end{bmatrix} - \begin{bmatrix} 0.001 & 0.002 \\ 0.003 & 0.004 \end{bmatrix} = \begin{bmatrix} 0.099 & 0.198 \\ 0.297 & 0.396 \end{bmatrix} [ 0.1 0.3 0.2 0.4 ] − [ 0.001 0.003 0.002 0.004 ] = [ 0.099 0.297 0.198 0.396 ]
Scalar Multiplication
Scalar multiplication involves multiplying each element of a matrix by a scalar value.
Example (3x3 matrix):
Let's multiply a 3x3 matrix by a scalar:
Given: k = 2 , A = [ 1 2 3 4 5 6 7 8 9 ] k = 2, \quad A = \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 8 & 9 \end{bmatrix} k = 2 , A = 1 4 7 2 5 8 3 6 9
Then, k ⋅ A = 2 ⋅ [ 1 2 3 4 5 6 7 8 9 ] = [ 2 4 6 8 10 12 14 16 18 ] k \cdot A = 2 \cdot \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 8 & 9 \end{bmatrix} = \begin{bmatrix} 2 & 4 & 6 \\ 8 & 10 & 12 \\ 14 & 16 & 18 \end{bmatrix} k ⋅ A = 2 ⋅ 1 4 7 2 5 8 3 6 9 = 2 8 14 4 10 16 6 12 18
Step by step explanationStep 1: Multiply each element by k k k
( 1 , 1 ) : 2 ∗ 1 = 2 (1,1): 2 * 1 = 2 ( 1 , 1 ) : 2 ∗ 1 = 2
( 1 , 2 ) : 2 ∗ 2 = 4 (1,2): 2 * 2 = 4 ( 1 , 2 ) : 2 ∗ 2 = 4
( 1 , 3 ) : 2 ∗ 3 = 6 (1,3): 2 * 3 = 6 ( 1 , 3 ) : 2 ∗ 3 = 6
( 2 , 1 ) : 2 ∗ 4 = 8 (2,1): 2 * 4 = 8 ( 2 , 1 ) : 2 ∗ 4 = 8
( 2 , 2 ) : 2 ∗ 5 = 10 (2,2): 2 * 5 = 10 ( 2 , 2 ) : 2 ∗ 5 = 10
( 2 , 3 ) : 2 ∗ 6 = 12 (2,3): 2 * 6 = 12 ( 2 , 3 ) : 2 ∗ 6 = 12
( 3 , 1 ) : 2 ∗ 7 = 14 (3,1): 2 * 7 = 14 ( 3 , 1 ) : 2 ∗ 7 = 14
( 3 , 2 ) : 2 ∗ 8 = 16 (3,2): 2 * 8 = 16 ( 3 , 2 ) : 2 ∗ 8 = 16
( 3 , 3 ) : 2 ∗ 9 = 18 (3,3): 2 * 9 = 18 ( 3 , 3 ) : 2 ∗ 9 = 18
Step 2: Write the result A = [ 2 4 6 8 10 12 14 16 18 ] A = \begin{bmatrix} 2 & 4 & 6 \\ 8 & 10 & 12 \\ 14 & 16 & 18 \end{bmatrix} A = 2 8 14 4 10 16 6 12 18
Matrix Multiplication
Matrix multiplication is a crucial operation in many ML computations, including neural network layers and linear transformations.
Matrix Multiplication Compatibility
For two matrices A and B to be multiplied:
The number of columns in matrix must equal the number of rows in matrix B B B .
If A is an m × n m × n m × n matrix and B is a p × q p × q p × q matrix, then n must equal p.
The resulting matrix will have dimensions m × q m × q m × q .
Example (2x2 matrices):
Given: A = [ 1 2 3 4 ] , B = [ 5 6 7 8 ] A = \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix}, \quad B = \begin{bmatrix} 5 & 6 \\ 7 & 8 \end{bmatrix} A = [ 1 3 2 4 ] , B = [ 5 7 6 8 ]
Then, A ⋅ B = [ 1 2 3 4 ] ⋅ [ 5 6 7 8 ] = [ 19 22 43 50 ] A \cdot B = \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix} \cdot \begin{bmatrix} 5 & 6 \\ 7 & 8 \end{bmatrix} = \begin{bmatrix} 19 & 22 \\ 43 & 50 \end{bmatrix} A ⋅ B = [ 1 3 2 4 ] ⋅ [ 5 7 6 8 ] = [ 19 43 22 50 ]
Step by step explanationStep 1: Multiply row 1 of A with columns of B
( 1 , 1 ) : ( 15 ) + ( 27 ) = 5 + 14 = 19 (1,1): (15) + (27) = 5 + 14 = 19 ( 1 , 1 ) : ( 15 ) + ( 27 ) = 5 + 14 = 19
( 1 , 2 ) : ( 16 ) + ( 28 ) = 6 + 16 = 22 (1,2): (16) + (28) = 6 + 16 = 22 ( 1 , 2 ) : ( 16 ) + ( 28 ) = 6 + 16 = 22
Step 2: Multiply row 2 of A with columns of B
( 2 , 1 ) : ( 35 ) + ( 47 ) = 15 + 28 = 43 (2,1): (35) + (47) = 15 + 28 = 43 ( 2 , 1 ) : ( 35 ) + ( 47 ) = 15 + 28 = 43
( 2 , 2 ) : ( 36 ) + ( 48 ) = 18 + 32 = 50 (2,2): (36) + (48) = 18 + 32 = 50 ( 2 , 2 ) : ( 36 ) + ( 48 ) = 18 + 32 = 50
Step 3: Write the result A B = [ 19 22 43 50 ] AB = \begin{bmatrix} 19 & 22 \\ 43 & 50 \end{bmatrix} A B = [ 19 43 22 50 ]
Example in ML. Forward pass in a neural network layer:
Given:
W = [ 0.1 0.2 0.3 0.4 ] , X = [ 2 3 ] , b = [ 0.1 0.2 ] W = \begin{bmatrix} 0.1 & 0.2 \\ 0.3 & 0.4 \end{bmatrix}, \quad X = \begin{bmatrix} 2 \\ 3 \end{bmatrix}, \quad b = \begin{bmatrix} 0.1 \\ 0.2 \end{bmatrix} W = [ 0.1 0.3 0.2 0.4 ] , X = [ 2 3 ] , b = [ 0.1 0.2 ]
Then,
Z = W ⋅ X + b = [ ( 0.1 ⋅ 2 + 0.2 ⋅ 3 ) ( 0.3 ⋅ 2 + 0.4 ⋅ 3 ) ] + [ 0.1 0.2 ] = [ 0.8 2.0 ] + [ 0.1 0.2 ] = [ 0.9 2.2 ] Z = W \cdot X + b = \begin{bmatrix} (0.1 \cdot 2 + 0.2 \cdot 3) \\ (0.3 \cdot 2 + 0.4 \cdot 3) \end{bmatrix} + \begin{bmatrix} 0.1 \\ 0.2 \end{bmatrix} = \begin{bmatrix} 0.8 \\ 2.0 \end{bmatrix} + \begin{bmatrix} 0.1 \\ 0.2 \end{bmatrix} = \begin{bmatrix} 0.9 \\ 2.2 \end{bmatrix} Z = W ⋅ X + b = [ ( 0.1 ⋅ 2 + 0.2 ⋅ 3 ) ( 0.3 ⋅ 2 + 0.4 ⋅ 3 ) ] + [ 0.1 0.2 ] = [ 0.8 2.0 ] + [ 0.1 0.2 ] = [ 0.9 2.2 ]
Step by step explanationStep 1: Multiply W W W and X X X
Z = W ⋅ X = [ 0.1 0.2 0.3 0.4 ] ⋅ [ 2 3 ] = [ ( 0.1 ⋅ 2 + 0.2 ⋅ 3 ) ( 0.3 ⋅ 2 + 0.4 ⋅ 3 ) ] Z = W \cdot X = \begin{bmatrix} 0.1 & 0.2 \\ 0.3 & 0.4 \end{bmatrix} \cdot \begin{bmatrix} 2 \\ 3 \end{bmatrix} = \begin{bmatrix} (0.1 \cdot 2 + 0.2 \cdot 3) \\ (0.3 \cdot 2 + 0.4 \cdot 3) \end{bmatrix} Z = W ⋅ X = [ 0.1 0.3 0.2 0.4 ] ⋅ [ 2 3 ] = [ ( 0.1 ⋅ 2 + 0.2 ⋅ 3 ) ( 0.3 ⋅ 2 + 0.4 ⋅ 3 ) ] Z = W ⋅ X = [ 0.1 0.2 0.3 0.4 ] ⋅ [ 2 3 ] = [ ( 0.1 ⋅ 2 + 0.2 ⋅ 3 ) ( 0.3 ⋅ 2 + 0.4 ⋅ 3 ) ] Z = W \cdot X = \begin{bmatrix} 0.1 & 0.2 \\ 0.3 & 0.4 \end{bmatrix} \cdot \begin{bmatrix} 2 \\ 3 \end{bmatrix} = \begin{bmatrix} (0.1 \cdot 2 + 0.2 \cdot 3) \\ (0.3 \cdot 2 + 0.4 \cdot 3) \end{bmatrix} Z = W ⋅ X = [ 0.1 0.3 0.2 0.4 ] ⋅ [ 2 3 ] = [ ( 0.1 ⋅ 2 + 0.2 ⋅ 3 ) ( 0.3 ⋅ 2 + 0.4 ⋅ 3 ) ]
Step 2: Add bias b
Z = [ 0.8 1.8 ] + [ 0.1 0.2 ] = [ 0.8 + 0.1 1.8 + 0.2 ] = [ 0.9 2.0 ] Z = \begin{bmatrix} 0.8 \\ 1.8 \end{bmatrix} + \begin{bmatrix} 0.1 \\ 0.2 \end{bmatrix} = \begin{bmatrix} 0.8 + 0.1 \\ 1.8 + 0.2 \end{bmatrix} = \begin{bmatrix} 0.9 \\ 2.0 \end{bmatrix} Z = [ 0.8 1.8 ] + [ 0.1 0.2 ] = [ 0.8 + 0.1 1.8 + 0.2 ] = [ 0.9 2.0 ] [ 0.8 ] + [ 0.1 ] = [ 0.9 ] [ 1.8 ] [ 0.2 ] [ 2.0 ] [0.8] + [0.1] = [0.9] [1.8] [0.2] [2.0] [ 0.8 ] + [ 0.1 ] = [ 0.9 ] [ 1.8 ] [ 0.2 ] [ 2.0 ]
Transposition
Transposition is the operation of flipping a matrix over its diagonal, switching its rows with its columns.
Transpose Properties
( A T ) T = A ( A B ) T = B T A T ( A + B ) T = A T + B T (A^T)^T = A (AB)^T = B^T A^T (A + B)^T = A^T + B^T ( A T ) T = A ( A B ) T = B T A T ( A + B ) T = A T + B T
Example:
Given: A = [ 1 2 3 4 ] A = \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix} A = [ 1 3 2 4 ]
Taking the transpose of A A A : A T = [ 1 3 2 4 ] A^T = \begin{bmatrix} 1 & 3 \\ 2 & 4 \end{bmatrix} A T = [ 1 2 3 4 ]
Then, taking the transpose again: ( A T ) T = [ 1 2 3 4 ] = A (A^T)^T = \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix} = A ( A T ) T = [ 1 3 2 4 ] = A
ML Application: These properties are often used in deriving gradient descent algorithms and in simplifying complex matrix expressions in various ML models.
Understanding these compatibility conditions and properties is essential for:
Correctly implementing machine learning algorithms
Optimizing computations for better performance
Debugging issues related to matrix dimensions in neural networks
Deriving new algorithms or simplifying existing ones
In practice, many machine learning libraries handle these compatibility checks automatically, but understanding the underlying principles helps in designing and troubleshooting models effectively.
Example (3x3 matrix):
Given: A = [ 1 2 3 4 5 6 7 8 9 ] A = \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 8 & 9 \end{bmatrix} A = 1 4 7 2 5 8 3 6 9
Then, the transpose of A A A , denoted A T A^T A T , is: A T = [ 1 4 7 2 5 8 3 6 9 ] A^T = \begin{bmatrix} 1 & 4 & 7 \\ 2 & 5 & 8 \\ 3 & 6 & 9 \end{bmatrix} A T = 1 2 3 4 5 6 7 8 9
Step by step explanationStep 1: Swap rows and columns
N e w ( 1 , 1 ) = O l d ( 1 , 1 ) : 1 New (1,1) = Old (1,1): 1 N e w ( 1 , 1 ) = Ol d ( 1 , 1 ) : 1
N e w ( 1 , 2 ) = O l d ( 2 , 1 ) : 4 New (1,2) = Old (2,1): 4 N e w ( 1 , 2 ) = Ol d ( 2 , 1 ) : 4
N e w ( 1 , 3 ) = O l d ( 3 , 1 ) : 7 New (1,3) = Old (3,1): 7 N e w ( 1 , 3 ) = Ol d ( 3 , 1 ) : 7
N e w ( 2 , 1 ) = O l d ( 1 , 2 ) : 2 New (2,1) = Old (1,2): 2 N e w ( 2 , 1 ) = Ol d ( 1 , 2 ) : 2
N e w ( 2 , 2 ) = O l d ( 2 , 2 ) : 5 New (2,2) = Old (2,2): 5 N e w ( 2 , 2 ) = Ol d ( 2 , 2 ) : 5
N e w ( 2 , 3 ) = O l d ( 3 , 2 ) : 8 New (2,3) = Old (3,2): 8 N e w ( 2 , 3 ) = Ol d ( 3 , 2 ) : 8
N e w ( 3 , 1 ) = O l d ( 1 , 3 ) : 3 New (3,1) = Old (1,3): 3 N e w ( 3 , 1 ) = Ol d ( 1 , 3 ) : 3
N e w ( 3 , 2 ) = O l d ( 2 , 3 ) : 6 New (3,2) = Old (2,3): 6 N e w ( 3 , 2 ) = Ol d ( 2 , 3 ) : 6
N e w ( 3 , 3 ) = O l d ( 3 , 3 ) : 9 New (3,3) = Old (3,3): 9 N e w ( 3 , 3 ) = Ol d ( 3 , 3 ) : 9
Step 2: Write the result
A T = [ 1 4 7 2 5 8 3 6 9 ] A^T = \begin{bmatrix} 1 & 4 & 7 \\ 2 & 5 & 8 \\ 3 & 6 & 9 \end{bmatrix} A T = 1 2 3 4 5 6 7 8 9
Example in ML. Computing the gradient in linear regression:
( g r a d i e n t = X T ∗ ( y p r e d − y ) / n s a m p l e s gradient = X^T * (y_{pred} - y) / n_{samples} g r a d i e n t = X T ∗ ( y p re d − y ) / n s am pl es )
Given: X = [ 1 2 3 4 5 6 ] , y = [ 5 11 17 ] , w = [ 0.5 1.5 ] X = \begin{bmatrix} 1 & 2 \\ 3 & 4 \\ 5 & 6 \end{bmatrix}, \quad y = \begin{bmatrix} 5 \\ 11 \\ 17 \end{bmatrix}, \quad w = \begin{bmatrix} 0.5 \\ 1.5 \end{bmatrix} X = 1 3 5 2 4 6 , y = 5 11 17 , w = [ 0.5 1.5 ]
Transpose of X X X : X T = [ 1 3 5 2 4 6 ] X^T = \begin{bmatrix} 1 & 3 & 5 \\ 2 & 4 & 6 \end{bmatrix} X T = [ 1 2 3 4 5 6 ]
Predicted values y pred y_{\text{pred}} y pred :
y pred = X ⋅ w = [ 1 ⋅ 0.5 + 2 ⋅ 1.5 3 ⋅ 0.5 + 4 ⋅ 1.5 5 ⋅ 0.5 + 6 ⋅ 1.5 ] = [ 3.5 7.5 11.5 ] y_{\text{pred}} = X \cdot w = \begin{bmatrix} 1 \cdot 0.5 + 2 \cdot 1.5 \\ 3 \cdot 0.5 + 4 \cdot 1.5 \\ 5 \cdot 0.5 + 6 \cdot 1.5 \end{bmatrix} = \begin{bmatrix} 3.5 \\ 7.5 \\ 11.5 \end{bmatrix} y pred = X ⋅ w = 1 ⋅ 0.5 + 2 ⋅ 1.5 3 ⋅ 0.5 + 4 ⋅ 1.5 5 ⋅ 0.5 + 6 ⋅ 1.5 = 3.5 7.5 11.5
Error calculation:
error = y pred − y = [ 3.5 − 5 7.5 − 11 11.5 − 17 ] = [ − 1.5 − 3.5 − 5.5 ] \text{error} = y_{\text{pred}} - y = \begin{bmatrix} 3.5 - 5 \\ 7.5 - 11 \\ 11.5 - 17 \end{bmatrix} = \begin{bmatrix} -1.5 \\ -3.5 \\ -5.5 \end{bmatrix} error = y pred − y = 3.5 − 5 7.5 − 11 11.5 − 17 = − 1.5 − 3.5 − 5.5
Gradient calculation:
gradient = X T ⋅ error = [ 1 ⋅ ( − 1.5 ) + 3 ⋅ ( − 3.5 ) + 5 ⋅ ( − 5.5 ) 2 ⋅ ( − 1.5 ) + 4 ⋅ ( − 3.5 ) + 6 ⋅ ( − 5.5 ) ] = [ − 40.5 − 58.5 ] \text{gradient} = X^T \cdot \text{error} = \begin{bmatrix} 1 \cdot (-1.5) + 3 \cdot (-3.5) + 5 \cdot (-5.5) \\ 2 \cdot (-1.5) + 4 \cdot (-3.5) + 6 \cdot (-5.5) \end{bmatrix} = \begin{bmatrix} -40.5 \\ -58.5 \end{bmatrix} gradient = X T ⋅ error = [ 1 ⋅ ( − 1.5 ) + 3 ⋅ ( − 3.5 ) + 5 ⋅ ( − 5.5 ) 2 ⋅ ( − 1.5 ) + 4 ⋅ ( − 3.5 ) + 6 ⋅ ( − 5.5 ) ] = [ − 40.5 − 58.5 ]
Step by step explanationStep 1: Calculate error:
error = y pred − y = [ 3.5 − 5 7.5 − 11 11.5 − 17 ] = [ − 1.5 − 3.5 − 5.5 ] \text{error} = y_{\text{pred}} - y = \begin{bmatrix} 3.5 - 5 \\ 7.5 - 11 \\ 11.5 - 17 \end{bmatrix} = \begin{bmatrix} -1.5 \\ -3.5 \\ -5.5 \end{bmatrix} error = y pred − y = 3.5 − 5 7.5 − 11 11.5 − 17 = − 1.5 − 3.5 − 5.5
Step 2: Transpose X X X :
X T = [ 1 3 5 2 4 6 ] X^T = \begin{bmatrix} 1 & 3 & 5 \\ 2 & 4 & 6 \end{bmatrix} X T = [ 1 2 3 4 5 6 ]
Step 3: Multiply X T X^T X T by error:
gradient = X T ⋅ error = [ 1 ⋅ ( − 1.5 ) + 3 ⋅ ( − 3.5 ) + 5 ⋅ ( − 5.5 ) 2 ⋅ ( − 1.5 ) + 4 ⋅ ( − 3.5 ) + 6 ⋅ ( − 5.5 ) ] = [ − 40.5 − 58.5 ] \text{gradient} = X^T \cdot \text{error} = \begin{bmatrix} 1 \cdot (-1.5) + 3 \cdot (-3.5) + 5 \cdot (-5.5) \\ 2 \cdot (-1.5) + 4 \cdot (-3.5) + 6 \cdot (-5.5) \end{bmatrix} = \begin{bmatrix} -40.5 \\ -58.5 \end{bmatrix} gradient = X T ⋅ error = [ 1 ⋅ ( − 1.5 ) + 3 ⋅ ( − 3.5 ) + 5 ⋅ ( − 5.5 ) 2 ⋅ ( − 1.5 ) + 4 ⋅ ( − 3.5 ) + 6 ⋅ ( − 5.5 ) ] = [ − 40.5 − 58.5 ]
Step 4: Divide by n s a m p l e s n_{samples} n s am pl es (3 in this case):
gradient 3 = [ − 40.5 3 − 58.5 3 ] = [ − 13.5 − 19.5 ] \frac{\text{gradient}}{3} = \begin{bmatrix} \frac{-40.5}{3} \\ \frac{-58.5}{3} \end{bmatrix} = \begin{bmatrix} -13.5 \\ -19.5 \end{bmatrix} 3 gradient = [ 3 − 40.5 3 − 58.5 ] = [ − 13.5 − 19.5 ]
This step-by-step breakdown illustrates how each matrix operation is performed and how it applies in machine learning contexts.