Artificial Intelligence with PHP
  • Getting Started
    • Introduction
    • Audience
    • How to Read This Book
    • Glossary
    • Contributors
    • Resources
    • Changelog
  • Artificial Intelligence
    • Introduction
    • Overview of AI
      • History of AI
      • How Does AI Work?
      • Structure of AI
      • Will AI Take Over the World?
      • Types of AI
        • Limited Memory AI
        • Reactive AI
        • Theory of Mind AI
        • Self-Aware AI
    • AI Capabilities in PHP
      • Introduction to LLM Agents PHP SDK
      • Overview of AI Libraries in PHP
    • AI Agents
      • Introduction to AI Agents
      • Structure of AI Agent
      • Components of AI Agents
      • Types of AI Agents
      • AI Agent Architecture
      • AI Agent Environment
      • Application of Agents in AI
      • Challenges in AI Agent Development
      • Future of AI Agents
      • Turing Test in AI
      • LLM AI Agents
        • Introduction to LLM AI Agents
        • Implementation in PHP
          • Sales Analyst Agent
          • Site Status Checker Agent
    • Theoretical Foundations of AI
      • Introduction to Theoretical Foundations of AI
      • Problem Solving in AI
        • Introduction
        • Types of Search Algorithms
          • Comparison of Search Algorithms
          • Informed (Heuristic) Search
            • Global Search
              • Beam Search
              • Greedy Search
              • Iterative Deepening A* Search
              • A* Search
                • A* Graph Search
                • A* Graph vs A* Tree Search
                • A* Tree Search
            • Local Search
              • Hill Climbing Algorithm
                • Introduction
                • Best Practices and Optimization
                • Practical Applications
                • Implementation in PHP
              • Simulated Annealing Search
              • Local Beam Search
              • Genetic Algorithms
              • Tabu Search
          • Uninformed (Blind) Search
            • Global Search
              • Bidirectional Search (BDS)
              • Breadth-First Search (BFS)
              • Depth-First Search (DFS)
              • Iterative Deepening Depth-First Search (IDDFS)
              • Uniform Cost Search (UCS)
            • Local Search
              • Depth-Limited Search (DLS)
              • Random Walk Search (RWS)
          • Adversarial Search
          • Means-Ends Analysis
      • Knowledge & Uncertainty in AI
        • Knowledge-Based Agents
        • Knowledge Representation
          • Introduction
          • Approaches to KR in AI
          • The KR Cycle in AI
          • Types of Knowledge in AI
          • KR Techniques
            • Logical Representation
            • Semantic Network Representation
            • Frame Representation
            • Production Rules
        • Reasoning in AI
        • Uncertain Knowledge Representation
        • The Wumpus World
        • Applications and Challenges
      • Cybernetics and AI
      • Philosophical and Ethical Foundations of AI
    • Mathematics for AI
      • Computational Theory in AI
      • Logic and Reasoning
        • Classification of Logics
        • Formal Logic
          • Propositional Logic
            • Basics of Propositional Logic
            • Implementation in PHP
          • Predicate Logic
            • Basics of Predicate Logic
            • Implementation in PHP
          • Second-order and Higher-order Logic
          • Modal Logic
          • Temporal Logic
        • Informal Logic
        • Semi-formal Logic
      • Set Theory and Discrete Mathematics
      • Decision Making in AI
    • Key Application of AI
      • AI in Astronomy
      • AI in Agriculture
      • AI in Automotive Industry
      • AI in Data Security
      • AI in Dating
      • AI in E-commerce
      • AI in Education
      • AI in Entertainment
      • AI in Finance
      • AI in Gaming
      • AI in Healthcare
      • AI in Robotics
      • AI in Social Media
      • AI in Software Development
      • AI in Adult Entertainment
      • AI in Criminal Justice
      • AI in Criminal World
      • AI in Military Domain
      • AI in Terrorist Activities
      • AI in Transforming Our World
      • AI in Travel and Transport
    • Practice
  • Machine Learning
    • Introduction
    • Overview of ML
      • History of ML
        • Origins and Early Concepts
        • 19th Century
        • 20th Century
        • 21st Century
        • Coming Years
      • Key Terms and Principles
      • Machine Learning Life Cycle
      • Problems and Challenges
    • ML Capabilities in PHP
      • Overview of ML Libraries in PHP
      • Configuring an Environment for PHP
        • Direct Installation
        • Using Docker
        • Additional Notes
      • Introduction to PHP-ML
      • Introduction to Rubix ML
    • Mathematics for ML
      • Linear Algebra
        • Scalars
          • Definition and Operations
          • Scalars with PHP
        • Vectors
          • Definition and Operations
          • Vectors in Machine Learning
          • Vectors with PHP
        • Matrices
          • Definition and Types
          • Matrix Operations
          • Determinant of a Matrix
          • Inverse Matrices
          • Cofactor Matrices
          • Adjugate Matrices
          • Matrices in Machine Learning
          • Matrices with PHP
        • Tensors
          • Definition of Tensors
          • Tensor Properties
            • Tensor Types
            • Tensor Dimension
            • Tensor Rank
            • Tensor Shape
          • Tensor Operations
          • Practical Applications
          • Tensors in Machine Learning
          • Tensors with PHP
        • Linear Transformations
          • Introduction
          • LT with PHP
          • LT Role in Neural Networks
        • Eigenvalues and Eigenvectors
        • Norms and Distances
        • Linear Algebra in Optimization
      • Calculus
      • Probability and Statistics
      • Information Theory
      • Optimization Techniques
      • Graph Theory and Networks
      • Discrete Mathematics and Combinatorics
      • Advanced Topics
    • Data Fundamentals
      • Data Types and Formats
        • Data Types
        • Structured Data Formats
        • Unstructured Data Formats
        • Implementation with PHP
      • General Data Processing
        • Introduction
        • Storage and Management
          • Data Security and Privacy
          • Data Serialization and Deserialization in PHP
          • Data Versioning and Management
          • Database Systems for AI
          • Efficient Data Storage Techniques
          • Optimizing Data Retrieval for AI Algorithms
          • Big Data Considerations
            • Introduction
            • Big Data Techniques in PHP
      • ML Data Processing
        • Introduction
        • Types of Data in ML
        • Stages of Data Processing
          • Data Acquisition
            • Data Collection
            • Ethical Considerations in Data Preparation
          • Data Cleaning
            • Data Cleaning Examples
            • Data Cleaning Types
            • Implementation with PHP
          • Data Transformation
            • Data Transformation Examples
            • Data Transformation Types
            • Implementation with PHP ?..
          • Data Integration
          • Data Reduction
          • Data Validation and Testing
            • Data Splitting and Sampling
          • Data Representation
            • Data Structures in PHP
            • Data Visualization Techniques
          • Typical Problems with Data
    • ML Algorithms
      • Classification of ML Algorithms
        • By Methods Used
        • By Learning Types
        • By Tasks Resolved
        • By Feature Types
        • By Model Depth
      • Supervised Learning
        • Regression
          • Linear Regression
            • Types of Linear Regression
            • Finding Best Fit Line
            • Gradient Descent
            • Assumptions of Linear Regression
            • Evaluation Metrics for Linear Regression
            • How It Works by Math
            • Implementation in PHP
              • Multiple Linear Regression
              • Simple Linear Regression
          • Polynomial Regression
            • Introduction
            • Implementation in PHP
          • Support Vector Regression
        • Classification
        • Recommendation Systems
          • Matrix Factorization
          • User-Based Collaborative Filtering
      • Unsupervised Learning
        • Clustering
        • Dimension Reduction
        • Search and Optimization
        • Recommendation Systems
          • Item-Based Collaborative Filtering
          • Popularity-Based Recommendations
      • Semi-Supervised Learning
        • Regression
        • Classification
        • Clustering
      • Reinforcement Learning
      • Distributed Learning
    • Integrating ML into Web
      • Open-Source Projects
      • Introduction to EasyAI-PHP
    • Key Applications of ML
    • Practice
  • Neural Networks
    • Introduction
    • Overview of NN
      • History of NN
      • Basic Components of NN
        • Activation Functions
        • Connections and Weights
        • Inputs
        • Layers
        • Neurons
      • Problems and Challenges
      • How NN Works
    • NN Capabilities in PHP
    • Mathematics for NN
    • Types of NN
      • Classification of NN Types
      • Linear vs Non-Linear Problems in NN
      • Basic NN
        • Simple Perceptron
        • Implementation in PHP
          • Simple Perceptron with Libraries
          • Simple Perceptron with Pure PHP
      • NN with Hidden Layers
      • Deep Learning
      • Bayesian Neural Networks
      • Convolutional Neural Networks (CNN)
      • Recurrent Neural Networks (RNN)
    • Integrating NN into Web
    • Key Applications of NN
    • Practice
  • Natural Language Processing
    • Introduction
    • Overview of NLP
      • History of NLP
        • Ancient Times
        • Medieval Period
        • 15th-16th Century
        • 17th-18th Century
        • 19th Century
        • 20th Century
        • 21st Century
        • Coming Years
      • NLP and Text
      • Key Concepts in NLP
      • Common Challenges in NLP
      • Machine Learning Role in NLP
    • NLP Capabilities in PHP
      • Overview of NLP Libraries in PHP
      • Challenges in NLP with PHP
    • Mathematics for NLP
    • NLP Techniques
      • Basic Text Processing with PHP
      • NLP Workflow
      • Popular Tools and Frameworks for NLP
      • Techniques and Algorithms in NLP
        • Basic NLP Techniques
        • Advanced NLP Techniques
      • Advanced NLP Topics
    • Integrating NLP into Web
    • Key Applications of NLP
    • Practice
  • Computer Vision
    • Introduction
  • Overview of CV
    • History of CV
    • Common Use Cases
  • CV Capabilities in PHP
  • Mathematics for CV
  • CV Techniques
  • Integrating CV into Web
  • Key Applications of CV
  • Practice
  • Robotics
    • Introduction
  • Overview of Robotics
    • History and Evolution of Robotics
    • Core Components
      • Sensors (Perception)
      • Actuators (Action)
      • Controllers (Processing and Logic)
    • The Role of AI in Robotics
      • Object Detection and Recognition
      • Path Planning and Navigation
      • Decision Making and Learning
  • Robotics Capabilities in PHP
  • Mathematics for Robotics
  • Building Robotics
  • Integration Robotics into Web
  • Key Applications of Robotics
  • Practice
  • Expert Systems
    • Introduction
    • Overview of ES
      • History of ES
        • Origins and Early ES
        • Milestones in the Evolution of ES
        • Expert Systems in Modern AI
      • Core Components and Architecture
      • Challenges and Limitations
      • Future Trends
    • ES Capabilities in PHP
    • Mathematics for ES
    • Building ES
      • Knowledge Representation Approaches
      • Inference Mechanisms
      • Best Practices for Knowledge Base Design and Inference
    • Integration ES into Web
    • Key Applications of ES
    • Practice
  • Cognitive Computing
    • Introduction
    • Overview of CC
      • History of CC
      • Differences Between CC and AI
    • CC Compatibilities in PHP
    • Mathematics for CC
    • Building CC
      • Practical Implementation
    • Integration CC into Web
    • Key Applications of CC
    • Practice
  • AI Ethics and Safety
    • Introduction
    • Overview of AI Ethics
      • Core Principles of AI Ethics
      • Responsible AI Development
      • Looking Ahead: Ethical AI Governance
    • Building Ethics & Safety AI
      • Fairness, Bias, and Transparency
        • Bias in AI Models
        • Model Transparency and Explainability
        • Auditing, Testing, and Continuous Monitoring
      • Privacy and Security in AI
        • Data Privacy and Consent
        • Safety Mechanisms in AI Integration
        • Preventing and Handling AI Misuse
      • Ensuring AI Accountability
        • Ethical AI in Decision Making
        • Regulations & Compliance
        • AI Risk Assessment
    • Key Applications of AI Ethics
    • Practice
  • Epilog
    • Summing-up
Powered by GitBook
On this page
  • RubixML Examples
  • PHP-ML Examples
  • Summary
  1. Machine Learning
  2. Data Fundamentals
  3. ML Data Processing
  4. Stages of Data Processing
  5. Data Cleaning

Implementation with PHP

Sure! Here’s a detailed example for data cleaning using RubixML and PHP-ML, two machine learning libraries for PHP. We'll look at how to handle missing values, normalization, and standardization.

For this example, let’s assume you’re working with a dataset containing information about customer transactions, with fields like age, income, spending_score, and some missing values. We'll demonstrate each data cleaning step using the RubixML library, followed by some alternative approaches using PHP-ML.


RubixML Examples

1. Setting Up the Dataset

Let's assume our dataset is a CSV file named customers.csv with the following fields:

age,income,spending_score,tag
25,55000,45,yes
32,?,75,yes
40,72000,?,yes
?,82000,60,yes
28,63000,30,yes

RubixML Example

RubixML provides dedicated transformers for handling missing values, normalization, and standardization. Let’s break down each step.

Step 1: Handling Missing Values

RubixML provides the MissingDataImputer for handling missing values. This imputer allows you to fill in missing values using strategies like Mean, Prior, Percentile or Constant.

require 'vendor/autoload.php';

use Rubix\ML\Datasets\Labeled;
use Rubix\ML\Strategies\Percentile;
use Rubix\ML\Transformers\MissingDataImputer;
use Rubix\ML\Extractors\CSV;
use Rubix\ML\Strategies\Prior;

// Load the dataset using CSV instead of CsvIterator
$dataset = Labeled::fromIterator(new CSV(dirname(__FILE__) . '/customers.csv', true));

// Create imputer with percentile strategy for numeric values and
// Prior (most frequent value) strategy for categorical values
$imputer = new MissingDataImputer(new Percentile(0.55), new Prior());

$dataset->apply($imputer);

echo "\nAfter Imputation:\n";
foreach ($dataset->samples() as $i => $sample) {
    echo implode(',', $sample) . "\n";
}

Here, MissingDataImputer will replace missing values with the mean of the respective column.

After Imputation:
25,55000,45
32,72000,75
40,72000,30
25,82000,60
28,63000,30

Step 2: Normalization

RubixML has a MinMaxNormalizer that scales values to a range (usually between 0 and 1). This is especially useful for features like income and spending_score that vary widely.

use Rubix\ML\Datasets\Labeled;
use Rubix\ML\Transformers\MinMaxNormalizer;

// Create a sample dataset with some numerical features
$samples = [
    [100, 500, 25],
    [150, 300, 15],
    [200, 400, 20],
    [50, 200, 10]
];

$labels = ['A', 'B', 'C', 'D'];

// Create a labeled dataset
$dataset = new Labeled($samples, $labels);

// Create a MinMaxNormalizer to scale values between 0 and 1
$normalizer = new MinMaxNormalizer(0, 1);

// Apply normalization to the dataset
$dataset->apply($normalizer);

// Print the normalized values
echo "Normalized Dataset:\n";
print_r($dataset->samples());

The MinMaxNormalizer will now adjust each feature to the 0–1 range, ensuring uniformity across features. The formula for calculating the normalized value of a feature 𝑥𝑥x is: x′=x−min⁡(x)max⁡(x)−min⁡(x)x{\prime} = \frac{x - \min(x)}{\max(x) - \min(x)}x′=max(x)−min(x)x−min(x)​

Normalized Dataset:
Array
(
    [0] => Array
        (
            [0] => 0.33333333333333 // (100-50)/(200-50) = 0.33
            [1] => 1                // (500-200)/(500-200) = 1
            [2] => 1                // (25-10)/(25-10) = 1
        )
    [1] => Array
        (
            [0] => 0.66666666666667 // (150-50)/(200-50) = 0.67
            [1] => 0.33333333333333 // (300-200)/(500-200) = 0.33
            [2] => 0.33333333333333 // (15-10)/(25-10) = 0.33
        )
    // ... and so on
)

Step 3: Standardization

If standardization is more appropriate (for instance, if we’re using algorithms like SVMs that are sensitive to variance), we can apply the ZScaleStandardizer.

require APP_PATH . 'vendor/autoload.php';

use Rubix\ML\Datasets\Labeled;
use Rubix\ML\Transformers\MinMaxNormalizer;
use Rubix\ML\Transformers\ZScaleStandardizer;

// Create a sample dataset with some numerical features
$samples = [
    [100, 500, 25],
    [150, 300, 15],
    [200, 400, 20],
    [50, 200, 10]
];

$labels = ['A', 'B', 'C', 'D'];

// Create a labeled dataset
$dataset = new Labeled($samples, $labels);

// Apply standardization
$standardizer = new ZScaleStandardizer();
$dataset->apply($standardizer);

echo "After Standardization: \n";
print_r($dataset->samples());

The ZScaleStandardizer adjusts the features to have a mean of 0 and a standard deviation of 1, which is ideal for models like Support Vector Machines (SVM) and Principal Component Analysis (PCA).

Standardized Dataset:
Array
(
    [0] => Array
        (
            [0] => -0.44721359549996
            [1] => 1.3416407864999
            [2] => 1.3416407864999
        )

    [1] => Array
        (
            [0] => 0.44721359549996
            [1] => -0.44721359549996
            [2] => -0.44721359549996
        )
    // ... and so on
)

PHP-ML Examples

PHP-ML offers similar functionality, although it is less feature-rich than RubixML. Here’s how to handle some of these tasks with PHP-ML.

Step 1: Handling Missing Values

PHP-ML doesn’t have a built-in MissingDataImputer, but we can write custom code to handle missing values.

use Phpml\Dataset\CsvDataset;

// Load the dataset
$dataset = new CsvDataset('customers.csv', 3);

// Custom function to replace missing values with the mean of the column
function imputeMissingValues($dataset) {
    $samples = $dataset->getSamples();
    $colMeans = [];

    // Calculate the mean for each column
    foreach (range(0, 2) as $colIndex) {
        $colValues = array_column($samples, $colIndex);
        $filteredValues = array_filter($colValues, fn($val) => $val !== null && $val !== '' ? (int)$val : false );
        $colMeans[$colIndex] = $filteredValues ? array_sum($filteredValues) / count($filteredValues) : 0;
    }

    // Replace missing values with the column mean
    foreach ($samples as &$sample) {
        foreach ($sample as $i => &$value) {
            if ($value === null || $value === '' || $value === '?') {
                $value = $colMeans[$i];
            }
        }
    }

    return $samples;
}

// Apply missing value imputation
$samples = imputeMissingValues($dataset);

echo "\nAfter Imputation:\n";
foreach ($samples as $i => $sample) {
    echo implode(',', $sample) . "\n";
}

This function calculates the mean for each column and replaces missing values with the respective column mean.

After Imputation:
25,55000,45
32,68000,75
40,72000,52.5
31.25,82000,60
28,63000,30

Step 2: Normalization and Standardization

Normalization in PHP-ML can be done manually or by looping through each feature. However, PHP-ML also includes some transformers, though they are more limited. Here’s an example of manual Min-Max normalization.

// Create a sample dataset with some numerical features
$samples = [
    [100, 500, 25],
    [150, 300, 15],
    [200, 400, 20],
    [50, 200, 10]
];

function normalize($samples) {
    $minMax = [];

    // Find min and max for each column
    foreach (range(0, count($samples[0]) - 1) as $colIndex) {
        $colValues = array_column($samples, $colIndex);
        $minMax[$colIndex] = [min($colValues), max($colValues)];
    }

    // Normalize each value
    foreach ($samples as &$sample) {
        foreach ($sample as $i => &$value) {
            $min = $minMax[$i][0];
            $max = $minMax[$i][1];
            $value = ($value - $min) / ($max - $min);
        }
    }

    return $samples;
}

$samples = normalize($samples);

// Print the normalized values
echo "Normalized Dataset:\n";
print_r($samples);

This code adjusts the features to have a mean of 0 and a standard deviation of 1.

Normalized Dataset:
Array
(
    [0] => Array
        (
            [0] => 0.33333333333333
            [1] => 1
            [2] => 1
        )
    [1] => Array
        (
            [0] => 0.66666666666667
            [1] => 0.33333333333333
            [2] => 0.33333333333333
        )
    // ... and so on
)

Summary

Using RubixML, we streamlined data cleaning with MissingDataImputer, MinMaxNormalizer, and ZScaleStandardizer. With PHP-ML, custom functions were needed to perform imputation, normalization, and standardization. RubixML is more convenient and feature-rich for these data preprocessing tasks, making it a good choice for machine learning in PHP.

PreviousData Cleaning TypesNextData Transformation

Last updated 1 month ago

To try this code yourself, install the example files from the official GitHub repository:

https://github.com/apphp/ai-with-php-examples