# Implementation with PHP

Sure! Here’s a detailed example for data cleaning using **RubixML** and **PHP-ML**, two machine learning libraries for PHP. We'll look at how to handle missing values, normalization, and standardization.

For this example, let’s assume you’re working with a dataset containing information about customer transactions, with fields like `age`, `income`, `spending_score`, and some missing values. We'll demonstrate each data cleaning step using the RubixML library, followed by some alternative approaches using PHP-ML.

***

### **RubixML** Examples

#### 1. Setting Up the Dataset

Let's assume our dataset is a CSV file named `customers.csv` with the following fields:

```
age,income,spending_score,tag
25,55000,45,yes
32,?,75,yes
40,72000,?,yes
?,82000,60,yes
28,63000,30,yes
```

#### RubixML Example

RubixML provides dedicated transformers for handling missing values, normalization, and standardization. Let’s break down each step.

#### **Step 1: Handling Missing Values**

RubixML provides the `MissingDataImputer` for handling missing values. This imputer allows you to fill in missing values using strategies like `Mean`, `Prior`, `Percentile` or `Constant`.

```php
require 'vendor/autoload.php';

use Rubix\ML\Datasets\Labeled;
use Rubix\ML\Strategies\Percentile;
use Rubix\ML\Transformers\MissingDataImputer;
use Rubix\ML\Extractors\CSV;
use Rubix\ML\Strategies\Prior;

// Load the dataset using CSV instead of CsvIterator
$dataset = Labeled::fromIterator(new CSV(dirname(__FILE__) . '/customers.csv', true));

// Create imputer with percentile strategy for numeric values and
// Prior (most frequent value) strategy for categorical values
$imputer = new MissingDataImputer(new Percentile(0.55), new Prior());

$dataset->apply($imputer);

echo "\nAfter Imputation:\n";
foreach ($dataset->samples() as $i => $sample) {
    echo implode(',', $sample) . "\n";
}
```

Here, `MissingDataImputer` will replace missing values with the mean of the respective column.

```
After Imputation:
25,55000,45
32,72000,75
40,72000,30
25,82000,60
28,63000,30
```

#### **Step 2: Normalization**

RubixML has a `MinMaxNormalizer` that scales values to a range (usually between 0 and 1). This is especially useful for features like `income` and `spending_score` that vary widely.

```php
use Rubix\ML\Datasets\Labeled;
use Rubix\ML\Transformers\MinMaxNormalizer;

// Create a sample dataset with some numerical features
$samples = [
    [100, 500, 25],
    [150, 300, 15],
    [200, 400, 20],
    [50, 200, 10]
];

$labels = ['A', 'B', 'C', 'D'];

// Create a labeled dataset
$dataset = new Labeled($samples, $labels);

// Create a MinMaxNormalizer to scale values between 0 and 1
$normalizer = new MinMaxNormalizer(0, 1);

// Apply normalization to the dataset
$dataset->apply($normalizer);

// Print the normalized values
echo "Normalized Dataset:\n";
print_r($dataset->samples());
```

The `MinMaxNormalizer` will now adjust each feature to the 0–1 range, ensuring uniformity across features. The formula for calculating the normalized value of a feature $$𝑥$$ is: $$x{\prime} = \frac{x - \min(x)}{\max(x) - \min(x)}$$

```
Normalized Dataset:
Array
(
    [0] => Array
        (
            [0] => 0.33333333333333 // (100-50)/(200-50) = 0.33
            [1] => 1                // (500-200)/(500-200) = 1
            [2] => 1                // (25-10)/(25-10) = 1
        )
    [1] => Array
        (
            [0] => 0.66666666666667 // (150-50)/(200-50) = 0.67
            [1] => 0.33333333333333 // (300-200)/(500-200) = 0.33
            [2] => 0.33333333333333 // (15-10)/(25-10) = 0.33
        )
    // ... and so on
)
```

#### **Step 3: Standardization**

If standardization is more appropriate (for instance, if we’re using algorithms like SVMs that are sensitive to variance), we can apply the `ZScaleStandardizer`.

```php
require APP_PATH . 'vendor/autoload.php';

use Rubix\ML\Datasets\Labeled;
use Rubix\ML\Transformers\MinMaxNormalizer;
use Rubix\ML\Transformers\ZScaleStandardizer;

// Create a sample dataset with some numerical features
$samples = [
    [100, 500, 25],
    [150, 300, 15],
    [200, 400, 20],
    [50, 200, 10]
];

$labels = ['A', 'B', 'C', 'D'];

// Create a labeled dataset
$dataset = new Labeled($samples, $labels);

// Apply standardization
$standardizer = new ZScaleStandardizer();
$dataset->apply($standardizer);

echo "After Standardization: \n";
print_r($dataset->samples());
```

The `ZScaleStandardizer` adjusts the features to have a mean of 0 and a standard deviation of 1, which is ideal for models like Support Vector Machines (SVM) and Principal Component Analysis (PCA).

```
Standardized Dataset:
Array
(
    [0] => Array
        (
            [0] => -0.44721359549996
            [1] => 1.3416407864999
            [2] => 1.3416407864999
        )

    [1] => Array
        (
            [0] => 0.44721359549996
            [1] => -0.44721359549996
            [2] => -0.44721359549996
        )
    // ... and so on
)
```

***

### PHP-ML Examples

PHP-ML offers similar functionality, although it is less feature-rich than RubixML. Here’s how to handle some of these tasks with PHP-ML.

#### **Step 1: Handling Missing Values**

PHP-ML doesn’t have a built-in `MissingDataImputer`, but we can write custom code to handle missing values.

```php
use Phpml\Dataset\CsvDataset;

// Load the dataset
$dataset = new CsvDataset('customers.csv', 3);

// Custom function to replace missing values with the mean of the column
function imputeMissingValues($dataset) {
    $samples = $dataset->getSamples();
    $colMeans = [];

    // Calculate the mean for each column
    foreach (range(0, 2) as $colIndex) {
        $colValues = array_column($samples, $colIndex);
        $filteredValues = array_filter($colValues, fn($val) => $val !== null && $val !== '' ? (int)$val : false );
        $colMeans[$colIndex] = $filteredValues ? array_sum($filteredValues) / count($filteredValues) : 0;
    }

    // Replace missing values with the column mean
    foreach ($samples as &$sample) {
        foreach ($sample as $i => &$value) {
            if ($value === null || $value === '' || $value === '?') {
                $value = $colMeans[$i];
            }
        }
    }

    return $samples;
}

// Apply missing value imputation
$samples = imputeMissingValues($dataset);

echo "\nAfter Imputation:\n";
foreach ($samples as $i => $sample) {
    echo implode(',', $sample) . "\n";
}
```

This function calculates the mean for each column and replaces missing values with the respective column mean.

```
After Imputation:
25,55000,45
32,68000,75
40,72000,52.5
31.25,82000,60
28,63000,30
```

#### **Step 2: Normalization and Standardization**

Normalization in PHP-ML can be done manually or by looping through each feature. However, PHP-ML also includes some transformers, though they are more limited. Here’s an example of manual Min-Max normalization.

```php
// Create a sample dataset with some numerical features
$samples = [
    [100, 500, 25],
    [150, 300, 15],
    [200, 400, 20],
    [50, 200, 10]
];

function normalize($samples) {
    $minMax = [];

    // Find min and max for each column
    foreach (range(0, count($samples[0]) - 1) as $colIndex) {
        $colValues = array_column($samples, $colIndex);
        $minMax[$colIndex] = [min($colValues), max($colValues)];
    }

    // Normalize each value
    foreach ($samples as &$sample) {
        foreach ($sample as $i => &$value) {
            $min = $minMax[$i][0];
            $max = $minMax[$i][1];
            $value = ($value - $min) / ($max - $min);
        }
    }

    return $samples;
}

$samples = normalize($samples);

// Print the normalized values
echo "Normalized Dataset:\n";
print_r($samples);
```

This code adjusts the features to have a mean of 0 and a standard deviation of 1.

```
Normalized Dataset:
Array
(
    [0] => Array
        (
            [0] => 0.33333333333333
            [1] => 1
            [2] => 1
        )
    [1] => Array
        (
            [0] => 0.66666666666667
            [1] => 0.33333333333333
            [2] => 0.33333333333333
        )
    // ... and so on
)
```

### Summary

Using **RubixML**, we streamlined data cleaning with `MissingDataImputer`, `MinMaxNormalizer`, and `ZScaleStandardizer`. With **PHP-ML**, custom functions were needed to perform imputation, normalization, and standardization. RubixML is more convenient and feature-rich for these data preprocessing tasks, making it a good choice for machine learning in PHP.

{% hint style="info" %}
To try this code yourself, install the example files from the official GitHub repository: <https://github.com/apphp/ai-with-php-examples>
{% endhint %}


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://apphp.gitbook.io/artificial-intelligence-with-php/machine-learning/data-fundamentals/ml-data-processing/stages-of-data-processing/data-cleaning/implementation-with-php.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
