Sure! Here’s a detailed example for data cleaning using RubixML and PHP-ML, two machine learning libraries for PHP. We'll look at how to handle missing values, normalization, and standardization.
For this example, let’s assume you’re working with a dataset containing information about customer transactions, with fields like age, income, spending_score, and some missing values. We'll demonstrate each data cleaning step using the RubixML library, followed by some alternative approaches using PHP-ML.
RubixML Examples
1. Setting Up the Dataset
Let's assume our dataset is a CSV file named customers.csv with the following fields:
RubixML provides dedicated transformers for handling missing values, normalization, and standardization. Let’s break down each step.
Step 1: Handling Missing Values
RubixML provides the MissingDataImputer for handling missing values. This imputer allows you to fill in missing values using strategies like Mean, Prior, Percentile or Constant.
require'vendor/autoload.php';useRubix\ML\Datasets\Labeled;useRubix\ML\Strategies\Percentile;useRubix\ML\Transformers\MissingDataImputer;useRubix\ML\Extractors\CSV;useRubix\ML\Strategies\Prior;// Load the dataset using CSV instead of CsvIterator$dataset =Labeled::fromIterator(newCSV(dirname(__FILE__).'/customers.csv',true));// Create imputer with percentile strategy for numeric values and// Prior (most frequent value) strategy for categorical values$imputer =newMissingDataImputer(newPercentile(0.55),newPrior());$dataset->apply($imputer);echo"\nAfter Imputation:\n";foreach ($dataset->samples()as $i => $sample) {echoimplode(',', $sample)."\n";}
Here, MissingDataImputer will replace missing values with the mean of the respective column.
After Imputation:
25,55000,45
32,72000,75
40,72000,30
25,82000,60
28,63000,30
Step 2: Normalization
RubixML has a MinMaxNormalizer that scales values to a range (usually between 0 and 1). This is especially useful for features like income and spending_score that vary widely.
useRubix\ML\Datasets\Labeled;useRubix\ML\Transformers\MinMaxNormalizer;// Create a sample dataset with some numerical features$samples = [ [100,500,25], [150,300,15], [200,400,20], [50,200,10]];$labels = ['A','B','C','D'];// Create a labeled dataset$dataset =newLabeled($samples, $labels);// Create a MinMaxNormalizer to scale values between 0 and 1$normalizer =newMinMaxNormalizer(0,1);// Apply normalization to the dataset$dataset->apply($normalizer);// Print the normalized valuesecho"Normalized Dataset:\n";print_r($dataset->samples());
The MinMaxNormalizer will now adjust each feature to the 0–1 range, ensuring uniformity across features. The formula for calculating the normalized value of a feature x is: x′=max(x)−min(x)x−min(x)
If standardization is more appropriate (for instance, if we’re using algorithms like SVMs that are sensitive to variance), we can apply the ZScaleStandardizer.
require APP_PATH .'vendor/autoload.php';useRubix\ML\Datasets\Labeled;useRubix\ML\Transformers\MinMaxNormalizer;useRubix\ML\Transformers\ZScaleStandardizer;// Create a sample dataset with some numerical features$samples = [ [100,500,25], [150,300,15], [200,400,20], [50,200,10]];$labels = ['A','B','C','D'];// Create a labeled dataset$dataset =newLabeled($samples, $labels);// Apply standardization$standardizer =newZScaleStandardizer();$dataset->apply($standardizer);echo"After Standardization: \n";print_r($dataset->samples());
The ZScaleStandardizer adjusts the features to have a mean of 0 and a standard deviation of 1, which is ideal for models like Support Vector Machines (SVM) and Principal Component Analysis (PCA).
PHP-ML offers similar functionality, although it is less feature-rich than RubixML. Here’s how to handle some of these tasks with PHP-ML.
Step 1: Handling Missing Values
PHP-ML doesn’t have a built-in MissingDataImputer, but we can write custom code to handle missing values.
usePhpml\Dataset\CsvDataset;// Load the dataset$dataset =newCsvDataset('customers.csv',3);// Custom function to replace missing values with the mean of the columnfunctionimputeMissingValues($dataset) { $samples = $dataset->getSamples(); $colMeans = [];// Calculate the mean for each columnforeach (range(0,2)as $colIndex) { $colValues =array_column($samples, $colIndex); $filteredValues =array_filter($colValues,fn($val) => $val !==null&& $val !==''? (int)$val :false); $colMeans[$colIndex] = $filteredValues ?array_sum($filteredValues)/count($filteredValues):0; }// Replace missing values with the column meanforeach ($samples as&$sample) {foreach ($sample as $i =>&$value) {if ($value ===null|| $value ===''|| $value ==='?') { $value = $colMeans[$i]; } } }return $samples;}// Apply missing value imputation$samples =imputeMissingValues($dataset);echo"\nAfter Imputation:\n";foreach ($samples as $i => $sample) {echoimplode(',', $sample)."\n";}
This function calculates the mean for each column and replaces missing values with the respective column mean.
After Imputation:
25,55000,45
32,68000,75
40,72000,52.5
31.25,82000,60
28,63000,30
Step 2: Normalization and Standardization
Normalization in PHP-ML can be done manually or by looping through each feature. However, PHP-ML also includes some transformers, though they are more limited. Here’s an example of manual Min-Max normalization.
// Create a sample dataset with some numerical features$samples = [ [100,500,25], [150,300,15], [200,400,20], [50,200,10]];functionnormalize($samples) { $minMax = [];// Find min and max for each columnforeach (range(0, count($samples[0])-1)as $colIndex) { $colValues =array_column($samples, $colIndex); $minMax[$colIndex] = [min($colValues),max($colValues)]; }// Normalize each valueforeach ($samples as&$sample) {foreach ($sample as $i =>&$value) { $min = $minMax[$i][0]; $max = $minMax[$i][1]; $value = ($value - $min) / ($max - $min); } }return $samples;}$samples =normalize($samples);// Print the normalized valuesecho"Normalized Dataset:\n";print_r($samples);
This code adjusts the features to have a mean of 0 and a standard deviation of 1.
Using RubixML, we streamlined data cleaning with MissingDataImputer, MinMaxNormalizer, and ZScaleStandardizer. With PHP-ML, custom functions were needed to perform imputation, normalization, and standardization. RubixML is more convenient and feature-rich for these data preprocessing tasks, making it a good choice for machine learning in PHP.