Methods in DNA methylation array dataset analysis: A review

Understanding the intricate relationships between gene expression levels and epigenetic modifications in a genome is crucial to comprehending the pathogenic mechanisms of many diseases. With the advancement of DNA Methylome Profiling techniques, the emphasis on identifying Differentially Methylated Regions (DMRs/DMGs) has become crucial for biomarker discovery, offering new insights into the etiology of illnesses. This review surveys the current state of computational tools/algorithms for the analysis of microarray-based DNA methylation profiling datasets, focusing on key concepts underlying the diagnostic/prognostic CpG site extraction. It addresses methodological frameworks, algorithms, and pipelines employed by various authors, serving as a roadmap to address challenges and understand changing trends in the methodologies for analyzing array-based DNA methylation profiling datasets derived from diseased genomes. Additionally, it highlights the importance of integrating gene expression and methylation datasets for accurate biomarker identification, explores prognostic prediction models, and discusses molecular subtyping for disease classification. The review also emphasizes the contributions of machine learning, neural networks, and data mining to enhance diagnostic workflow development, thereby improving accuracy, precision, and robustness.


Introduction
DNA methylation has a long evolutionary history and can be found in all kingdoms of life, including eukaryotic and archaebacterial organisms, is extensively reported as a significant process for embryonic development and cellular function.This process adds a methyl group to the cytosines region of the CpG sites in the genome and is closely involved in the regulation of gene expression [1,2].DNA methylation is a crucial epigenetic modification of the genome important for cellular reprogramming, tissue differentiation, and proper development connected to many biological processes, including the control of gene expression [3,4].It is known that CpG dinucleotides, which are primarily located in so-called CGI (CpG islands) areas, undergo DNA methylation at the 5′ of the cytosine.Approximately 70% of gene promoters reside within the CpG islands, majorly including the promoters of the housekeeping genes [5].DNA methylation at CpG islands (CGIs) regulates gene activity, playing a crucial role in gene silencing through promoter methylation, and can contribute to the pathogenesis of diseases [6,7].
Various forms of cancer such as colon, breast, liver, bladder, oesophageal, prostate, and bone cancers, have been reported with aberrant DNA methylation of imprinted sites [8].Multiple omics-based research studies have also revealed that a variety of malignancies, including hepatocellular carcinoma [9], glioblastoma, breast cancer, squamous cell lung cancer, thyroid carcinoma, and leukemia [10], have diverse DNA methylation patterns.Moreover, DNMT mutations, various DNMT expression levels, dysregulation of TETs, and frequent observations of cancer, point towards a strong association between DNA methylation and cancer [11].
DNA methylation in the genomic DNA.These include targeted bisulfite sequencing with TruSeq Methyl Capture, whole genome bisulfite sequencing, methylated DNA immunoprecipitation (MeDIP), pyrosequencing, Illumina Infinium DNA methylation, Nanopore DNA sequencing, and ultra-high performance liquid chromatography merged with mass spectrometry (UHPLC-MS/MS) [12,13].TruSeq EPIC sequencing offers targeted coverage of 3.34 million CpG sites, surpassing EPIC-array capabilities [14].MeDIP-seq extends coverage to approximately 10% of the genome, with RRBS notably covering 85% of CGIs, primarily in promoter regions [15].WGBS provides comprehensive genome coverage but is resource-intensive [16].Pyrosequencing offers targeted analysis, while Illumina Infinium assays offer high-resolution single-CpG-site measurements.The data file sizes for methylation beta values typically range from 20 GB (unzipped) to 4 GB (zipped).Costs vary, with sequencing services such as WGBS and RRBS priced at around $300 per sample, while methylation array expenses average $425 per chip, covering reagents and labor costs for multiple samples.Further, the cost is also dependent upon the company providing the platform services.The scalability of the methods can range from moderate (done for multiple samples) to high (done for large amounts of samples).Many bioinformatics methods and pipelines, including Bigmelon [17], EpiScanpy [18], EpiMOLAS [19], MADA [20], AmpliconDesign [21], COH-CAP [22], Bicycle [23], and ChAMP [24], have been developed for analyzing the extent of high throughput methylation dataset produced by the various platforms for conducting epigenome-wide association studies, whose output is in the proclaimed repositories.
Differentially methylated regions (DMRs), which are genomic areas that exhibit noticeable variations in levels of methylation between various biological states (e.g., normal versus diseased), have been discovered to be connected to several diseases [25].Hence, one of the most important problems in understanding the mechanism of the disease at the molecular level is the detection of DMRs.Even though DNA methylation patterns arse stable throughout the cell growth mechanism of normal somatic cells, variation seen in genomic methylation might be caused by genetic differences or vice versa.However, whether methylation alteration is a cause or an effect is typically overlooked in traditional DMR analysis.For differential methylation analyses at the cell-type level, Rahmani et al., 2019 investigate the impact of model directionality and also state whether the methylation can affect the condition of interest (phenotype) or vice versa [26].They demonstrate that identifying cell type-specific differential methylation depends significantly on properly accounting for model directionality.
The connections between methylation modifications and copy number variations (CNVs) provide a wider, and therefore more useful picture of the samples under analysis, particularly for tumor data defined by significant genomic rearrangements, according to current research [27].As a result, the ability to measure CNVs using DNA methylation data is possible with recent developments in technology.One of the primary advantages of DNA methylation-based CNV approaches is their ability to incorporate epigenomic (methylation) information and genomic (copy number) information.In 2022, Mariani and colleagues introduced MethylMasteR, an R software package that incorporates CNV calling algorithms based on DNA methylation, making it easier to standardize, compare, and customize CNV investigations [28].MethylMasteR enables performance evaluation, comparing runtime and memory usage, and assessing the detection of large-scale CNVs in cancer samples using four well-known methylation-based CNV algorithms: ChAMP [24], SeSAMe [29], Epicopy [30], and a modified version of cnAnalysis450k [31].
A DNA methylation dataset encompasses the chromosome number, UCSC reference genome information, the chromosomal coordinates of the CpG island, experimentally determined differentially methylated regions with the specification of disease, regulatory feature details, and Hidden Markov Model Islands providing the details of the computationally predicted disease.The metadata describes the technical details of the DNA methylation profiling experiment, including the sequencing platform used, the title, repositories, and term accession, as well as a summary of information about the experimental conditions, sample preparation details (for control and case-defined groups), and experimental conditions [32].The size of raw methylation array data files is dependent on the number of samples, which ranges from > 100 MB (with the inclusion of one sample) to 5-10 GB (maximum of 987 patient samples), accessible at the NCBI GEO database [33].A data storage and retrieval system/platform, reference genome databases like UCSC Genome Browser [34] or ENCODE [35], bioinformatics tools like R studio, Anaconda, and Bioconductor packages, along with statistical tools R, SAS, or SPSS [36] are all necessary to work with large-scale data files from repositories.To process, interpret, and analyze the methylation array data for finding DMRs, computational methods are combined with statistical and data visualization techniques.This is followed by the use of tools for enrichment analysis necessary for the biological interpretation of the expressed methylation.Fig. 1 illustrates the systematic process, starting with data collection from public repositories, followed by the application of significant computational algorithms to analyze methylation datasets, in line with multiple predefined hypotheses delineated by the researchers.
Given the current worldwide emphasis on integrated disease management, it is impossible to overlook the emergence of novel computational paradigms available for fundamental and biomedical investigations.These approaches are crucial in examining their potential to significantly improve disease diagnosis capabilities.Despite the widespread usage of numerous methylation profiling approaches, the individual analytical processes of each go beyond the scope of this review and are therefore not discussed.These techniques include Whole Genome Bisulfite Sequencing (WGBS), Reduced Representation Bisulfite Sequencing (RRBS), MeDIP (Methylated DNA Immunoprecipitation), single-cell RRBS (scRRBS), and Next generation sequencing (NGS) [37,38].Technically, RRBS-seq, scRRBS, MeDIP, BS-Seq, and WGBS have established protocols for detecting methylated cytosines in genomic DNA.In contrast, Next-Generation Sequencing (NGS) provides a comprehensive view of nucleotide sequences across entire genomes or specific DNA/RNA regions.Existing literature reviews have thoroughly explored both the experimental profiling techniques related to DNA methylation and the computational methods employed for the analysis of DNA methylation data (array-based and sequence-based data) [15,39,40].WGBS, RRBS, and NGS are the high throughput sequencing methods providing data whose analysis pipeline contains library preparation, alignment, quality control, methylation calling, and annotation [41,42].Similarly, MeDIP and MethylCap-seq data analysis also follow these post-sequencing steps [43,44].Also, methylation serves as a potent marker for distinguishing cells under varying conditions or cell types, pushing the boundaries of research into single-cell DNA methylation profiling whose analysis method is similar to that of bulk methylation data analysis [45].But, as these technologies are promising, DNA hybridization microarrays are increasingly utilized for their cost-effectiveness, rapid analysis, and broad coverage using a predetermined set of CpG sites.This technology also supports a wide range of experiments, encompassing genotyping, epigenetics, translation profiling, and gene expression analysis.Notably, the Illumina Infinium HumanMethylation BeadChip array is a widely used high-throughput option, providing the most comprehensive genome-wide DNA methylation data available in the GEO database for disease research [46].Therefore, this review highlights studies that utilize DNA Infinium microarray data to effectively identify methylation markers for disease.The article selection focuses exclusively on identifying methylated regions/sites as diagnostic or prognostic biomarkers for human diseases, supplemented by analyses/methods that enhance their biological relevance.We compiled the articles by searching the web, PubMed, and Google Scholar with the keywords "DNA methylation array analysis in cancer", "DNA methylation array analysis in diseases" AND "Computational methods for analyzing DNA methylation array," along with the filters (selected parameters) used for the collection of datasets were "Homo sapiens" as the organism choice; selected platform as "Illumina arrays including Infinium MethylationEPIC, Infinium Human Methyl-ation27, and the Infinium HD 450 K methylation array", respectively.We selected and analyzed relevant publications in the past 5 years (Table 1 and Table 2) to provide an overview of existing DNA methylation-based biomarker studies and to outline the progression and future aspects of this research domain.We seek to meticulously address the practical challenges faced by the researchers in the selection of methodologies and give an updated perspective on algorithms/packages used for processing array-based DNA methylation data.We also aim to present an end-to-end methodological framework that guides the selection of computational algorithms for diverse research outcomes and demonstrates advancement in the analysis of methylation array data for diagnostic and prognostic studies in disease pathology.
Taking account of the multiple hypotheses and the methodology designed by the investigators, the review has been divided into six sections with their respective subsections.The first section focuses on how the investigator retrieves the methylation datasets supported by a variety of filters to meet the requirement of the research objective.The second portion delves into preprocessing requirements and algorithm selection criteria to improve the analysis accuracy.The third section addresses exploratory analysis showcasing the clustering algorithms undertaken by different studies for better characterizing and visualizing the patients' samples.The proceeding sections detail the downstream analysis with separate subsections discussing the algorithms/methods and packages used specifically for identifying DMRs which can serve the purpose of diagnosis and prognosis of the disease.Also, we briefly address the algorithms used to identify genome segments with similar methylation patterns under a single condition.In the penultimate section, we explore the integration of feature selection algorithms and the development of ML/DL models, which aid in predicting the estimated risk of disease occurrence and progression, and also determine overall survival efficacy for derived significant methylation biomarkers.Lastly, the process used for evaluating the biological relevance and functional significance of methylated regions is discussed in detail.More clarity on the step-wise analysis of methods followed in the relevant articles along with their respective multiple-hypothesis taken into consideration for this review article as illustrated in Fig. 2.

Microarray dataset collection and repositories
To begin, the primary DNA methylation workflows make use of global repositories such as the GEO databases that are supported by the National Institute of Health (NIH).Additionally, the TCGA-GDC portal (http://cancergenome.nih.gov/) is a collaboration that is supported by the National Cancer Institute (NCI) and the National Human Genome Fig. 1.Visual representation of the step-wise analysis of methylation data with its significance and majorly used algorithms in the reviewed manuscripts.Depending upon the researcher objective, some manuscript demonstrates the application of all the steps within the single paper, while others showcase the application of a few selective steps.

DNA methylation microarray data analysis
The analysis procedure is somewhat analogous to the analysis patterns that are being followed in various other analyses of sequencing data.The complete procedure could be broken down into the following four broad sections, each of which can be further subdivided.

Pre-processing analysis of raw dataset
The major algorithms, which are among the most popularly used ones in published research for pre-processing of the raw data include SWAN-subset-quantile within array normalisation [53][54][55], FunNorm-Functional normalization [54,56], pQuantile-stratified quantile normalization [57], noob-normal-exponential using out-of-band probes [54,58], RCP-Regression on Correlated Probes [59], BMIQ-Beta-Mixture Quantile Method [50,[60][61][62][63] and the combination of noob and BMIQ, that was shown to give better performance than others [58].Further detailing of the normalization algorithms and packages can be found in Supplementary Table III.Such pre-processing methods also target several common computational manipulations on raw data.These include background signal subtraction, color bias adjustment, and probe type adjustments to reduce the effects of experimental variation early in the pipeline.Also, it is necessary to check for missing values in the genomic regions of the patient samples which can lead to technical biasness.This can be overcome by employing imputation functions based on k-nearest neighbors using an Euclidean metric method for simple data.This is followed by inter and intra-sample normalization as well as batch effect correction, which are typically taken into account later in the pipeline.Moreover, in epigenome-wide association studies (EWAS), cellular composition can confound the association between primary phenotype and methylation levels.To address this, the researcher can use functions such as estimateCellCounts offered by Minfi and FlowSorted.Blood.EPIC Bioconductor R package, a reference-based deconvolution method [64,65], and the methylDeConv R package (capable of analyzing both Illumina 450k and EPIC arrays), to estimate and account for this confounding effect [55,66,67].Hence, considering the required steps for selecting an appropriate pre-processing pipeline is a must to improve statistical efficacy in single dataset analysis and ensure result reliability.Supplementary Table IV details the preprocessing methods utilized by the packages/functions frequently mentioned in the majority of research articles.
Despite the diversity of the studies discussed in this review, some common preprocessing steps can be identified in their methodologies.These steps are outlined below: a) Filtering of probes: • Remove probes with detection p-value greater than 0.01 or 0.05, low bead count, SNPs, or cross-hybridization potential.
• Remove CpG sites with no or small differences in beta values among tissues and remove probes on the sex chromosomes.• Reject samples with too many poorly performing or missing probes.
• Some methodologies only selected CpG sites present in the promoter regions.
b) Quality control: • Apply background subtraction using methylumi or other packages.
• Filter out PCA outliers using various methods.

c) Imputation of missing values:
• Remove or impute missing values using Impute or ENmix packages.
• Use different methods such as KNN, mean, or iterative imputation.
• Use Bayesian Ridge regression for probabilistic estimation of missing values.

d) Batch correction and FDR correction:
• Use the ComBat algorithm to correct for batch effects.
• Use the Benjamini-Hochberg method to control for multiple testing.
e) Normalization of samples:

Exploratory analysis
Numerous studies have shown that utilizing visual inspection and graphical representation of normalized and quality-controlled DNA methylation data aids in detecting global changes in methylation patterns.This preliminary analysis precedes more intricate investigations related to differential DNA methylation.These methods also facilitate the detection of methylation gains or losses, and exploration of methylation levels in specific genomic regions across diverse samples, enabling comparative analysis of methylation patterns.

Clustering samples
The process of clustering is the procedure that is utilized to divide the data items into many groups or subgroups within a set of given samples.For instance, the application of the clustering-based approach can differentiate the DNA methylation data into clusters of normal and abnormal DNA profiles data; hypermethylated and hypomethylated probes in the DNA methylation array data; and clusters with and without CIMP (CpG island methylator phenotype).

Clustering methods. PCA (principal component analysis
) is a classic clustering method used by Tirosh et al., 2022 for visualizing the association/comparison of DNA methylome signatures between the groups of Neuroendocrine tumors and normal samples, for better characterization [68].Despite its widespread use as a dimension reduction procedure, PCA's main drawback is the difficulty in interpreting the independent variables that form the principal components, along with the necessity for a large sample size to ensure reliable outcomes.Another significant algorithm is Hierarchical clustering, which creates a binary tree by progressively combining samples or probes that are alike, using a specific similarity measure [69].Significantly, this clustering analysis identified a distinct DNA methylation pattern that effectively distinguished CTC-MCC-41 cells from HT29 cancer cells, in colorectal cancer patients [70].However, the unsupervised nature of hierarchal clustering does not allow the algorithm to use data beyond methylation for making clusters and eventually can fail to predict a phenotype required by the user [71].Therefore, Kok-Sin et al., 2015 used an advanced method for binary sample distribution analysis, which includes both supervised hierarchal clustering and PCA for subgrouping the differentially methylated loci and methylation ratio matrix of genes, into hypermethylated and hypomethylated loci for finding a significant locus/gene, respectively [72].For complex data distribution, the tSNE (t-distributed stochastic neighbor embedding) and NMF (non-negative matrix factorization) algorithms are used for the dimensionality reduction step, along with their contribution towards disease stratification [73].t-SNE and NMF demonstrate robust performance in managing outliers and in the unsupervised modeling of cancer diagnosis, respectively, outperforming PCA in terms of efficacy [74,75].t-SNE's need for precise hyperparameter adjustment and high computational demands pose challenges, contrasting with NMF's limitations related to non-negative beta-valued methylation data assumptions and parameter sensitivity [75,76].In recent times, Amor et al., 2022 developed the deep-embedded refined clustering (DERC) method showing a better approach than PCA, tSNE, and NMF, by using autoencoders for performing unsupervised classification of breast cancer samples among normal samples.The accuracy of the method achieved is 0.99 [77].
For interpreting the tumor heterogeneity of cancer tissues or classifying the tumor samples, classic unsupervised/supervised methods like K mean clustering stratify samples based on CIMP status, affecting tumor differentiation of colorectal cancer [78].K-means clustering is user-friendly and adaptable for large samples or diverse methylation patterns, but it may need multiple attempts to overcome its randomness and high computational demand, with the requirement of dimensionality reduction techniques for better efficiency [79,80].Similarly, Hosseini, M et al., 2023 used the hierarchal clustering algorithm for visualization, showing the separation between the significant hypermethylated probes related to promoter region into tumor and non-tumor classes.The researchers further utilized the Pearson correlation and recursive feature elimination with the 10-fold cross-validation (RFECV) methods for filtrating features to identify diagnostic biomarkers in stomach adenocarcinoma [81].In this aspect, hierarchical clustering is often seen as a more user-friendly algorithm than K-means, because it provides easily interpretable dendrograms, deeper insight into sample relationships, and careful interpretation due to its sensitivity towards noise and outliers.Apart from these non-parametric approaches, the recursively partitioned mixture model (RPMM) has been applied to cluster DNA methylation and hydroxymethylation data for tumor classification using beta values [82].Interestingly, Azizgolshani et al., 2021 employed this algorithm for perceiving the association between the 5hmC signals of CNS tumor samples with overall survival, suggesting that low 5hmC patterns have an increased risk of recurrence and poor overall survival (OS) rate [83].One of the advantages of the RPMM approach is the robust computational efficiency over traditional finite mixture models and it can integrate data related to CpG sites on the genomic locations to create biological correlational structures that become the basis of further clustering study.Also, its notable limitation can be its inability to include established biological correlations of the measured features [71,84].
Several clustering algorithms are utilized to subgroup differentially methylated sites (DMS), serving as prognostic models, capable of predicting risk scores or classifying patients into distinct molecular subtypes.These algorithms leverage information from the analysis of overall survival on samples to accomplish this task.the variation in molecular genetic features (determined by hypermethylated/hypomethylated loci) concerning the prognostic behaviour of the cluster group [85].Similarly, Yin, X. et al., 2021 evaluate such molecular subgroups of pancreatic cancer samples, for poorer prognosis and its associated clinicopathological features [86].In a related study, BRCA samples were subgrouped through consensus clustering, which utilized methylation data of methylated-driven genes (MDG).This involved subsampling the data matrix and categorizing each subset into 'k' clusters via K-means.The subgroups' overall survival was analyzed using Kaplan-Meier plots, and the significance of differences between clusters was evaluated using the log-rank test [87].Basically, Consensus clustering is a more dependable method by repeatedly applying a selected clustering algorithm to different subsets of the data, leading to its robust nature as compared to single-run clustering algorithms.It is extensively utilized to identify clusters linked to various clinical outcomes [88].

Downstream analysis
This refers to the sequence of analytical procedures conducted following the collection and preprocessing of the raw methylation data.It involves multi-sample dataset analysis such as the identification of regions with significant methylation changes, annotation, enrichment, and classification.

Identification of differentially methylated regions (DMR)
The DMRs are composed of closely related DMS/DMCs (Differentially methylated sites) referring to individual CpG sites or small genomic regions that exhibit differential DNA methylation levels.DMRs differ in the methylation level of genomic features, including gene promoters, enhancers, CpG islands, and intergenic regions, particularly across distinct biological conditions (e.g., normal vs. disease) [89].Most studied algorithms collectively focus on the DMR/DMC analysis in the methylation profiling data of the disease for screening the hypermethylated and hypomethylated genes, based on the significant user-selected margin of FDR, log fold change (FC), and p-value.The collective description of the packages along with their major role in computational analysis workflows such as preprocessing, DMRs/DMPs identification, annotation, and visualization are detailed in tabulated format (Supplementary Table V).Theoretically, aberrant DNA methylation affects gene expression in diseased pathways, causing suppression.Addressing this aim, multiple workflows have been developed to explore the correlation between methylation and gene expression data analysis.These methodologies are thoroughly examined in the following section.
First, a common strategy involves identifying hypermethylateddownregulated and hypomethylated-upregulated genes through separate analyses of gene expression and DNA methylation datasets.Subsequently, hub genes are retrieved by predicting protein-protein interactions (PPI network) among these identified genes.For instance, the identification of the aberrantly methylated differentially expressed genes was done by comparing the raw data grouped as tumor and normal samples of Oesophageal squamous cell carcinoma and bladder cancer, using GEO2R [90,91].Also, Cheng et al., 2022 used the DMRcate algorithm offered by the ChAMP pipeline to process 19 carotid atherosclerotic and 15 control aortic tissue samples for screening differentially methylated genes [92].Further exploring the epigenetic regulation in the promoter region, Wang et al.,2021 derived 8029 differential CpG sites with 4940 genes annotated to the promoter region by using the empirical Bayes moderate T-test (limma), after comparing the 64 normal and the 183-periodontitis patient's sample.This was followed by Weighted co-expression analysis for identifying immune-related co-expression patterns involving differential CpG sites [93].Using the same method, Feng et al., 2021 analyzed methylation in blood samples of 39 ARDS and 30 control patients, suggesting hypomethylation may increase hub gene expression [94].Also, this algorithm was utilized for preprocessing and differential expression of 371 HCC and 50 normal controls for designing a diagnostic signature model, giving five top methylated markers [95].Likewise, Raman et al., 2018 extracted methylated genes and DEGs in pancreatic cancer patients, comparing survival-(<1 year) and survival+ (>2 years) groups using the limma package.The validation of the survival signature genes with high DNA methylation extend, is done by ROC and Kaplan-Meier survival analysis [96].The same protocol was followed by Liang et al., 2019 and Zhang et al., 2019 for finding DEGs and DMGs with an additive assessment of Spearman's correlation analysis for predicting the MeDEGs (methylated differentially expressed genes) in colon cancer and glioblastoma multiforme samples, respectively [91,97].Ma et al., 2020 used the Empirical Bayes t-test model for identifying aberrantly methylated DEGs/DMGs in comparing 39 diseased samples with 44 controls followed by identification of intersecting nodes between the list of 1313 DEGs; 1405 DMGs; oncogenes, and tumour suppressor genes [98].Correspondingly, Xia et al., 2023 generated tumor signatures of cervical cancer by doing the expression analysis using limma on 306 cervical squamous cell carcinoma and endocervical adenocarcinoma (CESC) from TCGA and 215 CESC patients from the GEO portal.This was coupled with the application of an estimate algorithm, specialized in estimating the concentration of immune and stromal cells in the samples by using the gene expression data only (targeting the TCGA portal) [99].Similarly, the gene co-expression network was constructed by importing DEGs to find the 268 co-expressed module genes in Alzheimer's disease.These were subsequently compared with differentially methylated positions (DMPs) identified through ChAMP analysis, leading to the identification of 77 common genes [100].Therefore, limma is frequently utilized for identifying genes, differentially methylated positions (DMPs), and differentially methylated regions (DMRs) due to its implementation of empirical Bayesian methods, ensuring robust outcomes even with limited sample sizes [101].What's more, due to the comprehensive and specific approach of the ChAMP pipeline towards DNA methylation analysis, it is capable of conducting end-to-end analysis of DNA methylation microarray data (both EPIC and 450k data) and ensuring consistency with user-friendly interfaces [24].To date, much of the literature has focused on the concurrent identification of DEGs and DMPs/DMRs to explore correlations between them.However, some algorithms specifically address the analysis of methylation data emphasizing solely on DMR investigation from array-based methylation data.Algorithms such as Comb-p, DMRcate, Bumphunter, and probe lasso suggest its identification in the promoter region with the decreasing order of their performance evaluation in terms of power, sensitivity, DMR size, DMR overlap, and the simulated time consumption [102,103].Therefore, a recent study done by Zhang W et al., 2023 implemented the use of Comb-p for identifying DMRs related to the CSF biomarker in Alzheimer's patient's blood samples, to collectively identify the regions showcasing the adjacent low p-values [85,104].
Second, in the quest for understanding the intricate regulatory mechanisms underlying gene activity, several studies admit correlation analysis between methylation patterns with gene expression levels revealing how epigenetic marks [105] A sophisticated study done by Yanzhao Xu et al., 2021 put forward the use of Fisher's exact test offered by the COHCAP R package for estimating the differentially methylated sites on raw data of 170 samples of Oesophageal carcinoma (cases and control) and then undergoing WGCNA Co-Expression Network analysis for the identification of the hub genes [105].As discussed earlier, limma is a popular choice among researchers due to its versatility in analyzing various omics data types.However, COHCAP is also recognized as an efficient algorithm specialized for methylation data analysis and its influence on gene expression.Additionally, the introduction of WGCNA Co-Expression Network analysis enhances the depth of biological insight into diseases by identifying gene modules (clusters) with similar expression patterns.These modules frequently correspond to biologically relevant pathways or processes, containing related hub genes within the module.Another study done by Rodriguez et al., 2022, used Hierarchical linear models (HLM) coupled with the Mann-Whitney U test to measure the accurate estimates of the methylation differences in the expression data between different tumour and non-tumour groups of breast cancer patients, to account for promoter hypermethylation of WNT1 [106].Over time, Hierarchical Linear Model (HLM) is preferred for its robust handling of large amounts of missing data and its ability to estimate individual changes over time with fewer assumptions [107].Another statistical test commonly employed for correlation analysis is the estimation of Pearson's correlation coefficient.It is utilized to explore the association between DNA methylation data and expression data related to exons or isoforms [108].Notably, it's important to be attentive towards potential non-linear relationships between variables, the existence of outliers, and limitations due to restricted data ranges [109].Furthermore, mapping methylation signals to specific genomic coordinates based on microarray probes is a common approach in methylation data analysis, but it can encounter challenges, resulting in reduced sensitivity and increased false positives.To address this, regression algorithms, such as linear regression models, are employed to find the association between gene expression (dependent variable) and DNA methylation levels (independent variables).As a result, a novel combinatorial framework was introduced, integrating DNA methylation data with TCGA gene expression data.This framework combines linear regression, differential expression analysis, and deep learning techniques to enable accurate biological interpretation [110].Hence, Linear regression and deep learning algorithms are good choices for identifying complex relationships between DNA methylation signatures and gene expression data.
Third, researchers have harnessed statistical methods to unravel the intricate relationship between DNA methylation patterns and various biological processes or pathways.These investigations focus on statistical approaches and some machine learning algorithms for exploring the epigenetic mechanisms.For instance, a study done by Yeung KS, et al., 2017 on Systemic Lupus Erythematosus Patients stated the use of a Wilcoxon rank-sum test for a specific region, for comparing the groups.The CpG sites with an adjusted p-value below 0.05 and a mean methylation change exceeding |0.1| were identified as differentially methylated.The resulting hypomethylated gene is related to the type I interferon pathway [111].Here, the Wilcoxon rank-sum test, also referred to as the Mann-Whitney U test, is a widely-used nonparametric method for comparing two separate groups with a minimal assumption regarding data format, typically applied to medium to large datasets.Though widely used, this test is quite sensitive to outliners, the tied values (data points having the same values) and the small sample size [112].Moreover, Mohammadnejad et al., 2021 used the generalized correlation coefficient (GCC) approach (Matie R package), linear mixed model (lme4 R package), and kinship model (kinship2 R package) together to identify 65 CpGs showing the association of DNA methylation with cognitive function ability of twin's samples.The choice of these algorithms was guided by their appropriateness for analyzing DNA methylation data, accommodating both fixed and random effects, and capturing both linear and nonlinear correlations [102].As discussed earlier, parametric linear regression models can be used to investigate the association between DNA methylation and symptom severity.A recent study done by Tang Y et al., 2024, used PANSS scores (positive, negative, general subscale, and total) and covariates (sex, age, education level, and cell type composition) to build linear regression models after using DMPfinder, then testing association methylation probes link to treatment response, considering antipsychotic details and baseline PANSS total score [113].This algorithm has a few limitations: it assumes that observations are independent, is sensitive to outliers, and tends to underfit.A similar kind of study involves the derivation of DMRs/DMGs by using DMRcate (offered by ChAMP) and Limma to calculate methylation scores, distinguishing active Psoriasis from remission samples.It linked DMPs to disease activity and aided in the computation of methylation score [114].
Lastly, validating the methylation extent of derived hub genes obtained from gene expression analysis is essential for establishing their potential as epigenetic biomarkers in certain studies.The majority of the articles utilize web-based tools for validation.For instance, Luo D et al., 2022 used the MEXPRESS web-based tool to visualize the relationship between gene expression and methylation extend in hub genes related to Colorectal cancer [115].Tong Lin et al., 2021 employed DNMIVD, SurvivalMeth, and MethServ tools for estimating the correlation between the expression level and the methylation density of DEGs, estimating the global methylation, and finding an association between the methylation level of CpG sites of the DEGs and overall survival of hepatocellular carcinoma patients, respectively [116].Another study used the MethHC database to explore the candidate hub gene derived from lung adenocarcinoma patient samples, whose methylation level is negatively correlated with gene expression, affecting the normal functioning of the diseased pathway [117].Furthermore, Shijian et al., 2022 found an association of the hub gene expression with the abnormal methylated data of the disease by using DiseaseMeth 2.0 database [118].Further validation can involve the application of Single-Sample Gene Set Enrichment Analysis (ssGSEA) which can specify the enrichment score of the hub genes.This analysis is pursued to investigate the immune infiltration landscape of the sample which can act as a measure for disease progression [100].The survival analysis validates hub genes by assessing their impact on patient survival and the application of the TIMER database analyzes the role and correlation of immune cells with the progression of the disease [99].Table 1 encompasses a comprehensive analysis of recent publications in DNA methylation-based studies over the past five years focusing majorly on the identification of DMRs/DMPs, giving a clear takeaway of result analysis done in this section.

Algorithmic approaches for identifying methylation states in singlecondition
The genome is categorized into distinct methylation patterns subjected to a single condition, including Unmethylated Regions (UMRs), Low Methylated Regions (LMRs), Fully Methylated Regions (FMRs), and DNA methylation valleys (DMVs).In recent times, several R packages aiming the statistical and AI-based approaches have been used for methylation segmentation, extending beyond array data analysis.The use of HMM segmentation is the most popular method for locating regions with CpGs in comparable methylation states.The development of methPipe by Smith Lab uses a two-state HMM to identify hypo-and hypermethylated regions and to detect allele-specific methylated regions with consecutive methylation values around 50% [124,125].Malonzo et al., 2023 developed LuxHMM, a probabilistic method and software that uses a hidden Markov model and a Bayesian regression model to segment and infer differential methylation of regions in bisulfite sequencing data [126].In the same year, the MethyLasso approach was developed for whole genome datasets enabling the independent analysis of the data belonging to different conditions of patients and integrating replicates to identify LMRs, UMRs, DMVs, and PMDs by segmenting DNA methylation levels and variation [127].In recent studies related to cancer, genetic alternation such as Copy-number variations (CNVs) may impact tumor classification and therapeutic decisions.Therefore, the R package conumee 2.0, designed for such studies, combines tangent normalization, genomic binning heuristic, and weighted circular binary segmentation to analyze Copy-number variations (CNVs) using DNA methylation arrays [128].The AI algorithms application can increase the efficiency and the accuracy of the methylation-based classification/segmentation methods.Towards this goal, Liu Y. et al., 2023, propose methylClass, an R package that offers an eSVM (ensemble-based support vector machine) model for methylation data classification, improving accuracy and overcoming time-consuming traditional SVM methods.The package also includes novel feature selection methods and multi-omics integration methods such as the Single-Cell Manifold Preserving Feature Selection (SCMER) method, the JV method performing joint tSNE and UMAP embedding, and Multi-Omics Graph cOnvolutional NETworks (MOGONET) [129].

Application of ML/DL for methylation array data analysis
Progressively, cutting-edge ML/DL techniques have broadened the scope for identifying DMRs/DMCs, offering increased flexibility in detecting patterns within complex, high-dimensional data, and yielding numerous potential biomarkers and drug targets [130].This aims to identify and characterize genomic regions as methylation features that are later subjected to the construction of prognostic or classification models for finding the prognostic behaviour and predictive accuracy of the disease.Numerous methods for feature selection have been utilized to discover a variety of methylation features that exhibit strong associations with survival outcomes and also can be used as dimensionality reduction step.This simplifies the development of the prognostic model, which estimates the risk of disease onset and progression, facilitates the stratification of the population into high and low-risk groups, and assesses the overall survival efficacy of the identified key methylation biomarker.Thus, the construction of classifiers is preferred over the use of the statistical algorithm for studying the diagnostic and prognostic behaviour of disease biomarkers.Therefore, this section deals with the review of the research methodologies followed for processing methylation microarray data using machine learning/deep learning approaches.
For instance, Zheng et al., 2020, demonstrated the Deep neural network (DNN) model's ability to classify cancer origin and predict cancer cell types using 10360 CpG sites from 7339 patients with 18 cancer origins.These 10360 CpG sites were filtered from ANOVA and Tukey's honest significance difference tests [131].Deep learning algorithms coupled with feature selection methods can reduce data complexity and increase the prediction accuracy of the model.Therefore, a study done by Gomes et al., 2022, introduced an approach using a Deep neural network (DNN) classifier model with the feature selection method as a Wilcoxon rank-sum test to identify the top 685 CpG markers in 27 K array and Random Forest algorithm to identify the top 1572 CpG markers in 450 K array.Later, the selected CpG markers are used as an input in the DNN model deducing 7 prognostic overlapping genes between 27 K and 450 K array [132].The use of the Wilcoxon rank-sum test as a feature selection method and deep learning models can identify key methylation features, enhance predictive accuracy, maintain scalability, and handle non-normal data effectively [75].Another study conducted by Zhang G. et al., 2021, considered the use of both DNA Methylation and Gene Expression Datasets for selecting the differentially expressed genes by applying mutual information (MI), along with fold change (FC), T-test, and false discover rate (FDR) test, as feature selection steps.The selected features are imported into the DNN classifier model to measure its classification ability and to identify biomarkers for gastric cancer [133].The interpretation of using a broader range of statistical methods can potentially improve model robustness but at the same time can increase the risk of overfitting with too many features.
Despite extensive research, the clinical implications of DNA methylation in disease prognosis, tumor classification, and survival outcomes remain unclear.There are various methodologies followed by researchers showing the systematic assessments of DNA methylation's impact on overall survival outcomes of patients, with or without the use of feature selection methods.One such way is to develop a prognostic prediction model that integrates various differential methylation sites derived from high-throughput microarray assay data.For instance, Liu Y et al., 2021 reported an approach for identifying specific CpG sites as methylation features by combining data from Epigenome-Wide Association Study (EWAS) using CpGassoc R package and methylation Bead-Chip assays data processing using minfi.The selected features are subsequently fed into a Support Vector Machine (SVM) classifier for model training.This model used the β-values of CpG sites derived from EWAS, as the predictor variable for predicting the diagnosis of Gestational Diabetes Mellitus [134].While SVMs offer advantages such as handling nonlinear relationships, robustness in high-dimensional spaces, and effectiveness with small sample sizes, they do have limitations [135].These include the inability of the Minfi package to capture all relevant biological variability and the challenge of interpreting SVM models [136].Progressively, a study done by Shu C et. al., 2021, analyzed HIV-positive veterans by using an ensemble model (including Random Forest (RF), GLMNET, SVM, and k-nearest neighbours (k-NN)) based on 393 CpG sites derived from feature selection mechanism, to predict mortality risk [137].The use of ensemble methods can lower the variance of the models by bagging and subsampling, and increase the model robustness to enhance the diagnosis and prediction performance [138].Additionally, various studies employ the Cox proportional hazard model to examine how various external factors may affect patient survival outcomes.For instance, Xu et al., 2022, demonstrated that the combination of univariate and multivariate Cox models effectively identifies prognosis-related CpG sites.These sites were then categorized into subgroups through consensus clustering to construct prognostic models for lung adenocarcinoma [139].Also, Guan W et al., 2022 employed the Wilcoxon rank-sum test to assess differences in methylation β-values between thymoma and thymic carcinoma and proposed an approach for identifying candidate methylation sites with potential prognostic impact.For validation purposes, they used univariate Cox regression to find methylation sites closely related to recurrence-free survival (RFS) in thymic epithelial tumors (TETs).However, multivariable Cox regression, incorporating forwarding selection for covariates, revealed that only a few characteristic features remained independent prognostic factors for RFS in TETs [140].Using a similar Cox regression model approach, Wu et al., 2021 identified 166 independent prognosis-related CpG sites which were subjected to the consensus clustering method for finding a cluster showing the highest methylation sites associated with the risk scores.Overall, this model can subdivide the cohort into high-risk or low-risk cancer groups suggesting the poor prognosis of the hypermethylated group [141].Meanwhile, Peng et al., 2021 employed a beta-mixture model (via the Methylmix package) for the identification of MDGs.Subsequently, they constructed a prognostic gene panel by combining Cox regression with the least absolute shrinkage and selection operator (LASSO) regularization methods [142].Additionally, Wang et al., 2021 acquired Cox proportional hazard models to select the 11-methylation marker related to the Overall survival (OS) of patients from 485577 methylation sites (samples), being subjected to nomogram construction.This nomogram significantly enhances the predictive capability of the existing predictor for the OS of patients having stage I-II lung adenocarcinoma [143].Yin et al., 2021 also mentioned the formation of a nomogram model with clinical features and prognostic risk model output, that involves factors to provide accurate prognostic estimates for patients' long-term survival outcomes.Such prognostic risk model construction was based on the 111 differentially methylated CpG sites derived from the preliminary steps of clustering into molecular subgroups and DNA methylation analysis on pancreatic samples [86].Cox regression models offer detailed insights into DNA methylation's impact on patient survival outcomes, yet they struggle with high-dimensional data overfitting.Pre-applying feature selection enhances their performance, making it comparable to the approaches showing the integration of machine learning algorithms with feature selection [144].
Other sample classification strategies for risk stratification involve the use of multivariate filter-based methods such as PLS-DA (Partial Least-Squares Discriminant Analysis), LDA (linear discriminant analysis), CFS (Correlation-based Feature Selection), and multiple regression analysis such as OPLS-DA (Orthogonal Projections to Latent Structures Discriminant Analysis), Sparse Partial Least Squares Discriminant Analysis (sPLSDA).These techniques aid in creating a prognostic classification panel, comprehensively exploring biomarkers and risk factors for disease recurrence [145].PLS-DA is a supervised version of Principal Component Analysis and a multivariate dimensionality-reduction method that achieves feature selection and classification model building for identifying biomarkers and further stratification of the samples into different risk groups based on their methylation patterns [146].An advanced iteration of PLS-DA, termed OPLS-DA is gaining popularity for its ability to create decipherable models by dividing variance into predictive and noise-based parts, making it easier to create models compared to its previous version.For instance, Agarwal P et al.,2022 employed this classification method to identify the metabolites and DMRs with utmost importance which was followed by the application of the cross-validation method to avoid overfitting [147].Later, the PLS/OPLS-DA model calculates VIP (variable influence of projection) for metabolites based on predictive components, with VIP> 1 or 1.5 thresholds used for further analysis using linear regression models [146,147].While effective for DNA methylation data analysis, this computationally intensive method operates on assumptions like linearity and homoscedasticity, potentially leading to biased outcomes if not met in practice.Another supervised method used by Marie-Claire et al., 2020 is the sPLS-DA algorithm that merges the features selection ability of PLS-DA with the predictive strength of logistic regression, to distinguish between lithium excellent-responders (LiERs) and non-responders (LiNRs) in patients with bipolar disorder type 1.Using sPLS-DA, they identified DMRs by combining Partial Least Squares and Lasso penalization, determining optimal DMRs via ROC curve analysis, and assessing feature selection stability with bootstrap samples.LOOCV evaluated model performance, enabling treatment response prediction based on methylation profiles [148].While sPLS-DA is effective for binary classification, its direct applicability to multi-class problems or other data types may be limited.Conversely, the CFS algorithm selects attributes based on gene usefulness for prediction, minimizing inter-correlation among features to avoid redundancy.It evaluates subsets considering predictive ability and correlation, offering heuristic merit for feature subsets, unlike methods focusing solely on individual features.This approach allows the designing of heuristic functions to minimize costs towards the goal [149].However, in a 2023 study, researchers demonstrated that mRMR and F-score do better feature selection for Alzheimer's disease prediction using gene expression data, surpassing the performance of Chi-Square and CFS filters [150].In the same year, Sharif Rahmani E et al. introduced MBMethPred, an AI-based computational framework utilizing a linear model (LDA) for subgroup classification with 763 medulloblastoma samples.This framework uses LDA for feature selection and ANN for capturing intricate nonlinear relationships between variables, achieving classification accuracy exceeding 96% and utilizing 399 CpGs as prediction biomarkers [151].
To enhance the predictive power and classification accuracy of the developed model for the patients' samples, many studies have employed complex feature selection algorithms rather than traditional methods.This can provide an extensive understanding of the methylation features and their role in the disease.For instance, Wu J et al., 2017, introduced a three-step feature selection method including minimum redundancy maximum relevance (mRMR-wrapper method) relying on mutual information theory, differential methylation analysis (filter method), and another wrapper method based on a genetic algorithm.This was integrated with the classification model formation using RF based on the selected candidate probe, classifying the samples into normal Lymph node (LN), negative LN metastasis (LN-), and positive LN metastasis (LN+) used for obtaining a biomarker for predicting lymph node metastasis of stomach cancer.Therefore, the mRMR feature selection method effectively reduced the risk of overfitting in the prediction task but may require fine-tuning of the parameters that can be considered as a limitation [152].As previously discussed, Wang et al., 2021 collectively used precursive information of the differentially methylated sites between normal and periodontitis samples and co-expression modules of CpGs (derived from WGCNA analysis) for the construction of a Support Vector Machine (SVM) classification model with a prediction accuracy of 95.5%.The classifier's high performance on both the training and external datasets indicated that the derived genes had a strong ability to classify periodontitis and provide biological context to features [93].Instead of integrating the feature selection methods, Adeoye et al., 2022 compared the ANOVA, mRMR, and LASSO (Least Absolute Shrinkage and Selection Operator) mechanism as feature selection techniques, for finding DMCs/DMRs as predictive features for machine learning models (SVM, Random Forest, and ExtraTrees) proving that the 11 DMRs selected through LASSO for the linear SVM model had the ideal AUC, recall, specificity, and calibration for OSCC detection [153].From a technical viewpoint, Zhuang, J. et al., 2012 already stated that the construction of a classification model using the Elastic Net and Support Vector Machine (SVM) outperforms competing methods like LASSO and supervised principal components analysis (SPCA) [75].Still, recent research has shown the wide exploitation of LASSO as a feature selection mechanism followed by the construction of SVM, random forest (RF), and Deep learning (DL) classification models showing high accuracy for identifying DMCs/DMRs associated with specific diseases, serving as both predictive and diagnostic features.Despite the preferable usage of SVM and Naïve Bayes, random forest (RF) outperformed most algorithms.It handled complex feature interactions and provided high accuracy, stability, and predictive power [144,154].Considering this robust classification done by RF models, Tu et al., 2022 effectively used the LASSO as a feature selection mechanism and RF model construction for the selection of the significant methylation features, allowing the accurate prediction of the samples for the occurrence of cervical cancer and supporting the stratification of the patients' samples into low-risk and high-risk groups [155].Also, Principal Component Analysis (PCA) is often chosen as the feature selection algorithm for enhancing the model performance.A study done by Nguyen et al., 2022 approved the use of the PCA method for finding principal components and employing ML algorithms such as Deep learning (DL), Support Vector Machine (SVM), and Random Forest (RF) for model construction, resulting in the identification of biomarker showing the high biological processes prediction of samples and the disease-oriented with it.This work states the better performance of the Deep learning model as compared to other models [156].Due to its limitations such as reliability on original variable data, linear relationship constraints, and oversight of data's multivariate aspects, PCA is unsuitable for complex data analysis.In contrast, methods like t-SNE and UMAP are preferable for their ability to handle non-linear interactions and complexities.
Instead of feature selection methods, numerous studies use feature ranking methods which can reduce the dimensionality of the data, remove irrelevant or redundant features, and enhance the interpretability and generalization of the model [157].For instance, some recent studies were done by Jian et al., 2022 andRen et al., 2022, filtered the methylation probes by applying the Boruta algorithm, ranked the features using MCFS, LightGBM, and LASSO, and incremental feature selection (IFS) with decision tree and random forest algorithms for creating six classification models.This helped to extract essential methylation features; construct efficient classifiers and classification subtyping rules for anal carcinoma, cervical carcinoma, and sarcoma patients' samples [158,159].A similar approach of feature ranking methods has been used to create high-performance classification models that can identify methylation sites and decision rules for COVID-19, lymphoblastic leukemia, and non-Hodgkin's lymphoma.Therefore, we can state that it can be a better approach to adopt for finding relevant methylation signatures and sample classification rules created by ML classifiers.Some of its limitations include the stringent selection criteria for features, leading to the exclusion of relevant features [160][161][162].A comprehensive overview of all the reviewed articles unveils the classification models using DMRs with a clear tabulated analysis input, the algorithm followed for model formation, and the output of the study is given in Table 2.

Annotation and Visualization of DMRs
Annotation is a crucial step in evaluating the biological relevance and functional significance of DMRs.This process involves enriching functional annotations within genomic regions that display distinct DNA methylation patterns.Various tools and databases are commonly used for functional enrichment analysis, including GSVA, DAVID, GO, and KEGG pathway analysis.GSVA (Gene Set Variation Analysis) is an unsupervised computational analysis that identifies diseased molecular pathways associated with DMS [118].DAVID database is a reputable choice for researchers seeking integrative and systematic gene annotation.It provides information on biological pathways, protein networks, and gene ontology terms.GO and KEGG pathway analyses aid in identifying signature disease-related genes.GO annotations classify enriched pathways into cellular components, biological processes, and molecular function categories, while KEGG analysis uncovers relevant molecular and metabolic pathways and interacting networks in the context of the disease [87,[90][91][92]94,97,164].To annotate genes closely associated with methylated sites, the study utilized the GRCh38 annotation file from the GENCODE project [86].The STRING database was used to explore functional proteins and protein-protein interactions (PPI) related to hub genes, contributing to an improved understanding of disease biomarkers [87,91,92].Subsequently, Cytoscape was employed to visualize the intricate network, and its integrated application, Cytohubba, was utilized to identify the most significant hub genes within the PPI network [105,165].Various studies utilize a range of tools and online databases to analyze hub genes, establishing them as potential biomarkers for disease.Databases such as GENEMANIA and miRWalk are utilized for pinpointing genes associated with a predefined list of genes and for mapping out gene-miRNA interaction networks, respectively [97,116,164].
Conventional approaches to gene set testing may generate biased Pvalues owing to variations in gene lengths.For Illumina array-profiled DNA methylation data, methods adjusting for the number of CpGs, rather than gene length, are imperative.MethylGSA resolves this concern by facilitating gene set testing with adjustments for length biases.This enables the discovery of enriched pathways extracted from prominent databases like Gene Ontology, KEGG, and Reactome [166].Notably, the number of CpGs linked to each gene on the 450 K array varies widely, potentially biasing gene set analysis.This calls for the application of the gometh function offered by the missMethyl Bioconductor package by adjusting for the number of CpGs associated with each gene.The input taken is a vector of significant CpGs followed by a hypergeometric test, considering the CpG site density per gene on the 450 K/EPIC arrays [167].Moreover, the same package offers GSAmeth function which is designed to assess if there's a statistically significant concentration of differentially methylated CpG sites within gene sets predefined by the researcher.This method systematically evaluates the presence of methylation changes across these gene sets to understand their potential biological impact [168].Outperforming the ways focussing on identifying individual genes that exhibit differences between two states of interest, the introduction of Gene set enrichment analysis (GSEA) analyzes the expression data at the level of gene sets.This offers several advantages such as enhanced interpretation by identifying pathways and processes, greater reproducibility and interpretability, an enhanced signal-to-noise ratio, and detects subtle changes in genes within highly correlated sets [169].Additionally, an advanced iteration of GSEA, known as ebGSEA, was introduced to address the issue of differential probe representation on Illumina Infinium DNA methylation bead chips.This method prioritizes genes over CpGs, ranking them based on overall differential methylation levels using all corresponding probes.It offers improved sensitivity and specificity compared to existing methods for EWAS data analysis [170].Furthermore, GSEA can be adapted for cross-species studies through domain adaptation.This approach known as CROSS-species gene set enrichment problem (XGSEP) allows for the analysis of gene expression measured under the same phenotype of different species, which is particularly useful when direct experiments on humans are risky and are instead substituted by model organisms like mice.The XGSEP method is structured into three stages: GSEA, domain adaptation, and regression [171].Furthermore, GSEA software has been updated to support RNA-seq datasets and single-sample analysis (ssGSEA), expanding its applicability in various biological states and phenotypes [172].These adaptations enhance the utility of GSEA in modern biological research, allowing for more comprehensive and versatile analyses.
Visualization of DMS aids in detecting inaccuracy in results, identifying the features and exploring patterns that are not detectable in the tabular outputs, and comprehensible investigation of the biological processes related to the genomic data allowing the researcher to hypothesize the research outcome [173].Heatmaps visually represent color variations to display different variables, including hypermethylated and hypomethylated CpG sites while scatter plots illustrate the association between variables such as methylation level and gene expression or methylation level at specific CpG sites [174,175,176].Volcano plots are a form of scatter plot that visualize and identify DMRs between studied groups [111,177].Furthermore, box plot visual representation allows the user to correlate the relationship between sample tissue and methylation value [178][179][180].A violin plot is an amalgamation of a box plot and kernel density distribution, to display CHH/CG/CGH methylation levels in specific DMRs [181][182][183].What's more, UCSC Genome Browser Home is a genome browser available for visualizing particular genome annotations, along with analyzing and comparing the genomic datasets [184].The Ensemble database offers reliable genome annotations and tracks gene evolution across species.It also allows for the incorporation of related biological data mapped onto features derived from the genome [185].Also, some of the web-based applications such as MethSurv, are designed with user-friendly efficiency providing the visualization of the CpG sites, functional analysis, graphical parameters, and survival correlations using the Cox proportional-hazards models [186][187][188].

Discussion
DNA methylation is one of the earliest and most significant heritable events among the epigenetic marks of the genome associated with gene regulation, as well as developmental and progressive events of underlying disease [157].The genome-wide methylation profiling analysis has attained widespread popularity for the identification of epigenetic biomarkers (episignatures) acting as the predictive tool for clinical studies.Also, this analysis makes way for the classification of diseases based on molecular subtyping, guiding treatment choices, and ultimately managing overall patients' life expectancy [181].There are a variety of computational tools and algorithms available for the processing and analysis of DNA methylation profiling data, detailed in several review articles [15,71,89].Consequently, this review provides a comprehensive and consolidated overview of diverse aspects of array-based methylation data analysis within a single resource, highlighting the trending methodologies or workflows followed by the researchers for finding the methylated dysregulated sites.In this study, we have outlined existing tools and workflows, evaluated their primary strengths and limitations, and proposed a selection of algorithms that we believe currently offer the most effective approach for analyzing DNA methylation microarray data.
In terms of databases, public repository data has been identified as the preferred choice among academics and practitioners for analyzing DNA methylation array data.However, despite the frequent limitations of current datasets, such as imbalanced data and missing data, many researchers also rely on additional data acquired from hospitals and clinics.To assist future researchers and practitioners interested in analyzing DNA methylation array data, we've curated a list of commonly referenced datasets, detailing their origins, sample sizes, representativeness, and selection criteria in Supplementary Table II.Additionally, Supplementary Table I contains links to and descriptions of public repositories.
The choice of pre-processing method is of utmost importance as it can drastically affect the between-sample variability and the results of the analysis [189].Most of the workflows recommend the use of the Minfi package for pre-processing the array data, and other frequently used algorithms with details are listed in Supplementary Tables III and IV.The identification of global changes can be facilitated by visually inspecting methylation data, which can be achieved through various clustering methods outlined in referenced studies, such as PCA, hierarchical clustering, K-means clustering, and consensus clustering.We highly recommend employing consensus clustering (model-based clustering method) and recursively partitioned algorithms, as they are effective for processing high-dimensional data.These methods form distinct methylation subtypes that help classify diseases, which are then analyzed by clinical and molecular traits.Additionally, the utilization of consensus clustering allows for representing consensus across multiple clustering algorithm runs and evaluating cluster stability with random restarts [88,101].
The downstream analysis involves identifying differentially methylated regions across different biological conditions using tools like the Limma package and the ChAMP pipeline, which are known for their efficacy in array-based methylation data analysis.Renowned for its efficacy in gene discovery, Limma excels in differential expression analysis for methylation arrays, microarrays, and RNA-seq data.Also, ChAMP provides a comprehensive analysis, including batch effect correction, differential methylation, copy number variation adjustments, cell type heterogeneity management, network analysis, and an interactive GUI [24,101].We also suggest considering linear regression models coupled with deep learning algorithms to elucidate the intricate associations between DNA methylation signatures and gene expression data, offering insights into clinical variables [190].Additionally, statistical Cox regression models are highly recommended for effectively identifying prognostic CpG sites associated with disease, offering superior estimations of survival probabilities and cumulative hazards compared to the Kaplan-Meier function [191].Furthermore, we advocate for the use of analytical tools like MEXPRESS, DNMIVD, SurvivalMeth, MethHC, DiseaseMeth 2.0 database, ssGSEA, and TIMER to validate biomarker gene methylation and explore the correlation between the gene expression and methylation levels, which are the crucial components of a comprehensive methylation analysis workflow.
Despite the complexities of disease mechanisms and symptoms, our review highlights the utility of ML/DL algorithms in enhancing the efficiency of disease diagnosis and prognosis.We support the utilization of Deep Neural Network (DNN) models, complemented by robust feature selection techniques for the identification of DNA methylation profiles into distinct regions based on observed methylation patterns.This approach is advantageous for capturing complex and non-linear patterns within high-dimensional datasets, thereby enhancing predictive accuracy.Additionally, the review commonly examines algorithms such as Support Vector Machines (SVM), which are particularly prominent, along with K-Nearest Neighbors (KNN), Random Forests (RF), Deep Learning (DL), and Decision Trees (DT), all of which are extensively employed in disease diagnosis research [192].Generally, the selection of the feature selection methods observed in reference articles, comprised of Principal Component Analysis (PCA), Least Absolute Shrinkage and Selection Operator (LASSO), and sometimes more intricate approaches like Minimum Redundancy Maximum Relevance (mRMR) and feature ranking methods such as Monte Carlo Feature Selection (MCFS), Light Gradient Boosting Machine (LightGBM) and Incremental feature selection (IFS).The selection of these methods is influenced by sample size, the biological context of the study, computational limits, and the aim to enhance predictive accuracy.

The effective role of algorithms in the diagnosis of some closely related human diseases
This paper provides a survey of different R packages, statistical algorithms, and machine learning techniques for the diagnosis of different diseases such as numerous cancer types, varying tumor types, atherosclerosis, dementia, diabetes, high blood pressure, periodontitis, Acute respiratory syndrome, Alzheimer's, schizophrenia, Coronary artery disease, HIV, bipolar disease type I, and knee osteoarthritis.Analytical and computational pipelines used to analyze two similar diseases may differ according to the available data, disease characteristics, and the specific objectives of the analysis.The predominant focus of the reviewed literature pertains to various cancer types, highlighting the utilization of the limma or ChAMP pipeline in identifying diagnostic CpG sites.Additionally, these studies also employ WGCNA, HLM, and the Mann-Whitney U test to explore the biological function and measure the accurate estimates of the methylation differences.Moreover, in cancer research, univariate and multivariate Cox regression analyses are effectively used to identify prognostic factors for developing nanogram to predict patient survival [193].For the diagnosis of cancer-related diseases, other recommended machine learning (ML) and deep learning (DL) algorithms include DNN, CNN, ANN, RF, and SVM classification models.These models are often enhanced by feature selection algorithms like MCFS, LightGBM, IFS, ANOVA, mRMR, and LASSO to improve diagnostic accuracy [194].Similarly for tumor diagnosis, the MBMethPred package, integrating ML and neural network models, is suitable for subgroup classification whereas univariate and multivariate Cox models have demonstrated satisfactory prognostic accuracy.Clustering methods like RPMM, PCA, tSNE, consensus clustering, NMF, and hierarchical clustering are recommended for cancer diagnosis, with consensus clustering offering robust predictions and K-means excelling in high variance scenarios with identical centroids [195,196].For neurological disorders, algorithms such as Comb-p and DMPfinder are effective in identifying DMRs as diagnostic biomarkers, often used in conjunction with linear regression models and feature selection methods like mRMR and F-score.In the realm of immunological diseases, an ensemble approach combining RF, GLMNET, SVM, and k-NN is recommended for robust diagnostic modeling, complementing traditional methods.The figurative approach of diseases diagnosed by the methodological framework (including all packages/algorithms/ ML models) is shown in Fig. 3.

The effective role of optimal algorithms in the diagnosis of specific diseases
ML predictive tools enable proactive disease diagnosis and risk evaluation, often before symptoms emerge.Breast cancer, being among the most frequently diagnosed malignancies, has witnessed a surge in research efforts.This has particularly been directed towards the application of ML algorithms to facilitate early detection.So, the utilization of ML algorithms such as DNN, LR, RF, SVM, KNN, and DT is extensively recognized in breast cancer research.Some of the traditional methods including Cox regression models, HLM, differential analysis, Mann Whitney U test, t-test, and ANOVA, continue to be pivotal for analyzing the diagnostic patterns of breast cancer [197,198].Analyzing microarray datasets with ML/DL algorithms can help pinpoint key protein biomarkers for early pancreatic cancer detection.Notably, Cox regression models, SVM classifiers utilizing Recursive Feature Elimination (RFE), and Artificial Neural Network (ANN) methodologies are among the most effective for this diagnosis [199].In the realm of oncology, XGboost, and deep learning models, particularly CNN and DNN models, have demonstrated high accuracy in the early diagnosis of Esophageal cancer [200].Moreover, studies support the effective incorporation of WGCNA for gaining biological insight into the disease.In cervical cancer research, integration of traditional and machine learning algorithms such as Differential Methylation Analysis (using limma), LASSO, and Boruta feature selection, coupled with DT and RF classifier models, is a prevalent way for identifying diagnostic biomarkers [201].DL models, Logistic Regression, and Cox Regression models have been thoroughly explored for integration into Computer-Aided Detection (CAD) systems for automated detection and classification of lung cancer [202].In the context of Alzheimer's disease, the Comb-p algorithm for DMR analysis, while Singular Value Decomposition (SVD), PCA, mRMR, and F-score have been utilized for feature selection, with a CNN serving as a classifier to enhance disease prediction [203].

Limitations and future prospects
Inevitably, we also acknowledge certain innate limitations associated with the review which cannot be covered due to time and scope of exploration constraints.Firstly, several algorithms give importance only to differentially methylated CpG islands, overlooking the significance of distal regulatory elements regions in the genome which are believed to provide crucial support for biomarker investigation.Second, correlation analysis of the risk score obtained in the diagnostic model with that of the clinical characteristics, generation of copy number data, single-cell methylation analysis, and tumor microenvironment are the unexplored sections in the field of epigenetic research that are not held accountable in this review.Additionally, the discussion of computational pipelines and algorithms specific to sequence-based methylation data falls outside the scope of this review.Looking ahead, there is ample opportunity to broaden this study to include multi-omics approaches, which would integrate DNA methylation studies with other epigenomic data, thereby enriching the avenues for disease treatment, diagnosis, and prognosis.
As we delve deeper, beyond the scope of methodologies covered within the review, there emerges a range of additional statistical algorithms deserving of attention.Each of these algorithms carries the potential to reshape and elevate epigenetic research by offering fresh perspectives and insight.Such as Graph theory-based analysis, capable of elucidating biological pathways through the integration of DNA methylation, gene expression datasets, and other omics data, has been incorporated in classification studies and has proven to be superior than other algorithms.This graph-based approach can also deal with the complexity of the microarray data and its large number of features (genes) while annotating with samples [204].Additionally, Bayesian approaches, hold the potential to address inter-cellular methylation heterogeneity and counteract the sparsity often observed in methylation data.Furthermore, the interdependency of the methylation functionality and its changes over the change in the environmental conditions can be depicted with the application of the longitudinal analysis of methylation data for predicting the CpG sites with different stabilities, states, and functionality [205].Building upon the foundation of these innovative algorithms and the methodologies outlined in this review, we are equipped with a robust resource that can guide the selection of effective protocols for DNA methylation array data processing and analysis.Certainly, DNA methylation is a stable and reliable biomarker for disease diagnosis, as it does not change as easily as RNA or protein levels.Despite this, the widespread clinical integration of most candidate genes into molecular diagnostics remains distant.The general limitation faced by the computational algorithms is concerned with complex data, availability of variability in datasets compared to samples, the non-linear association of data, and the validation method of the algorithms for overcoming the error rate.Such limitations can influence the accuracy of the analysis and subsequently impact the identification of effective epigenetic biomarkers.To overcome these limitations, a promising approach should involve the integration of existing workflows into flexible algorithms/pipelines which can improve the detection and annotation of methylated sites, handle sample variability, and identify subtle genomic changes.
In the future, we believe that the field of epigenetic research can be uplifted by the utilization of deep learning techniques that can significantly enhance predictive modeling by effectively capturing intricate features linked to aging, cell types, and disease progression.As a result, they hold the potential to offer valuable diagnostic and prognostic clinical outcomes.Furthermore, the development of flexible algorithms that can segment the genomic regions with different methylation patterns, such as silencing certain disease-related genes (e.g., tumor suppressor genes in cancer) in some stretches of length, is necessary.Due to the advantageous nature of computational methods to harness heterogeneous data across several dimensions of biological variation, several developments have been made so far for predicting better clinical outcomes and a comprehensive view of the regulatory landscapes.One such development includes simultaneous profiling of histone modification and DNA methylation from a single DNA molecule using the nanopore sequencing technique [206].The derived ONT (Oxford Nanopore Technology) data from the nanopore sequencing technique requires the evolution of robust and user-friendly bioinformatics software, providing cloud storage and real-time analysis [207].Further, the investigation of the epigenetic markers for finding reliable correlations with the living phenotype leads to the enhancement of epigenetic editing methods (including established tools like CRISPR-Cas9) which simultaneously brings the computational framework into consideration for developing machine-learning-based model that can incorporate multi-dimensional data features, genetic variation data, and chromatin accessibility data for predicting off-target as well as on-target effects [208,209].Additionally, existing research has integrated linear regression algorithms with methylation profiling to explore correlations between environmental factors or exposome and changes in DNA methylation data at least at 'metastable epialleles' denoting the developmental environment [210].This explores the ways of dealing with limitations of existing computational algorithms such as produced bias due to accounted relatedness of samples and unreliable certainty of the statistical tests if there is a significant difference in the number of cases and controls.Moreover, the requirement for robust bioinformatics pipeline/algorithms development may increase due to the growing production of whole-genome sequencing (NGS) data in the position of supporting the analysis of methylation profiling data for biomarker and SNP estimation.Because it deals with the functional ramifications of uncommon DNA modifications, the presence of uncommon cell types, and intercellular heterogeneity, the integration of single-cell epigenomics techniques with transcriptome and epigenomic sequencing data will also necessitate a significant evolution of computational pipelines and strategies [211].By disclosing specific molecular and genetic variations at the cellular as well as genomic level, the future outlooks discussed so far can improve research efforts in the creation of novel algorithms and workflows for enhancing the efficacy of targeted illness therapy.We strive to create novel and enhanced computational approaches that will advance personalized healthcare and precision epigenetic therapy.

Conclusion
The diversity of measurable epigenetic markers enables the use of epigenetic events as early indicators of human disorders and provides mechanistic clues to disease etiology.The quest to identify epigenetic biomarkers is propelling advancements in the medical field, leading toward a paradigm where personalized treatment strategies significantly enhance patient care, diagnostic ability, and prognostic accuracy.This necessitates a complexity of methods for array processing and analysis where optimized computational algorithms adeptly address crucial aspects like disease detection accuracy and effective treatment.It's essential to establish a protocol of best practices for the algorithms and packages mentioned earlier in the review.This ensures that research outcomes are of the highest quality and stay relevant to diverse research hypotheses.The recommended methodologies and algorithms are decided based on the computationally intensive nature of the resource's availability, the user-friendly nature of algorithms, and better validation of results.The study's goal is to examine the range and efficacy of various algorithms and workflows used in recent research, with a focus on their accuracy in disease prediction through the identification of biomarkers, classification, clustering, and survival analysis.While designing a pipeline for DNA methylation array analysis, it's crucial to incorporate all the major processing and analysis steps previously outlined.An efficient computational workflow demonstrates the utilization of traditional statistical methods to categorize groups based on methylation profiles, followed by the application of machine learning (ML) techniques to enhance analysis and ensure prediction accuracy.For optimal results in DNA methylation data preprocessing and differential analysis, it is advisable to employ the Minfi package, complemented by traditional statistical methods such as the empirical Bayes approach (limma), Bumphunter, and DMRcate for their proven effectiveness.Moreover, Consensus and recursive partitioning clustering algorithms excel in detecting methylation patterns and characterizing samples.The application of ML algorithms showing the highest accuracy for disease diagnosis, are DNN, RF, and SVM classification models.Furthermore, the synergy of Cox regression with logistic regression models in machine learning offers superior predictions for disease prognosis and patient survival rates.Despite the differences in frequency and performance metrics, this article depicts the promising potential of discussed algorithms and workflows in disease prediction.Through the proper use of the processing and analyzing methodologies outlined above, we hope that potential users will best harness the suitable possible outcomes from array-based data, leading to rapid advancement in human health and disease research.

Reviewer disclosure
Peer reviewers on this manuscript have no relevant financial relationships or otherwise to disclose.
For instance, Consensus clustering is employed by Wang et al., 2021 for subgrouping prognostic methylated CpG site into four methylation clusters, reflecting

Fig. 2 .
Fig. 2. General concept showing the flow of DNA methylation profiling data from experimental methods to data repositories and providence of DMR analysis algorithms.[212].

Fig. 3 .
Fig. 3. Figurative approach of diseases diagnosed by different effective algorithms of some related human diseases.

Table 1
Summary of studies reviewed focused on exploring methods and algorithms for the identification of DMPs/DMRs for different types of breast cancer and other diseased genome profiles.

Table 2
Comprehensive summary of the Machine learning and Deep Learning algorithms utilized in the reviewed studies for classification models, emphasizing their significance in the research.
[131] Highly robust nature Cons: This model is[131](continued on next page) K. Sahoo and V. Sundararajan