
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 12 Issue: 08 | Aug 2025 www.irjet.net p-ISSN: 2395-0072
![]()

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 12 Issue: 08 | Aug 2025 www.irjet.net p-ISSN: 2395-0072
Adegoke Oluyemi Borisade 1 , Taiwo Fele 2
1Centre for Energy Research and Development, Obafemi Awolowo University, Ile Ife, Nigeria
2Computer Science Department, The Federal Polytechnic, Ado Ekiti, Nigeria
Abstract - Reinforcement learning (RL) is powerful for complexcontroltasksbutoftenlackstransparency,limitingits useinhigh-stakesreal-worldapplications.Tobridgethisgap, we propose eXActor-Critic, an explainable RL framework basedontheActor-Criticarchitecture,balancingperformance andinterpretability.TheframeworkfeaturesaDual-PathGRU Network, combining Bidirectional GRUs (BiGRUs) for longterm dependencies and Unidirectional GRUs (UniGRUs) for short-term adaptability, with Dynamic Mode Switching to optimize reward-based learning in non-stationary environments. For interpretability, eXActor-Critic employs attention-based saliency maps to identify critical state variablesandPCA/t-SNEvisualizationstorevealhidden-state dynamics.TestedonamodifiedCartPole-v1environmentwith controlled non-stationarity, eXActor-Critic outperformed a traditional Actor-Critic baseline by 154.4% in mean reward (28.409vs.11.167),withstatisticalsignificanceconfirmedvia t-Test (p ≈ 3.03E-159) and Mann-Whitney U Test (p ≈ 8.83E193). Stability mechanisms like Layer Normalization, Automatic Mixed Precision, and Z-score Advantage Normalization ensured robust training. Key contributions include a transparent RL framework with real-time explainability, a dual-path GRU architecture for adaptive learning, and visual interpretability tools (saliency maps, dimensionality reduction). eXActor-Critic advances trustworthyRL,withapplicationsinrobotics,healthcare,and finance, while future work focuses on scaling to larger environments and optimizing RNN architectures.
Key Words: Explainable Reinforcement Learning, ActorCritic,Dual-PathGRU,InterpretableAI,SaliencyMaps,NonStationaryEnvironments.
Reinforcement learning (RL), a machine learning paradigm where agents learn optimal control actions by interactingwiththeirenvironments,hasachievedimpactful results across a wide range of fields. It enables early fault detection for predictive maintenance and scheduling in industrialsystems[1],dynamicpruningandinferencingfor efficient healthcare IoT systems [2], energy-efficient and damage-recovery locomotion in adaptive robotics [3], systemic risk reduction in interbank networks [4], personalized medicine through optimized treatments includingpharmacologicalanemiamanagement[5],resilient energygridsthroughnetworkreconfiguration[6],collision
avoidance and autonomous navigation in UAVs [7] and surface vehicles [8], transparent decision-making in autonomousdrivingandhuman-in-the-loopsystems[9],[10], optimized debt recovery models [11], multi-intersection signal control to reduce traffic congestion [12], flood risk mitigationinurbandrainagesystems[13],adversarialattack detectionforsecureplanetarylandings[14],andsustainable manufacturing through real-time demand response [15]. These applications highlight RL's versatility in addressing complex,real-worldchallenges.
Despite its successes, RL lacks explainability and interpretabilityfeaturesduetotheopaquedecision-making processes of its complex deep neural networks, which obscure how actions are derived from states and rewards [16]. This lack of transparency limits trust and hinders deployment in some critical applications like healthcare, robotics, and autonomous systems, where safety and reliability are paramount [16,17]. Additionally, achieving efficient interpretability without compromising computationalperformanceremainsachallenge,particularly in real-time systems [18]. The complexity of policy representations in high-dimensional or continuous action spacesfurthercomplicatesunderstandingtherelationships betweenstates,actions,andrewards[17].Thesechallenges underscoretheneedforenhancementstomakeRLsystems moretransparent,interpretable,andefficientfortrustworthy decision-making.
Current approaches to achieving interpretability in reinforcementlearningexhibitseveralkeylimitationsthat hinderreal-worlddeployment.SelectiveParticleAttention [19] enables feature selection but struggles with highdimensional spaces and may miss critical features in dynamicenvironments.Dynamicstaterepresentations[20] show promise for automated driving yet depend on predefined structures that limit adaptability to novel scenarios. Model-based approaches [21] offer interpretability but demand substantial computational resources, making them impractical for real-time applications. Layer-wise Relevance Propagation [17] provides explanations but is noise-sensitive and lacks temporal coherence. Policy distillation methods [18] risk oversimplifying complex policies during compression. Blockchain-oriented RL [22] handles parameter configurationwellbutstruggleswithmulti-agentdynamics and environment complexity, while photovoltaic frameworks[23]requireextensivedatasetsandshowsless

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 12 Issue: 08 | Aug 2025 www.irjet.net p-ISSN: 2395-0072
optimalperformanceinhandlingnonlinearity.Notably,jointaction intrinsic reward approaches [35] demonstrate effectivemulti-agentcooperationbutlackgeneralizabilityto single-agentscenariosandisnotdesignedtoadapttononstationaryconditions.
Inresponsetothesechallenges,weintroduceExActor-Critic Model-4, an explainable RL framework that builds upon Actor-Criticfoundationswithseveralusefulmodifications:
1. Dual-Path GRU Network:CombinesBidirectional GRUs (BiGRU) for long-term dependency capture withUnidirectionalGRUs(UniGRU)forshort-term responsiveness, enabling effective temporal processingacrossvaryingtimehorizons.
2. Automated Optimization System: Integrates Optuna-based hyperparameter tuning with Layer Normalizationtomaintainconsistentgradientflow across both GRU pathways, stabilizing training whilepreservingmodelperformance.
3. Dynamic Mode Switching: AdaptsGRUselection basedonrewardimprovementthresholds(≥+5%), automaticallyoptimizingarchitectureconfiguration fortask-specificperformance.
4. Multi-Modal Interpretability:Combines:
o Interpretable attention mechanisms that guide both Actor and Critic networks to prioritizerelevantfeatures
o Dimensionalityreduction(PCA/t-SNE)for decisiontrajectoryvisualization
o Saliency maps for state variable importanceanalysis
5. Enhanced Stability Mechanisms:
o Automatic Mixed Precision (AMP) for efficientgradientscaling
o Z-scoreAdvantageNormalizationforstable policyupdates
o Validated robustness against observation noise, periodic perturbations, and nonlineardragforces
Unlikedomain-specificsolutions[20,24],ourframeworkis generalizableacrossdiverseapplicationswhileaddressing the computational limitations of prior work [25]. The selectiveGRUactivationreducesinferenceoverheadby37% comparedtoconventionalarchitectures,enablingreal-world deploymentinnon-stationaryenvironments.
Technical Contributions and Evaluation
ExActor-Critic'sevolutionarydevelopmentcomprisesfour progressivevariantssummarizedinTable1.
Experimental validation on modified CartPole-v1 demonstrates:
154.4%meanrewardimprovementoverbaseline
Actor-Critic
42% faster convergence compared to baseline
Actor-Criticsolution[20]
68% reduction in performance variance under dynamicconditions
Theseoutcomessuggesttheproposedframeworkmayoffer advantages particularly in challenging, non-stationary environments where existing approaches [17,19,22] and similartypicallymaybelesssuitable.
Table -1: ExActor-Critic'sevolutionarydevelopment
Model Version Key Features Advancement Over Prior Work
Model-1 GRUs+Attention Mechanisms IntroducedBase Interpretable Architecture
Model-2 Optuna hyperparameter optimization
Addressedscalability limitations[25]
Model-3 BidirectionalGRUs Improvedlong-term dependencycapture
Model-4 Fullstabilitypackage +dynamicswitching Solvednonstationary adaptation[21,35]
Reinforcementlearning(RL)hasemergedasapowerful paradigmfordecision-makingincomplexenvironments,yet itswidespreadadoptioninreal-worldapplicationsremains hindered by the "black-box" nature of modern deep RL systems[16].WhilerecentadvancesinexplainableRL(XRL) have made significant progress through three main approaches;quantifiableinterpretabilitymetrics[26],selfexplaining architectures [27], and post-hoc explanation methods [28]; critical gaps remain in developing frameworks that balance performance, transparency, and generalizability.Thissectionreviewstheseadvanceswhile positioning our ExActor-Critic framework as a comprehensive solution that addresses key limitations of existingapproaches.
TheActor-Criticarchitecturehasbecomethecornerstone ofmodernRLduetoitsstabilityadvantagesoverpurepolicy gradientmethods[31].Recentinnovationshavefocusedon improvingsampleefficiencythroughexperiencereplay[32, 33]andenhancingrobustnessviaconstrainedoptimization techniques[30].However,asnotedin[26],theseadvances have largely overlooked the critical need for built-in explainability, a limitation that becomes particularly problematicinsafety-criticalapplicationslikeautonomous navigation [8] or medical systems [5]. Our ExActor-Critic framework directly addresses this gap by embedding interpretabilitymechanismsateachleveloftheActor-Critic architecture,fromthepolicynetworktothevaluefunction approximation, while maintaining the stability benefits of traditionalimplementations[29].
Current XRL methods can be broadly categorized into three complementary approaches, each with distinct

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 12 Issue: 08 | Aug 2025 www.irjet.net p-ISSN: 2395-0072
advantages and limitations. Feature importance methods like attention mechanisms [34] and saliency maps [19] provide localized explanations but often fail to capture temporal dependencies in sequential tasks. Architectural approachessuchasinterpretabledecision trees[27]offer globaltransparencybuttypicallysacrificeperformanceon complextasks.Post-hoctechniquesincludingcounterfactual explanations [28] and state space visualizations [41] are flexible but computationally expensive. Our work bridges theseapproachesthroughanovelcombinationofattentionbased feature importance, architectural transparency via GRU-based state representations, and efficient post-hoc visualization achieving comprehensive explainability without compromising on task performance, as demonstratedinourexperimentalresults.
The field has increasingly relied on standardized environmentslikeOpenAIGymforreproducibleevaluation [38]. While this has enabled meaningful comparisons of algorithmicperformance[39],currentbenchmarksoftenfail to adequately assess explainability a critical oversight giventhegrowingemphasisontrustworthyAIsystems.Our evaluationprotocolextendsbeyondconventionalmetricsby introducing quantitative measures of interpretability alongsidetraditionalperformanceindicators.Furthermore, ourmodifiedCartPole-v1environmentwithcontrollednonstationarityprovidesamorerigoroustestbedforassessing robustness in dynamic conditions compared to standard benchmarks[40,41].
Theuseofrecurrentarchitectures,particularlyGRUs,has become prevalent for tasks requiring temporal state representation [20]. While effective in domains like autonomous driving [20] andenergy systems [6], existing implementationssufferfromtwokeylimitations:(1)their unidirectional nature restricts comprehensive state representation,and(2)they lack integratedexplainability features.Ourdual-pathGRUdesigninnovatesbeyondthese approaches by combining bidirectional processing for complete temporal context with built-in attention mechanismsforreal-timeinterpretability.Thisarchitecture not only outperforms conventional GRU implementations [14,23]intermsoftaskperformancebutalsoprovidesthe explanatorycapabilitiesneededfordeploymentinsensitive applications.
Whileacknowledgingthevaluablecontributionsofprior workascapturedinTable2,ourExActor-Criticframework explores new possibilities for enhancing the reliability of reinforcement learning systems.. The subsequent sections detail how our integrated approach to performance and explainabilityachievessuperiorresultscomparedtoexisting methods,asdemonstratedthroughcomprehensiveempirical evaluation.
Table -2: NoveltyofExActor-CriticComparedtoPrevious Works
Reference Key Contribution ExActor-Critic's Novelty
[19]
Selectiveparticle attentionforfeature selectioninDRL
[34] Generalized attention-weighted RLfor interpretability
[29] Adaptivebiasvariancetrade-offin advantage estimationforActorCriticalgorithms
[20] Dynamicand interpretablestate representationfor autonomousdriving usingGRUs
[10] HybridRLandrulebasedcontrollersfor transparent decision-makingin autonomousdriving
[42] Curriculumlearning withHindsight ExperienceReplay forsequentialtasks
Integratessaliency mapsandattention mechanismsfor deeper interpretability
Usesbidirectional GRUsandautomated hyperparameter optimizationfor enhanced performance
Introduces progressivemodel variants(Model-1, Model-2,Model-3)for scalabilityand efficiency
Generalizesto broaderRLtasksand addsdimensionality reductionforbetter visualization
Focusesonend-toendexplainability withoutrelyingon rule-basedsystems
Usesbidirectional GRUstocapture dependenciesfrom bothpastandfuture states
[25] Parameter-exploring policygradientsfor RL Introducesautomated hyperparameter optimizationusing Optunaforscalability
[24] Continuousaction policyforlimitorder tradinginDRL
Designedforgeneral RLtasks,notlimited tospecificdomains likefinance
[43] ModularDRLfor robotnavigation usingrewardand punishment IntegratesGRUs, attention mechanisms,and bidirectionalGRUsfor temporalmodeling

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 12 Issue: 08 | Aug 2025 www.irjet.net p-ISSN: 2395-0072
Thissectionestablishesthemathematicalandconceptual foundationfortheExActor-Critic framework,startingwith ReinforcementLearning(RL)asasequentialdecision-making problemformalizedbyMarkovDecisionProcesses(MDPs), where an agent maximizes cumulative rewards through environmentinteractions.ItintroducesActor-Criticmethods, combining policy-based and value-based approaches, alongsidekeyconceptslikevaluefunctions,policygradients, andtemporaldifferencelearning,formingthebasisforthe ExActor-Criticframework.



ReinforcementLearning(RL)isaframeworkforsolving sequentialdecision-makingproblems.Itisformallymodeled as a Markov Decision Process (MDP), defined by the tuple , :Thesetofallpossiblestates, :Thesetofall possible actions, : The transition probability function, ,whichdefinestheprobabilityoftransitioningto state from state after taking action , : The reward function, , which specifies the immediate reward received after transitioning from state to state due to action , :Thediscountfactor, ,whichdetermines the importance of future rewards relative to immediate rewards.


The goal of RL is to find a policy , which is a mappingfromstatestoactions,thatmaximizestheexpected cumulativediscountedreward:


Here, isthevaluefunctionunderpolicy ,representing the expected cumulative reward starting from state and following policy . The symbol denotes the expectation operator, which computes the average value of a random variableoveritsprobabilitydistribution.
TheBellmanequationforthevaluefunctionis:

Similarly, the action-value function represents theexpectedcumulativerewardstartingfromstate ,taking action ,andthenfollowingpolicy :

The optimal policy is the policy that maximizes the valuefunctionforallstates:
The Bellman optimality equation for the optimal value function is:

Actor-Critic methods are a class of RL algorithms that combine the strengths of policy-based and value-based approaches.Theyconsistoftwomaincomponents,Actorand Critic.

Actor:Theactorisresponsibleforselectingactionsbased on the current policy , where represents the parameters of the policy. The actor is updated using the policygradienttheorem:


Here, is the expected cumulative reward, is a trajectory,and istheadvantagefunction,definedas:

The advantage function measures how much better an action iscomparedtotheaverageactioninstate .

Critic:Thecriticestimatesthevaluefunction ,where representstheparametersofthevaluefunction.Thecritic isupdatedbyminimizingthetemporaldifference(TD)error:
The loss function for the critic is typically the mean squarederroroftheTDerror:
Theactorandcriticareupdatedsimultaneouslyduring training.Theactorimprovesthepolicyusingtheadvantage functionestimatedbythecritic,whilethecriticimprovesits valuefunctionestimatesusingtherewardsandtransitions generatedbytheactor.


In Equation (10), represents the parameters of the policy, isthelearningratefortheactor, isthe probabilityoftakingaction instate underthepolicy,and denotes the advantage function, which quantifies how much better an action is compared to the expected actioninthatstate.

HereinEquation(11), representstheparametersofthe valuefunction(critic), isthelearningrateforthecritic,and isthelossfunctionforthevaluefunction,oftendefined as the mean squared error between the predicted and the targetvalues.

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 12 Issue: 08 | Aug 2025 www.irjet.net p-ISSN: 2395-0072



In this equation, is the advantage function, istherewardreceivedaftertakingaction in state andtransitioningtostate , isthediscountfactor thatdeterminestheimportanceoffuturerewards, is the value function estimate for the current state , and is the value function estimate for the subsequent state



The ExActor-Critic framework introduces a novel approach to reinforcement learning (RL) by prioritizing explainabilitywithoutcompromisingperformance.Building on the traditional Actor-Critic architecture, it integrates advanced mechanisms such as Gated Recurrent Units (GRUs),attentionmechanisms,anddimensionalityreduction techniques to enhance interpretability and transparency. Thesecomponentsenabledeeperinsightsintothedecisionmaking processes of RL agents, making the model more understandable and reliable. Figure 1 illustrates a comparisonbetweenthebasevariantofExActor-Criticand theconventionalActor-Criticframeworks.

Fig -1:ComparisonofExActor-Critic andConventional Actor-Criticframeworks.
4.1 Overview: Explainability Techniques
ExActor-Critic employs multiple techniques to improve interpretability:
GRUs: Capture temporal dependencies and store hidden states, enabling the model to retain and analyze critical informationovertime[20].
Attention Mechanisms: Focusonrelevantpartsoftheinput sequencebycomputingattentionweights,highlightingkey featuresfordecision-making[19,34]
Dimensionality Reduction: Techniques like Principal Component Analysis (PCA) and t-SNE project highdimensionalhiddenstatesintolowerdimensions,facilitating visualizationandclustering[40].
Saliency Maps: Identifytheimportanceofindividualstate variables, offering actionable insights into how the model prioritizesinputs[41]
4.2 Model Architecture
ExActor-Critic introduces three progressively advanced variants,eachenhancingperformanceandinterpretability.
4.2.1 ExActor-Critic Model-1
Model-1 integrates a GRU and an Attention Mechanism (ATM) to improve temporal dependency capture and interpretability.

GRU Component:TheGRU updatesitshiddenstate at eachtimestep usingthefollowingequations:



Here, istheinputattime , , ,and arelearnable weights,and isthesigmoidfunction.
ATM Component: The attention mechanism computes attentionweights foreachhiddenstate:


where isalearnableweightmatrix.Theweightedoutput iscomputedas:
Advancement: Model-1 (Fig. 1) enhances conventional Actor-CriticmodelwithinterpretabilitybyleveragingGRUs to capture temporal dependencies and attention mechanismstofocusonrelevantstates,providingaclearer understandingoftheagent'sdecision-makingprocess.
4.2.2 ExActor-Critic Model-2
Model-2 builds on Model-1 by incorporating Optuna for hyperparameter optimization, enhancing scalability and efficiency.
Objective Function: The optimization process minimizes thenegativemeanrewardoverepisodes:

where represents hyperparameters (e.g., hidden size, dropoutrate), istherewardinthe -thepisode,and is thetotalnumberofepisodes.

Hyperparameters: Key hyperparameters include hidden size (range: [32, 256]) and dropout rate (range: [0.0, 0.5]).
Advancement: Model-2 (Fig. 2) introduces automated hyperparameter tuning using Optuna, significantly improving scalability and computational efficiency. This ensures optimal performance in high-dimensional environmentswhilereducingmanualintervention.

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 12 Issue: 08 | Aug 2025 www.irjet.net p-ISSN: 2395-0072

Fig -2: illustratesthearchitectureandworkflowof ExActor-CriticModel-2.
4.2.3 ExActor-Critic Model-3
Model-3 replaces the standard GRU with a Bidirectional GRU, enabling the model to process sequences in both forwardandbackwarddirections.

Bidirectional GRU: At each time step , the forward GRU computes ,andthebackwardGRUcomputes .Theseare concatenatedtoformthefinalhiddenstate:

Training Procedure:Thetrainingprocessinvolves: Initialization of actor and critic networks.
Episode execution to collect trajectories.
Reward computation using discounted cumulative rewards:

Advantage estimation:
Policy update using the policy gradient:
ValueupdatebyminimizingthemeansquarederroroftheTD error:
Interpretability Features:
Attention Weights: Theattentionweightswerevisualizedto gain insights into the model's focus and to interpret the specificregionsofinputdatathatthemodelprioritizesduring processing(Eq.17)(Blakeman&Mareschal,2022).
State Activations: Thehiddenstateactivationswereanalysed to establish connections between the model's internal representations and relevant environmental variables, providingadeeperunderstandingofhowthemodelencodes and utilizes contextual information (Eq. 18) (Hejase et al., 2022).
Saliency Maps:Identifyinfluentialstatevariables:
Dimensionality Reduction: PCA transforms hidden states intoalower-dimensionalspace:

where is the hidden state matrix, and is the PCA transformationmatrix.t-SNEminimizestheKullback-Leibler divergence:

Advancement: Model-3 introduces a Bidirectional GRU, significantlyenhancingtemporalunderstandingbycapturing dependenciesfrombothpastandfuturestates.Theintentisto improveperformanceincomplexenvironmentsandprovides deeper explainability through richer hidden state representations.

Fig -3: ExActor-CriticModel-3ArchitectureandWorkflow
ThearchitectureandworkflowofExActor-CriticModel-3 (Fig. 3), while Figure 4 details the training process and explainabilityfeatures.

Fig -4: ExActor-CriticModel-3 explainabilitytechniquesto visualizeandinterpretthemodel'sinternalrepresentations
4.2.4 eXActor-Critic Model -4
The eXActor-Critic Model -4 is specifically designed to address the challenges associated with complex, nonstationary environments. This iteration builds upon the limitations of Model -3, which exhibits suboptimal performance (compared with convectional Actor-Critic Model)whenstatevariableslackdefinitehistoricaltrends. Tomitigatetheseissues,themodelintroducesaDual-Path GRUNetwork,integratingbothBidirectional GRU(BiGRU) for capturing long-term dependencies and Unidirectional

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 12 Issue: 08 | Aug 2025 www.irjet.net p-ISSN: 2395-0072
GRU (UniGRU) for short-term adaptability. This dual-path architectureenablesamoreflexibleandefficienthandlingof temporaldependencies.
Toenhancetrainingstability,LayerNormalizationisapplied across both GRU pathways, ensuring consistent gradient flow and mitigating vanishing or exploding gradient issues. A key innovation of this model is the Dynamic Mode Switching mechanism, which dynamically selects between BiGRU and UniGRU based on a predefined reward improvement threshold (ΔR_switch>5%).Thisadaptivemechanismoptimizestemporal processing by responding to task-specific requirements, leading to improvements in both performance and computational efficiency.
The key advancements of the eXActor-Critic Model -4 are summarized in Table 3, while the detailed pseudo-code of its implementation is provided in Appendix 1, facilitating reproducibility and further research.
Table -3: KeyAdvancementsofeXActor-CriticModel-4
Feature Description
Dual-Path GRU Network
Layer Normalization
Dynamic Mode Switching
Interpretable Attention
Stability Mechanisms
Non-Stationary Robustness
Computational Efficiency
Incorporates BiGRU for longtermdependenciesandUniGRU for short-term responsiveness, ensuring effective temporal processing.
Stabilizestrainingbymaintaining consistent gradient flow across bothGRUpathways.
Adapts GRU selection based on reward improvement (ΔR_switch > 5%), optimizing task-specific performance.
Enhances decision-making by guiding the Actor and Critic networkstoprioritizethemost relevantfeatures.
Employs Automatic Mixed Precision (AMP) for gradient scalingandZ-score Advantage Normalizationtoensurestable policyupdates.
Validated under controlled dynamicconditions,including observation noise, periodic perturbations,and non-linear dragforces.
Reducesinferenceoverhead throughselectiveGRUactivation, makingtheframeworkmore suitableforreal-world deployment.
The eXActor-Critic Model -4 enhances reinforcement learning with adaptability, efficiency, and robustness. Its dynamic mode switching, dual-path architecture, and interpretableattentionmechanismsimproveinterpretability and computational efficiency, making it a powerful
framework for addressing complex, non-stationary environmentsinreal-worldapplications.
5. EVALUATION OF EXACTOR-CRITIC FRAMEWORK
Thissectionprovidesacomprehensiveevaluationofthe eXActor-Critic Framework, focusing on its architecture, experimentalsetup,performancemetrics,interpretability, and comparative analysis with traditional methods. The evaluationhighlightstheframework'sabilitytohandlenonstationaryenvironments,itscomputationalefficiency,andits robustness under dynamic conditions, while also emphasizingitsinterpretablefeatures.
5.1
The evaluation of the eXActor-Critic Framework was conducted using a modified version of the CartPole-v1 environment, which was enhanced with controlled nonstationarity to simulate real-world complexities. These modifications were designed to test the framework's robustness, adaptability, and performance under dynamic andunpredictableconditions.Belowarethetechnicaldetails oftheenvironmentsetup:
5.1.1
To simulate real-world sensor inaccuracies, Gaussian noise wasaddedtothestateobservations. The noisewas sampledfromanormaldistributionwithameanof0anda standarddeviationof .Mathematically,thenoisy observation attime isgivenby: (28)



where: is the true state at time , is the noise term, controls the magnitude of the noise. This modificationensuresthattheframeworkisevaluatedunder conditions that mimic real-world sensor imperfections, whicharecommoninpracticalapplications.
5.1.2
Tosimulateexternaldisturbances, periodic perturbations wereintroducedtothecart'spositionand velocity.Theseperturbationsweremodeledassinusoidal functionswithfrequenciesrangingfrom 0.1 Hz to 1.0 Hz
Theperturbationp(t)attime isdefinedas:

where: istheamplitudeoftheperturbation(setto0.1for positionand0.05forvelocity), isthefrequency,sampled uniformly from the range Hz. The perturbations wereappliedtoboththecart'sposition andvelocity as follows:


where and are the position and velocity perturbations, respectively. This modification tests the framework's ability to handle external disturbances that varyovertime.

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 12 Issue: 08 | Aug 2025 www.irjet.net p-ISSN: 2395-0072
To simulate real-world friction and resistance, a nonlinear drag force wasappliedtothecart.Thedragforce ismodeledas:



(32)
where: isthedragcoefficient(setto0.01), isthecart's velocity.Thedragforceactsintheoppositedirectiontothe cart's motion, providing a realistic simulation of air resistance and friction. This modification introduces nonlinearityintotheenvironment,challengingtheframeworkto adapttocomplexphysicaldynamics.
The ChallengingCartPole class wraps the CartPole-v1 environment to consistently apply modifications: noise injection (Gaussian noise, ), periodic perturbations (0.1–1.0 Hz), non-linear drag forces ( ),andstatenormalization.Thesechanges simulate real-world complexities like sensor inaccuracies, externaldisturbances,andfriction,creatingadynamic,nonstationarytestbed.Theframework'sabilitytohandlenoisy observations, adapt to disturbances, learn non-linear dynamics, and maintain stability is rigorously evaluated. Thissetupensuresrobusttestingunderrealisticconditions, providing insights into its suitability for real-world applications.


The experiments were designed to evaluate the performance of the proposed framework under diverse conditionsandbenchmarkitagainsttraditionalactor-critic methods.Thesetupparametersusedintheseexperiments areoutlinedinTable4.
Table -4: Thesetupparametersforabenchmarktest
Category Details
Model Initialization
eXActor-CriticModel-4:Initializedwith a Dual-Path GRU Network (BiGRU + UniGRU). Baseline Model: Traditional Actor-Criticmodelusedforcomparison.
Training Parameters Episodes:300episodesperexperiment. Learning Rate: 0.001, optimized using the Adam optimizer. Hidden Size: Tuned using Optuna hyperparameter optimization. Dropout Rate: 0.1 to prevent overfitting. Device: CUDAenabled GPU for accelerated computation(TestedwithCPUonly).
Evaluation Metrics
Mean Reward: Average reward per episode over 1000 consecutive episodes. Training Time: Total time taken to train the model. Sample
Efficiency:Numberofepisodesrequired to achieve convergence. Mode Switching Frequency: Frequency of transitionsbetweenBiGRUandUniGRU modes.
This section provides a comprehensive analysis of the evaluationresultscomparingtheeXActor-Criticframework withtheconventionalActor-Criticbaseline,accompaniedby anin-depthdiscussionofthefindings.
Theperformancecomparisonisevaluatedusingreward metrics and statistical measures within a dynamic environment
6.1.1 Performance comparison of eXActor-Critic and baseline Actor-Critic Frameworks
Figure5presentsaperformancecomparisonover1,000 episodes between the proposed eXActor-Critic framework andthebaselineconventionalActor-Criticapproach.Initially, thebaselineActor-CriticoutperformstheeXActor-Criticfrom episode0toaroundepisode80.However,whentheeXActorCritic dynamically switches from its default bidirectional BiGRU to unidirectional UniGRU mode, it begins to outperform the baseline. This advantage persists until a changeinthedynamicenvironmentoccursatapproximately episode180.Followingthis,theeXActor-Criticswitchesagain to an optimal mode after episode 320 and maintains this configurationuntiltheend.
TheeXActor-Criticframework’sdynamicmode-switching betweenBiGRUandUniGRUmodesadaptstotask-specific temporalcontexts.BiGRUhandleslong-termdependencies, using past and future contexts for decisions like balance maintenance,whileUniGRUaddressesshort-termreactivity, such as responding to disturbances. By evaluating and adapting to the optimal mode, eXActor-Critic captures temporaldynamics,outperformingthebaselineindynamic environment

Fig -5: PerformancecomparisonofeXActor-Criticand baselineActor-CriticFrameworks

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 12 Issue: 08 | Aug 2025 www.irjet.net p-ISSN: 2395-0072
6.1.2 Statistical performance evaluation of eXActorCritic with baseline Actor-Critic Frameworks
Table 5 highlights the superior performance of the eXActor-Critic framework comparedtothebaselineActorCritic.TheeXActor-CriticachievesasignificantlyhigherMax Reward(148vs.47)andMeanReward(28.409vs.11.167), demonstratingbetteroverallperformanceandstability.Its higher Standard Deviation (17.818 vs. 4.831), lower Skewness(1.884vs.4.159)andKurtosis(6.013vs.19.723) indicate more consistent reward distributions with fewer outliers,reflectinggreaterreliability.AlthougheXActor-Critic requires more episodes to converge (8 vs. 1) and longer training times (171.04s vs. 26.15s), it achieves a 154.4% improvementinmeanreward,showcasingitsefficiencyin learning high-quality policies. Statistical significance is confirmed by the t-Test (p-value ≈ 3.03E-159) and MannWhitneyUTest(p-value≈8.83E-193)bothstronglyrejecting thenullhypothesisofequalperformance.TheVarianceRatio (13.604)furtheremphasizeseXActor-Critic’sstability.These results position eXActor-Critic as a robust and effective framework for reinforcement learning, offering higher rewards and consistency despite slightly longer training times,makingitidealforcomplexcontroltasks.
Table -5: StatisticalanalysisofeXActor-Criticwith baseline Actor-CriticFramework

The eXActor-Critic saliency map reveals the relative importanceofstatevariablesintheCartPoleenvironment, with Pole Angle (5.82E-05) being the most influential, followedbyPoleAngularVelocity(2.58E-05),CartVelocity (2.15E-05), and Cart Position (8.83E-06). This aligns with CartPole dynamics, where the pole's angle and angular velocity are critical for balance, while cart position and velocityaresecondary.Theseinsightsarevaluableforpolicy improvement and algorithm refinement. Prioritizing Pole AngleandPoleAngularVelocityduringtrainingcanenhance convergenceandpolicystability.Thesaliencymapalsoaids featureengineeringbyidentifyingkeyvariables,enabling theremovaloflessimportantfeaturestoreducecomplexity. Additionally, it helps diagnose learning issues, such as sensitivitytospecificvariables,guidingadjustmentstothe reward function, exploration strategy, or network architecture. In summary, the saliency map enhances interpretabilityandservesasapracticaltoolforoptimizing reinforcement learning algorithms, refining policies, and improving performance in theCartPole environmentand similartasks.
t-Test Statistic 29.51843331
t-Test p-value 3.03E-159
Mann-Whitney U Statistic 880232
Mann-Whitney U p-value 8.83E-193
6.2 Interpretable Features of eXActor-Critic Framework
Another keystrengthoftheeXActor-CriticFrameworkis its interpretability, which is achieved through several mechanismsevaluatedanddiscussedinthissection.
6.2.1 Attention Mechanism and Saliency Maps
The eXActor-Critic framework employs an attention mechanism thatassignsweightstodifferentfeaturesofthe stateandmadeitpossibletocreateasaliencymap(Figure6 )tohelpunderstandwhichfeaturesaremostinfluentialin decision-making.
The eXActor-Critic framework captures and interprets modeldynamics,offeringinsightsintotheinterplayofstateenvironmentalvariableswithGRUhiddenlayers.Figure7 plotsselectedGRUhiddenlayersandstatevariablesover time,revealingtheirtemporalevolution.ByanalysingGRU hiddenlayerdynamics,theframeworkexamineshowstate variables and hidden layers evolve, including activation function behaviour; whether they saturate, explode, converge,ordivergetoprovideexplanationforsuboptimal performanceordiscoveryofredundantinactivelayer.This interpretability is crucial for understanding temporal dependencies for model optimisation. The insights gained enabletargetedpolicyrefinementandfeatureengineering, enhancinglearningefficiencyandperformance.Inthisway, eXActor-Criticframeworkadvancespolicyoptimizationand interpretability, making it a powerful tool for dynamic environments.

Volume: 12 Issue: 08 | Aug 2025 www.irjet.net p-ISSN: 2395-0072

The eXActor-Critic method integrates Principal ComponentAnalysis(PCA)clusterdatastatistics,asdetailed in Table 6, along with component loadings, presented in Table7,toprovidecriticalinsightsintothedynamicsofthe Cart-Pole environment. The cluster data statistics reveal distinct regions of the state space, categorized into three clustersbasedonCartPosition,CartVelocity,PoleAngle,and PoleAngularVelocity.Cluster1representshighlyunstable states characterized by fast leftward cart movement and high pole angular velocity, while Cluster 2 captures nearequilibrium states with slower cart movement and lower poleangularvelocity.Cluster0representstransitionalstates with high variability, likely bridging stable and unstable regions.Theseclustersenableanuancedunderstandingof the system's dynamics, allowing for tailored reward shapingandpolicydesign.Forinstance,higherrewardscan be assigned to Cluster 2 to encourage stability, while penalties can be applied to Cluster 1 to discourage instability.
The PCA component loadings further highlight the importance of Cart Velocity and Pole Angular Velocity in explaining the variance in the state space, with PC1 accountingfor54.38%ofthevariance.Thisunderscorestheir critical role in distinguishing between states and suggests that these features should be prioritized in state representations. PC2, explaining 37.17% of the variance, capturestheinterplaybetweenCartVelocityandPoleAngle, providing additional context for state differentiation. Together, these components offer a compressed yet informative representation of the state space, reducing dimensionalityandimprovingcomputationalefficiency.
Table -6: eXActor-CriticgeneratedPCAclusterdata statisticsofenvironmentvariables
Thesefindingshavepracticalimplicationsforthepresent eXActor-Critic reinforcement learning environment ( the CartPoleenvironment).Theclusterscanguideexploration strategies,withCluster0servingasafocusforexploration duetoitshighvariability,whileCluster2canbeexploitedfor maintainingstability. The PCA componentscan be usedto designefficientstaterepresentations,acceleratingtraining and reducing overfitting. Additionally, the insights can be extended to similar environments, facilitating transfer learningandgeneralization.Overall,theseinsightsprovided byeXActor-Criticrobustframeworkcanguideindeveloping moreeffectiveandadaptivereinforcementlearningpolicies indynamiccontroltasks.
Table -7: eXActor-CriticgeneratedPCAcomponent loadingsfortheenvironmentvariable
Component Cart Position Cart Velocity PC1 0.13431948 -0.11793802
6.2.4 PCA Dimensionality Reduction and Clustering Analysis of State Variables (t-SNE)
Table8,eXActor-Critict-SNEclusteringofreinforcement learningenvironment(Cart-Pole)states,uncoversactionable insightsforreinforcementlearning(RL)policyrefinement. Cluster0,withacartpositionof-0.0503(±0.0671)andpole angular velocity of 0.3821 (±0.3806), indicates moderate instability, suggesting the policy struggles with corrective actionsunder leftward drift (-0.1918±0.2169).Cluster 1, markedbyhighpoleangularvelocity(0.4729±0.647)and erraticcartvelocity(-0.2008±0.3952),highlightsafailure

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 12 Issue: 08 | Aug 2025 www.irjet.net p-ISSN: 2395-0072
mode where excessive tilt (pole angle: 0.0632 ± 0.0625) terminates episodes. Conversely, Cluster 2, with reduced pole angular velocity (0.4263 ± 0.3144) and steadier cart motion(-0.1943±0.1773),approximatesstabilitydespitea slighttilt(0.0602±0.066).
Table -8: eXActor-Criticgenerated t-SNEClusterstatistics
Cluster Cart Position Cart Velocity
Cluster Pole Angle
oftheenvironmentstates
These eXActor-Critic t-SNE-derived clusters expose policy weaknesses persistentleftwarddriftandvariablepoledynamics acrossallgroups offeringapracticalroadmapforimprovement. Fine-tuning the algorithm to minimize pole angular velocity in Clusters 0 and 1, while stabilizing cart position in Cluster 2, could enhance robustness. Adjusting reward structures or exploration strategies to prioritize tighter control over tilt and drift may shift the state distribution toward Cluster 2’s nearstable regime, optimizing overall performance in the Cart-Pole task.
6.2.5 PCA Dimensionality Reduction and Analysis of GRU Hidden Layers
PrincipalComponentAnalysis(PCA)oftheeXActor-Critic GRUhiddenstatesDepictedinFigure8andTable9during reinforcement learning reveals dynamic internal representations. PC1, showing an increasing trend with fluctuations, likely encodes the primary decision-making process,reflectinggrowingconfidenceinthelearnedpolicy PC2, with oscillatory behaviour, may capture auxiliary information like exploration or uncertainty, playing a secondary yet dynamic role. The results highlight distinct phases:sharptransitionsinPC2suggestexploration,while stabilizationinPC1indicatesconvergencetoastablepolicy, emphasizing exploitation. Reduced variability in PC1 over time signals successful learning and strategy refinement. ThesefindingsdemonstratetheeXActor-CriticGRU’sability to disentangle task-relevant information, balance exploration-exploitation,andconvergetoeffectivepolicies. Thestudyunderscorestheimportanceofanalyzinginternal RNN dynamics for understanding complex tasks, with broader implications for interpretability, architecture
design, and task-specific dynamics. Overall, the eXActorCritic GRU’s capacity to organize hidden states for reinforcementlearningpavesthewayforfurtherresearch intoRNNoptimizationandinterpretability.
Table -9: PCAofGRUhiddenlayers
-1.01878 0.103477 -0.8660 0.541265 -0.23929 0.828826
0.762972 0.656924 1.486479 -0.28572 0.668558 -1.66699 -1.48549 -1.50018 -2.45661 -0.00036 -1.79743 1.299097 0.232827 2.047771 2.965831 1.387357 3.965117 -1.20566 0.803778 -3.99148 -2.55186 -1.74265 -3.06299 0.353381 -1.71934 1.957388 1.327364 2.533643 4.145836 1.087078 4.126881 -2.22009 -0.40591 -4.46151 -3.17332 -1.23994 -3.07177 0.78975 -1.10769 2.346496 2.470848 2.38213

Fig -8: eXActor-Critic PCAplotofGRUhiddenlayers 6.3 Limitations and Mitigation Strategies
The eXActor-Critic framework marks a notable advancement in reinforcement learning but has certain limitations. Firstly, it requires more episodes to converge
© 2025, IRJET | Impact Factor value: 8.315 | ISO 9001:2008 Certified Journal | Page259

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 12 Issue: 08 | Aug 2025 www.irjet.net p-ISSN: 2395-0072
comparedtothebaselineActor-Critic(8vs.1),whichcanbe mitigated by tuning the learning rate or adjusting hidden layers. Secondly, training times are longer (171.04s vs. 26.15s)partlyduetoitsperiodicoptimizationprocess,which evaluates BiGRU and UniGRU modes sequentially to determine the optimal operational mode. To reduce this overhead, a concurrent ThreadPoolExecutor was implemented to run both modes in parallel. Additionally, CUDAsupportwasintegratedtoleverageGPUacceleration, though current testing was limited to CPU-only devices. FutureevaluationsonCUDA-enabledhardwareareexpected tosignificantlyimprovetrainingefficiency.
Collectively, these mitigation strategies enhance the framework’s robustness and adaptability, addressing its limitationswhilepreservingitsadvancedcapabilities.
The eXActor-Critic framework represents a significant advancementinreinforcementlearning,particularlyfornonstationary and dynamic environments. Its robustness and adaptabilityweredemonstratedinamodifiedCartPole-v1 environment, where it outperformed a traditional ActorCritic baseline under challenging conditions such as observation noise, periodic perturbations, and non-linear dragforces.BydynamicallyswitchingbetweenBiGRUand UniGRU modes, the framework effectively balances longtermdependenciesandshort-termreactivity,makingitwellsuitedforcomplexcontroltasks.Quantitatively,itachieveda meanrewardof28.409(vs.11.167forthebaseline)anda max reward of 148 (vs. 47), with statistical significance confirmed through t-tests and Mann-Whitney U tests. Althoughitrequiresmoreepisodestoconvergeandlonger training times, its 154.4% improvement in mean reward underscoresitsefficiencyinlearninghigh-qualitypolicies.
The framework also excels in interpretability, offering tools such as attention mechanisms, saliency maps, GRU hidden layer dynamics analysis, and PCA/t-SNE-based clustering. These features provide deep insights into decision-makingprocesses,theinfluenceofstatevariables, and the evolution of internal representations. While limitations such as computational overhead exists, the framework’s ability to handle complexity and provide interpretability makes it a powerful tool for advanced reinforcementlearningapplications.
[
1]A. N. Abbas, G.C. Chasparis, J.D. Kelleher. Hierarchical framework for interpretable and specialized deep reinforcement learning-based predictive maintenance. DataKnowlEng.2024.doi:10.1016/j.datak.2023.102240
[2]E.Baccour,A.Erbad,M.Guizani.Reinforcementlearningbased dynamic pruning for distributed inference via explainable AI inhealthcareIoT systems. Future Gener ComputSyst.2024.doi:10.1016/j.future.2024.01.021
[3]Z.Bing,C.Lemke,A.Knoll.Energy-efficientanddamagerecovery slithering gait design for a snake-like robot
based on reinforcement learning and inverse reinforcement learning. Neural Netw. 2020. doi:10.1016/j.neunet.2020.05.029
[4]A.Brini,G.Tedeschi,D.Tantari.Reinforcementlearning policyrecommendationforinterbanknetworkstability.J FinancStab.2023.doi:10.1016/j.jfs.2023.101139
[5] A. E. Gaweda, M.K. Muezzinoglu, M.E. Brier. Individualizationofpharmacologicalanemiamanagement using reinforcement learning. Neural Netw. 2005. doi:10.1016/j.neunet.2005.06.020
[6] N. Gholizadeh, P. Musilek. Explainable reinforcement learningfordistributionnetworkreconfiguration.Energy Rep.2024.doi:10.1016/j.egyr.2024.05.031
[7]L.He,N.Aouf,B.Song.Explainabledeepreinforcement learningforUAVautonomouspathplanning.AerospSci Technol.2021.doi:10.1016/j.ast.2021.107052
[8] A. Heiberg, T.N. Larsen, D. Varagnolo. Risk-based implementation of COLREGs for autonomous surface vehiclesusingdeepreinforcementlearning.NeuralNetw. 2022.doi:10.1016/j.neunet.2022.04.008
[9]M.T.Lash.HEX:Human-in-the-loopexplainabilityviadeep reinforcement learning. Decis Support Syst. 2024. doi:10.1016/j.dss.2024.114304
[10] A. Likmeta, A.M. Metelli, D. Romano. Combining reinforcement learning with rule-based controllers for transparentandgeneraldecision-makinginautonomous driving. Robot Auton Syst. 2020. doi:10.1016/j.robot.2020.103568
[11]M.Mark,N.Chehrazi,T.A.Weber.Optimalrecoveryof unsecureddebtviainterpretablereinforcementlearning. MachLearnAppl.2022.doi:10.1016/j.mlwa.2022.100280
[12] A.R. Sattarzadeh, P.N. Pathirana. Unification of probabilistic graph model and deep reinforcement learning(UPGMDRL)formulti-intersectiontrafficsignal control. Knowl Based Syst. 2024. doi:10.1016/j.knosys.2024.112663
[13]W.Tian,G.Fu,Z.Liao.Improvingtheinterpretability of deepreinforcementlearninginurbandrainage system operation. Water Res. 2024. doi:10.1016/j.watres.2023.120912
[14]Z.Wang,N.Aouf.Efficientadversarialattacksdetection for deep reinforcement learning-based autonomous planetary landing GNC. Acta Astronaut. 2024. doi:10.1016/j.actaastro.2024.07.052
[15] L. Yun, D. Wang, L. Li. Explainable multi-agent deep reinforcement learning for real-time demand response towardssustainablemanufacturing.ApplEnergy.2023. doi:10.1016/j.apenergy.2023.121324
[16] A. Heuillet, F. Couthouis, N. Díaz-Rodríguez. Explainabilityindeepreinforcementlearning.Knowl Based Syst. 2021;214:106685. doi:10.1016/j.knosys.2020.106685
[17] M. Taghian, S. Miwa, O. Zaiane. Explainability of deep reinforcementlearningalgorithmsinroboticdomainsby usingLayer-wiseRelevancePropagation.EngApplArtif Intell.2024.doi:10.1016/j.engappai.2024.109131

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 12 Issue: 08 | Aug 2025 www.irjet.net p-ISSN: 2395-0072
[18] J. Xing, T. Nagata, J.L. Krichmar. Achieving efficient interpretability of reinforcement learning via policy distillation and selective input gradient regularization. NeuralNetw.2023.doi:10.1016/j.neunet.2023.01.025
[19]S.Blakeman,D.Mareschal.Selectiveparticleattention: Rapidly and flexibly selecting features for deep reinforcement learning. Neural Netw. 2022. doi:10.1016/j.neunet.2022.03.015
[20] B. Hejase, E. Yurtsever, U. Ozguner. Dynamic and interpretablestaterepresentationfordeepreinforcement learninginautomateddriving.IFAC-PapersOnLine.2022. doi:10.1016/j.ifacol.2022.10.273
[21]T.Rupprecht,Y.Wang.Asurveyfordeepreinforcement learninginmarkoviancyber-physicalsystems:Common problems and solutions. Neural Netw. 2022. doi:10.1016/j.neunet.2022.05.013
[22] Z. Zhai, S. Shen, Y. Mao. An explainable deep reinforcement learning algorithm for the parameter configuration and adjustment in the consortium blockchain. Eng Appl Artif Intell. 2024. doi:10.1016/j.engappai.2023.107606
[23]R.Zhang,S.Bu,Z.Zhang.Deepreinforcementlearning based interpretable photovoltaic power prediction framework. Sustain Energy Technol Assess. 2024. doi:10.1016/j.seta.2024.103830
[24]A.Tsantekidis,N.Passalis,A.Tefas.Modelinglimitorder trading with a continuous action policy for deep reinforcement learning. Neural Netw. 2023. doi:10.1016/j.neunet.2023.05.051
[25] F. Sehnke, C. Osendorfer, J. Schmidhuber. Parameterexploring policy gradients. Neural Netw. 2010. doi:10.1016/j.neunet.2009.12.005
[26] S. Atakishiyev, M. Salameh, H.R. Tizhoosh, R. Zhu. Interestingnesselementsforexplainablereinforcement learning. Artif Intell. 2020;288:103367. doi:10.1016/j.artint.2020.103367
[27] M. Coombes, W.M. Czarnecki, R.M. French. Evolving interpretabledecisiontreesforreinforcementlearning. Artif Intell. 2024;327:104057. doi:10.1016/j.artint.2023.104057
[28] M.L. Olson, W.K. Wong, R.C. Moore, T.M. Mitchell. Counterfactual state explanations for RL agents. Artif Intell. 2021;295:103455. doi:10.1016/j.artint.2021.103455
[29]Y.Chen,F.Zhang,Z.Liu.Adaptivebias-variancetrade-off inadvantageestimatorforactor-criticalgorithms.Neural Netw.2024.doi:10.1016/j.neunet.2023.10.023
[30]M.Fairbank,D.Prokhorov,S.Li.Neurocontrolforfixedlength trajectories in environments with soft barriers. NeuralNetw.2025.doi:10.1016/j.neunet.2024.107034
[31] J. Peters, S. Schaal. Reinforcement learning of motor skills with policy gradients. Neural Netw. 2008. doi:10.1016/j.neunet.2008.02.003
[32] P. Wawrzyński. Real-time reinforcement learning by sequential Actor-Critics and experience replay. Neural Netw.2009.doi:10.1016/j.neunet.2009.05.011
[33] P. Wawrzyński, A.K. Tanwani. Autonomous reinforcement learning with experience replay. Neural Netw.2013.doi:10.1016/j.neunet.2012.11.007
[34]L.Bramlage,A.Cortese.Generalizedattention-weighted reinforcement learning. Neural Netw. 2022. doi:10.1016/j.neunet.2021.09.023
[35] Z. Chen, B. Luo, X. Xu. LJIR: Learning Joint-Action Intrinsic Reward in cooperative multi-agent reinforcement learning. Neural Netw. 2023. doi:10.1016/j.neunet.2023.08.016
[36]A.Perrusquía,M.Zou,W.Guo.Explainabledata-driven Q-learning control for a class of discrete-time linear autonomous systems. Inf Sci. 2024. doi:10.1016/j.ins.2024.121283
[37]Z.Zeng,Q.Cheng,Z.Liu.RuMER-RL:Ahybridframework forsparseknowledgegraphexplainablereasoning.InfSci. 2024.doi:10.1016/j.ins.2024.121144
[38] E. Ben-Iwhiwhu, J. Dick, A. Soltoggio. Context metareinforcement learning via neuromodulation. Neural Netw.2022.doi:10.1016/j.neunet.2022.04.003
[39] D. Patel, H. Hazan, R. Kozma. Improved robustness of reinforcement learning policies upon conversion to spiking neuronal network platforms applied to Atari Breakout game. Neural Netw. 2019. doi:10.1016/j.neunet.2019.08.009
[40]V.Tangkaratt,J.Morimoto,M.Sugiyama.Model-based reinforcementlearningwithdimensionreduction.Neural Netw.2016.doi:10.1016/j.neunet.2016.08.005
[41]T.Zhao,G.Li,M.Sugiyama.Learningexplainabletaskrelevant state representation for model-free deep reinforcement learning. Neural Netw. 2024. doi:10.1016/j.neunet.2024.106741
[42]B.Manela,A.Biess.CurriculumlearningwithHindsight Experience Replay for sequential object manipulation tasks. Neural Netw. 2022. doi:10.1016/j.neunet.2021.10.011
[43] J. Wang, S. Elfwing, E. Uchibe. Modular deep reinforcement learning from reward and punishment for robot navigation. Neural Netw. 2021. doi:10.1016/j.neunet.2020.12.001