Data Efficient Reinforcement Learning

dc.description.abstractReinforcement learning (RL) has recently emerged as a generic yet powerful solution for learning complex decision-making policies, providing the key foundational underpinnings of recent successes in various domains, such as game playing and robotics. However, many state-of-the-art algorithms are data-hungry and computationally expensive, requiring large amounts of data to succeed. While this is possible for certain scenarios, in applications arising in social sciences and healthcare for example, where available data is sparse, this naturally can be costly or infeasible. With the surging interest in applying RL to broader domains, it is imperative to develop an informed view about the usage of data involved in its algorithmic design. This thesis hence focuses on studying the data efficiency of RL, through a structural perspective. Advancement along this direction naturally requires us to understand when and why algorithms are successful to begin with; and building upon such understanding, further improve the data efficiency of RL. To this end, this thesis begins by taking inspiration from the empirical successes. We consider the popular use of simulation-based Monte Carlo Tree Search (MCTS) in RL, as exemplified by the remarkable achievement of AlphaGo Zero, and probe the data efficiency of incorporating such a key ingredient. Specifically, we investigate the correct form to utilize such a tree structure for estimating values and characterize the corresponding data complexity. These results further enable us to analyze the data complexity of a RL algorithm that combines MCTS with supervised learning as done in AlphaGo Zero. Having developed a better understanding, as a next step, we improve the algorithmic designs of simulation-based data-efficient RL algorithms that have access to a generative model. We provide such improvements for both bounded and unbounded spaces. Our first contribution is a structural framework through a novel lens of low-rank representation of the Q-function. The proposed data-efficient RL algorithm exploits the low-rank structure to perform pseudo-exploration by querying/simulating only a selected subset of state-action pairs, via a new matrix estimation technique. Remarkably, this leads to a significant (exponential) improvement in data complexity. Moving to our endeavor with unbounded spaces, one must first address the unique conceptual challenges incurred by the unbounded domains. Inspired by classical queueing systems, we propose an appropriate notion of stability for quantifying "goodness" of policies. Subsequently, by leveraging the stability structure of the underlying systems, we design efficient, adaptive algorithms with a modified, efficient Monte Carlo oracle that guarantee the desired stability with a favorable data complexity that is polynomial with respect to the parameters of interest. Altogether, through new analytical tools and structural frameworks, this thesis contributes to the design and analysis of data-efficient RL algorithms.
Data Efficient Reinforcement Learning
2022 PhD Thesis Award: Kaiqing Zhang, "Reinforcement Learning for Multi-Agent and Robust Control Systems: Towards Large-scale and Reliable Autonomy"

Sample-Efficient Deep Reinforcement Learning for Continuous Control

Reinforcement learning (RL) is a powerful, generic approach to discovering optimal policies in complex sequential decision-making problems. Recently, with flexible function approximators such as neural networks, RL has greatly expanded its realm of applications, from playing computer games with pixel inputs, to mastering the game of Go, to learning parkour movements by simulated humanoids. However, the common RL approaches are known to be sample intensive, making them difficult to be applied to real-world problems such as robotics. This thesis makes several contributions toward developing RL algorithms for learning in the wild, where sample-efficiency and stability are critical. The key contributions include Normalized Advantage Functions (NAF), extending Q-learning for continuous action problems; Interpolated Policy Gradient (IPG), unifying prior policy gradient algorithm variants through theoretical analyses on bias and variance; and Temporal Difference Models (TDM), interpreting a parameterized Q-function as a generalized dynamics model for novel temporally abstracted model-based planning. Importantly, this thesis highlights that these algorithms can be seen as bridging gaps between branches of RL – model-based with modelfree, and on-policy with off-policy. The proposed algorithms not only achieve substantial improvements over the prior approaches, but also provide novel perspectives on how to mix different branches of RL effectively to gain the best of both worlds. NAF has subsequently been shown to be able to train two 7-DoF robot arms to open doors using only 2.5 hours of real-world experience, making it one of the first demonstrations of deep RL approaches on real robots.


Deep multi-agent reinforcement learning

A plethora of real world problems, such as the control of autonomous vehicles and drones, packet delivery, and many others consists of a number of agents that need to take actions based on local observations and can thus be formulated in the multi-agent reinforcement learning (MARL) setting. Furthermore, as more machine learning systems are deployed in the real world, they will start having impact on each other, effectively turning most decision making problems into multiagent proble...

Title: Deep reinforcement learning-based dynamic scheduling
Abstract: Attempts to address the production scheduling problem thus far rely on simplifying assumptions, such as static environment and inflexible size of the problem, which compromises the schedule performance in practice due to many unpredictable disruptions to the system. Thus, the study of scheduling in the presence of real-time events, termed dynamic scheduling, continues to attract attention given the agility, flexibility, and timeliness modern production systems must deliver. Additionally, the changing nature of the manufacturing system also raises new challenges to existing scheduling strategies. At the front-end, the development of advanced data creation and exchange frameworks such as the Internet of things and cyber-physical system and their applications to the industrial environment have created an abundance of industrial data, while at the backend, edge and cloud computing technologies greatly enhance the capacity to process that data. Industrial data must be mined and analyzed so that the investment in infrastructure is not wasted, and the production system managed more effectively and in real-time. Many data-driven technologies have been adopted in scheduling research, a promising candidate among them being reinforcement learning (RL) which is able to build a direct mapping from observation of environment to actions that improve its performance. In this thesis, a deep multi-agent reinforcement learning (deep MARL) architecture is proposed to solve the dynamic scheduling problem (DSP). The deep reinforcement learning (DRL) algorithm is used to train the decentralized scheduling agents, to capture the relationship between information on the factory floor and scheduling objectives, with the aim of making real-time decisions for a manufacturing system with frequent unexpected events. Two major aspects of deep MARL application to DSP are addressed in this work, namely the conversion from traditional static scheduling problem (SSP) to dynamic scheduling in a practical context, and the adaptation of existing deep MARL algorithms to solve the scheduling problem in such an environment. Some impractical constraints of traditional studies are removed to create a research context that is closer to actual practice, result in a scheduling problem of variable size and scope. Specialized state and action representations that can handle the ever-changing specification of problem are developed; the criteria of feature selection in dynamic environment are also discussed. Recent progressions in DRL and MARL research are integrated into the proposed approach after selection and adaptation. In addition, various improvements to common deep MARL architecture are proposed, including the lightweight multilayer perceptron (MLP) encoder that is efficient in handling unstructured industrial data, a training scheme under the multi-agent architecture to improve the stability of training and overall performance, and knowledge-based reward-shaping techniques to decompose the joint reward signal into individual utilities to speed up the learning and encourage cooperative behavior between agents. Simulation studies are then conducted for the ablation study and validation. In the first stage, the performance of the proposed approach, either as individual components or as an integrated model, are tested in iterative simulation runs within which a unique instance of production is created. Meanwhile, a set of DRL-based approaches from recent publications are run in parallel. Results suggest that the contribution of each improvement is significant; the integrated architecture also delivers stronger performance than peer DRL-based approaches. For the validation, a set of priority rules that have strong performance in specified context and are widely applied in actual production scheduling are used as the benchmark. Proposed approach also provides performance gain compared to the strongest rule, with a minor increase in computation cost and negligible latency in decision-making.
phd thesis on reinforcement learning

Carnegie Mellon University

Meta Reinforcement Learning through Memory

Modern deep reinforcement learning (RL) algorithms, despite being at the forefront of artificial intelligence capabilities, typically require a prohibitive amount of training samples to reach a human-equivalent level of performance. This severe data inefficiency is the major obstruction to deep RL’s practical application: it is often near impossible to apply deep RL to any domain without at least a simulator available. Motivated to address this critical data inefficiency, in this thesis we work towards the design of meta-learning agents that are capable of rapidly adapting to new environments. In contrast to standard reinforcement learning, meta-learning learns over distributions of environments, from which specific tasks are sampled and with which the meta-learner is directly optimized to improve the speed of policy improvement on. By exploiting a distribution of tasks which share common substructure with the tasks of interest, the meta-learner can adjust its own inductive biases to enable rapid adaptation at test time.

This thesis focuses on the design of meta-learning algorithms which exploit memory as the main mechanism driving rapid adaptation in novel environments. Meta learning with inter-episodic memories are a class of meta-learning methods that leverage a memory architecture conditioned on the entire interaction history of a particular environment to produce a policy. The learning dynamics driving policy improvement in a particular task are thus subsumed by the computational process of the sequence model, essentially offloading the design of the learning algorithm to the architecture. While conceptually straightforward, meta-learning using inter-episodic memory is highly effective and remains a state-of-the-art method. 

We present and discuss several techniques for meta-learning through memory. The first part of the thesis focuses on the “embodied” class of environments, where an agent has a physical manifestation in an environment resembling the natural world. We exploit this highly structured set of environments to work towards the design of a monolithic embodied agent architecture that has the capabilities of rapid memorization, planning and state inference. In the second part of the thesis, we move to focus on methods that apply in general environments without strong common substructure. First, we re-examine the modes of interaction a meta-learning agent has with the environment: proposing to replace the typically sequential processing of interaction history with a concurrent execution framework where multiple agents act in the environment in parallel. Next, we discuss the use of a general and powerful sequence model for inter-episodic memory, the gated transformer, demonstrating large improvements in performance and data efficiency. Finally, we develop a method that significantly reduces the training cost and acting latency of transformer models in (meta-)reinforcement learning settings, with the aim to both (1) make their use more widespread within the research community, and, (2) unlock their use in real-time and latency-constrained applications, such as in robotics.

Reinforcement Learning Using Neural Networks, with Applications to Motor Control

PhD thesis, by Rémi Coulom

PhD thesis, by Rémi Coulom

This thesis is a study of practical methods to estimate value functions with feedforward neural networks in model-based reinforcement learning. Focus is placed on problems in continuous time and space, such as motor-control tasks. In this work, the continuous TD(lambda) algorithm is refined to handle situations with discontinuous states and controls, and the vario-eta algorithm is proposed as a simple but efficient method to perform gradient descent. The main contributions of this thesis are experimental successes that clearly indicate the potential of feedforward neural networks to estimate high-dimensional value functions. Linear function approximators have been often preferred in reinforcement learning, but their success is restricted to relatively simple mechanical systems, or require a lot of prior knowledge. The method presented in this thesis was tested successfully on an original task of learning to swim by a simulated articulated robot, with 4 control variables and 12 independent state variables.

(only the first pages are in French, the rest is in English):

For those who cannot run the win32 demos below, some avi movies demonstrating the movements of swimmers (DivX codec required):

A few interactive (win32) swimmer demos (click in the window to change swimming direction):

  • swimmer3.exe
  • swimmer4-Slow.exe
  • swimmer4-Fast.exe
  • sw5-l0.exe (beginner)
  • sw5-l5.exe (expert)
  • sw5-l6.exe (performance drop)

Source code of the swimmer simulator:

  • swimmer.tar.bz2
  • RARSRLDemo.exe

Reinforcement learning for robots using neural networks

  Longxin Lin
Title: Reinforcement Learning for Portfolio Management.

Abstract: In this thesis, we develop a comprehensive account of the expressive power, modelling efficiency, and performance advantages of so-called trading agents (i.e., Deep Soft Recurrent Q-Network (DSRQN) and Mixture of Score Machines (MSM)), based on both traditional system identification (model-based approach) as well as on context-independent agents (model-free approach). The analysis provides conclusive support for the ability of model-free reinforcement learning methods to act as universal trading agents, which are not only capable of reducing the computational and memory complexity (owing to their linear scaling with the size of the universe), but also serve as generalizing strategies across assets and markets, regardless of the trading universe on which they have been trained. The relatively low volume of daily returns in financial market data is addressed via data augmentation (a generative approach) and a choice of pre-training strategies, both of which are validated against current state-of-the-art models. For rigour, a risk-sensitive framework which includes transaction costs is considered, and its performance advantages are demonstrated in a variety of scenarios, from synthetic time-series (sinusoidal, sawtooth and chirp waves), simulated market series (surrogate data based), through to real market data (S\&P 500 and EURO STOXX 50). The analysis and simulations confirm the superiority of universal model-free reinforcement learning agents over current portfolio management model in asset allocation strategies, with the achieved performance advantage of as much as 9.2\% in annualized cumulative returns and 13.4\% in annualized Sharpe Ratio.
  1. Data Efficient Reinforcement Learning

    Reinforcement learning (RL) has recently emerged as a generic yet powerful solution for learning complex decision-making policies, providing the key foundational underpinnings of recent successes in various domains, such as game playing and robotics. ... Xu-zhihu-PhD-EECS-2021-thesis.pdf Size: 16.46Mb Format: PDF Description: Thesis PDF. View ...

  2. Exploration and Safety in Deep Reinforcement Learning

    In this thesis, we address these challenges in the deep reinforcement learning setting by modifying the underlying optimization problem that agents solve, incentivizing them to explore in safer or more-efficient ways.

  3. PDF Robust and Adaptive Decision-Making: A Reinforcement Learning Perspective

    A Reinforcement Learning Perspective Wanqi Xue School of Computer Science and Engineering A thesis submitted to the Nanyang Technological University in partial ful llment of the requirements for the degree of Doctor of Philosophy 2023. Statement of Originality

  4. 2022 PhD Thesis Award: Kaiqing Zhang, "Reinforcement Learning for Multi

    Recent years have witnessed tremendous successes of AI and machine learning, especially reinforcement learning (RL), in solving many decision-making and control tasks. However, many RL algorithms are still miles away from being applied to practical autonomous systems, which usually involve more complicated scenarios with model uncertainty and multiple decision-makers by nature. In this talk, I ...

  5. Towards efficient and robust reinforcement learning via synthetic

    Over the past decade, Deep Reinforcement Learning (RL) has driven many advances in sequential decision-making, including remarkable applications in superhuman Go-playing, robotic control, and automated algorithm discovery. However, despite these successes, deep RL is also notoriously

  6. PDF Deep Learning and Reward Design for Reinforcement Learning

    Deep Learning and Reward Design for Reinforcement Learning by Xiaoxiao Guo A dissertation submitted in partial ful llment of the requirements for the degree of Doctor of Philosophy (Computer Science and Engineering) in The University of Michigan 2017 Doctoral Committee: Professor Satinder Singh Baveja, Co-Chair Professor Richard L. Lewis, Co-Chair

  7. PDF Deep Reinforcement Learning for Complex Manipulation Tasks with Sparse

    the learning process. Lastly, HER cannot be applied for sequential manipu-lation tasks, which significantly limits its practical application. 1.2 Research Objective This thesis is about enabling manipulators to learn new challenging skills from sparse feedback using deep reinforcement learning algorithms. We aim

  8. PDF Deep Reinforcement Learning for Adaptive Control In Robotics

    DEEP REINFORCEMENT LEARNING FOR ADAPTIVE CONTROL IN ROBOTICS By Luke Bhan Thesis Submitted to the Faculty of the Graduate School of Vanderbilt University in partial fulfillment of the requirements for the degree of MASTER of SCIENCE in Computer Science May 13, 2022 Nashville, Tennessee Approved: Gautam Biswas, Ph.D. Marcos Quinnones-Grueiro, Ph.D.

  9. Sample-Efficient Deep Reinforcement Learning for Continuous Control

    Reinforcement learning (RL) is a powerful, generic approach to discovering optimal policies in complex sequential decision-making problems. Recently, with flexible function approximators such as neural networks, RL has greatly expanded its realm of applications, from playing computer games with pixel inputs, to mastering the game of Go, to learning parkour movements by simulated humanoids.

  10. PDF Efficient Reinforcement Learning using Gaussian Processes

    E cient Reinforcement Learning using Gaussian Processes Marc Peter Deisenroth Dissertation November 22, 2010 Revised October 23, 2011 Original version available at ... Uwe D. Hanebeck for accepting me as an external PhD student and for his longstanding support since my undergraduate student times. I am deeply grateful to my supervisor Dr. Carl ...

  11. Inductive biases and generalisation for deep reinforcement learning

    In this thesis we aim to improve generalisation in deep reinforcement learning. Generalisation is a fundamental challenge for any type of learning, determining how acquired knowledge can be transferred to new, previously unseen situations. We focus on reinforcement learning, a framework describing

  12. Deep multi-agent reinforcement learning

    Deep multi-agent reinforcement learning. Abstract: A plethora of real world problems, such as the control of autonomous vehicles and drones, packet delivery, and many others consists of a number of agents that need to take actions based on local observations and can thus be formulated in the multi-agent reinforcement learning (MARL) setting.

  13. PDF On-Policy Deep Reinforcement Learning

    In this thesis we tackle these issues in the context of on-policy Deep Reinforcement Learning (DRL), both theoretically and algorithmically. This work addresses both the discounted and average reward criteria. In the ˙rst part of this thesis, we develop theory for average reward on-policy reinforcement learning by extending recent results

  14. Reinforcement Learning with Deep Q-Networks

    these DNNs have been applied to reinforcement learning tasks with state-. of-the-art results using Deep Q-Networks (DQNs) based on the Q-Learning. algorithm. However, the DQN training process is diferent from standard. DNNs and poses significant challenges for certain reinforcement learning envi-. ronments.

  15. PDF Reinforcement Learning with Sparse and Multiple Rewards

    of developing autonomous learning. In this thesis we will present methods to increase the autonomy of reinforcement learning al-gorithms, i.e., learning without expert pre-engineering, by addressing the issues discussed above. The key points of our research address (1) techniques to deal with multiple conflicting reward

  16. Deep reinforcement learning-based dynamic scheduling

    In this thesis, a deep multi-agent reinforcement learning (deep MARL) architecture is proposed to solve the dynamic scheduling problem (DSP). The deep reinforcement learning (DRL) algorithm is used to train the decentralized scheduling agents, to capture the relationship between information on the factory floor and scheduling objectives, with ...

  17. Meta Reinforcement Learning through Memory

    Meta Reinforcement Learning through Memory. Download (20.02 MB) thesis. posted on 2022-12-02, 11:40 authored by Emilio Parisotto. Modern deep reinforcement learning (RL) algorithms, despite being at the forefront of artificial intelligence capabilities, typically require a prohibitive amount of training samples to reach a human-equivalent level ...

  18. Reinforcement Learning Using Neural Networks, with Applications to

    PhD thesis, by Rémi Coulom. Abstract. This thesis is a study of practical methods to estimate value functions with feedforward neural networks in model-based reinforcement learning. Focus is placed on problems in continuous time and space, such as motor-control tasks. In this work, the continuous TD(lambda) algorithm is refined to handle ...

  19. Tech Reports

    This model represents one small, but important steps towards more useful dynamics models in model-based reinforcement learning. This thesis concludes with future directions on the synergy of prediction and control in MBRL, primarily focused on state-abstractions, temporal correlation, and future prediction methodologies.}, } EndNote citation:

  20. Reinforcement learning and planning for autonomous agent navigation

    PhD thesis Faculty Faculty of Science (FNWI) Institute ... The machine learning paradigm of reinforcement learning (RL) enables learning (neural network) policies for decision making through continuous interaction with the environment. However, if the rewards that are received as feedback are sparse, improving the policy gets difficult and ...

  21. Reinforcement learning for robots using neural networks

    This dissertation concludes that it is possible to build artificial agents than can acquire complex control policies effectively by reinforcement learning and enable its applications to complex robot-learning problems. Reinforcement learning agents are adaptive, reactive, and self-supervised. The aim of this dissertation is to extend the state of the art of reinforcement learning and enable ...

  22. PDF Optimizing Expectations: From Deep Reinforcement Learning to Stochastic

    This thesis is mostly focused on reinforcement learning, which is viewed as an opti-mization problem: maximize the expected total reward with respect to the parameters of the policy. The first part of the thesis is concerned with making policy gradient meth- ... Reinforcement learning can be viewed as a special case of optimizing an expectation,

  23. [1909.09571] Reinforcement Learning for Portfolio Management

    Reinforcement Learning for Portfolio Management. In this thesis, we develop a comprehensive account of the expressive power, modelling efficiency, and performance advantages of so-called trading agents (i.e., Deep Soft Recurrent Q-Network (DSRQN) and Mixture of Score Machines (MSM)), based on both traditional system identification (model-based ...