Loss Weighting in Multi-task Learning

Background The loss weights are uniform or manually tuned. GradNorm GradNorm: Gradient Normalization for Adaptive Loss Balancing in Deep Multitask Networks Task imbalances impede proper…

Overview of Industrial CTR Prediction

Hand-crafted Feature Engineering 1. Feature Interaction Learning no interaction, pair-wise interaction (inner-product, outer-product, convolutional, attention and etc.),  high-order interaction (explicitly, implicit) No interaction: LR, GBDT+LR…

LOSS: Wasserstein Distance

The implementation of tf version Wasserstein Distance in reference of Scipy wasserstein_distance:

Natural Policy Gradient

The key idea underlying policy gradients is to push up the probabilities of actions that lead to higher return, and push down the probabilities of actions that lead to lower return, until you arrive at the optimal policy.

Trust Region Policy Optimization (TRPO)

This is different from normal policy gradient, which keeps new and old policies close in parameter space. But even seemingly small differences in parameter space can have very large differences in performance—so a single bad step can collapse the policy performance. This makes it dangerous to use large step sizes with vanilla policy gradients, thus hurting its sample efficiency. TRPO nicely avoids this kind of collapse, and tends to quickly and monotonically improve performance.

Prioritized Double DQN

This post is mainly summarized from original PER paper, but with more detailed and illustrative explanation, we will go through key ideas and implementation of…

Deep Q-Learning Series (DQN)

DQN series reinforcement learning algorithms involve with learning by using Deep Q Networks, these algorithms include Deep Q-learning (short for DQN), Double Deep Q-learning (Double…

Temporal Difference Learning from Scratch

This post hilights on basic temporal difference learning theory and algorithms that contribute much to more advanced topics like Deep Q Learning (DQN), doublel DQN,…

Dynamic Programming in Reinforcement Learning

Dynamic Programming (DP) plays an important role in traditional approaches of solving reinforcement learning with known dynamics. This post focuses on a more classical view…

Asynchronous Advantage Actor Critic (A3C)

Policy-based Methods directly learn a parameterized policy.Value-value methods learn the values of actions, and then selected actions based on their estimated action values. In policy-based…

Policy Gradient Methods Overview

Policy gradient methods have advantages over better convergence properties, being effective in high-dimensional or continuous action spaces and being able to learn stochastic policies. 1….

Dive into Deep Reinforcment Learning

      As part of my focus, I have spent more than two years to try to make every part of the reinforcement learning system linked together…

Twin Delayed DDPG (TD3) Walkthrough

This post is organized as follow:1.Problem Formulation2. Reducing Overestimation Bias3. Addressing Variance4. Psuedocode and Implementation5. Reference TD3 Overview      Twin Delayed Deep Deterministic Policy…

Deep Determinstic Policy Gradient (DDPG)

Deep Deterministic Policy Gradient (DDPG) is an algorithm which concurrently learns a Q-function and a policy. It uses off-policy data and the Bellman equation to learn the Q-function, and uses the Q-function to learn the policy.

Welcome to this new website!

Dear friend, welcome to this website! I setup this website with the intention of recording and sharing personal knowledge of special topics in AI domain. Here,…