Off-policy learning for a turbocharged diesel engine

This is the Numerical Example 2.2 in Chapter 2 of book Robust Adaptive Dynamic Programming by Yu Jiang and Zhong-Ping Jiang

Parameters configuration
Run the simulation and obtain the data matrices \delta_{xx},
Off-policy learning using the collected online data
Post-learning simulations
Plotting results
Display results

Parameters configuration

xn = 6;
un = 2;

% Set the weighting matrices for the cost function
Q = diag([100 0 0 0 0 100]);
R = diag([1 1]);

% Initialize the feedback gain matrix
Kinit  = zeros(un,xn);  %Only if A is Hurwitz, K can be set as zero.
K = Kinit;
N  = 100;           %Length of the window, should be at least xn^2+2*xn*un
MaxIteration = 10;  %Max iteration times
T  = 0.01;          %Length of each integration interval

x0 = [20;5;10;2;-1;-2]; %Initial condition

expl_noise_freq = (rand(un,100)-.5)*100; % Exploration noise frequencies

% Matrices to collect online data and perform learning
Dxx=[];
Ixx=[];
Ixu=[];

% Initial condition of the augmented system
X=[x0;kron(x0',x0')';kron(x0,zeros(un,1))]';

Run the simulation and obtain the data matrices \delta_{xx},

%I_{xx}, and I_{xu}

ifLearned = 0;

x_save=[];
t_save=[];

for i=1:N
    % Simulation the system and at the same time collect online info.
    [t,X] = ode45(@(t,x)aug_sys(t,x,K,ifLearned,expl_noise_freq), ...
        [(i-1)*T,i*T],X(end,:));

    %Append new data to the data matrices
    Dxx=[Dxx;kron(X(end,1:xn),X(end,1:xn))-kron(X(1,1:xn),X(1,1:xn))];
    Ixx=[Ixx;X(end,xn+1:xn+xn^2)-X(1,xn+1:xn+xn^2)];
    Ixu=[Ixu;X(end,xn+xn^2+1:end)-X(1,xn+xn^2+1:end)];

    % Keep track of the system trajectories
    x_save=[x_save;X];
    t_save=[t_save;t];
end

ifLearned = 1; % Mark learning finished

P_old = zeros(xn);
P = eye(xn)*10;    % Initialize the previous cost matrix
it = 0;            % Counter for iterations
p_save = [];       % Track the cost matrices in all the iterations
k_save = [];       % Track the feedback gain matrix in each iterations

% The system matrices are copied here only for the purpose of analyzing the
% results
A = [-0.4125    -0.0248        0.0741     0.0089   0          0;
    101.5873    -7.2651        2.7608     2.8068   0          0;
    0.0704       0.0085       -0.0741    -0.0089   0          0.0200;
    0.0878       0.2672        0         -0.3674   0.0044     0.3962;
    -1.8414      0.0990        0          0       -0.0343    -0.0330;
    0            0             0       -359      187.5364   -87.0316];

B = [-0.0042  0.0064
     -1.0360  1.5849
      0.0042  0;
      0.1261  0;
      0      -0.0168;
      0       0];

[Kopt,Popt] = lqr(A,B,Q,R); % Calculate the ideal solution for comparion purpose
k_save  = norm(K-Kopt);  % keep track of the differences between the actual K
% and the idea valu

Off-policy learning using the collected online data

while norm(P-P_old)>1e-8 & it<MaxIteration
    it = it+1;                        % Update and display the # of iters
    P_old = P;                        % Update the previous cost matrix
    QK = Q+K'*R*K;                    % Update the Qk matrix

    Theta = [Dxx,-Ixx*kron(eye(xn),K')-Ixu];  % Left-hand side of
    % the key equation
    Xi = -Ixx*QK(:);                 % Right-hand side of the key equation
    pp = pinv(Theta)*Xi;             % Solve the equations in the LS sense
    P = reshape(pp(1:xn*xn), [xn, xn]);  % Reconstruct the symmetric matrix
    P = (P + P')/2;

    BPv = pp(end-(xn*un-1):end);
    K = inv(R)*reshape(BPv,un,xn)/2;% Get the improved gain matrix
    p_save = [p_save,norm(P-Popt)];   % Keep track of the cost matrix
    k_save = [k_save,norm(K-Kopt)];    % Keep track of the control gains

    disp(['K_', num2str(it), '=']);
    disp(K);
end

K_1=
   47.8326    0.7785    2.9006   42.1547  -24.8884   -0.0909
   38.7653    0.4848    0.5799   20.7823  -11.6976    0.0712

K_2=
   20.3865    0.4444    1.2043   27.3859  -15.6164   -0.1539
   18.1106    0.2602    0.0321   11.4403   -6.2478    0.0293

K_3=
   11.3687    0.3886    0.4585   24.9271  -13.7566   -0.1668
    9.6886    0.2055    0.0395    9.3035   -5.0999    0.0184

K_4=
   10.7552    0.3875    0.4006   24.8447  -13.6680   -0.1672
    9.0006    0.2031    0.0681    9.2034   -5.0791    0.0180

K_5=
   10.7537    0.3875    0.4006   24.8446  -13.6677   -0.1672
    8.9983    0.2031    0.0682    9.2032   -5.0796    0.0180

K_6=
   10.7537    0.3875    0.4006   24.8446  -13.6677   -0.1672
    8.9983    0.2031    0.0682    9.2032   -5.0796    0.0180

K_7=
   10.7537    0.3875    0.4005   24.8446  -13.6677   -0.1672
    8.9983    0.2031    0.0682    9.2032   -5.0796    0.0180

K_8=
   10.7537    0.3875    0.4005   24.8446  -13.6677   -0.1672
    8.9983    0.2031    0.0682    9.2032   -5.0796    0.0180

K_9=
   10.7537    0.3875    0.4005   24.8446  -13.6677   -0.1672
    8.9983    0.2031    0.0682    9.2032   -5.0796    0.0180

K_10=
   10.7537    0.3875    0.4006   24.8446  -13.6677   -0.1672
    8.9983    0.2031    0.0682    9.2032   -5.0796    0.0180

Post-learning simulations

[tt,xx]=ode45(@(t,x)aug_sys(t,x,K,ifLearned,expl_noise_freq), ...
    [t(end) 10],X(end,:)');

% Keep track of the post-learning trajectories
t_final = [t_save;tt];
x_final = [x_save;xx];

% For comparson, also get the unlearned simulation
[ttt,xxx]=ode45(@(t,x)aug_sys(t,x,Kinit,ifLearned,expl_noise_freq), ...
    [t(end) 10],X(end,:)');
t_unlearned = [t_save;ttt];
x_unlearned = [x_save;xxx];

Plotting results

figure(1)
subplot(211)
plot(0:length(p_save)-1,p_save,'o',0:length(p_save)-1,p_save,'Linewidth',2);
ylim([-max(p_save)*0.2 max(p_save)*1.2]);
legend('||P_k-P^*||')
xlabel('Number of iterations')

subplot(212)
plot(0:length(k_save)-1,k_save,'^',0:length(k_save)-1,k_save,'Linewidth',2)
ylim([-max(k_save)*0.2 max(k_save)*1.2]);
legend('||K_k-K^*||')
xlabel('Number of iterations')

% Uncomment to save figure
% print('Ch2_ex2_fig1_pk','-depsc')

figure(2)
plot(t_final,x_final(:,1:6),'Linewidth',2)
legend('x_1','x_2','x_3','x_4','x_5','x_6')
xlabel('Time (sec)')

figure(3)
plot(t_final, 3.6*x_final(:,6),'k-', ...
     ttt, 3.6*xxx(:,6),'r--', ...
    'Linewidth',2)
ylim([-400 150])

legend('MAF (Under ADP)', 'MAF (Unlearned)')
xlabel('Time (sec)','FontSize',12)

% Create textarrow
annotation(figure(3),'textarrow',[0.2630173564753 0.218958611481976],...
	[0.166023166023166 0.198841698841699],'String',{'Controller Updated'},...
	'FontSize',14);

% Uncomment to save figure
% print('Ch2_ex2_fig2_y','-depsc')

Display results

disp('Approximate Cost Matrix')
P
disp('Optimal Cost Matrix')
Popt
disp('Approximate Gain Matrix')
K
disp('Optimal Gain Matrix')
Kopt

Approximate Cost Matrix

P =

  766.7586    1.4019   97.3262  119.0937 -111.2552    0.6265
    1.4019    0.1032   -0.6581    3.9893   -1.8221    0.0150
   97.3262   -0.6581   71.6942   -1.3764  -29.0644    0.0230
  119.0937    3.9893   -1.3764  233.8101 -126.0950   -1.1828
 -111.2552   -1.8221  -29.0644 -126.0950   88.0805    0.5859
    0.6265    0.0150    0.0230   -1.1828    0.5859    0.5687

Optimal Cost Matrix

Popt =

  766.7586    1.4020   97.3265  119.0937 -111.2549    0.6265
    1.4020    0.1032   -0.6581    3.9893   -1.8221    0.0150
   97.3265   -0.6581   71.6915   -1.3767  -29.0678    0.0230
  119.0937    3.9893   -1.3767  233.8101 -126.0952   -1.1828
 -111.2549   -1.8221  -29.0678 -126.0952   88.0775    0.5859
    0.6265    0.0150    0.0230   -1.1828    0.5859    0.5687

Approximate Gain Matrix

K =

   10.7537    0.3875    0.4006   24.8446  -13.6677   -0.1672
    8.9983    0.2031    0.0682    9.2032   -5.0796    0.0180

Optimal Gain Matrix

Kopt =

   10.7537    0.3875    0.4006   24.8446  -13.6677   -0.1672
    8.9983    0.2031    0.0681    9.2032   -5.0796    0.0180

Off-policy learning for a turbocharged diesel engine

Contents

Parameters configuration

Run the simulation and obtain the data matrices \delta_{xx},

Off-policy learning using the collected online data

Post-learning simulations

Plotting results

Display results