Can anyone interpret this Q-learning code?

I found this relatively simple Arduino code for controlling a self-learning two servo crawling robot. But I just cannot wrap my head around how he is using the R array to represent the servos. Can anyone interpret for me? His website is here and it gives some detail but not enough for me. https://planetachatbot.com/q-learning-con-arduino-crawling-robot-espanol-5eb0acf5aaaf

Here is the code:
/*
Q_Learning Robot
by: Erick M. Sirpa
*/
#include <Servo.h>
void Mostrar(float Q[][4]);
float distancia;
float tiempo;
int TRIGGER=8,ECHO=7;
Servo servo1,servo2;
int valor=0;
int act=0;
int ang=40;
int ang2=0;
int ang_t=0;
int ang2_t=0;

float Q[16][4]={{ 0, 0, 0, 0},
{0, 0, 0, 0},
{0, 0, 0, 0},
{0, 0, 0, 0},
{ 0, 0, 0, 0},
{0, 0, 0, 0},
{0, 0, 0, 0},
{0, 0, 0, 0},
{ 0, 0, 0, 0},
{0, 0, 0, 0},
{0, 0, 0, 0},
{0, 0, 0, 0},
{0, 0, 0, 0},
{0, 0, 0, 0},
{0, 0, 0, 0},
{0, 0, 0, 0}};
int action=0;
int state=0;
int cont=0;
float gamma = 0.8;
float Qmax=0;
float a=0,b=0;
int x=0;
int goal=15;
void setup (){
servo1.attach(9);
servo2.attach(6);

pinMode(TRIGGER, OUTPUT);
pinMode(ECHO, INPUT);
Serial.begin(9600);
float R[16][4] = {
{ 0, -1, 0, -1},
{-1, -1, 0, 0},
{-1, -1, 0, 0},
{-1, -1, -1, 0},
{ 0, 0, 0, -1},
{-1, 0, 0, 0},
{-1, 0, 0, 0},
{-1, 0, -1, 0},
{ 0, 0, 0, -1},
{-1, 0, 0, 0},
{-1, 0, 0, 0},
{-1, 0, -1, 0},
{-1, 0, 0, -1},
{-1, -1, 0, 0},
{-1, -1, 1000,0},
{-1, 0, -1, 0}};

int pos[16][3]={
{0,2,0},
{2,3,0},
{2,3,0},
{3,3,0},
{0,1,2},
{2,3,0},
{2,3,0},
{3,3,0},
{0,1,2},
{2,3,0},
{2,3,0},
{3,3,0},
{1,2,1},
{2,3,3},
{2,3,3},
{1,3,3},
};
int nstate=0;
float diff=0,d_prom=0,d_ant=0, d_new=0;
float point=0;
int cc=0;
for(int d=0;d<40;d++){

d_prom=dist()+d_prom;
delay(100);

}
d_ant=d_prom/20;
Serial.println(d_ant);
delay(1000);
for (int epoca=0;epoca<10;epoca++)
{
randomSeed(analogRead(0));
state=random(15);
ang=40;
ang2=0;

while(state!=goal){
ang_t=ang;
ang2_t=ang2;
cc=0;
cont++;
x=random(2);

action=pos[state][x];

if(action==0){
nstate=state+4;
ang=ang+20;
ang2=0;

} 
else if(action==1){

nstate=state-4;
ang=ang-20;
ang2=0;

} 
else if(action==2){
nstate=state+1;
ang2=ang2+45;
 
} 
else if(action==3){
nstate=state-1;
ang2=ang2-45;

}

servoVelocidad(servo1,ang_t,ang,5);
servoVelocidad(servo2,ang2_t,ang2,5);
d_new=dist();
diff=d_new-d_ant;
d_ant=d_new;

if(diff>=1.9 ){

point=map(diff,1,4,5,10);

R[nstate][action]=point;

Serial.println(point);
}
Serial.println(" ");

a = -10;
for (int i = 0; i < 4; i++) {
if (a < Q[nstate][i]) {
a = Q[nstate][i];
}
}

  Qmax = a * gamma;
  Q[state][action] = R[state][action] + Qmax;
  state = nstate;

}
}
Mostrar®;
Serial.println(" “);
Serial.println(” ");
Mostrar(Q);
}
void loop(){
state = random(3);
ang=40;
ang2=0;
while(state!=goal){
b = -10;

  for (int i = 0; i < 4; i++) {
    if (b <= Q[state][i]) {
      b = Q[state][i];
      act = i;
    }
  }
ang_t=ang;
     ang2_t=ang2;
 if(act==0){

state=state+4;
ang=ang+20;
ang2=0;

} 
else if(act==1){

state=state-4;
ang=ang-20;
ang2=0;

} 
else if(act==2){
state=state+1;
ang2=ang2+45;
 
} 
else if(act==3){
state=state-1;
ang2=ang2-45;

}

servoVelocidad(servo1,ang_t,ang,25);
servoVelocidad(servo2,ang2_t,ang2,25);
}
}
void Mostrar(float Q[][4]){
for (int i=0;i<16;i++){
for(int j=0;j<4;j++){
Serial.print(Q[i][j]);
Serial.print(" ");

}
Serial.println(" ");

}
}
float dist() {
digitalWrite(TRIGGER, LOW);
delayMicroseconds(2);
digitalWrite(TRIGGER, HIGH);
delayMicroseconds(10);
digitalWrite(TRIGGER, LOW);
// Calcula la distancia midiendo el tiempo del estado alto del pin ECHO
tiempo = pulseIn(ECHO, HIGH);

distancia = tiempo / 58.00;

/*
Serial.print(distancia);
Serial.println(“cm”);
delay(100);
*/
return distancia;
}
void servoVelocidad(Servo servo, int anguloA, int anguloB, int velocidad) {
if (anguloA<anguloB) {
for (int angulo = anguloA; angulo <= anguloB; angulo=angulo+2)
{
servo.write(angulo);
delay(velocidad);
}
}
else {
for (int angulo = anguloA; angulo >= anguloB; angulo=angulo-2)
{
servo.write(angulo);
delay(velocidad);
}
}
}

I don’t have too much experience in Reinforcement Learning, and I just scrolled through the code a few times, so take my answer with a grain of salt.
The rows of the R matrix are the states(servo positions) and the columns are the actions(move servo1 by +20 and -20 degrees, move servo2 by +45 and -45 degrees), so an element from this matrix gives us the reward in a given state for a chosen action. The -1 values represent actions which are cannot be taken in the corresponding state, for example the first row of the R is {0,-1,0,-1} because all your servos are in the lowest orientation so you cannot perform the -20 and -45 action, the last row is the opposite {-1,0,-1,0} all your servos are in the upmost position so you cannot perform the +20 and +45 action; and for all the other states, you have to look what orientations your servos have, then if a movement cannot be taken you set the action reward to -1.
Personal note: I think we shouldn’t have a goal(value 1000 in R matrix), because in this environment we can continuously act, in a maze env. it makes sense having a goal state because you don’t want your agent to do actions after it reached the goal. The other thing I don’t fully understand is why the R matrix elements are set by distance score for the next action, the Q matrix should be updated by the current reward, I think this is not the correct formula, and there is also no need to store the rewards, but it can be good for debug/visualization purposes.

Thanks so much for looking at it. What you say makes a lot of sense, even with my poor understanding. I will keep working on. You got a real understanding of what he is doing in a few passes and I have been working on it for days. But I am started to get a feel for it now.
Just your comment about the first row where the servos are in the lowest orientation is very helpful. That is where I could not visualize what was happening. I think I will build the machine and step through it now. Perhaps the algorithm can even be simplified a bit. Wish the author had commented his code a bit - would have saved me some time. Still, it is interesting to work a bit with machine learning - haven’t done that before.
I can follow the simple maze applications of Qlearning but I got lost in how to set up the arrays for the servos. But getting there.
Really, really appreciate your time and input and it was very helpful! Thanks.

You can make a nstate,reward = takeAction(state,action) function (you return multiple values in C by making a struct or class), where state is the current state you are in action is what action you want to perform in this state, and this function gives you back next next state nstate and gives you the value of the received reward. The function will check if by performing an action the servos can be out of bound, if not, the distance is measured and reward is calculated based on that, and the state is updated according to the action; if the servos would go out of bound you give back -1 reward and give back the state unchanged. If you want to filter out illegal actions you can make another function like action=randomValidAction(state) this gives a random valid action in a particular state.

I have one more question and then I think I understand it pretty well as I have been running the program and printing out results in monitor.
In the valid action table (pos array), I understand that the values are 0-3 which represents moving servo1 and servo2 up and down, and that the rows of the pos array are states. What are the columns of the pos array representing and why would he choose 020 for the values of the first row?

I don’t understand why pos array is needed, and why it has 3 columns.
What is certain that the possible actions for the next state are sampled from this array instead of the R matrix, which is redundant.
I think the author of the code found out that every row in R shows maximum 3 possible actions that’s why pos has 3 columns, for example, R[0] = {0, -1, 0, -1} this means we can choose action index 0 and action index 2 that’s why pos[0] = {0.2,0}(the second 0 is for filling up the empty space), another example: R[14]={-1, -1, 1000,0} so our possible action indices are 2 and 3, pos[14]={2,3,3} and for last: R[8]={0, 0, 0, -1}, so possible action indices are 0,1,2 => pos[8]={0,1,2}; but this hypotheses is not consistent with every row, it can be typos or I’m not fully understanding what is going on.
A valid action can be sampled directly from the R matrix, you count the number of non -1 values in a given state(row) let’s call this value nva then you call vai=random(nva), where vai will give you the index of a valid action from the set of valid actions, so you can’t use this index as a action index directly, you have to start counting(from zero) the number of valid actions in a row and when you arrive at vai you are at an index of a valid action.
Basically this method gives you an uniform sampling from the valid actions in a given state(row).

1 Like

Got it. I agree, there are up to 3 possible choices and if there are only two legitimate ones, then the other value is a placeholder. I think he has typos. So grateful for your analysis. You are a bright guy. I will build this now and play with the code a bit.
As an aside, I know he modeled his program after this python program: https://towardsdatascience.com/introduction-to-q-learning-88d1c4f2b49c
so that is where he probably got the pos array idea.

1 Like

Parting shot: this program is kind of cheating in that it always starts at the top of the servos stroke and proceeds to the goal which is just the bottom. I think a more sophisticated example of machine learning would be to find the combination of servo movements that propel the machine forward. Thus the goal would be movement forward and not a particular set of servo positions. But I guess that would require another algorithm other than Qlearning. But it was a great project to pique my interest in machine learning so I have no complaints and again thanks for your excellent help.

Thanks for your kind words. I’m glad that my observation helped you.
You are right that hardcoding the servo positions is cheating… and it’s not just cheating it’s unnecessary and it might prevent the system to learn the optimal policy. In this case the learner only needs the distance change as reward, no need for a goal, so in the learning phase you do a bunch of random steps(actions) and update the Q matrix, then after the learning you just execute the policy by always choosing the action with the highest value in a given state. If the algorithm learned enough it should crawl always forward, you don’t need to set/reset/check for goal state, the crawler should start crawling from any given state.
On the weekend I think I will have the time to implement my version of this Q-learner crawler with continuous learning, if you are interested.

Will be very interested in seeing your version working. I guess you will be using object oriented code which I just haven’t gotten around to learning. I am building a robot now and should have it up and running this weekend with Eric’s code. Then I may play around with my own. My idea was to just randomly select some positions until it came up with an up and down combination that made movement and then optimize that combination with random + or - 5 degrees until it comes up with the best movement - no arrays necessary or Clearning algorithm necessary. Not very elegant but simple and perhaps entertaining. Anyway, this is really cool stuff.

Hi notaii. I got the robot built and working. I used Eric’s code as is. It does work but I noticed a few problems.

  1. When I print out the servo positions that it sends to the servos, they occasionally come up as negative numbers. This is because he randomly selects a state at the beginning of each epoch. As a result the arm sometimes just sits there and doesn’t move because it is generating negative servo positions. So I set state = 0 at start of each epoch and then no negative positions come up. This means every epoch begins from the highest servo positions (40 and 0) and proceeds to lowest (goal).
  2. If x=random(2) then he is only using the first two columns of the pos array. Why use a 3 column array then? And that probably means some of the states will never be selected.
  3. Minor problem: he calls the sonar function 40 times to average results but only divides results by 20. No big deal but seems sloppy.
    4.Still can’t figure out exactly how he arrived at his R array values. I drew out all the possible servo positions (states) and built my own pos array and it doesn’t look anything like his. Will try my values next and see if works any better.
    Update: actually I did figure out #4. His R array is correct.

Hi! Sorry for the late reply.
I’m glad that you got the robot working.

  1. Servo angles are not tested for out of bounds, it relies on the qlearning code not to select actions which can drive the servos too far.

  2. Every R matrix row has maximum 3 valid actions, that is why a 3 column pos array was used, but I think the author didn’t know that the random function’s max value argument is non inclusive, so it will only generate 0,1 values.

  3. I don’t know why it is only divided by 20, but it shouldn’t affect the learning, because the goal is to maximize the expected reward over time, it doesn’t matter if you multiply the reward with a >0 constant.

  4. I only looked at the first and second row of the R matrix, so i didn’t know if it was correct or not, but a few values in the pos matrix wasn’t consistent with R.

I currently haven’t built my robot, but I’m on it.

Hi notaii. Yes I had worked out these issues and corrected his R array and used all 3 values so that it now used every state and so that it did not generate negative value servo positions. I think it is smoother operating now, I noticed his robot was kind of jumpy, even after the learning phase.
I am on to working on my own algorithm after I finish this silly little strandbeest I am working on. Jim.

Hi demej00, I want to know why they work in state + 4, state-1, state-1 at state update? and why is the state set to 16,
Please help me.

Hi. Sorry, I don’t have this robot anymore and I just don’t think I can get my head back into it. It seems notaii has a better understanding of it than I did anyway so perhaps you can ask him to. I remember I spent a lot of time adapting Eric’s program (see link above) but his website is in Spanish so needs translation. Also his code seems to have had errors. Also if you read notaii’s comments above they might help but he thinks Eric’s code had problems.
Jim.

all right, thank you very much for reminding me, I am trying to understand the function of this program, thank you for your reply! :grinning: