Q function (learning to optimize when the model is not available)

If the model is not available then the agent learns the model and optimal policy by trial and error. When the model is not available, the agent uses a Q function, which is defined as follows:

The Q function basically maps the pairs of states and actions to a real number that denotes the expected total reward if the agent at state s selects an action a.