The problem is that to solve that, you need three coordinates: X, Y, and depth. Unfortunately, the camera only gives you two.
If you can assume that the floor ahead of you is flat, then you can assume the depth of a particular pixel, and from that, you may be able to calculate the actual size, but this is an assumption that only works with very well calibrated/mounted cameras in research-ey, indoors environments.
What you can do is calculate the angle from the iris to the pixel. If you know the horizontal field of view, you can divide that by the width in pixels; same thing for vertical. For very narrow fields of view, it may be sufficient to do a linear relation; for wider fields of view, you need to skew the pixel angle based on the tangent of the ray from the object to the iris. Note that the angle is not the same thing as size, as it expands as it moves away from the camera! (Hence, why depth is needed.)
In general, I highly recommend studying actual optics (like, for cameras,) followed by 3D graphics projection math, before coming back to computer vision.