In our work, we have found the user interface of the Vive, Oculus, and PSVR to be the most reliable, easiest to train and best suited for fine-tuned gestures and manipulation.
The HoloLens natural input gestures can take a few minutes to learn and the voice recognition, while very reliable, even in noisy conditions, can be tricky due to the nuances of language and the wide ranges of synonyms that exist. Grab/Pinch/Hold/Grasp/Clutch/Seize/Take can all mean the same thing to different users as can Drop/Place/Release/Let Go/Place here. Keeping vocabulary to a minimum while building in as much redundancy as possible is the best design choice.
For Cardboard VR (though it can be used in all the other systems), the most common user interface is simply gaze select. The user puts the cross-hair on an icon and holds it there for several seconds. Typically, an animated circle will fill, signifying the action is selected. This is in large part because users need to hold the cardboard devices to their face with both hands. A one-button tap or finger can also be used, but can be unreliable given the rubber band/magnetic/cardboard construction.
In all cases, user fatigue, eye strain, and motion sickness should be included in the design consideration.
Collaboration: While all of the systems can allow for multiple users to interact with the same computer generated imagery, either in the same physical room, or remotely, AR, specifically the HoloLens, allows the user to both see the CGI and the user's face. This can give important non-verbal cues.