In features, "contains x" or words to that effect are short for "contains x to a degree sufficient for x to be recognizable as x to a human" . This means among other things that unless explicitly stated, partial visibility or audibility may suffice.
The fact that a segment contains video of physical objects representing the feature target, such as photos, paintings, models, or toy versions of the target, will NOT be grounds for judging the feature to be true for the segment. Containing video of the target within video may be grounds for doing so.
If the feature is true for some frame (sequence) within the shot, then it is true for the shot; and vice versa. This is a simplification adopted for the benefits it affords in pooling of results and approximating the basis for calculating recall.
When a feature expresses the need for x and y and ..., all of these (x and y and ...) must be perceivable simultaneously in one or more frames of a shot in order for the shot to be considered as meeting the need.