We Have So Much In Common: Modeling Semantic Relational Set Abstractions In Videos
General Terms: Semantic: Semantics is the study of relationship between words and how we can draw meaning from words. Example, A child could be called a child, kid, boy, girl, son, daughter. Sematic relations connect up entities in a text. Semantic relation are at the cross-road between knowledge and language and tTogether with entities make up good chunk of the meaning of the text.
To explore, how sets of group of words are linked by means of sevaeral semantic relations, one can look at WordNet.
Gist: For humans it is easy to identify how two events are related by looking at them. Inherently, we can decompose two events in general abstract meaning and see which of the abstract meaning are similar. Example, when we see a.) a human being typing on a computer b.) a GIF with a mouse typing on a calculator, we can relate the hand movements and object being typed upon. We can relate that both activities are related to pushing some buttons and seeing a result on screen and hence can see the similarities in the two activities. It is a result of underlying human decision making abilty. Computers can’t do the same.
The paper proposes “an approach for learning semantic relational set abstractions on videos”. “semantic relational set abstractions” should be read in two parts “semantic” and “relational set abstractions” for better understanding. It simply says that if we can find the relational sets between different videos (e.g. similar images at different view angles) and get an abstract meaning from these relational sets, the proposed method can learn the semantic (meaning) between these abstractions. If we can do so, is will be easier to tell which images are similar, which are different and how they are similar/different.
More formally, as the paper puts it, this allows our model to perform cognitive tasks such as:
- set abstraction (which general concept is in common among a set of videos?)
- set completion (which new video goes well with the set?), and
- odd one out detection (which video does not belong to the set?)
Datsets used:
- K400 (Kinect 400)
- Multi-Monets in Time