With the ever-increasing amount of research data, the question arises where this data comes from and what it is about. The aim of this thesis is to provide an overview of the topic and address issues caused by the rapid explosion of data by examining existing standards and attempting to develop new methods for recreating provenance information from data. These methods are applied to different use cases and the specifics of data similarity and metadata extraction are explored.