Modular framework for similarity-based dataset discovery using external knowledge
Autoři
Rok
2022
Publikováno
Data Technologies and Applications. 2022, 56(4), 506-535. ISSN 2514-9288.
Typ
Článek
Pracoviště
Anotace
Purpose
Semantic retrieval and discovery of datasets published as open data remains a challenging task. The datasets inherently originate in the globally distributed web jungle, lacking the luxury of centralized database administration, database schemes, shared attributes, vocabulary, structure and semantics. The existing dataset catalogs provide basic search functionality relying on keyword search in brief, incomplete or misleading textual metadata attached to the datasets. The search results are thus often insufficient. However, there exist many ways of improving the dataset discovery by employing content-based retrieval, machine learning tools, third-party (external) knowledge bases, countless feature extraction methods and description models and so forth.
Design/methodology/approach
In this paper, the authors propose a modular framework for rapid experimentation with methods for similarity-based dataset discovery. The framework consists of an extensible catalog of components prepared to form custom pipelines for dataset representation and discovery.
Findings
The study proposes several proof-of-concept pipelines including experimental evaluation, which showcase the usage of the framework.
Originality/value
To the best of authors’ knowledge, there is no similar formal framework for experimentation with various similarity methods in the context of dataset discovery. The framework has the ambition to establish a platform for reproducible and comparable research in the area of dataset discovery. The prototype implementation of the framework is available on GitHub.
Open dataset discovery using context-enhanced similarity search
Autoři
Rok
2022
Publikováno
Knowledge and Information Systems. 2022, 64(12), 3265-3291. ISSN 0219-1377.
Typ
Článek
Pracoviště
Anotace
Today, open data catalogs enable users to search for datasets with full-text queries in metadata records combined with simple faceted filtering. Using this combination, a user is able to discover a significant number of the datasets relevant to a user’s search intent. However, there still remain relevant datasets that are hard to find because of the enormous sparsity of their metadata (e.g., several keywords). As an alternative, in this paper, we propose an approach to dataset discovery based on similarity search over metadata descriptions enhanced by various semantic contexts. In general, the semantic contexts enrich the dataset metadata in a way that enables the identification of additional relevant datasets to a query that could not be retrieved using just the keyword or full-text search. In experimental evaluation we show that context-enhanced similarity retrieval methods increase the findability of relevant datasets, improving thus the retrieval recall that is critical in dataset discovery scenarios. As a part of the evaluation, we created a catalog-like user interface for dataset discovery and recorded streams of user actions that served us to create the ground truth. For the sake of reproducibility, we published the entire evaluation testbed.
Similarity vs. Relevance: From Simple Searches to Complex Discovery
Autoři
Rok
2021
Publikováno
Similarity Search and Applications. Springer, Cham, 2021. p. 104-117. ISSN 0302-9743. ISBN 978-3-030-89656-0.
Typ
Stať ve sborníku
Pracoviště
Anotace
Similarity queries play the crucial role in content-based retrieval. The similarity function itself is regarded as the function of relevance between a query object and objects from database; the most similar objects are understood as the most relevant. However, such an automatic adoption of similarity as relevance leads to limited applicability of similarity search in domains like entity discovery, where relevant objects are not supposed to be similar in the traditional meaning. In this paper, we propose the meta-model of data-transitive similarity operating on top of a particular similarity model and a database. This meta-model enables to treat directly non-similar objects x, y as similar if there exists a chain of objects x, i_1,... ,i_n, y having the neighboring members similar enough. Hence, this approach places the similarity in the role of relevance, where objects do not need to be directly similar but still remain relevant to each other (transitively similar). The data-transitive similarity concept allows to use standard similarity-search methods (queries, joins, rankings, analytics) in more complex tasks, like the entity discovery, where relevant results are often complementary or orthogonal to the query, rather than directly similar. Moreover, we show the data-transitive similarity is inherently self-explainable and non-metric. We discuss the approach in the domain of open dataset discovery.
Evaluation Framework for Search Methods Focused on Dataset Findability in Open Data Catalogs
Autoři
Rok
2020
Publikováno
Proceedings of the 22nd International Conference on Information Integration and Web-based Applications & Services. New York: Association for Computing Machinery, 2020. p. 200-209. ISBN 978-1-4503-8922-8.
Typ
Stať ve sborníku
Pracoviště
Anotace
Many institutions publish datasets as Open Data in catalogs, however, their retrieval remains problematic issue due to the absence of dataset search benchmarking. We propose a framework for evaluating findability of datasets, regardless of retrieval models used. As task-agnostic labeling of datasets by ground truth turns out to be infeasible in the general domain of open data datasets, the proposed framework is based on evaluation of entire retrieval scenarios that mimic complex retrieval tasks. In addition to the framework we present a proof of concept specification and evaluation on several similarity-based retrieval models and several dataset discovery scenarios within a catalog, using our experimental evaluation tool. Instead of traditional matching of query with metadata of all the datasets, in similarity-based retrieval the query is formulated using a set of datasets (query by example) and the most similar datasets to the query set are retrieved from the catalog as a result.
LinkedPipes ETL: Evolved Linked Data Preparation
Autoři
Klímek, J.; Škoda, P; Nečaský, M
Rok
2016
Publikováno
The Semantic Web: ESWC 2016 Satellite Events. Cham: Springer International Publishing AG, 2016. p. 95-100. 9989. ISSN 0302-9743. ISBN 978-3-319-47601-8.
Typ
Stať ve sborníku
Pracoviště
Anotace
As Linked Data gains traction, the proper support for its publication and consumption is more important than ever. Even though there is a multitude of tools for preparation of Linked Data, they are still either quite limited, difficult to use or not compliant with recent W3C Recommendations. In this demonstration paper, we present LinkedPipes ETL, a lightweight, Linked Data preparation tool. It is focused mainly on smooth user experience including mobile devices, ease of integration based on full API coverage and universal usage thanks to its library of components. We build on our experience gained by development and use of UnifiedViews, our previous Linked Data ETL tool, and present four use cases in which our new tool excels in comparison.
Modeling fiscal data with the Data Cube Vocabulary
Autoři
Mynarz, J; Svátek, V; Karampatakis, S; Klímek, J.; Bratsas, C
Rok
2016
Publikováno
Joint Proceedings of the Posters and Demos Track of the 12th International Conference on Semantic Systems - SEMANTiCS2016 and the 1st International Workshop on Semantic Change & Evolving Semantics (SuCCESS'16) co-located with the 12th International Conference on Semantic Systems (SEMANTiCS 2016). Aachen: CEUR Workshop Proceedings, 2016. 1695. ISSN 1613-0073.
Typ
Stať ve sborníku
Pracoviště
Anotace
We present a fiscal data model based on the Data Cube Vocabulary, which we developed for the OpenBudgets.eu project. The model defines component properties out of which data structure definitions for concrete datasets can be composed. Based on initial usage experiments, simple validation constraints have been formulated.
Reusable transformations of Data Cube Vocabulary datasets from the fiscal domain
Autoři
Mynarz, J; Klímek, J.; Dudáš, M; Škoda, P; Engels, C; Musyaffa, F.A.; Svátek, V
Rok
2016
Publikováno
Proceedings of the 4th International Workshop on Semantic Statistics co-located with 15th International Semantic Web Conference (ISWC 2016). Aachen: CEUR Workshop Proceedings, 2016. vol. 1654. ISSN 1613-0073.
Typ
Stať ve sborníku
Pracoviště
Anotace
Shared data models provide leverage for reusable data transformations. Common modelling patterns and data structures can make data transformations applicable to diverse datasets. Similarly to data models, reusable data transformations promote separation of concerns, prevent duplication of effort, and reduce the time spent processing data. However, unlike data models, which can be shared as RDF vocabularies or ontologies, there is no well-established way of sharing data transformations.We propose a way to share data transformations as ‘pipeline fragments’ for LinkedPipes ETL (LP-ETL), which is an RDFbased data processing tool focused on RDF data. We describe the features of LP-ETL that enable development of reusable transformations as pipeline fragments. Pipeline fragments are represented in RDF as JSON-LD files that can be shared directly or via dereferenceable IRIs. We demonstrate the use of pipeline fragments on data transformations for fiscal data described by the Data Cube Vocabulary (DCV). We cover both generic transformations for any DCV-compliant
data, such as DCV validation or DCV to CSV conversion, and transformations specific for the fiscal data used in the OpenBudgets.eu (OBEU) project, including conversion of Fiscal Data Package to RDF or normalization of monetary values. The applicability of these transformations is shown on concrete use cases serving the goals of the OBEU project.
Procurement notice enrichment using product ontologies
Autoři
Svátek, V.; Kompuš, P.; Dudáš, M.; Nečaský, M.; Klímek, J.
Rok
2015
Publikováno
Proceedings of the 11th International Conference on Semantic Systems. New York: ACM, 2015. p. 200-203. ISBN 978-1-4503-3462-4.
Typ
Stať ve sborníku
Pracoviště
Anotace
Linked data resources supporting matchmaking supply and demand on the procurement market are so far limited. Precise match could be obtained by enriching the procurement notices with detailed types and parameters of the product/service that are explicitly modeled in 'vertical' ontologies for the e-commerce field, in particular in the OPDM project associated with the GoodRelations initiative. We showcase a web-based prototype that allows the contracting authority to (1) fetch a product ontology from the OPDM repository, (2) create forms using relevant concepts from the ontology, and (3) annotate a procurement notice via the form corresponding to the demanded product.
Use Cases for Linked Data Visualization Model
Autoři
Klímek, J.; Helmich, J.; Nečaský, M.
Rok
2015
Publikováno
Proceedings of the Workshop on Linked Data on the Web co-located with the 24th International World Wide Web Conference (WWW 2015). Aachen: CEUR Workshop Proceedings, 2015. ISSN 1613-0073.
Typ
Stať ve sborníku
Pracoviště
Anotace
There is a vast amount of Linked Data on the web spread across a large number of datasets. One of the visions behind Linked Data is that the published data is conveniently reusable by others. This, however, depends on many details such as conformance of the data with commonly used vocabularies and adherence to best practices for data modeling. Therefore, when an expert wants to reuse existing datasets, he still needs to analyze them to discover how the data is modeled and what it actually contains. This may include analysis of what entities are there, how are they linked to other entities, which properties from which vocabularies are used, etc. What is missing is a convenient and fast way of seeing what could be usable in the chosen unknown dataset without reading through its RDF serialization. In this paper we describe use cases based on this problem and their realization using our Linked Data Visualization Model (LDVM) and its new implementation. LDVM is a formal base that exploits the Linked Data principles to ensure interoperability and compatibility of compliant analytic and visualization components. We demonstrate the use cases on examples from the Czech Linked Open Data cloud.
Vocabulary for Linked Data Visualization Model
Autoři
Klímek, J.; Helmich, J.
Rok
2015
Publikováno
DATESO 2015. Praha: MATFYZPRESS, vydavatelství Matematicko-fyzikální fakulty UK, 2015. pp. 28-39. CEUR Workshop Proceedings. ISSN 1613-0073.
Typ
Stať ve sborníku
Pracoviště
Anotace
There is already a vast amount of Linked Data on the web. What is missing is a convenient way of analyzing and visualizing the data that would benefit from the Linked Data principles. In our previous work we introduced the Linked Data Visualization Model (LDVM). It is a formal base that exploits the principles to ensure interoperability and compatibility of compliant components. In this paper we introduce a vocabulary for description of the components and an analytic and visualization pipeline composed of them. We demonstrate its viability on an example from the Czech Linked Open Data cloud.
Application of the Linked Data Visualization Model on Real World Data from the Czech LOD Cloud
Autoři
Klímek, J.; Helmich, J.; Nečaský, M.
Rok
2014
Publikováno
Workshop on Linked Data on the Web, LDOW 2014 - Co-located with the 23rd International World Wide Web Conference, WWW 2014. Aachen: CEUR Workshop Proceedings, 2014. CEUR Workshop Proceedings. ISSN 1613-0073.
Typ
Stať ve sborníku
Pracoviště
Anotace
In the recent years the Linked Open Data phenomenon has gained a substantial traction. This has lead to a vast amount of data being available on the Web in what is known as the LOD cloud. While the potential of this linked data space is huge, it fails to reach the non-expert users so far. At the same time there is even larger amount of data that is so far not open yet, often because its owners are not convinced of its usefulness. In this paper we refine our Linked Data Visualization Model (LDVM) and show its application via its implementation Payola. On a real-world scenario built on real-world Linked Open Data created from Czech open data sources we show how end-user friendly visualizations can be easily achieved. Our first goal is to show that using Payola, existing Linked Open Data can be easily mashed up and visualized using an extensible library of analyzers, transformers and visualizers. Our second goal is to give potential publishers of (Linked) Open Data a proof that simply by publishing their data in a right way can bring them powerful visualizations at virtually no additional cost.
Linked Data in Enterprise Integration
Autoři
Auer, Sören; Ngomo, A.-C. N.; Frischmuth, P.; Klímek, J.
Rok
2014
Publikováno
Big Data Computing. Boca Raton: CRC Press, 2014. p. 169-203. ISBN 978-1-4665-7837-1.
Typ
Kapitola v knize
Pracoviště
Anotace
Due to market forces and technological evolution, Big Data computing is developing at an increasing rate. A wide variety of novel approaches and tools have emerged to tackle the challenges of Big Data, creating both more opportunities and more challenges for students and professionals in the field of data computation and analysis.
Presenting a mix of industry cases and theory, Big Data Computing discusses the technical and practical issues related to Big Data in intelligent information management. Emphasizing the adoption and diffusion of Big Data tools and technologies in industry, the book introduces a broad range of Big Data concepts, tools, and techniques. It covers a wide range of research, and provides comparisons between state-of-the-art approaches.
Linked data support for filing public contracts
Autoři
Nečaský, M.; Klímek, J.; Mynarz, J.; Knap, T.; Svátek, V.; Stárka, J.
Rok
2014
Publikováno
Computers in Industry. 2014, 65(5), 862-877. ISSN 0166-3615.
Typ
Článek
Pracoviště
Anotace
Management of the tendering phase of the public contract lifecycle is a demanding activity with often irrevocable impact on the subsequent realization phase. We investigate the impact of the linked data technology on this process.
The public contract information itself can be published as linked data. A specialized vocabulary, the Public Contracts Ontology, was designed for this purpose. Extractors and transformers for public contract datasets in various formats (HTML, CSV, XML) were developed to enable conversion into RDF format corresponding to the vocabulary.
Moreover, an application for filing public contracts was implemented. It enables a contracting authority to manage RDF data about itself and its contracts, suppliers to the contracts, to-be-contracted products and services, and actual tenders proposed by bidders. It also provides matchmaking services for finding similar contracts and suitable suppliers for a given call for tenders based on their history, which is a useful feature for contracting authorities.
Translation of Structural Constraints from Conceptual Model for XML to Schematron
Autoři
Klímek, J.; Benda, S.; Nečaský, M.
Rok
2014
Publikováno
Journal of Universal Computer Science. 2014, 20(3), 277-301. ISSN 0948-695X.
Typ
Článek
Pracoviště
Anotace
Today, XML (eXtensible Markup Language) is a standard for exchange inside and among IT infrastructures. For the exchange to work an XML format must be negotiated between the communicating parties. The format is often expressed as an XML schema. In our previous work, we introduced a conceptual model for XML, which utilizes modeling, evolution and maintenance of a set of XML schemas and allows schema designers to export modeled formats into grammar-based XML schema languages like DTD and XML Schema. However, there is another type of XML schema languages called rule-based languages with Schematron as their main representative. In our preceding conference paper [Benda et al.(2013)] we briefly introduced the process of translation from our conceptual model to Schematron. Expressing XML schemas in Schematron has advantages over grammar-based languages and in this paper, we describe the previously introduced translation in more detail with focus on structural constraints and how they are represented in Schematron. Also, we discuss the possibilities and limitations of translation from our grammar-based conceptual model to the rule-based Schematron.
Visualizing RDF Data Cubes Using the Linked Data Visualization Model
Autoři
Helmich, J.; Klímek, J.; Nečaský, M.
Rok
2014
Publikováno
The Semantic Web: ESWC 2014 Satellite Events. Cham: Springer International Publishing AG, 2014. p. 368-373. Lecture Notes in Computer Science. ISSN 0302-9743. ISBN 978-3-319-11954-0.
Typ
Stať ve sborníku
Pracoviště
Anotace
Data Cube represents one of the basic means for storing, processing and analyzing statistical data. Recently, the RDF Data Cube Vocabulary became a W3C recommendation and at the same time interesting datasets using it started to appear. Along with them appeared the need for compatible visualization tools. The Linked Data Visualisation Model is a formalism focused on this area and is implemented by Payola, a framework for analysis and visualization of Linked Data. In this paper, we present capabilities of LDVM and Payola to visualize RDF Data Cubes as well as other statistical datasets not yet compatible with the Data Cube Vocabulary. We also compare our approach to CubeViz, which is a visualization tool specialized on RDF Data Cube visualizations.
Formal Linked Data Visualization Model
Autoři
Brunetti, J. M.; Auer, S.; García, R.; Klímek, J.; Nečaský, M.
Rok
2013
Publikováno
IIWAS '13:Proceedings of the 15th International Conference on Information Integration and Web-based Applications & Services. New York: ACM, 2013. p. 309-318. ISBN 978-1-4503-2113-6.
Typ
Stať ve sborníku
Pracoviště
Anotace
Recently, the amount of semantic data available in the Web has increased dramatically. The potential of this vast amount of data is enormous but in most cases it is difficult for users to explore and use this data, especially for those without experience with Semantic Web technologies. Applying information visualization techniques to the Semantic Web helps users to easily explore large amounts of data and interact with them. In this article we devise a formal Linked Data Visualization Model (LDVM), which allows to dynamically connect data with visualizations. We report about our implementation of the LDVM comprising a library of generic visualizations that enable both users and data analysts to get an overview on, visualize and explore the Data Web and perform detailed analyzes on Linked Data.
Linked Open Data for Healthcare Professionals
Autoři
Kozák, J.; Nečaský, M.; Dědek, J.; Klímek, J.; Pokorný, J.
Rok
2013
Publikováno
IIWAS '13:Proceedings of the 15th International Conference on Information Integration and Web-based Applications & Services. New York: ACM, 2013. p. 400-409. ISBN 978-1-4503-2113-6.
Typ
Stať ve sborníku
Pracoviště
Anotace
Physicians are overwhelmed with many different drugs and the need to know a lot of information about all of them. That is, however, almost impossible in the fast evolving area of pharmaceutical industry. Although many data sources about drugs are published on the Web, structured or unstructured, it is very time consuming to search through them. In this paper we identify these data sources according to information needs of physicians. We show that they can be relatively easily integrated using the Linked Data principles and, in case of unstructured data, NLP methods. An application on the top of the integrated data sets is presented as a possible tool for clinical decision support.