new

Get trending papers in your email inbox!

Subscribe

Daily Papers

byAK and the research community

Mar 11

SIRL: Similarity-based Implicit Representation Learning

When robots learn reward functions using high capacity models that take raw state directly as input, they need to both learn a representation for what matters in the task -- the task ``features" -- as well as how to combine these features into a single objective. If they try to do both at once from input designed to teach the full reward function, it is easy to end up with a representation that contains spurious correlations in the data, which fails to generalize to new settings. Instead, our ultimate goal is to enable robots to identify and isolate the causal features that people actually care about and use when they represent states and behavior. Our idea is that we can tune into this representation by asking users what behaviors they consider similar: behaviors will be similar if the features that matter are similar, even if low-level behavior is different; conversely, behaviors will be different if even one of the features that matter differs. This, in turn, is what enables the robot to disambiguate between what needs to go into the representation versus what is spurious, as well as what aspects of behavior can be compressed together versus not. The notion of learning representations based on similarity has a nice parallel in contrastive learning, a self-supervised representation learning technique that maps visually similar data points to similar embeddings, where similarity is defined by a designer through data augmentation heuristics. By contrast, in order to learn the representations that people use, so we can learn their preferences and objectives, we use their definition of similarity. In simulation as well as in a user study, we show that learning through such similarity queries leads to representations that, while far from perfect, are indeed more generalizable than self-supervised and task-input alternatives.

  • 5 authors
·
Jan 2, 2023

Tversky Neural Networks: Psychologically Plausible Deep Learning with Differentiable Tversky Similarity

Work in psychology has highlighted that the geometric model of similarity standard in deep learning is not psychologically plausible because its metric properties such as symmetry do not align with human perception. In contrast, Tversky (1977) proposed an axiomatic theory of similarity based on a representation of objects as sets of features, and their similarity as a function of common and distinctive features. However, this model has not been used in deep learning before, partly due to the challenge of incorporating discrete set operations. We develop a differentiable parameterization of Tversky's similarity that is learnable through gradient descent, and derive neural network building blocks such as the Tversky projection layer, which unlike the linear projection layer can model non-linear functions such as XOR. Through experiments with image recognition and language modeling, we show that the Tversky projection layer is a beneficial replacement for the linear projection layer, which employs geometric similarity. On the NABirds image classification task, a frozen ResNet-50 adapted with a Tversky projection layer achieves a 24.7% relative accuracy improvement over the linear layer adapter baseline. With Tversky projection layers, GPT-2's perplexity on PTB decreases by 7.5%, and its parameter count by 34.8%. Finally, we propose a unified interpretation of both projection layers as computing similarities of input stimuli to learned prototypes, for which we also propose a novel visualization technique highlighting the interpretability of Tversky projection layers. Our work offers a new paradigm for thinking about the similarity model implicit in deep learning, and designing networks that are interpretable under an established theory of psychological similarity.

  • 3 authors
·
May 20, 2025

SDSC:A Structure-Aware Metric for Semantic Signal Representation Learning

We propose the Signal Dice Similarity Coefficient (SDSC), a structure-aware metric function for time series self-supervised representation learning. Most Self-Supervised Learning (SSL) methods for signals commonly adopt distance-based objectives such as mean squared error (MSE), which are sensitive to amplitude, invariant to waveform polarity, and unbounded in scale. These properties hinder semantic alignment and reduce interpretability. SDSC addresses this by quantifying structural agreement between temporal signals based on the intersection of signed amplitudes, derived from the Dice Similarity Coefficient (DSC).Although SDSC is defined as a structure-aware metric, it can be used as a loss by subtracting from 1 and applying a differentiable approximation of the Heaviside function for gradient-based optimization. A hybrid loss formulation is also proposed to combine SDSC with MSE, improving stability and preserving amplitude where necessary. Experiments on forecasting and classification benchmarks demonstrate that SDSC-based pre-training achieves comparable or improved performance over MSE, particularly in in-domain and low-resource scenarios. The results suggest that structural fidelity in signal representations enhances the semantic representation quality, supporting the consideration of structure-aware metrics as viable alternatives to conventional distance-based methods.

  • 2 authors
·
Jul 19, 2025 1

Adversarial Speaker Disentanglement Using Unannotated External Data for Self-supervised Representation Based Voice Conversion

Nowadays, recognition-synthesis-based methods have been quite popular with voice conversion (VC). By introducing linguistics features with good disentangling characters extracted from an automatic speech recognition (ASR) model, the VC performance achieved considerable breakthroughs. Recently, self-supervised learning (SSL) methods trained with a large-scale unannotated speech corpus have been applied to downstream tasks focusing on the content information, which is suitable for VC tasks. However, a huge amount of speaker information in SSL representations degrades timbre similarity and the quality of converted speech significantly. To address this problem, we proposed a high-similarity any-to-one voice conversion method with the input of SSL representations. We incorporated adversarial training mechanisms in the synthesis module using external unannotated corpora. Two auxiliary discriminators were trained to distinguish whether a sequence of mel-spectrograms has been converted by the acoustic model and whether a sequence of content embeddings contains speaker information from external corpora. Experimental results show that our proposed method achieves comparable similarity and higher naturalness than the supervised method, which needs a huge amount of annotated corpora for training and is applicable to improve similarity for VC methods with other SSL representations as input.

  • 5 authors
·
May 16, 2023

Hyp-OW: Exploiting Hierarchical Structure Learning with Hyperbolic Distance Enhances Open World Object Detection

Open World Object Detection (OWOD) is a challenging and realistic task that extends beyond the scope of standard Object Detection task. It involves detecting both known and unknown objects while integrating learned knowledge for future tasks. However, the level of "unknownness" varies significantly depending on the context. For example, a tree is typically considered part of the background in a self-driving scene, but it may be significant in a household context. We argue that this contextual information should already be embedded within the known classes. In other words, there should be a semantic or latent structure relationship between the known and unknown items to be discovered. Motivated by this observation, we propose Hyp-OW, a method that learns and models hierarchical representation of known items through a SuperClass Regularizer. Leveraging this representation allows us to effectively detect unknown objects using a similarity distance-based relabeling module. Extensive experiments on benchmark datasets demonstrate the effectiveness of Hyp-OW, achieving improvement in both known and unknown detection (up to 6 percent). These findings are particularly pronounced in our newly designed benchmark, where a strong hierarchical structure exists between known and unknown objects. Our code can be found at https://github.com/tldoan/-HYP-OW-AAAI-2024-

  • 6 authors
·
Jun 25, 2023

Automatic Malware Description via Attribute Tagging and Similarity Embedding

With the rapid proliferation and increased sophistication of malicious software (malware), detection methods no longer rely only on manually generated signatures but have also incorporated more general approaches like machine learning detection. Although powerful for conviction of malicious artifacts, these methods do not produce any further information about the type of threat that has been detected neither allows for identifying relationships between malware samples. In this work, we address the information gap between machine learning and signature-based detection methods by learning a representation space for malware samples in which files with similar malicious behaviors appear close to each other. We do so by introducing a deep learning based tagging model trained to generate human-interpretable semantic descriptions of malicious software, which, at the same time provides potentially more useful and flexible information than malware family names. We show that the malware descriptions generated with the proposed approach correctly identify more than 95% of eleven possible tag descriptions for a given sample, at a deployable false positive rate of 1% per tag. Furthermore, we use the learned representation space to introduce a similarity index between malware files, and empirically demonstrate using dynamic traces from files' execution, that is not only more effective at identifying samples from the same families, but also 32 times smaller than those based on raw feature vectors.

  • 5 authors
·
May 15, 2019

Pretraining the Vision Transformer using self-supervised methods for vision based Deep Reinforcement Learning

The Vision Transformer architecture has shown to be competitive in the computer vision (CV) space where it has dethroned convolution-based networks in several benchmarks. Nevertheless, convolutional neural networks (CNN) remain the preferential architecture for the representation module in reinforcement learning. In this work, we study pretraining a Vision Transformer using several state-of-the-art self-supervised methods and assess the quality of the learned representations. To show the importance of the temporal dimension in this context we propose an extension of VICReg to better capture temporal relations between observations by adding a temporal order verification task. Our results show that all methods are effective in learning useful representations and avoiding representational collapse for observations from Atari Learning Environment (ALE) which leads to improvements in data efficiency when we evaluated in reinforcement learning (RL). Moreover, the encoder pretrained with the temporal order verification task shows the best results across all experiments, with richer representations, more focused attention maps and sparser representation vectors throughout the layers of the encoder, which shows the importance of exploring such similarity dimension. With this work, we hope to provide some insights into the representations learned by ViT during a self-supervised pretraining with observations from RL environments and which properties arise in the representations that lead to the best-performing agents. The source code will be available at: https://github.com/mgoulao/TOV-VICReg

  • 2 authors
·
Sep 22, 2022

CoDiEmb: A Collaborative yet Distinct Framework for Unified Representation Learning in Information Retrieval and Semantic Textual Similarity

Learning unified text embeddings that excel across diverse downstream tasks is a central goal in representation learning, yet negative transfer remains a persistent obstacle. This challenge is particularly pronounced when jointly training a single encoder for Information Retrieval (IR) and Semantic Textual Similarity (STS), two essential but fundamentally disparate tasks for which naive co-training typically yields steep performance trade-offs. We argue that resolving this conflict requires systematically decoupling task-specific learning signals throughout the training pipeline. To this end, we introduce CoDiEmb, a unified framework that reconciles the divergent requirements of IR and STS in a collaborative yet distinct manner. CoDiEmb integrates three key innovations for effective joint optimization: (1) Task-specialized objectives paired with a dynamic sampler that forms single-task batches and balances per-task updates, thereby preventing gradient interference. For IR, we employ a contrastive loss with multiple positives and hard negatives, augmented by cross-device sampling. For STS, we adopt order-aware objectives that directly optimize correlation and ranking consistency. (2) A delta-guided model fusion strategy that computes fine-grained merging weights for checkpoints by analyzing each parameter's deviation from its pre-trained initialization, proving more effective than traditional Model Soups. (3) An efficient, single-stage training pipeline that is simple to implement and converges stably. Extensive experiments on 15 standard IR and STS benchmarks across three base encoders validate CoDiEmb. Our results and analysis demonstrate that the framework not only mitigates cross-task trade-offs but also measurably improves the geometric properties of the embedding space.

  • 6 authors
·
Aug 15, 2025

ProtoECGNet: Case-Based Interpretable Deep Learning for Multi-Label ECG Classification with Contrastive Learning

Deep learning-based electrocardiogram (ECG) classification has shown impressive performance but clinical adoption has been slowed by the lack of transparent and faithful explanations. Post hoc methods such as saliency maps may fail to reflect a model's true decision process. Prototype-based reasoning offers a more transparent alternative by grounding decisions in similarity to learned representations of real ECG segments, enabling faithful, case-based explanations. We introduce ProtoECGNet, a prototype-based deep learning model for interpretable, multi-label ECG classification. ProtoECGNet employs a structured, multi-branch architecture that reflects clinical interpretation workflows: it integrates a 1D CNN with global prototypes for rhythm classification, a 2D CNN with time-localized prototypes for morphology-based reasoning, and a 2D CNN with global prototypes for diffuse abnormalities. Each branch is trained with a prototype loss designed for multi-label learning, combining clustering, separation, diversity, and a novel contrastive loss that encourages appropriate separation between prototypes of unrelated classes while allowing clustering for frequently co-occurring diagnoses. We evaluate ProtoECGNet on all 71 diagnostic labels from the PTB-XL dataset, demonstrating competitive performance relative to state-of-the-art black-box models while providing structured, case-based explanations. To assess prototype quality, we conduct a structured clinician review of the final model's projected prototypes, finding that they are rated as representative and clear. ProtoECGNet shows that prototype learning can be effectively scaled to complex, multi-label time-series classification, offering a practical path toward transparent and trustworthy deep learning models for clinical decision support.

  • 7 authors
·
Apr 11, 2025

CLIPC8: Face liveness detection algorithm based on image-text pairs and contrastive learning

Face recognition technology is widely used in the financial field, and various types of liveness attack behaviors need to be addressed. Existing liveness detection algorithms are trained on specific training datasets and tested on testing datasets, but their performance and robustness in transferring to unseen datasets are relatively poor. To tackle this issue, we propose a face liveness detection method based on image-text pairs and contrastive learning, dividing liveness attack problems in the financial field into eight categories and using text information to describe the images of these eight types of attacks. The text encoder and image encoder are used to extract feature vector representations for the classification description text and face images, respectively. By maximizing the similarity of positive samples and minimizing the similarity of negative samples, the model learns shared representations between images and texts. The proposed method is capable of effectively detecting specific liveness attack behaviors in certain scenarios, such as those occurring in dark environments or involving the tampering of ID card photos. Additionally, it is also effective in detecting traditional liveness attack methods, such as printing photo attacks and screen remake attacks. The zero-shot capabilities of face liveness detection on five public datasets, including NUAA, CASIA-FASD, Replay-Attack, OULU-NPU and MSU-MFSD also reaches the level of commercial algorithms. The detection capability of proposed algorithm was verified on 5 types of testing datasets, and the results show that the method outperformed commercial algorithms, and the detection rates reached 100% on multiple datasets. Demonstrating the effectiveness and robustness of introducing image-text pairs and contrastive learning into liveness detection tasks as proposed in this paper.

  • 5 authors
·
Nov 29, 2023

Self-supervised Learning of Echocardiographic Video Representations via Online Cluster Distillation

Self-supervised learning (SSL) has achieved major advances in natural images and video understanding, but challenges remain in domains like echocardiography (heart ultrasound) due to subtle anatomical structures, complex temporal dynamics, and the current lack of domain-specific pre-trained models. Existing SSL approaches such as contrastive, masked modeling, and clustering-based methods struggle with high intersample similarity, sensitivity to low PSNR inputs common in ultrasound, or aggressive augmentations that distort clinically relevant features. We present DISCOVR (Distilled Image Supervision for Cross Modal Video Representation), a self-supervised dual branch framework for cardiac ultrasound video representation learning. DISCOVR combines a clustering-based video encoder that models temporal dynamics with an online image encoder that extracts fine-grained spatial semantics. These branches are connected through a semantic cluster distillation loss that transfers anatomical knowledge from the evolving image encoder to the video encoder, enabling temporally coherent representations enriched with fine-grained semantic understanding. Evaluated on six echocardiography datasets spanning fetal, pediatric, and adult populations, DISCOVR outperforms both specialized video anomaly detection methods and state-of-the-art video-SSL baselines in zero-shot and linear probing setups, and achieves superior segmentation transfer.

  • 7 authors
·
Jun 13, 2025

Masked Momentum Contrastive Learning for Zero-shot Semantic Understanding

Self-supervised pretraining (SSP) has emerged as a popular technique in machine learning, enabling the extraction of meaningful feature representations without labelled data. In the realm of computer vision, pretrained vision transformers (ViTs) have played a pivotal role in advancing transfer learning. Nonetheless, the escalating cost of finetuning these large models has posed a challenge due to the explosion of model size. This study endeavours to evaluate the effectiveness of pure self-supervised learning (SSL) techniques in computer vision tasks, obviating the need for finetuning, with the intention of emulating human-like capabilities in generalisation and recognition of unseen objects. To this end, we propose an evaluation protocol for zero-shot segmentation based on a prompting patch. Given a point on the target object as a prompt, the algorithm calculates the similarity map between the selected patch and other patches, upon that, a simple thresholding is applied to segment the target. Another evaluation is intra-object and inter-object similarity to gauge discriminatory ability of SSP ViTs. Insights from zero-shot segmentation from prompting and discriminatory abilities of SSP led to the design of a simple SSP approach, termed MMC. This approaches combines Masked image modelling for encouraging similarity of local features, Momentum based self-distillation for transferring semantics from global to local features, and global Contrast for promoting semantics of global features, to enhance discriminative representations of SSP ViTs. Consequently, our proposed method significantly reduces the overlap of intra-object and inter-object similarities, thereby facilitating effective object segmentation within an image. Our experiments reveal that MMC delivers top-tier results in zero-shot semantic segmentation across various datasets.

  • 6 authors
·
Aug 22, 2023

L3Cube-IndicSBERT: A simple approach for learning cross-lingual sentence representations using multilingual BERT

The multilingual Sentence-BERT (SBERT) models map different languages to common representation space and are useful for cross-language similarity and mining tasks. We propose a simple yet effective approach to convert vanilla multilingual BERT models into multilingual sentence BERT models using synthetic corpus. We simply aggregate translated NLI or STS datasets of the low-resource target languages together and perform SBERT-like fine-tuning of the vanilla multilingual BERT model. We show that multilingual BERT models are inherent cross-lingual learners and this simple baseline fine-tuning approach without explicit cross-lingual training yields exceptional cross-lingual properties. We show the efficacy of our approach on 10 major Indic languages and also show the applicability of our approach to non-Indic languages German and French. Using this approach, we further present L3Cube-IndicSBERT, the first multilingual sentence representation model specifically for Indian languages Hindi, Marathi, Kannada, Telugu, Malayalam, Tamil, Gujarati, Odia, Bengali, and Punjabi. The IndicSBERT exhibits strong cross-lingual capabilities and performs significantly better than alternatives like LaBSE, LASER, and paraphrase-multilingual-mpnet-base-v2 on Indic cross-lingual and monolingual sentence similarity tasks. We also release monolingual SBERT models for each of the languages and show that IndicSBERT performs competitively with its monolingual counterparts. These models have been evaluated using embedding similarity scores and classification accuracy.

  • 5 authors
·
Apr 22, 2023

Scalable Attentive Sentence-Pair Modeling via Distilled Sentence Embedding

Recent state-of-the-art natural language understanding models, such as BERT and XLNet, score a pair of sentences (A and B) using multiple cross-attention operations - a process in which each word in sentence A attends to all words in sentence B and vice versa. As a result, computing the similarity between a query sentence and a set of candidate sentences, requires the propagation of all query-candidate sentence-pairs throughout a stack of cross-attention layers. This exhaustive process becomes computationally prohibitive when the number of candidate sentences is large. In contrast, sentence embedding techniques learn a sentence-to-vector mapping and compute the similarity between the sentence vectors via simple elementary operations. In this paper, we introduce Distilled Sentence Embedding (DSE) - a model that is based on knowledge distillation from cross-attentive models, focusing on sentence-pair tasks. The outline of DSE is as follows: Given a cross-attentive teacher model (e.g. a fine-tuned BERT), we train a sentence embedding based student model to reconstruct the sentence-pair scores obtained by the teacher model. We empirically demonstrate the effectiveness of DSE on five GLUE sentence-pair tasks. DSE significantly outperforms several ELMO variants and other sentence embedding methods, while accelerating computation of the query-candidate sentence-pairs similarities by several orders of magnitude, with an average relative degradation of 4.6% compared to BERT. Furthermore, we show that DSE produces sentence embeddings that reach state-of-the-art performance on universal sentence representation benchmarks. Our code is made publicly available at https://github.com/microsoft/Distilled-Sentence-Embedding.

  • 6 authors
·
Aug 14, 2019

Contrastive Learning and Mixture of Experts Enables Precise Vector Embeddings

The advancement of transformer neural networks has significantly elevated the capabilities of sentence similarity models, particularly in creating effective vector representations of natural language inputs. However, these models face notable challenges in domain-specific contexts, especially in highly specialized scientific sub-fields. Traditional methods often struggle in this regime, either overgeneralizing similarities within a niche or being overly sensitive to minor differences, resulting in inaccurate text classification and subpar vector representation. In an era where retrieval augmentation and search are increasingly crucial, precise and concise numerical representations are essential. In this paper, we target this issue by assembling niche datasets using co-citations as a similarity metric, focusing on biomedical domains. We employ two key strategies for fine-tuning state-of-the-art models: 1. Domain-specific Fine-Tuning, which tailors pretrained models to a single domain, and 2. Universal Applicability with Mixture of Experts (MoE), adapting pretrained models with enforced routing for multiple domains simultaneously. Our training approach emphasizes the use of abstracts for faster training, incorporating Multiple Negative Rankings loss for efficient contrastive learning. Notably, our MoE variants, equipped with N experts, achieve the efficacy of N individual models, heralding a new era of versatile, One-Size-Fits-All transformer networks for various tasks. This methodology marks significant advancements in scientific text classification metrics and holds promise for enhancing vector database search and compilation.

  • 4 authors
·
Jan 28, 2024

2D Matryoshka Sentence Embeddings

Common approaches rely on fixed-length embedding vectors from language models as sentence embeddings for downstream tasks such as semantic textual similarity (STS). Such methods are limited in their flexibility due to unknown computational constraints and budgets across various applications. Matryoshka Representation Learning (MRL) (Kusupati et al., 2022) encodes information at finer granularities, i.e., with lower embedding dimensions, to adaptively accommodate ad hoc tasks. Similar accuracy can be achieved with a smaller embedding size, leading to speedups in downstream tasks. Despite its improved efficiency, MRL still requires traversing all Transformer layers before obtaining the embedding, which remains the dominant factor in time and memory consumption. This prompts consideration of whether the fixed number of Transformer layers affects representation quality and whether using intermediate layers for sentence representation is feasible. In this paper, we introduce a novel sentence embedding model called Two-dimensional Matryoshka Sentence Embedding (2DMSE). It supports elastic settings for both embedding sizes and Transformer layers, offering greater flexibility and efficiency than MRL. We conduct extensive experiments on STS tasks and downstream applications. The experimental results demonstrate the effectiveness of our proposed model in dynamically supporting different embedding sizes and Transformer layers, allowing it to be highly adaptable to various scenarios.

  • 5 authors
·
Feb 22, 2024

DefSent+: Improving sentence embeddings of language models by projecting definition sentences into a quasi-isotropic or isotropic vector space of unlimited dictionary entries

This paper presents a significant improvement on the previous conference paper known as DefSent. The prior study seeks to improve sentence embeddings of language models by projecting definition sentences into the vector space of dictionary entries. We discover that this approach is not fully explored due to the methodological limitation of using word embeddings of language models to represent dictionary entries. This leads to two hindrances. First, dictionary entries are constrained by the single-word vocabulary, and thus cannot be fully exploited. Second, semantic representations of language models are known to be anisotropic, but pre-processing word embeddings for DefSent is not allowed because its weight is frozen during training and tied to the prediction layer. In this paper, we propose a novel method to progressively build entry embeddings not subject to the limitations. As a result, definition sentences can be projected into a quasi-isotropic or isotropic vector space of unlimited dictionary entries, so that sentence embeddings of noticeably better quality are attainable. We abbreviate our approach as DefSent+ (a plus version of DefSent), involving the following strengths: 1) the task performance on measuring sentence similarities is significantly improved compared to DefSent; 2) when DefSent+ is used to further train data-augmented models like SIMCSE, SNCSE, and SynCSE, state-of-the-art performance on measuring sentence similarities can be achieved among the approaches without using manually labeled datasets; 3) DefSent+ is also competitive in feature-based transfer for NLP downstream tasks.

  • 1 authors
·
May 25, 2024

A Massive Scale Semantic Similarity Dataset of Historical English

A diversity of tasks use language models trained on semantic similarity data. While there are a variety of datasets that capture semantic similarity, they are either constructed from modern web data or are relatively small datasets created in the past decade by human annotators. This study utilizes a novel source, newly digitized articles from off-copyright, local U.S. newspapers, to assemble a massive-scale semantic similarity dataset spanning 70 years from 1920 to 1989 and containing nearly 400M positive semantic similarity pairs. Historically, around half of articles in U.S. local newspapers came from newswires like the Associated Press. While local papers reproduced articles from the newswire, they wrote their own headlines, which form abstractive summaries of the associated articles. We associate articles and their headlines by exploiting document layouts and language understanding. We then use deep neural methods to detect which articles are from the same underlying source, in the presence of substantial noise and abridgement. The headlines of reproduced articles form positive semantic similarity pairs. The resulting publicly available HEADLINES dataset is significantly larger than most existing semantic similarity datasets and covers a much longer span of time. It will facilitate the application of contrastively trained semantic similarity models to a variety of tasks, including the study of semantic change across space and time.

  • 2 authors
·
Jun 30, 2023

Resources for Brewing BEIR: Reproducible Reference Models and an Official Leaderboard

BEIR is a benchmark dataset for zero-shot evaluation of information retrieval models across 18 different domain/task combinations. In recent years, we have witnessed the growing popularity of a representation learning approach to building retrieval models, typically using pretrained transformers in a supervised setting. This naturally begs the question: How effective are these models when presented with queries and documents that differ from the training data? Examples include searching in different domains (e.g., medical or legal text) and with different types of queries (e.g., keywords vs. well-formed questions). While BEIR was designed to answer these questions, our work addresses two shortcomings that prevent the benchmark from achieving its full potential: First, the sophistication of modern neural methods and the complexity of current software infrastructure create barriers to entry for newcomers. To this end, we provide reproducible reference implementations that cover the two main classes of approaches: learned dense and sparse models. Second, there does not exist a single authoritative nexus for reporting the effectiveness of different models on BEIR, which has led to difficulty in comparing different methods. To remedy this, we present an official self-service BEIR leaderboard that provides fair and consistent comparisons of retrieval models. By addressing both shortcomings, our work facilitates future explorations in a range of interesting research questions that BEIR enables.

  • 6 authors
·
Jun 12, 2023

Evaluating Unsupervised Text Classification: Zero-shot and Similarity-based Approaches

Text classification of unseen classes is a challenging Natural Language Processing task and is mainly attempted using two different types of approaches. Similarity-based approaches attempt to classify instances based on similarities between text document representations and class description representations. Zero-shot text classification approaches aim to generalize knowledge gained from a training task by assigning appropriate labels of unknown classes to text documents. Although existing studies have already investigated individual approaches to these categories, the experiments in literature do not provide a consistent comparison. This paper addresses this gap by conducting a systematic evaluation of different similarity-based and zero-shot approaches for text classification of unseen classes. Different state-of-the-art approaches are benchmarked on four text classification datasets, including a new dataset from the medical domain. Additionally, novel SimCSE and SBERT-based baselines are proposed, as other baselines used in existing work yield weak classification results and are easily outperformed. Finally, the novel similarity-based Lbl2TransformerVec approach is presented, which outperforms previous state-of-the-art approaches in unsupervised text classification. Our experiments show that similarity-based approaches significantly outperform zero-shot approaches in most cases. Additionally, using SimCSE or SBERT embeddings instead of simpler text representations increases similarity-based classification results even further.

  • 3 authors
·
Nov 29, 2022

Learning semantic sentence representations from visually grounded language without lexical knowledge

Current approaches to learning semantic representations of sentences often use prior word-level knowledge. The current study aims to leverage visual information in order to capture sentence level semantics without the need for word embeddings. We use a multimodal sentence encoder trained on a corpus of images with matching text captions to produce visually grounded sentence embeddings. Deep Neural Networks are trained to map the two modalities to a common embedding space such that for an image the corresponding caption can be retrieved and vice versa. We show that our model achieves results comparable to the current state-of-the-art on two popular image-caption retrieval benchmark data sets: MSCOCO and Flickr8k. We evaluate the semantic content of the resulting sentence embeddings using the data from the Semantic Textual Similarity benchmark task and show that the multimodal embeddings correlate well with human semantic similarity judgements. The system achieves state-of-the-art results on several of these benchmarks, which shows that a system trained solely on multimodal data, without assuming any word representations, is able to capture sentence level semantics. Importantly, this result shows that we do not need prior knowledge of lexical level semantics in order to model sentence level semantics. These findings demonstrate the importance of visual information in semantics.

  • 2 authors
·
Mar 27, 2019

Representation, Exploration and Recommendation of Music Playlists

Playlists have become a significant part of our listening experience because of the digital cloud-based services such as Spotify, Pandora, Apple Music. Owing to the meteoric rise in the usage of playlists, recommending playlists is crucial to music services today. Although there has been a lot of work done in playlist prediction, the area of playlist representation hasn't received that level of attention. Over the last few years, sequence-to-sequence models, especially in the field of natural language processing, have shown the effectiveness of learned embeddings in capturing the semantic characteristics of sequences. We can apply similar concepts to music to learn fixed length representations for playlists and use those representations for downstream tasks such as playlist discovery, browsing, and recommendation. In this work, we formulate the problem of learning a fixed-length playlist representation in an unsupervised manner, using Sequence-to-sequence (Seq2seq) models, interpreting playlists as sentences and songs as words. We compare our model with two other encoding architectures for baseline comparison. We evaluate our work using the suite of tasks commonly used for assessing sentence embeddings, along with a few additional tasks pertaining to music, and a recommendation task to study the traits captured by the playlist embeddings and their effectiveness for the purpose of music recommendation.

  • 3 authors
·
Jul 1, 2019

Vector representations of text data in deep learning

In this dissertation we report results of our research on dense distributed representations of text data. We propose two novel neural models for learning such representations. The first model learns representations at the document level, while the second model learns word-level representations. For document-level representations we propose Binary Paragraph Vector: a neural network models for learning binary representations of text documents, which can be used for fast document retrieval. We provide a thorough evaluation of these models and demonstrate that they outperform the seminal method in the field in the information retrieval task. We also report strong results in transfer learning settings, where our models are trained on a generic text corpus and then used to infer codes for documents from a domain-specific dataset. In contrast to previously proposed approaches, Binary Paragraph Vector models learn embeddings directly from raw text data. For word-level representations we propose Disambiguated Skip-gram: a neural network model for learning multi-sense word embeddings. Representations learned by this model can be used in downstream tasks, like part-of-speech tagging or identification of semantic relations. In the word sense induction task Disambiguated Skip-gram outperforms state-of-the-art models on three out of four benchmarks datasets. Our model has an elegant probabilistic interpretation. Furthermore, unlike previous models of this kind, it is differentiable with respect to all its parameters and can be trained with backpropagation. In addition to quantitative results, we present qualitative evaluation of Disambiguated Skip-gram, including two-dimensional visualisations of selected word-sense embeddings.

  • 1 authors
·
Jan 7, 2019

VacancySBERT: the approach for representation of titles and skills for semantic similarity search in the recruitment domain

The paper focuses on deep learning semantic search algorithms applied in the HR domain. The aim of the article is developing a novel approach to training a Siamese network to link the skills mentioned in the job ad with the title. It has been shown that the title normalization process can be based either on classification or similarity comparison approaches. While classification algorithms strive to classify a sample into predefined set of categories, similarity search algorithms take a more flexible approach, since they are designed to find samples that are similar to a given query sample, without requiring pre-defined classes and labels. In this article semantic similarity search to find candidates for title normalization has been used. A pre-trained language model has been adapted while teaching it to match titles and skills based on co-occurrence information. For the purpose of this research fifty billion title-descriptions pairs had been collected for training the model and thirty three thousand title-description-normalized title triplets, where normalized job title was picked up manually by job ad creator for testing purposes. As baselines FastText, BERT, SentenceBert and JobBert have been used. As a metric of the accuracy of the designed algorithm is Recall in top one, five and ten model's suggestions. It has been shown that the novel training objective lets it achieve significant improvement in comparison to other generic and specific text encoders. Two settings with treating titles as standalone strings, and with included skills as additional features during inference have been used and the results have been compared in this article. Improvements by 10% and 21.5% have been achieved using VacancySBERT and VacancySBERT (with skills) respectively. The benchmark has been developed as open-source to foster further research in the area.

  • 3 authors
·
Jul 31, 2023