{"id":176542,"date":"2025-11-06T12:55:34","date_gmt":"2025-11-06T12:55:34","guid":{"rendered":"https:\/\/ktromedia.com\/?p=176542"},"modified":"2025-11-06T12:55:34","modified_gmt":"2025-11-06T12:55:34","slug":"7-advanced-feature-engineering-tricks-for-text-data-using-llm-embeddings","status":"publish","type":"post","link":"http:\/\/ktromedia.com\/?p=176542","title":{"rendered":"7 Advanced Feature Engineering Tricks for Text Data Using LLM Embeddings"},"content":{"rendered":"<div id=\"\">\n<div style=\"width: 810px\" class=\"wp-caption aligncenter\"><\/p>\n<p class=\"wp-caption-text\">7 Advanced Feature Engineering Tricks for Text Data Using LLM Embeddings<br \/>Image by Editor<\/p>\n<\/div>\n<h2>Introduction<\/h2>\n<p>Large language models (LLMs) are not only good at understanding and generating text; they can also turn raw text into numerical representations called embeddings. These embeddings are useful for incorporating additional information into traditional predictive machine learning models\u2014such as those used in <strong><a href=\"https:\/\/scikit-learn.org\/\" target=\"_blank\">scikit-learn<\/a><\/strong>\u2014to improve downstream performance.<\/p>\n<p>This article presents seven advanced Python examples of feature engineering tricks that add extra value to text data by leveraging LLM-generated embeddings, thereby enhancing the accuracy and robustness of downstream machine learning models that rely on text, in applications such as sentiment analysis, topic classification, document clustering, and semantic similarity detection.<\/p>\n<p><strong>Common setup for all examples<\/strong><\/p>\n<p>Unless stated otherwise, the seven example tricks below make use of this common setup. We rely on <strong><a href=\"https:\/\/www.sbert.net\/\" target=\"_blank\">Sentence Transformers<\/a><\/strong> for embeddings and <strong><a href=\"https:\/\/scikit-learn.org\/\" target=\"_blank\">scikit-learn<\/a><\/strong> for modeling utilities.<\/p>\n<div id=\"urvanov-syntax-highlighter-69064ae400a83246775161\" class=\"urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate\" data-settings=\" minimize scroll-mouseover disable-anim\" style=\" margin-top: 12px; margin-bottom: 12px; font-size: 12px !important; line-height: 15px !important;\">\n<p><textarea wrap=\"soft\" class=\"urvanov-syntax-highlighter-plain print-no\" data-settings=\"dblclick\" readonly=\"readonly\" style=\"-moz-tab-size:4; -o-tab-size:4; -webkit-tab-size:4; tab-size:4; font-size: 12px !important; line-height: 15px !important;\"><br \/>\n!pip install sentence-transformers scikit-learn -q&#13;<br \/>\nfrom sentence_transformers import SentenceTransformer&#13;<br \/>\nimport numpy as np&#13;<br \/>\n&#13;<br \/>\n# Load a lightweight LLM embedding model; builds 384-dimensional embeddings&#13;<br \/>\nmodel = SentenceTransformer(&#8220;all-MiniLM-L6-v2&#8221;)<\/textarea><\/p>\n<div class=\"urvanov-syntax-highlighter-main\" style=\"\">\n<table class=\"crayon-table\">\n<tr class=\"urvanov-syntax-highlighter-row\">\n<td class=\"crayon-nums \" data-settings=\"show\">\n<\/td>\n<td class=\"urvanov-syntax-highlighter-code\">\n<div class=\"crayon-pre\" style=\"font-size: 12px !important; line-height: 15px !important; -moz-tab-size:4; -o-tab-size:4; -webkit-tab-size:4; tab-size:4;\">\n<p><span class=\"crayon-o\">!<\/span><span class=\"crayon-e\">pip <\/span><span class=\"crayon-e\">install <\/span><span class=\"crayon-v\">sentence<\/span><span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-e\">transformers <\/span><span class=\"crayon-v\">scikit<\/span><span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-v\">learn<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-i\">q<\/span><\/p>\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-e\">sentence_transformers <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-e\">SentenceTransformer<\/span><\/p>\n<p><span class=\"crayon-e\">import <\/span><span class=\"crayon-e\">numpy <\/span><span class=\"crayon-st\">as<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">np<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-p\"># Load a lightweight LLM embedding model; builds 384-dimensional embeddings<\/span><\/p>\n<p><span class=\"crayon-v\">model<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">SentenceTransformer<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-s\">&#8220;all-MiniLM-L6-v2&#8221;<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<\/div>\n<\/td>\n<\/tr>\n<\/table><\/div>\n<\/p><\/div>\n<h2>1. Combining TF-IDF and Embedding Features<\/h2>\n<p>The first example shows how to jointly extract\u2014given a source text dataset like <code>fetch_20newsgroups<\/code>\u2014both TF-IDF and LLM-generated sentence-embedding features. We then combine these feature types to train a logistic regression model that classifies news texts based on the combined features, often boosting accuracy by capturing both lexical and semantic information.<\/p>\n<div id=\"urvanov-syntax-highlighter-69064ae400a8d462869789\" class=\"urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate\" data-settings=\" minimize scroll-mouseover disable-anim\" style=\" margin-top: 12px; margin-bottom: 12px; font-size: 12px !important; line-height: 15px !important;\">\n<p><textarea wrap=\"soft\" class=\"urvanov-syntax-highlighter-plain print-no\" data-settings=\"dblclick\" readonly=\"readonly\" style=\"-moz-tab-size:4; -o-tab-size:4; -webkit-tab-size:4; tab-size:4; font-size: 12px !important; line-height: 15px !important;\"><br \/>\nfrom sklearn.datasets import fetch_20newsgroups&#13;<br \/>\nfrom sklearn.feature_extraction.text import TfidfVectorizer&#13;<br \/>\nfrom sklearn.linear_model import LogisticRegression&#13;<br \/>\nfrom sklearn.preprocessing import StandardScaler&#13;<br \/>\n&#13;<br \/>\n# Loading data&#13;<br \/>\ndata = fetch_20newsgroups(subset=&#8221;train&#8221;, categories=[&#8216;sci.space&#8217;, &#8216;rec.autos&#8217;])&#13;<br \/>\ntexts, y = data.data[:500], data.target[:500]&#13;<br \/>\n&#13;<br \/>\n# Extracting features of two broad types&#13;<br \/>\ntfidf = TfidfVectorizer(max_features=300).fit_transform(texts).toarray()&#13;<br \/>\nemb = model.encode(texts, show_progress_bar=False)&#13;<br \/>\n&#13;<br \/>\n# Combining features and training ML model&#13;<br \/>\nX = np.hstack([tfidf, StandardScaler().fit_transform(emb)])&#13;<br \/>\nclf = LogisticRegression(max_iter=1000).fit(X, y)&#13;<br \/>\nprint(&#8220;Accuracy:&#8221;, clf.score(X, y))<\/textarea><\/p>\n<div class=\"urvanov-syntax-highlighter-main\" style=\"\">\n<table class=\"crayon-table\">\n<tr class=\"urvanov-syntax-highlighter-row\">\n<td class=\"crayon-nums \" data-settings=\"show\">\n<div class=\"urvanov-syntax-highlighter-nums-content\" style=\"font-size: 12px !important; line-height: 15px !important;\">\n<p>1<\/p>\n<p>2<\/p>\n<p>3<\/p>\n<p>4<\/p>\n<p>5<\/p>\n<p>6<\/p>\n<p>7<\/p>\n<p>8<\/p>\n<p>9<\/p>\n<p>10<\/p>\n<p>11<\/p>\n<p>12<\/p>\n<p>13<\/p>\n<p>14<\/p>\n<p>15<\/p>\n<p>16<\/p>\n<p>17<\/p>\n<\/div>\n<\/td>\n<td class=\"urvanov-syntax-highlighter-code\">\n<div class=\"crayon-pre\" style=\"font-size: 12px !important; line-height: 15px !important; -moz-tab-size:4; -o-tab-size:4; -webkit-tab-size:4; tab-size:4;\">\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-v\">sklearn<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">datasets <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-e\">fetch_20newsgroups<\/span><\/p>\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-v\">sklearn<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">feature_extraction<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">text <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-e\">TfidfVectorizer<\/span><\/p>\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-v\">sklearn<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">linear_model <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-e\">LogisticRegression<\/span><\/p>\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-v\">sklearn<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">preprocessing <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-i\">StandardScaler<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-p\"># Loading data<\/span><\/p>\n<p><span class=\"crayon-v\">data<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">fetch_20newsgroups<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">subset<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-s\">&#8216;train&#8217;<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">categories<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-s\">&#8216;sci.space&#8217;<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-s\">&#8216;rec.autos&#8217;<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-v\">texts<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">y<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">data<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">data<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-o\">:<\/span><span class=\"crayon-cn\">500<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">data<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">target<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-o\">:<\/span><span class=\"crayon-cn\">500<\/span><span class=\"crayon-sy\">]<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-p\"># Extracting features of two broad types<\/span><\/p>\n<p><span class=\"crayon-v\">tfidf<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">TfidfVectorizer<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">max_features<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">300<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">fit_transform<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">texts<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">toarray<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-v\">emb<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">model<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">encode<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">texts<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">show_progress_bar<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-t\">False<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-p\"># Combining features and training ML model<\/span><\/p>\n<p><span class=\"crayon-v\">X<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">np<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">hstack<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-v\">tfidf<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">StandardScaler<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">fit_transform<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">emb<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-v\">clf<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">LogisticRegression<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">max_iter<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">1000<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">fit<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">X<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">y<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-e\">print<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-s\">&#8220;Accuracy:&#8221;<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">clf<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">score<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">X<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">y<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<\/div>\n<\/td>\n<\/tr>\n<\/table><\/div>\n<\/p><\/div>\n<h2>2. Topic-Aware Embedding Clusters<\/h2>\n<p>This trick takes a few sample text sequences, generates embeddings using the preloaded language model, applies K-Means clustering on these embeddings to assign topics, and then combines the embeddings with a one-hot encoding of each example\u2019s cluster identifier (its \u201ctopic class\u201d) to build a new feature representation. It is a useful strategy for creating compact topic meta-features.<\/p>\n<div id=\"urvanov-syntax-highlighter-69064ae400a91214879039\" class=\"urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate\" data-settings=\" minimize scroll-mouseover disable-anim\" style=\" margin-top: 12px; margin-bottom: 12px; font-size: 12px !important; line-height: 15px !important;\">\n<p><textarea wrap=\"soft\" class=\"urvanov-syntax-highlighter-plain print-no\" data-settings=\"dblclick\" readonly=\"readonly\" style=\"-moz-tab-size:4; -o-tab-size:4; -webkit-tab-size:4; tab-size:4; font-size: 12px !important; line-height: 15px !important;\"><br \/>\nfrom sklearn.cluster import KMeans&#13;<br \/>\nfrom sklearn.preprocessing import OneHotEncoder&#13;<br \/>\n&#13;<br \/>\ntexts = [&#8220;Tokyo Tower is a popular landmark.&#8221;, &#8220;Sushi is a traditional Japanese dish.&#8221;,&#13;<br \/>\n         &#8220;Mount Fuji is a famous volcano in Japan.&#8221;, &#8220;Cherry blossoms bloom in the spring in Japan.&#8221;]&#13;<br \/>\n&#13;<br \/>\nemb = model.encode(texts)&#13;<br \/>\ntopics = KMeans(n_clusters=2, n_init=&#8221;auto&#8221;, random_state=42).fit_predict(emb)&#13;<br \/>\ntopic_ohe = OneHotEncoder(sparse_output=False).fit_transform(topics.reshape(-1, 1))&#13;<br \/>\n&#13;<br \/>\nX = np.hstack([emb, topic_ohe])&#13;<br \/>\nprint(X.shape)<\/textarea><\/p>\n<div class=\"urvanov-syntax-highlighter-main\" style=\"\">\n<table class=\"crayon-table\">\n<tr class=\"urvanov-syntax-highlighter-row\">\n<td class=\"crayon-nums \" data-settings=\"show\">\n<\/td>\n<td class=\"urvanov-syntax-highlighter-code\">\n<div class=\"crayon-pre\" style=\"font-size: 12px !important; line-height: 15px !important; -moz-tab-size:4; -o-tab-size:4; -webkit-tab-size:4; tab-size:4;\">\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-v\">sklearn<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">cluster <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-e\">KMeans<\/span><\/p>\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-v\">sklearn<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">preprocessing <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-e\">OneHotEncoder<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-v\">texts<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-s\">&#8220;Tokyo Tower is a popular landmark.&#8221;<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-s\">&#8220;Sushi is a traditional Japanese dish.&#8221;<\/span><span class=\"crayon-sy\">,<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 <\/span><span class=\"crayon-s\">&#8220;Mount Fuji is a famous volcano in Japan.&#8221;<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-s\">&#8220;Cherry blossoms bloom in the spring in Japan.&#8221;<\/span><span class=\"crayon-sy\">]<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-v\">emb<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">model<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">encode<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">texts<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-v\">topics<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">KMeans<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">n_clusters<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">2<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">n_init<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-s\">&#8216;auto&#8217;<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">random_state<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">42<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">fit_predict<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">emb<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-v\">topic_ohe<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">OneHotEncoder<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">sparse_output<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-t\">False<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">fit_transform<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">topics<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">reshape<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-cn\">1<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-cn\">1<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-v\">X<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">np<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">hstack<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-v\">emb<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">topic_ohe<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-e\">print<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">X<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">shape<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<\/div>\n<\/td>\n<\/tr>\n<\/table><\/div>\n<\/p><\/div>\n<h2>3. Semantic Anchor Similarity Features<\/h2>\n<p>This simple strategy computes similarity to a small set of fixed \u201canchor\u201d (or reference) sentences used as compact semantic descriptors\u2014essentially, semantic landmarks. Each column in the similarity-feature matrix contains the similarity of the text to one anchor. The main value lies in allowing the model to learn relationships between the text\u2019s similarity to key concepts and a target variable\u2014useful for text classification models.<\/p>\n<div id=\"urvanov-syntax-highlighter-69064ae400a94836933603\" class=\"urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate\" data-settings=\" minimize scroll-mouseover disable-anim\" style=\" margin-top: 12px; margin-bottom: 12px; font-size: 12px !important; line-height: 15px !important;\">\n<p><textarea wrap=\"soft\" class=\"urvanov-syntax-highlighter-plain print-no\" data-settings=\"dblclick\" readonly=\"readonly\" style=\"-moz-tab-size:4; -o-tab-size:4; -webkit-tab-size:4; tab-size:4; font-size: 12px !important; line-height: 15px !important;\"><br \/>\nfrom sklearn.metrics.pairwise import cosine_similarity&#13;<br \/>\n&#13;<br \/>\nanchors = [&#8220;space mission&#8221;, &#8220;car performance&#8221;, &#8220;politics&#8221;]&#13;<br \/>\nanchor_emb = model.encode(anchors)&#13;<br \/>\ntexts = [&#8220;The rocket launch was successful.&#8221;, &#8220;The car handled well on the track.&#8221;]&#13;<br \/>\nemb = model.encode(texts)&#13;<br \/>\n&#13;<br \/>\nsim_features = cosine_similarity(emb, anchor_emb)&#13;<br \/>\nprint(sim_features)<\/textarea><\/p>\n<div class=\"urvanov-syntax-highlighter-main\" style=\"\">\n<table class=\"crayon-table\">\n<tr class=\"urvanov-syntax-highlighter-row\">\n<td class=\"crayon-nums \" data-settings=\"show\">\n<\/td>\n<td class=\"urvanov-syntax-highlighter-code\">\n<div class=\"crayon-pre\" style=\"font-size: 12px !important; line-height: 15px !important; -moz-tab-size:4; -o-tab-size:4; -webkit-tab-size:4; tab-size:4;\">\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-v\">sklearn<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">metrics<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">pairwise <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-e\">cosine_similarity<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-v\">anchors<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-s\">&#8220;space mission&#8221;<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-s\">&#8220;car performance&#8221;<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-s\">&#8220;politics&#8221;<\/span><span class=\"crayon-sy\">]<\/span><\/p>\n<p><span class=\"crayon-v\">anchor_emb<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">model<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">encode<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">anchors<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-v\">texts<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-s\">&#8220;The rocket launch was successful.&#8221;<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-s\">&#8220;The car handled well on the track.&#8221;<\/span><span class=\"crayon-sy\">]<\/span><\/p>\n<p><span class=\"crayon-v\">emb<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">model<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">encode<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">texts<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-v\">sim_features<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">cosine_similarity<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">emb<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">anchor_emb<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-e\">print<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">sim_features<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<\/div>\n<\/td>\n<\/tr>\n<\/table><\/div>\n<\/p><\/div>\n<h2>4. Meta-Feature Stacking via Auxiliary Sentiment Classifier<\/h2>\n<p>For text associated with labels such as sentiments, the following feature-engineering technique adds extra value. A meta-feature is built as the prediction probability returned by an auxiliary classifier trained on the embeddings. This meta-feature is stacked with the original embeddings, resulting in an augmented feature set that can improve downstream performance by exposing potentially more discriminative information than raw embeddings alone.<\/p>\n<p>A slight additional setup is needed for this example:<\/p>\n<div id=\"urvanov-syntax-highlighter-69064ae400a99361582656\" class=\"urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate\" data-settings=\" minimize scroll-mouseover disable-anim\" style=\" margin-top: 12px; margin-bottom: 12px; font-size: 12px !important; line-height: 15px !important;\">\n<p><textarea wrap=\"soft\" class=\"urvanov-syntax-highlighter-plain print-no\" data-settings=\"dblclick\" readonly=\"readonly\" style=\"-moz-tab-size:4; -o-tab-size:4; -webkit-tab-size:4; tab-size:4; font-size: 12px !important; line-height: 15px !important;\"><br \/>\n!pip install sentence-transformers scikit-learn -q&#13;<br \/>\n&#13;<br \/>\nfrom sentence_transformers import SentenceTransformer&#13;<br \/>\nfrom sklearn.model_selection import train_test_split&#13;<br \/>\nfrom sklearn.linear_model import LogisticRegression&#13;<br \/>\nfrom sklearn.preprocessing import StandardScaler  # Import StandardScaler&#13;<br \/>\nimport numpy as np&#13;<br \/>\n&#13;<br \/>\nembedder = SentenceTransformer(&#8220;all-MiniLM-L6-v2&#8221;)  # 384-dim&#13;<br \/>\n&#13;<br \/>\n# Small dataset containing texts and sentiment labels&#13;<br \/>\ntexts = [&#8220;I love this!&#8221;, &#8220;This is terrible.&#8221;, &#8220;Amazing quality.&#8221;, &#8220;Not good at all.&#8221;]&#13;<br \/>\ny = np.array([1, 0, 1, 0])&#13;<br \/>\n&#13;<br \/>\n# Obtain embeddings from the embedder LLM&#13;<br \/>\nemb = embedder.encode(texts, show_progress_bar=False)&#13;<br \/>\n&#13;<br \/>\n# Train an auxiliary classifier on embeddings&#13;<br \/>\nX_train, X_test, y_train, y_test = train_test_split(&#13;<br \/>\n    emb, y, test_size=0.5, random_state=42, stratify=y&#13;<br \/>\n)&#13;<br \/>\nmeta_clf = LogisticRegression(max_iter=1000).fit(X_train, y_train)&#13;<br \/>\n&#13;<br \/>\n# Leverage the auxiliary model&#8217;s predicted probability as a meta-feature&#13;<br \/>\nmeta_feature = meta_clf.predict_proba(emb)[:, 1].reshape(-1, 1)  # Prob of positive class&#13;<br \/>\n&#13;<br \/>\n# Augment original embeddings with the meta-feature&#13;<br \/>\n# Do not forget to scale again for consistency&#13;<br \/>\nscaler = StandardScaler()&#13;<br \/>\nemb_scaled = scaler.fit_transform(emb)&#13;<br \/>\nX_aug = np.hstack([emb_scaled, meta_feature])  # Stack features together&#13;<br \/>\n&#13;<br \/>\nprint(&#8220;emb shape:&#8221;, emb.shape)&#13;<br \/>\nprint(&#8220;meta_feature shape:&#8221;, meta_feature.shape)&#13;<br \/>\nprint(&#8220;augmented shape:&#8221;, X_aug.shape)&#13;<br \/>\nprint(&#8220;meta clf accuracy on test slice:&#8221;, meta_clf.score(X_test, y_test))<\/textarea><\/p>\n<div class=\"urvanov-syntax-highlighter-main\" style=\"\">\n<table class=\"crayon-table\">\n<tr class=\"urvanov-syntax-highlighter-row\">\n<td class=\"crayon-nums \" data-settings=\"show\">\n<div class=\"urvanov-syntax-highlighter-nums-content\" style=\"font-size: 12px !important; line-height: 15px !important;\">\n<p>1<\/p>\n<p>2<\/p>\n<p>3<\/p>\n<p>4<\/p>\n<p>5<\/p>\n<p>6<\/p>\n<p>7<\/p>\n<p>8<\/p>\n<p>9<\/p>\n<p>10<\/p>\n<p>11<\/p>\n<p>12<\/p>\n<p>13<\/p>\n<p>14<\/p>\n<p>15<\/p>\n<p>16<\/p>\n<p>17<\/p>\n<p>18<\/p>\n<p>19<\/p>\n<p>20<\/p>\n<p>21<\/p>\n<p>22<\/p>\n<p>23<\/p>\n<p>24<\/p>\n<p>25<\/p>\n<p>26<\/p>\n<p>27<\/p>\n<p>28<\/p>\n<p>29<\/p>\n<p>30<\/p>\n<p>31<\/p>\n<p>32<\/p>\n<p>33<\/p>\n<p>34<\/p>\n<p>35<\/p>\n<p>36<\/p>\n<\/div>\n<\/td>\n<td class=\"urvanov-syntax-highlighter-code\">\n<div class=\"crayon-pre\" style=\"font-size: 12px !important; line-height: 15px !important; -moz-tab-size:4; -o-tab-size:4; -webkit-tab-size:4; tab-size:4;\">\n<p><span class=\"crayon-o\">!<\/span><span class=\"crayon-e\">pip <\/span><span class=\"crayon-e\">install <\/span><span class=\"crayon-v\">sentence<\/span><span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-e\">transformers <\/span><span class=\"crayon-v\">scikit<\/span><span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-v\">learn<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-i\">q<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-e\">sentence_transformers <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-e\">SentenceTransformer<\/span><\/p>\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-v\">sklearn<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">model_selection <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-e\">train_test_split<\/span><\/p>\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-v\">sklearn<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">linear_model <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-e\">LogisticRegression<\/span><\/p>\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-v\">sklearn<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">preprocessing <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-i\">StandardScaler<\/span><span class=\"crayon-h\">\u00a0\u00a0<\/span><span class=\"crayon-p\"># Import StandardScaler<\/span><\/p>\n<p><span class=\"crayon-e\">import <\/span><span class=\"crayon-e\">numpy <\/span><span class=\"crayon-st\">as<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">np<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-v\">embedder<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">SentenceTransformer<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-s\">&#8220;all-MiniLM-L6-v2&#8221;<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-h\">\u00a0\u00a0<\/span><span class=\"crayon-p\"># 384-dim<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-p\"># Small dataset containing texts and sentiment labels<\/span><\/p>\n<p><span class=\"crayon-v\">texts<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-s\">&#8220;I love this!&#8221;<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-s\">&#8220;This is terrible.&#8221;<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-s\">&#8220;Amazing quality.&#8221;<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-s\">&#8220;Not good at all.&#8221;<\/span><span class=\"crayon-sy\">]<\/span><\/p>\n<p><span class=\"crayon-v\">y<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">np<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-t\">array<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-cn\">1<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-cn\">0<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-cn\">1<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-cn\">0<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-p\"># Obtain embeddings from the embedder LLM<\/span><\/p>\n<p><span class=\"crayon-v\">emb<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">embedder<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">encode<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">texts<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">show_progress_bar<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-t\">False<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-p\"># Train an auxiliary classifier on embeddings<\/span><\/p>\n<p><span class=\"crayon-v\">X_train<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">X_test<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">y_train<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">y_test<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">train_test_split<\/span><span class=\"crayon-sy\">(<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">emb<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">y<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">test_size<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">0.5<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">random_state<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">42<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">stratify<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-i\">y<\/span><\/p>\n<p><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-v\">meta_clf<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">LogisticRegression<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">max_iter<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">1000<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">fit<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">X_train<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">y_train<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-p\"># Leverage the auxiliary model&#8217;s predicted probability as a meta-feature<\/span><\/p>\n<p><span class=\"crayon-v\">meta_feature<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">meta_clf<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">predict_proba<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">emb<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-o\">:<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-cn\">1<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">reshape<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-cn\">1<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-cn\">1<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-h\">\u00a0\u00a0<\/span><span class=\"crayon-p\"># Prob of positive class<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-p\"># Augment original embeddings with the meta-feature<\/span><\/p>\n<p><span class=\"crayon-p\"># Do not forget to scale again for consistency<\/span><\/p>\n<p><span class=\"crayon-v\">scaler<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">StandardScaler<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-v\">emb_scaled<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">scaler<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">fit_transform<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">emb<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-v\">X_aug<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">np<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">hstack<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-v\">emb_scaled<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">meta_feature<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-h\">\u00a0\u00a0<\/span><span class=\"crayon-p\"># Stack features together<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-e\">print<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-s\">&#8220;emb shape:&#8221;<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">emb<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">shape<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-e\">print<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-s\">&#8220;meta_feature shape:&#8221;<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">meta_feature<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">shape<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-e\">print<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-s\">&#8220;augmented shape:&#8221;<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">X_aug<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">shape<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-e\">print<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-s\">&#8220;meta clf accuracy on test slice:&#8221;<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">meta_clf<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">score<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">X_test<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">y_test<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<\/div>\n<\/td>\n<\/tr>\n<\/table><\/div>\n<\/p><\/div>\n<h2>5. Embedding Compression and Nonlinear Expansion<\/h2>\n<p>This strategy applies PCA dimensionality reduction to compress the raw embeddings built by the LLM and then polynomially expands these compressed embeddings. It may sound odd at first, but this can be an effective approach to capture nonlinear structure while maintaining efficiency.<\/p>\n<div id=\"urvanov-syntax-highlighter-69064ae400a9c088561428\" class=\"urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate\" data-settings=\" minimize scroll-mouseover disable-anim\" style=\" margin-top: 12px; margin-bottom: 12px; font-size: 12px !important; line-height: 15px !important;\">\n<p><textarea wrap=\"soft\" class=\"urvanov-syntax-highlighter-plain print-no\" data-settings=\"dblclick\" readonly=\"readonly\" style=\"-moz-tab-size:4; -o-tab-size:4; -webkit-tab-size:4; tab-size:4; font-size: 12px !important; line-height: 15px !important;\"><br \/>\n!pip install sentence-transformers scikit-learn -q&#13;<br \/>\n&#13;<br \/>\nfrom sentence_transformers import SentenceTransformer&#13;<br \/>\nfrom sklearn.decomposition import PCA&#13;<br \/>\nfrom sklearn.preprocessing import PolynomialFeatures&#13;<br \/>\nimport numpy as np&#13;<br \/>\n&#13;<br \/>\n# Loading a lightweight embedding language model&#13;<br \/>\nembedder = SentenceTransformer(&#8220;all-MiniLM-L6-v2&#8221;)&#13;<br \/>\n&#13;<br \/>\ntexts = [&#8220;The satellite was launched into orbit.&#8221;,&#13;<br \/>\n         &#8220;Cars require regular maintenance.&#8221;,&#13;<br \/>\n         &#8220;The telescope observed distant galaxies.&#8221;]&#13;<br \/>\n&#13;<br \/>\n# Obtaining embeddings&#13;<br \/>\nemb = embedder.encode(texts, show_progress_bar=False)&#13;<br \/>\n&#13;<br \/>\n# Compressing with PCA and enriching with polynomial features&#13;<br \/>\npca = PCA(n_components=2).fit_transform(emb)  # Reduced n_components to a valid value&#13;<br \/>\npoly = PolynomialFeatures(degree=2, include_bias=False).fit_transform(pca)&#13;<br \/>\n&#13;<br \/>\nprint(&#8220;Original shape:&#8221;, emb.shape)&#13;<br \/>\nprint(&#8220;After PCA:&#8221;, pca.shape)&#13;<br \/>\nprint(&#8220;After polynomial expansion:&#8221;, poly.shape)<\/textarea><\/p>\n<div class=\"urvanov-syntax-highlighter-main\" style=\"\">\n<table class=\"crayon-table\">\n<tr class=\"urvanov-syntax-highlighter-row\">\n<td class=\"crayon-nums \" data-settings=\"show\">\n<div class=\"urvanov-syntax-highlighter-nums-content\" style=\"font-size: 12px !important; line-height: 15px !important;\">\n<p>1<\/p>\n<p>2<\/p>\n<p>3<\/p>\n<p>4<\/p>\n<p>5<\/p>\n<p>6<\/p>\n<p>7<\/p>\n<p>8<\/p>\n<p>9<\/p>\n<p>10<\/p>\n<p>11<\/p>\n<p>12<\/p>\n<p>13<\/p>\n<p>14<\/p>\n<p>15<\/p>\n<p>16<\/p>\n<p>17<\/p>\n<p>18<\/p>\n<p>19<\/p>\n<p>20<\/p>\n<p>21<\/p>\n<p>22<\/p>\n<p>23<\/p>\n<p>24<\/p>\n<\/div>\n<\/td>\n<td class=\"urvanov-syntax-highlighter-code\">\n<div class=\"crayon-pre\" style=\"font-size: 12px !important; line-height: 15px !important; -moz-tab-size:4; -o-tab-size:4; -webkit-tab-size:4; tab-size:4;\">\n<p><span class=\"crayon-o\">!<\/span><span class=\"crayon-e\">pip <\/span><span class=\"crayon-e\">install <\/span><span class=\"crayon-v\">sentence<\/span><span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-e\">transformers <\/span><span class=\"crayon-v\">scikit<\/span><span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-v\">learn<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-i\">q<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-e\">sentence_transformers <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-e\">SentenceTransformer<\/span><\/p>\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-v\">sklearn<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">decomposition <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-e\">PCA<\/span><\/p>\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-v\">sklearn<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">preprocessing <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-e\">PolynomialFeatures<\/span><\/p>\n<p><span class=\"crayon-e\">import <\/span><span class=\"crayon-e\">numpy <\/span><span class=\"crayon-st\">as<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">np<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-p\"># Loading a lightweight embedding language model<\/span><\/p>\n<p><span class=\"crayon-v\">embedder<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">SentenceTransformer<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-s\">&#8220;all-MiniLM-L6-v2&#8221;<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-v\">texts<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-s\">&#8220;The satellite was launched into orbit.&#8221;<\/span><span class=\"crayon-sy\">,<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 <\/span><span class=\"crayon-s\">&#8220;Cars require regular maintenance.&#8221;<\/span><span class=\"crayon-sy\">,<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 <\/span><span class=\"crayon-s\">&#8220;The telescope observed distant galaxies.&#8221;<\/span><span class=\"crayon-sy\">]<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-p\"># Obtaining embeddings<\/span><\/p>\n<p><span class=\"crayon-v\">emb<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">embedder<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">encode<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">texts<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">show_progress_bar<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-t\">False<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-p\"># Compressing with PCA and enriching with polynomial features<\/span><\/p>\n<p><span class=\"crayon-v\">pca<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">PCA<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">n_components<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">2<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">fit_transform<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">emb<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-h\">\u00a0\u00a0<\/span><span class=\"crayon-p\"># Reduced n_components to a valid value<\/span><\/p>\n<p><span class=\"crayon-v\">poly<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">PolynomialFeatures<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">degree<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">2<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">include_bias<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-t\">False<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">fit_transform<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">pca<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-e\">print<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-s\">&#8220;Original shape:&#8221;<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">emb<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">shape<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-e\">print<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-s\">&#8220;After PCA:&#8221;<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">pca<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">shape<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-e\">print<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-s\">&#8220;After polynomial expansion:&#8221;<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">poly<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">shape<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<\/div>\n<\/td>\n<\/tr>\n<\/table><\/div>\n<\/p><\/div>\n<h2>6. Relational Learning with Pairwise Contrastive Features<\/h2>\n<p>The goal here is to build pairwise relational features from text embeddings. Interrelated features\u2014constructed in a contrastive fashion\u2014can highlight aspects of similarity and dissimilarity. This is particularly effective for predictive processes that inherently entail comparisons among texts.<\/p>\n<div id=\"urvanov-syntax-highlighter-69064ae400aa3191068731\" class=\"urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate\" data-settings=\" minimize scroll-mouseover disable-anim\" style=\" margin-top: 12px; margin-bottom: 12px; font-size: 12px !important; line-height: 15px !important;\">\n<p><textarea wrap=\"soft\" class=\"urvanov-syntax-highlighter-plain print-no\" data-settings=\"dblclick\" readonly=\"readonly\" style=\"-moz-tab-size:4; -o-tab-size:4; -webkit-tab-size:4; tab-size:4; font-size: 12px !important; line-height: 15px !important;\"><br \/>\n!pip install sentence-transformers -q&#13;<br \/>\nfrom sentence_transformers import SentenceTransformer&#13;<br \/>\nimport numpy as np&#13;<br \/>\n&#13;<br \/>\n# Loading embedder&#13;<br \/>\nembedder = SentenceTransformer(&#8220;all-MiniLM-L6-v2&#8221;)&#13;<br \/>\n&#13;<br \/>\n# Example text pairs&#13;<br \/>\npairs = [&#13;<br \/>\n    (&#8220;The car is fast.&#8221;, &#8220;The vehicle moves quickly.&#8221;),&#13;<br \/>\n    (&#8220;The sky is blue.&#8221;, &#8220;Bananas are yellow.&#8221;)&#13;<br \/>\n]&#13;<br \/>\n&#13;<br \/>\n# Generating embeddings for both sides&#13;<br \/>\nemb1 = embedder.encode([p[0] for p in pairs], show_progress_bar=False)&#13;<br \/>\nemb2 = embedder.encode([p[1] for p in pairs], show_progress_bar=False)&#13;<br \/>\n&#13;<br \/>\n# Building contrastive features: absolute difference and element-wise product&#13;<br \/>\nX_pairs = np.hstack([np.abs(emb1 &#8211; emb2), emb1 * emb2])&#13;<br \/>\n&#13;<br \/>\nprint(&#8220;Pairwise feature shape:&#8221;, X_pairs.shape)<\/textarea><\/p>\n<div class=\"urvanov-syntax-highlighter-main\" style=\"\">\n<table class=\"crayon-table\">\n<tr class=\"urvanov-syntax-highlighter-row\">\n<td class=\"crayon-nums \" data-settings=\"show\">\n<div class=\"urvanov-syntax-highlighter-nums-content\" style=\"font-size: 12px !important; line-height: 15px !important;\">\n<p>1<\/p>\n<p>2<\/p>\n<p>3<\/p>\n<p>4<\/p>\n<p>5<\/p>\n<p>6<\/p>\n<p>7<\/p>\n<p>8<\/p>\n<p>9<\/p>\n<p>10<\/p>\n<p>11<\/p>\n<p>12<\/p>\n<p>13<\/p>\n<p>14<\/p>\n<p>15<\/p>\n<p>16<\/p>\n<p>17<\/p>\n<p>18<\/p>\n<p>19<\/p>\n<p>20<\/p>\n<p>21<\/p>\n<\/div>\n<\/td>\n<td class=\"urvanov-syntax-highlighter-code\">\n<div class=\"crayon-pre\" style=\"font-size: 12px !important; line-height: 15px !important; -moz-tab-size:4; -o-tab-size:4; -webkit-tab-size:4; tab-size:4;\">\n<p><span class=\"crayon-o\">!<\/span><span class=\"crayon-e\">pip <\/span><span class=\"crayon-e\">install <\/span><span class=\"crayon-v\">sentence<\/span><span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-v\">transformers<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-i\">q<\/span><\/p>\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-e\">sentence_transformers <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-e\">SentenceTransformer<\/span><\/p>\n<p><span class=\"crayon-e\">import <\/span><span class=\"crayon-e\">numpy <\/span><span class=\"crayon-st\">as<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">np<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-p\"># Loading embedder<\/span><\/p>\n<p><span class=\"crayon-v\">embedder<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">SentenceTransformer<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-s\">&#8220;all-MiniLM-L6-v2&#8221;<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-p\"># Example text pairs<\/span><\/p>\n<p><span class=\"crayon-v\">pairs<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">[<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-s\">&#8220;The car is fast.&#8221;<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-s\">&#8220;The vehicle moves quickly.&#8221;<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">,<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-s\">&#8220;The sky is blue.&#8221;<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-s\">&#8220;Bananas are yellow.&#8221;<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-sy\">]<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-p\"># Generating embeddings for both sides<\/span><\/p>\n<p><span class=\"crayon-v\">emb1<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">embedder<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">encode<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-v\">p<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-cn\">0<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">for<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">p<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">in<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">pairs<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">show_progress_bar<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-t\">False<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-v\">emb2<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">embedder<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">encode<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-v\">p<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-cn\">1<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">for<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">p<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">in<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">pairs<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">show_progress_bar<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-t\">False<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-p\"># Building contrastive features: absolute difference and element-wise product<\/span><\/p>\n<p><span class=\"crayon-v\">X_pairs<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">np<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">hstack<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-v\">np<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">abs<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">emb1<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">emb2<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e \">emb1 *<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">emb2<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-e\">print<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-s\">&#8220;Pairwise feature shape:&#8221;<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">X_pairs<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">shape<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<\/div>\n<\/td>\n<\/tr>\n<\/table><\/div>\n<\/p><\/div>\n<h2>7. Cross-Modal Fusion<\/h2>\n<p>The last trick combines LLM embeddings with simple linguistic or numeric features\u2014such as punctuation ratio or other domain-specific engineered features. It contributes to more holistic text-derived features by uniting semantic signals with handcrafted linguistic aspects. Here is an example that measures punctuation in the text.<\/p>\n<div id=\"urvanov-syntax-highlighter-69064ae400aa9733280654\" class=\"urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate\" data-settings=\" minimize scroll-mouseover disable-anim\" style=\" margin-top: 12px; margin-bottom: 12px; font-size: 12px !important; line-height: 15px !important;\">\n<p><textarea wrap=\"soft\" class=\"urvanov-syntax-highlighter-plain print-no\" data-settings=\"dblclick\" readonly=\"readonly\" style=\"-moz-tab-size:4; -o-tab-size:4; -webkit-tab-size:4; tab-size:4; font-size: 12px !important; line-height: 15px !important;\"><br \/>\n!pip install sentence-transformers -q&#13;<br \/>\nfrom sentence_transformers import SentenceTransformer&#13;<br \/>\nimport numpy as np, re&#13;<br \/>\n&#13;<br \/>\n# Loading embedder&#13;<br \/>\nembedder = SentenceTransformer(&#8220;all-MiniLM-L6-v2&#8221;)&#13;<br \/>\n&#13;<br \/>\ntexts = [&#8220;Mars mission 2024!&#8221;, &#8220;New electric car model launched.&#8221;]&#13;<br \/>\n&#13;<br \/>\n# Computing embeddings&#13;<br \/>\nemb = embedder.encode(texts, show_progress_bar=False)&#13;<br \/>\n&#13;<br \/>\n# Adding simple numeric text features&#13;<br \/>\nlengths = np.array([len(t.split()) for t in texts]).reshape(-1, 1)&#13;<br \/>\npunct_ratio = np.array([len(re.findall(r&#8221;[^\\w\\s]&#8221;, t)) \/ len<br \/>\n&#13;<br \/>\n# Combining all features&#13;<br \/>\nX = np.hstack([emb, lengths, punct_ratio])&#13;<br \/>\n&#13;<br \/>\nprint(&#8220;Final feature matrix shape:&#8221;, X.shape)<\/textarea><\/p>\n<div class=\"urvanov-syntax-highlighter-main\" style=\"\">\n<table class=\"crayon-table\">\n<tr class=\"urvanov-syntax-highlighter-row\">\n<td class=\"crayon-nums \" data-settings=\"show\">\n<div class=\"urvanov-syntax-highlighter-nums-content\" style=\"font-size: 12px !important; line-height: 15px !important;\">\n<p>1<\/p>\n<p>2<\/p>\n<p>3<\/p>\n<p>4<\/p>\n<p>5<\/p>\n<p>6<\/p>\n<p>7<\/p>\n<p>8<\/p>\n<p>9<\/p>\n<p>10<\/p>\n<p>11<\/p>\n<p>12<\/p>\n<p>13<\/p>\n<p>14<\/p>\n<p>15<\/p>\n<p>16<\/p>\n<p>17<\/p>\n<p>18<\/p>\n<p>19<\/p>\n<p>20<\/p>\n<\/div>\n<\/td>\n<td class=\"urvanov-syntax-highlighter-code\">\n<div class=\"crayon-pre\" style=\"font-size: 12px !important; line-height: 15px !important; -moz-tab-size:4; -o-tab-size:4; -webkit-tab-size:4; tab-size:4;\">\n<p><span class=\"crayon-o\">!<\/span><span class=\"crayon-e\">pip <\/span><span class=\"crayon-e\">install <\/span><span class=\"crayon-v\">sentence<\/span><span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-v\">transformers<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-i\">q<\/span><\/p>\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-e\">sentence_transformers <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-e\">SentenceTransformer<\/span><\/p>\n<p><span class=\"crayon-e\">import <\/span><span class=\"crayon-e\">numpy <\/span><span class=\"crayon-st\">as<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">np<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">re<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-p\"># Loading embedder<\/span><\/p>\n<p><span class=\"crayon-v\">embedder<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">SentenceTransformer<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-s\">&#8220;all-MiniLM-L6-v2&#8221;<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-v\">texts<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-s\">&#8220;Mars mission 2024!&#8221;<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-s\">&#8220;New electric car model launched.&#8221;<\/span><span class=\"crayon-sy\">]<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-p\"># Computing embeddings<\/span><\/p>\n<p><span class=\"crayon-v\">emb<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">embedder<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">encode<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">texts<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">show_progress_bar<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-t\">False<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-p\"># Adding simple numeric text features<\/span><\/p>\n<p><span class=\"crayon-v\">lengths<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">np<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-t\">array<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-e\">len<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">t<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">split<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">for<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">t<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">in<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">texts<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">reshape<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-cn\">1<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-cn\">1<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-v\">punct_ratio<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">np<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-t\">array<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-e\">len<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">re<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">findall<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-i\">r<\/span><span class=\"crayon-s\">&#8220;[^\\w\\s]&#8221;<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">t<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">\/<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">len<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">t<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">for<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">t<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">in<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">texts<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">reshape<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-cn\">1<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-cn\">1<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-p\"># Combining all features<\/span><\/p>\n<p><span class=\"crayon-v\">X<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">np<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">hstack<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-v\">emb<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">lengths<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">punct_ratio<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-e\">print<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-s\">&#8220;Final feature matrix shape:&#8221;<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">X<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">shape<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<\/div>\n<\/td>\n<\/tr>\n<\/table><\/div>\n<\/p><\/div>\n<h2>Wrapping Up<\/h2>\n<p>We explored seven advanced feature-engineering tricks that help extract more information from raw text, going beyond LLM-generated embeddings alone. These practical strategies can boost downstream machine learning models that take text as input by capturing complementary lexical, semantic, relational, and handcrafted signals.<\/p>\n<\/p><\/div>\n","protected":false},"excerpt":{"rendered":"<p>7 Advanced Feature Engineering Tricks for Text Data Using LLM EmbeddingsImage by Editor Introduction Large language models (LLMs) are not only good at understanding and generating text; they can also turn raw text into numerical representations called embeddings. These embeddings are useful for incorporating additional information into traditional predictive machine learning models\u2014such as those used<\/p>\n","protected":false},"author":1,"featured_media":176543,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[42],"tags":[],"class_list":{"0":"post-176542","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-ai"},"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v26.4 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>7 Advanced Feature Engineering Tricks for Text Data Using LLM Embeddings - Ktromedia<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"http:\/\/ktromedia.com\/?p=176542\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"7 Advanced Feature Engineering Tricks for Text Data Using LLM Embeddings - Ktromedia\" \/>\n<meta property=\"og:description\" content=\"7 Advanced Feature Engineering Tricks for Text Data Using LLM EmbeddingsImage by Editor Introduction Large language models (LLMs) are not only good at understanding and generating text; they can also turn raw text into numerical representations called embeddings. These embeddings are useful for incorporating additional information into traditional predictive machine learning models\u2014such as those used\" \/>\n<meta property=\"og:url\" content=\"http:\/\/ktromedia.com\/?p=176542\" \/>\n<meta property=\"og:site_name\" content=\"Ktromedia\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/KTROMedia\/\" \/>\n<meta property=\"article:published_time\" content=\"2025-11-06T12:55:34+00:00\" \/>\n<meta property=\"og:image\" content=\"http:\/\/ktromedia.com\/wp-content\/uploads\/2025\/11\/1762433733_7-Advanced-Feature-Engineering-Tricks-for-Text-Data-Using-LLM.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1024\" \/>\n\t<meta property=\"og:image:height\" content=\"683\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"KTRO TEAM\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"KTRO TEAM\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"10 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"http:\/\/ktromedia.com\/?p=176542#article\",\"isPartOf\":{\"@id\":\"http:\/\/ktromedia.com\/?p=176542\"},\"author\":{\"name\":\"KTRO TEAM\",\"@id\":\"https:\/\/ktromedia.com\/#\/schema\/person\/612bf2fbac107722ea365932cdd35f5b\"},\"headline\":\"7 Advanced Feature Engineering Tricks for Text Data Using LLM Embeddings\",\"datePublished\":\"2025-11-06T12:55:34+00:00\",\"mainEntityOfPage\":{\"@id\":\"http:\/\/ktromedia.com\/?p=176542\"},\"wordCount\":2072,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/ktromedia.com\/#organization\"},\"image\":{\"@id\":\"http:\/\/ktromedia.com\/?p=176542#primaryimage\"},\"thumbnailUrl\":\"http:\/\/ktromedia.com\/wp-content\/uploads\/2025\/11\/1762433733_7-Advanced-Feature-Engineering-Tricks-for-Text-Data-Using-LLM.png\",\"articleSection\":[\"\u4eba\u5de5\u667a\u80fd\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"http:\/\/ktromedia.com\/?p=176542#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"http:\/\/ktromedia.com\/?p=176542\",\"url\":\"http:\/\/ktromedia.com\/?p=176542\",\"name\":\"7 Advanced Feature Engineering Tricks for Text Data Using LLM Embeddings - Ktromedia\",\"isPartOf\":{\"@id\":\"https:\/\/ktromedia.com\/#website\"},\"primaryImageOfPage\":{\"@id\":\"http:\/\/ktromedia.com\/?p=176542#primaryimage\"},\"image\":{\"@id\":\"http:\/\/ktromedia.com\/?p=176542#primaryimage\"},\"thumbnailUrl\":\"http:\/\/ktromedia.com\/wp-content\/uploads\/2025\/11\/1762433733_7-Advanced-Feature-Engineering-Tricks-for-Text-Data-Using-LLM.png\",\"datePublished\":\"2025-11-06T12:55:34+00:00\",\"breadcrumb\":{\"@id\":\"http:\/\/ktromedia.com\/?p=176542#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"http:\/\/ktromedia.com\/?p=176542\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"http:\/\/ktromedia.com\/?p=176542#primaryimage\",\"url\":\"http:\/\/ktromedia.com\/wp-content\/uploads\/2025\/11\/1762433733_7-Advanced-Feature-Engineering-Tricks-for-Text-Data-Using-LLM.png\",\"contentUrl\":\"http:\/\/ktromedia.com\/wp-content\/uploads\/2025\/11\/1762433733_7-Advanced-Feature-Engineering-Tricks-for-Text-Data-Using-LLM.png\",\"width\":1024,\"height\":683},{\"@type\":\"BreadcrumbList\",\"@id\":\"http:\/\/ktromedia.com\/?p=176542#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/ktromedia.com\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"7 Advanced Feature Engineering Tricks for Text Data Using LLM Embeddings\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/ktromedia.com\/#website\",\"url\":\"https:\/\/ktromedia.com\/\",\"name\":\"Ktromedia\",\"description\":\"KTRO MEDIA Crypto News\",\"publisher\":{\"@id\":\"https:\/\/ktromedia.com\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/ktromedia.com\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/ktromedia.com\/#organization\",\"name\":\"Ktromedia\",\"url\":\"https:\/\/ktromedia.com\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/ktromedia.com\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/ktromedia.com\/wp-content\/uploads\/2025\/11\/ktroicon.png\",\"contentUrl\":\"https:\/\/ktromedia.com\/wp-content\/uploads\/2025\/11\/ktroicon.png\",\"width\":250,\"height\":250,\"caption\":\"Ktromedia\"},\"image\":{\"@id\":\"https:\/\/ktromedia.com\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/KTROMedia\/\",\"https:\/\/www.linkedin.com\/company\/ktro-media\/\",\"https:\/\/t.me\/ktrogroup\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/ktromedia.com\/#\/schema\/person\/612bf2fbac107722ea365932cdd35f5b\",\"name\":\"KTRO TEAM\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/ktromedia.com\/#\/schema\/person\/image\/\",\"url\":\"http:\/\/ktromedia.com\/wp-content\/uploads\/2025\/10\/cropped-Untitled-design-7-1-150x150.png\",\"contentUrl\":\"http:\/\/ktromedia.com\/wp-content\/uploads\/2025\/10\/cropped-Untitled-design-7-1-150x150.png\",\"caption\":\"KTRO TEAM\"},\"description\":\"KTRO MEDIA \u662f\u4e00\u5bb6\u5168\u7403\u6027\u7684\u534e\u6587WEB3\u5a92\u4f53\u516c\u53f8\u3002\u6211\u4eec\u81f4\u529b\u4e8e\u4e3a\u533a\u5757\u94fe\u548c\u91d1\u878d\u79d1\u6280\u9886\u57df\u63d0\u4f9b\u6700\u65b0\u7684\u65b0\u95fb\u3001\u89c1\u89e3\u548c\u8d8b\u52bf\u5206\u6790\u3002\u6211\u4eec\u7684\u5b97\u65e8\u662f\u4e3a\u5168\u7403\u7528\u6237\u63d0\u4f9b\u9ad8\u8d28\u91cf\u3001\u5168\u9762\u7684\u8d44\u8baf\u670d\u52a1\uff0c\u8ba9\u4ed6\u4eec\u66f4\u597d\u5730\u4e86\u89e3\u533a\u5757\u94fe\u548c\u91d1\u878d\u79d1\u6280\u884c\u4e1a\u7684\u6700\u65b0\u52a8\u6001\u3002\u6211\u4eec\u4e5f\u5e0c\u671b\u80fd\u5e2e\u5230\u66f4\u591a\u4f18\u79c0\u7684WEB3\u4ea7\u54c1\u627e\u5230\u66f4\u591a\u66f4\u597d\u7684\u8d44\u6e90\u597d\u8ba9\u8fd9\u9886\u57df\u53d8\u5f97\u66f4\u6210\u719f\u3002 \u6211\u4eec\u7684\u62a5\u9053\u8303\u56f4\u6db5\u76d6\u4e86\u533a\u5757\u94fe\u3001\u52a0\u5bc6\u8d27\u5e01\u3001\u667a\u80fd\u5408\u7ea6\u3001DeFi\u3001NFT \u548c Web3 \u751f\u6001\u7cfb\u7edf\u7b49\u9886\u57df\u3002\u6211\u4eec\u7684\u62a5\u9053\u4e0d\u4ec5\u6765\u81ea\u884c\u4e1a\u5185\u7684\u4e13\u5bb6\uff0c\u5148\u950b\u8005\u4e5f\u5305\u62ec\u4e86\u6211\u4eec\u81ea\u5df1\u7684\u5206\u6790\u548c\u89c2\u70b9\u3002\u6211\u4eec\u5728\u5404\u4e2a\u56fd\u5bb6\u548c\u5730\u533a\u90fd\u8bbe\u6709\u56e2\u961f\uff0c\u4e3a\u8bfb\u8005\u63d0\u4f9b\u672c\u5730\u5316\u7684\u62a5\u9053\u548c\u5206\u6790\u3002 \u9664\u4e86\u65b0\u95fb\u62a5\u9053\uff0c\u6211\u4eec\u8fd8\u63d0\u4f9b\u5e02\u573a\u7814\u7a76\u548c\u54a8\u8be2\u670d\u52a1\u3002\u6211\u4eec\u7684\u4e13\u4e1a\u56e2\u961f\u53ef\u4ee5\u4e3a\u60a8\u63d0\u4f9b\u6709\u5173\u533a\u5757\u94fe\u548c\u91d1\u878d\u79d1\u6280\u884c\u4e1a\u7684\u6df1\u5165\u5206\u6790\u548c\u5e02\u573a\u8d8b\u52bf\uff0c\u5e2e\u52a9\u60a8\u505a\u51fa\u66f4\u660e\u667a\u7684\u6295\u8d44\u51b3\u7b56\u3002 \u6211\u4eec\u7684\u4f7f\u547d\u662f\u6210\u4e3a\u5168\u7403\u534e\u6587\u533a\u5757\u94fe\u548c\u91d1\u878d\u79d1\u6280\u884c\u4e1a\u6700\u53d7\u4fe1\u8d56\u7684\u4fe1\u606f\u6765\u6e90\u4e4b\u4e00\u3002\u6211\u4eec\u5c06\u7ee7\u7eed\u4e0d\u65ad\u52aa\u529b\uff0c\u4e3a\u8bfb\u8005\u63d0\u4f9b\u6700\u65b0\u3001\u6700\u5168\u9762\u3001\u6700\u53ef\u9760\u7684\u4fe1\u606f\u670d\u52a1\u3002\",\"sameAs\":[\"https:\/\/ktromedia.com\"],\"url\":\"http:\/\/ktromedia.com\/?author=1\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"7 Advanced Feature Engineering Tricks for Text Data Using LLM Embeddings - Ktromedia","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"http:\/\/ktromedia.com\/?p=176542","og_locale":"en_US","og_type":"article","og_title":"7 Advanced Feature Engineering Tricks for Text Data Using LLM Embeddings - Ktromedia","og_description":"7 Advanced Feature Engineering Tricks for Text Data Using LLM EmbeddingsImage by Editor Introduction Large language models (LLMs) are not only good at understanding and generating text; they can also turn raw text into numerical representations called embeddings. These embeddings are useful for incorporating additional information into traditional predictive machine learning models\u2014such as those used","og_url":"http:\/\/ktromedia.com\/?p=176542","og_site_name":"Ktromedia","article_publisher":"https:\/\/www.facebook.com\/KTROMedia\/","article_published_time":"2025-11-06T12:55:34+00:00","og_image":[{"width":1024,"height":683,"url":"http:\/\/ktromedia.com\/wp-content\/uploads\/2025\/11\/1762433733_7-Advanced-Feature-Engineering-Tricks-for-Text-Data-Using-LLM.png","type":"image\/png"}],"author":"KTRO TEAM","twitter_card":"summary_large_image","twitter_misc":{"Written by":"KTRO TEAM","Est. reading time":"10 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"http:\/\/ktromedia.com\/?p=176542#article","isPartOf":{"@id":"http:\/\/ktromedia.com\/?p=176542"},"author":{"name":"KTRO TEAM","@id":"https:\/\/ktromedia.com\/#\/schema\/person\/612bf2fbac107722ea365932cdd35f5b"},"headline":"7 Advanced Feature Engineering Tricks for Text Data Using LLM Embeddings","datePublished":"2025-11-06T12:55:34+00:00","mainEntityOfPage":{"@id":"http:\/\/ktromedia.com\/?p=176542"},"wordCount":2072,"commentCount":0,"publisher":{"@id":"https:\/\/ktromedia.com\/#organization"},"image":{"@id":"http:\/\/ktromedia.com\/?p=176542#primaryimage"},"thumbnailUrl":"http:\/\/ktromedia.com\/wp-content\/uploads\/2025\/11\/1762433733_7-Advanced-Feature-Engineering-Tricks-for-Text-Data-Using-LLM.png","articleSection":["\u4eba\u5de5\u667a\u80fd"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["http:\/\/ktromedia.com\/?p=176542#respond"]}]},{"@type":"WebPage","@id":"http:\/\/ktromedia.com\/?p=176542","url":"http:\/\/ktromedia.com\/?p=176542","name":"7 Advanced Feature Engineering Tricks for Text Data Using LLM Embeddings - Ktromedia","isPartOf":{"@id":"https:\/\/ktromedia.com\/#website"},"primaryImageOfPage":{"@id":"http:\/\/ktromedia.com\/?p=176542#primaryimage"},"image":{"@id":"http:\/\/ktromedia.com\/?p=176542#primaryimage"},"thumbnailUrl":"http:\/\/ktromedia.com\/wp-content\/uploads\/2025\/11\/1762433733_7-Advanced-Feature-Engineering-Tricks-for-Text-Data-Using-LLM.png","datePublished":"2025-11-06T12:55:34+00:00","breadcrumb":{"@id":"http:\/\/ktromedia.com\/?p=176542#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["http:\/\/ktromedia.com\/?p=176542"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"http:\/\/ktromedia.com\/?p=176542#primaryimage","url":"http:\/\/ktromedia.com\/wp-content\/uploads\/2025\/11\/1762433733_7-Advanced-Feature-Engineering-Tricks-for-Text-Data-Using-LLM.png","contentUrl":"http:\/\/ktromedia.com\/wp-content\/uploads\/2025\/11\/1762433733_7-Advanced-Feature-Engineering-Tricks-for-Text-Data-Using-LLM.png","width":1024,"height":683},{"@type":"BreadcrumbList","@id":"http:\/\/ktromedia.com\/?p=176542#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/ktromedia.com\/"},{"@type":"ListItem","position":2,"name":"7 Advanced Feature Engineering Tricks for Text Data Using LLM Embeddings"}]},{"@type":"WebSite","@id":"https:\/\/ktromedia.com\/#website","url":"https:\/\/ktromedia.com\/","name":"Ktromedia","description":"KTRO MEDIA Crypto News","publisher":{"@id":"https:\/\/ktromedia.com\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/ktromedia.com\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/ktromedia.com\/#organization","name":"Ktromedia","url":"https:\/\/ktromedia.com\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/ktromedia.com\/#\/schema\/logo\/image\/","url":"https:\/\/ktromedia.com\/wp-content\/uploads\/2025\/11\/ktroicon.png","contentUrl":"https:\/\/ktromedia.com\/wp-content\/uploads\/2025\/11\/ktroicon.png","width":250,"height":250,"caption":"Ktromedia"},"image":{"@id":"https:\/\/ktromedia.com\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/KTROMedia\/","https:\/\/www.linkedin.com\/company\/ktro-media\/","https:\/\/t.me\/ktrogroup"]},{"@type":"Person","@id":"https:\/\/ktromedia.com\/#\/schema\/person\/612bf2fbac107722ea365932cdd35f5b","name":"KTRO TEAM","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/ktromedia.com\/#\/schema\/person\/image\/","url":"http:\/\/ktromedia.com\/wp-content\/uploads\/2025\/10\/cropped-Untitled-design-7-1-150x150.png","contentUrl":"http:\/\/ktromedia.com\/wp-content\/uploads\/2025\/10\/cropped-Untitled-design-7-1-150x150.png","caption":"KTRO TEAM"},"description":"KTRO MEDIA \u662f\u4e00\u5bb6\u5168\u7403\u6027\u7684\u534e\u6587WEB3\u5a92\u4f53\u516c\u53f8\u3002\u6211\u4eec\u81f4\u529b\u4e8e\u4e3a\u533a\u5757\u94fe\u548c\u91d1\u878d\u79d1\u6280\u9886\u57df\u63d0\u4f9b\u6700\u65b0\u7684\u65b0\u95fb\u3001\u89c1\u89e3\u548c\u8d8b\u52bf\u5206\u6790\u3002\u6211\u4eec\u7684\u5b97\u65e8\u662f\u4e3a\u5168\u7403\u7528\u6237\u63d0\u4f9b\u9ad8\u8d28\u91cf\u3001\u5168\u9762\u7684\u8d44\u8baf\u670d\u52a1\uff0c\u8ba9\u4ed6\u4eec\u66f4\u597d\u5730\u4e86\u89e3\u533a\u5757\u94fe\u548c\u91d1\u878d\u79d1\u6280\u884c\u4e1a\u7684\u6700\u65b0\u52a8\u6001\u3002\u6211\u4eec\u4e5f\u5e0c\u671b\u80fd\u5e2e\u5230\u66f4\u591a\u4f18\u79c0\u7684WEB3\u4ea7\u54c1\u627e\u5230\u66f4\u591a\u66f4\u597d\u7684\u8d44\u6e90\u597d\u8ba9\u8fd9\u9886\u57df\u53d8\u5f97\u66f4\u6210\u719f\u3002 \u6211\u4eec\u7684\u62a5\u9053\u8303\u56f4\u6db5\u76d6\u4e86\u533a\u5757\u94fe\u3001\u52a0\u5bc6\u8d27\u5e01\u3001\u667a\u80fd\u5408\u7ea6\u3001DeFi\u3001NFT \u548c Web3 \u751f\u6001\u7cfb\u7edf\u7b49\u9886\u57df\u3002\u6211\u4eec\u7684\u62a5\u9053\u4e0d\u4ec5\u6765\u81ea\u884c\u4e1a\u5185\u7684\u4e13\u5bb6\uff0c\u5148\u950b\u8005\u4e5f\u5305\u62ec\u4e86\u6211\u4eec\u81ea\u5df1\u7684\u5206\u6790\u548c\u89c2\u70b9\u3002\u6211\u4eec\u5728\u5404\u4e2a\u56fd\u5bb6\u548c\u5730\u533a\u90fd\u8bbe\u6709\u56e2\u961f\uff0c\u4e3a\u8bfb\u8005\u63d0\u4f9b\u672c\u5730\u5316\u7684\u62a5\u9053\u548c\u5206\u6790\u3002 \u9664\u4e86\u65b0\u95fb\u62a5\u9053\uff0c\u6211\u4eec\u8fd8\u63d0\u4f9b\u5e02\u573a\u7814\u7a76\u548c\u54a8\u8be2\u670d\u52a1\u3002\u6211\u4eec\u7684\u4e13\u4e1a\u56e2\u961f\u53ef\u4ee5\u4e3a\u60a8\u63d0\u4f9b\u6709\u5173\u533a\u5757\u94fe\u548c\u91d1\u878d\u79d1\u6280\u884c\u4e1a\u7684\u6df1\u5165\u5206\u6790\u548c\u5e02\u573a\u8d8b\u52bf\uff0c\u5e2e\u52a9\u60a8\u505a\u51fa\u66f4\u660e\u667a\u7684\u6295\u8d44\u51b3\u7b56\u3002 \u6211\u4eec\u7684\u4f7f\u547d\u662f\u6210\u4e3a\u5168\u7403\u534e\u6587\u533a\u5757\u94fe\u548c\u91d1\u878d\u79d1\u6280\u884c\u4e1a\u6700\u53d7\u4fe1\u8d56\u7684\u4fe1\u606f\u6765\u6e90\u4e4b\u4e00\u3002\u6211\u4eec\u5c06\u7ee7\u7eed\u4e0d\u65ad\u52aa\u529b\uff0c\u4e3a\u8bfb\u8005\u63d0\u4f9b\u6700\u65b0\u3001\u6700\u5168\u9762\u3001\u6700\u53ef\u9760\u7684\u4fe1\u606f\u670d\u52a1\u3002","sameAs":["https:\/\/ktromedia.com"],"url":"http:\/\/ktromedia.com\/?author=1"}]}},"_links":{"self":[{"href":"http:\/\/ktromedia.com\/index.php?rest_route=\/wp\/v2\/posts\/176542","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/ktromedia.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/ktromedia.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/ktromedia.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/ktromedia.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=176542"}],"version-history":[{"count":1,"href":"http:\/\/ktromedia.com\/index.php?rest_route=\/wp\/v2\/posts\/176542\/revisions"}],"predecessor-version":[{"id":176544,"href":"http:\/\/ktromedia.com\/index.php?rest_route=\/wp\/v2\/posts\/176542\/revisions\/176544"}],"wp:featuredmedia":[{"embeddable":true,"href":"http:\/\/ktromedia.com\/index.php?rest_route=\/wp\/v2\/media\/176543"}],"wp:attachment":[{"href":"http:\/\/ktromedia.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=176542"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/ktromedia.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=176542"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/ktromedia.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=176542"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}