{"id":180412,"date":"2026-06-29T18:32:06","date_gmt":"2026-06-29T18:32:06","guid":{"rendered":"https:\/\/ktromedia.com\/?p=180412"},"modified":"2026-06-29T18:32:06","modified_gmt":"2026-06-29T18:32:06","slug":"clustering-unstructured-text-with-llm-embeddings-and-hdbscan","status":"publish","type":"post","link":"https:\/\/ktromedia.com\/?p=180412","title":{"rendered":"Clustering Unstructured Text with LLM Embeddings and HDBSCAN"},"content":{"rendered":"<div id=\"\">\n<p>In this article, you will learn how to build a text clustering pipeline by combining large language model embeddings with HDBSCAN, a density-based clustering algorithm, to automatically discover topics in unlabeled text data.<\/p>\n<p>Topics we will cover include:<\/p>\n<ul>\n<li>How to generate text embeddings for raw documents using a pre-trained sentence-transformers model.<\/li>\n<li>How to reduce the dimensionality of those embeddings with UMAP to prepare them for clustering.<\/li>\n<li>How to apply HDBSCAN to automatically discover topic clusters and visualize the results.<\/li>\n<\/ul>\n<div style=\"width: 810px\" class=\"wp-caption aligncenter\"><\/p>\n<p class=\"wp-caption-text\">Clustering Unstructured Text with LLM Embeddings and HDBSCAN<\/p>\n<\/div>\n<h2>Introduction<\/h2>\n<p>The current era of <strong>Generative AI<\/strong> seems to primarily focus on chat interfaces and prompts, but the range of applications of <strong>large language models<\/strong>, or LLMs for short, is not limited to just that. Indeed, one of their most powerful downstream abilities consists of turning raw, messy, unstructured text into semantically rich mathematical representations called <strong>embeddings<\/strong>. Once that\u2019s done, we can use these text representations for a variety of machine learning use cases, with clustering being no exception.<\/p>\n<p>In particular, embeddings can be combined with advanced, density-based <strong>clustering techniques<\/strong> like <strong>HDBSCAN<\/strong>, allowing as a result for the discovery of hidden topics, patterns, or categories in your collection of text documents: all without the need for prior labeling. <\/p>\n<p>This article shows how to construct a text-based clustering pipeline from scratch. We will use a freely available dataset containing text instances, as well as an open-source LLM that has been trained for generating embeddings \u2014 i.e. a so-called embedding model. The icing on the cake: we\u2019ll use free and handy, modern Python libraries providing implementations of clustering algorithms like HDBSCAN.<\/p>\n<h2>Step-by-Step Walkthrough<\/h2>\n<p>First, let\u2019s start by installing the key Python libraries we will need:<\/p>\n<ul>\n<li><strong>Sentence transformers<\/strong>, to load a pre-trained LLM for embedding generation from Hugging Face \u2014 you\u2019ll need a Hugging Face API key, also called an <a href=\"https:\/\/huggingface.co\/docs\/hub\/security-tokens\" target=\"_blank\">access token<\/a>, to be able to load the model.<\/li>\n<li><strong>Umap-learn<\/strong>, to apply an algorithm to reduce the dimensionality of embeddings.<\/li>\n<\/ul>\n<p>Likewise, if you are working on a local IDE instead of a cloud notebook environment and don\u2019t have <strong>scikit-learn<\/strong> and <strong>pandas<\/strong>, you may need to install them too.<\/p>\n<div id=\"urvanov-syntax-highlighter-6a40101bbcf8c255528643\" class=\"urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate\" data-settings=\" minimize scroll-mouseover disable-anim\" style=\" margin-top: 12px; margin-bottom: 12px; font-size: 12px !important; line-height: 15px !important;\">\n<p><textarea wrap=\"soft\" class=\"urvanov-syntax-highlighter-plain print-no\" data-settings=\"dblclick\" readonly=\"readonly\" style=\"-moz-tab-size:4; -o-tab-size:4; -webkit-tab-size:4; tab-size:4; font-size: 12px !important; line-height: 15px !important;\"><br \/>\n!pip install sentence-transformers umap-learn <\/textarea><\/p>\n<div class=\"urvanov-syntax-highlighter-main\" style=\"\">\n<table class=\"crayon-table\">\n<tr class=\"urvanov-syntax-highlighter-row\">\n<td class=\"crayon-nums \" data-settings=\"show\">\n<\/td>\n<td class=\"urvanov-syntax-highlighter-code\">\n<div class=\"crayon-pre\" style=\"font-size: 12px !important; line-height: 15px !important; -moz-tab-size:4; -o-tab-size:4; -webkit-tab-size:4; tab-size:4;\">\n<p><span class=\"crayon-o\">!<\/span><span class=\"crayon-e\">pip <\/span><span class=\"crayon-e\">install <\/span><span class=\"crayon-v\">sentence<\/span><span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-e\">transformers <\/span><span class=\"crayon-v\">umap<\/span><span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-i\">learn<\/span><span class=\"crayon-h\"> <\/span><\/p>\n<\/div>\n<\/td>\n<\/tr>\n<\/table><\/div>\n<\/p><\/div>\n<p>Now we start the coding part by getting some fresh data. The <code>fetch_20newsgroups<\/code> function, which fetches a dataset containing texts from categorized news articles, will do. Note that even though the dataset contains labels, we will omit them, as we are pretending not to know this information for the sake of clustering these data instances into groups based on similarity. Also, we sample down the dataset to 150 instances, which will be representative enough for our example.<\/p>\n<div id=\"urvanov-syntax-highlighter-6a40101bbcf9e187033994\" class=\"urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate\" data-settings=\" minimize scroll-mouseover disable-anim\" style=\" margin-top: 12px; margin-bottom: 12px; font-size: 12px !important; line-height: 15px !important;\">\n<p><textarea wrap=\"soft\" class=\"urvanov-syntax-highlighter-plain print-no\" data-settings=\"dblclick\" readonly=\"readonly\" style=\"-moz-tab-size:4; -o-tab-size:4; -webkit-tab-size:4; tab-size:4; font-size: 12px !important; line-height: 15px !important;\"><br \/>\nimport pandas as pd&#13;<br \/>\nfrom sklearn.datasets import fetch_20newsgroups&#13;<br \/>\n&#13;<br \/>\n# Fetching a highly targeted subset of data (~150-200 docs)&#13;<br \/>\ncategories = [&#8216;sci.space&#8217;, &#8216;sci.med&#8217;, &#8216;rec.autos&#8217;]&#13;<br \/>\nnewsgroups = fetch_20newsgroups(subset=&#8221;train&#8221;, categories=categories, remove=(&#8216;headers&#8217;, &#8216;footers&#8217;, &#8216;quotes&#8217;))&#13;<br \/>\n&#13;<br \/>\n# Sampling down into a representative, illustrative subset&#13;<br \/>\ndf = pd.DataFrame({&#8216;text&#8217;: newsgroups.data, &#8216;true_label&#8217;: newsgroups.target})&#13;<br \/>\ndf = df[df[&#8216;text&#8217;].str.strip().str.len() &gt; 100].sample(150, random_state=42).reset_index(drop=True)&#13;<br \/>\n&#13;<br \/>\nprint(f&#8221;Loaded {len(df)} text documents.&#8221;)&#13;<br \/>\nprint(&#8220;\\nSample document:&#8221;)&#13;<br \/>\nprint(df[&#8216;text&#8217;].iloc[0][:150] + &#8220;&#8230;&#8221;)<\/textarea><\/p>\n<div class=\"urvanov-syntax-highlighter-main\" style=\"\">\n<table class=\"crayon-table\">\n<tr class=\"urvanov-syntax-highlighter-row\">\n<td class=\"crayon-nums \" data-settings=\"show\">\n<\/td>\n<td class=\"urvanov-syntax-highlighter-code\">\n<div class=\"crayon-pre\" style=\"font-size: 12px !important; line-height: 15px !important; -moz-tab-size:4; -o-tab-size:4; -webkit-tab-size:4; tab-size:4;\">\n<p><span class=\"crayon-e\">import <\/span><span class=\"crayon-e\">pandas <\/span><span class=\"crayon-st\">as<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">pd<\/span><\/p>\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-v\">sklearn<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">datasets <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-v\">fetch<\/span><span class=\"crayon-sy\">_<\/span>20newsgroups<\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-p\"># Fetching a highly targeted subset of data (~150-200 docs)<\/span><\/p>\n<p><span class=\"crayon-v\">categories<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-s\">&#8216;sci.space&#8217;<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-s\">&#8216;sci.med&#8217;<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-s\">&#8216;rec.autos&#8217;<\/span><span class=\"crayon-sy\">]<\/span><\/p>\n<p><span class=\"crayon-v\">newsgroups<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">fetch_20newsgroups<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">subset<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-s\">&#8216;train&#8217;<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">categories<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-v\">categories<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">remove<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-s\">&#8216;headers&#8217;<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-s\">&#8216;footers&#8217;<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-s\">&#8216;quotes&#8217;<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-p\"># Sampling down into a representative, illustrative subset<\/span><\/p>\n<p><span class=\"crayon-v\">df<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">pd<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">DataFrame<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">{<\/span><span class=\"crayon-s\">&#8216;text&#8217;<\/span><span class=\"crayon-o\">:<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">newsgroups<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">data<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-s\">&#8216;true_label&#8217;<\/span><span class=\"crayon-o\">:<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">newsgroups<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">target<\/span><span class=\"crayon-sy\">}<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-v\">df<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">df<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-v\">df<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-s\">&#8216;text&#8217;<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">str<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">strip<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">str<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">len<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">&gt;<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-cn\">100<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">sample<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-cn\">150<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">random_state<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">42<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">reset_index<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">drop<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-t\">True<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-e\">print<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-i\">f<\/span><span class=\"crayon-s\">&#8220;Loaded {len(df)} text documents.&#8221;<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-e\">print<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-s\">&#8220;\\nSample document:&#8221;<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-e\">print<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">df<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-s\">&#8216;text&#8217;<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">iloc<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-cn\">0<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-o\">:<\/span><span class=\"crayon-cn\">150<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">+<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-s\">&#8220;&#8230;&#8221;<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<\/div>\n<\/td>\n<\/tr>\n<\/table><\/div>\n<\/p><\/div>\n<p>Output:<\/p>\n<div id=\"urvanov-syntax-highlighter-6a40101bbcfa4580964728\" class=\"urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate\" data-settings=\" minimize scroll-mouseover disable-anim\" style=\" margin-top: 12px; margin-bottom: 12px; font-size: 12px !important; line-height: 15px !important;\">\n<p><textarea wrap=\"soft\" class=\"urvanov-syntax-highlighter-plain print-no\" data-settings=\"dblclick\" readonly=\"readonly\" style=\"-moz-tab-size:4; -o-tab-size:4; -webkit-tab-size:4; tab-size:4; font-size: 12px !important; line-height: 15px !important;\"><br \/>\nLoaded 150 text documents.&#13;<br \/>\n&#13;<br \/>\nSample document:&#13;<br \/>\n&#13;<br \/>\nOkay Mr. Dyer, we&#8217;re properly impressed with your philosophical skills and&#13;<br \/>\nability to insult people. You&#8217;re a wonderful speaker and an adept politic&#8230;<\/textarea><\/p>\n<div class=\"urvanov-syntax-highlighter-main\" style=\"\">\n<table class=\"crayon-table\">\n<tr class=\"urvanov-syntax-highlighter-row\">\n<td class=\"crayon-nums \" data-settings=\"show\">\n<\/td>\n<td class=\"urvanov-syntax-highlighter-code\">\n<div class=\"crayon-pre\" style=\"font-size: 12px !important; line-height: 15px !important; -moz-tab-size:4; -o-tab-size:4; -webkit-tab-size:4; tab-size:4;\">\n<p><span class=\"crayon-i\">Loaded<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-cn\">150<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">text <\/span><span class=\"crayon-v\">documents<\/span><span class=\"crayon-sy\">.<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-e\">Sample <\/span><span class=\"crayon-v\">document<\/span><span class=\"crayon-o\">:<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-e\">Okay <\/span><span class=\"crayon-v\">Mr<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">Dyer<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">we<\/span><span class=\"crayon-s\">&#8216;re properly impressed with your philosophical skills and<\/span><\/p>\n<p><span class=\"crayon-s\">ability to insult people. You&#8217;<\/span><span class=\"crayon-i\">re<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">a<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">wonderful <\/span><span class=\"crayon-e\">speaker <\/span><span class=\"crayon-st\">and<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">an <\/span><span class=\"crayon-e\">adept <\/span><span class=\"crayon-v\">politic<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-sy\">.<\/span><\/p>\n<\/div>\n<\/td>\n<\/tr>\n<\/table><\/div>\n<\/p><\/div>\n<p>The next step is to obtain the embeddings from raw texts. To do this, we load <code>all-MiniLM-L6-v2<\/code> from Hugging Face\u2019s sentence-transformers library. This is a lightweight yet effective model to obtain embeddings quickly.<\/p>\n<div id=\"urvanov-syntax-highlighter-6a40101bbcfa9680303542\" class=\"urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate\" data-settings=\" minimize scroll-mouseover disable-anim\" style=\" margin-top: 12px; margin-bottom: 12px; font-size: 12px !important; line-height: 15px !important;\">\n<p><textarea wrap=\"soft\" class=\"urvanov-syntax-highlighter-plain print-no\" data-settings=\"dblclick\" readonly=\"readonly\" style=\"-moz-tab-size:4; -o-tab-size:4; -webkit-tab-size:4; tab-size:4; font-size: 12px !important; line-height: 15px !important;\"><br \/>\nfrom sentence_transformers import SentenceTransformer&#13;<br \/>\n&#13;<br \/>\n# Loading the free, open-source model&#13;<br \/>\nmodel = SentenceTransformer(&#8216;all-MiniLM-L6-v2&#8217;)&#13;<br \/>\n&#13;<br \/>\n# Encoding text documents into dense vector embeddings&#13;<br \/>\nprint(&#8220;Generating embeddings&#8230;&#8221;)&#13;<br \/>\nembeddings = model.encode(df[&#8216;text&#8217;].tolist(), show_progress_bar=True)&#13;<br \/>\n&#13;<br \/>\nprint(f&#8221;Embedding matrix shape: {embeddings.shape}&#8221;)<\/textarea><\/p>\n<div class=\"urvanov-syntax-highlighter-main\" style=\"\">\n<table class=\"crayon-table\">\n<tr class=\"urvanov-syntax-highlighter-row\">\n<td class=\"crayon-nums \" data-settings=\"show\">\n<\/td>\n<td class=\"urvanov-syntax-highlighter-code\">\n<div class=\"crayon-pre\" style=\"font-size: 12px !important; line-height: 15px !important; -moz-tab-size:4; -o-tab-size:4; -webkit-tab-size:4; tab-size:4;\">\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-e\">sentence_transformers <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-i\">SentenceTransformer<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-p\"># Loading the free, open-source model<\/span><\/p>\n<p><span class=\"crayon-v\">model<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">SentenceTransformer<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-s\">&#8216;all-MiniLM-L6-v2&#8217;<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-p\"># Encoding text documents into dense vector embeddings<\/span><\/p>\n<p><span class=\"crayon-e\">print<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-s\">&#8220;Generating embeddings&#8230;&#8221;<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-v\">embeddings<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">model<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">encode<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">df<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-s\">&#8216;text&#8217;<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">tolist<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">show_progress_bar<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-t\">True<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-e\">print<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-i\">f<\/span><span class=\"crayon-s\">&#8220;Embedding matrix shape: {embeddings.shape}&#8221;<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<\/div>\n<\/td>\n<\/tr>\n<\/table><\/div>\n<\/p><\/div>\n<p>Since the embedding dimension is originally too high for clustering purposes, we now apply a dimensionality reduction technique by using the UMAP algorithm from the namesake library installed earlier:<\/p>\n<div id=\"urvanov-syntax-highlighter-6a40101bbcfad553572469\" class=\"urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate\" data-settings=\" minimize scroll-mouseover disable-anim\" style=\" margin-top: 12px; margin-bottom: 12px; font-size: 12px !important; line-height: 15px !important;\">\n<p><textarea wrap=\"soft\" class=\"urvanov-syntax-highlighter-plain print-no\" data-settings=\"dblclick\" readonly=\"readonly\" style=\"-moz-tab-size:4; -o-tab-size:4; -webkit-tab-size:4; tab-size:4; font-size: 12px !important; line-height: 15px !important;\"><br \/>\nimport umap&#13;<br \/>\n&#13;<br \/>\n# Reducing embedding dimensions to 5, to retain enough density information for clustering&#13;<br \/>\nreducer = umap.UMAP(n_neighbors=15, n_components=5, min_dist=0.0, random_state=42)&#13;<br \/>\nreduced_embeddings = reducer.fit_transform(embeddings)&#13;<br \/>\n&#13;<br \/>\nprint(f&#8221;Reduced matrix shape: {reduced_embeddings.shape}&#8221;)<\/textarea><\/p>\n<div class=\"urvanov-syntax-highlighter-main\" style=\"\">\n<table class=\"crayon-table\">\n<tr class=\"urvanov-syntax-highlighter-row\">\n<td class=\"crayon-nums \" data-settings=\"show\">\n<\/td>\n<td class=\"urvanov-syntax-highlighter-code\">\n<div class=\"crayon-pre\" style=\"font-size: 12px !important; line-height: 15px !important; -moz-tab-size:4; -o-tab-size:4; -webkit-tab-size:4; tab-size:4;\">\n<p><span class=\"crayon-e\">import <\/span><span class=\"crayon-i\">umap<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-p\"># Reducing embedding dimensions to 5, to retain enough density information for clustering<\/span><\/p>\n<p><span class=\"crayon-v\">reducer<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">umap<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">UMAP<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">n_neighbors<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">15<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">n_components<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">5<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">min_dist<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">0.0<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">random_state<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">42<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-v\">reduced_embeddings<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">reducer<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">fit_transform<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">embeddings<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-e\">print<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-i\">f<\/span><span class=\"crayon-s\">&#8220;Reduced matrix shape: {reduced_embeddings.shape}&#8221;<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<\/div>\n<\/td>\n<\/tr>\n<\/table><\/div>\n<\/p><\/div>\n<p>Now our numerical embedding vectors associated with news articles consist of five dimensions (attributes) only. Let\u2019s see if this compact representation is meaningful enough to obtain insightful clustering by applying the HDBSCAN algorithm, which is a density-based clustering approach:<\/p>\n<div id=\"urvanov-syntax-highlighter-6a40101bbcfb1053210277\" class=\"urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate\" data-settings=\" minimize scroll-mouseover disable-anim\" style=\" margin-top: 12px; margin-bottom: 12px; font-size: 12px !important; line-height: 15px !important;\">\n<p><textarea wrap=\"soft\" class=\"urvanov-syntax-highlighter-plain print-no\" data-settings=\"dblclick\" readonly=\"readonly\" style=\"-moz-tab-size:4; -o-tab-size:4; -webkit-tab-size:4; tab-size:4; font-size: 12px !important; line-height: 15px !important;\"><br \/>\nfrom sklearn.cluster import HDBSCAN&#13;<br \/>\n&#13;<br \/>\n# Initializing HDBSCAN&#13;<br \/>\n# min_cluster_size=8: we specified that each cluster must have at least 8 documents&#13;<br \/>\nclusterer = HDBSCAN(min_cluster_size=8, min_samples=3, store_centers=&#8221;centroid&#8221;)&#13;<br \/>\ndf[&#8216;cluster&#8217;] = clusterer.fit_predict(reduced_embeddings)&#13;<br \/>\n&#13;<br \/>\n# Counting instances per cluster&#13;<br \/>\ncluster_counts = df[&#8216;cluster&#8217;].value_counts()&#13;<br \/>\nprint(&#8220;\\nCluster Distribution:&#8221;)&#13;<br \/>\nprint(cluster_counts)<\/textarea><\/p>\n<div class=\"urvanov-syntax-highlighter-main\" style=\"\">\n<table class=\"crayon-table\">\n<tr class=\"urvanov-syntax-highlighter-row\">\n<td class=\"crayon-nums \" data-settings=\"show\">\n<\/td>\n<td class=\"urvanov-syntax-highlighter-code\">\n<div class=\"crayon-pre\" style=\"font-size: 12px !important; line-height: 15px !important; -moz-tab-size:4; -o-tab-size:4; -webkit-tab-size:4; tab-size:4;\">\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-v\">sklearn<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">cluster <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-i\">HDBSCAN<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-p\"># Initializing HDBSCAN<\/span><\/p>\n<p><span class=\"crayon-p\"># min_cluster_size=8: we specified that each cluster must have at least 8 documents<\/span><\/p>\n<p><span class=\"crayon-v\">clusterer<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">HDBSCAN<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">min_cluster_size<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">8<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">min_samples<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">3<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">store_centers<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-s\">&#8216;centroid&#8217;<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-v\">df<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-s\">&#8216;cluster&#8217;<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">clusterer<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">fit_predict<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">reduced_embeddings<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-p\"># Counting instances per cluster<\/span><\/p>\n<p><span class=\"crayon-v\">cluster_counts<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">df<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-s\">&#8216;cluster&#8217;<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">value_counts<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-e\">print<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-s\">&#8220;\\nCluster Distribution:&#8221;<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-e\">print<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">cluster_counts<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<\/div>\n<\/td>\n<\/tr>\n<\/table><\/div>\n<\/p><\/div>\n<p><strong>Important<\/strong>: the clustering results are partly influenced by the hyperparameter settings we defined for HDBSCAN. I recommend you try out other configurations for the minimum cluster size and other hyperparameters to explore how this affects results.<\/p>\n<p>Result:<\/p>\n<div id=\"urvanov-syntax-highlighter-6a40101bbcfb6251275206\" class=\"urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate\" data-settings=\" minimize scroll-mouseover disable-anim\" style=\" margin-top: 12px; margin-bottom: 12px; font-size: 12px !important; line-height: 15px !important;\">\n<p><textarea wrap=\"soft\" class=\"urvanov-syntax-highlighter-plain print-no\" data-settings=\"dblclick\" readonly=\"readonly\" style=\"-moz-tab-size:4; -o-tab-size:4; -webkit-tab-size:4; tab-size:4; font-size: 12px !important; line-height: 15px !important;\"><br \/>\nCluster Distribution:&#13;<br \/>\ncluster&#13;<br \/>\n0    101&#13;<br \/>\n1     49&#13;<br \/>\nName: count, dtype: int64<\/textarea><\/p>\n<div class=\"urvanov-syntax-highlighter-main\" style=\"\">\n<table class=\"crayon-table\">\n<tr class=\"urvanov-syntax-highlighter-row\">\n<td class=\"crayon-nums \" data-settings=\"show\">\n<\/td>\n<td class=\"urvanov-syntax-highlighter-code\">\n<div class=\"crayon-pre\" style=\"font-size: 12px !important; line-height: 15px !important; -moz-tab-size:4; -o-tab-size:4; -webkit-tab-size:4; tab-size:4;\">\n<p><span class=\"crayon-e\">Cluster <\/span><span class=\"crayon-v\">Distribution<\/span><span class=\"crayon-o\">:<\/span><\/p>\n<p><span class=\"crayon-i\">cluster<\/span><\/p>\n<p><span class=\"crayon-cn\">0<\/span><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-cn\">101<\/span><\/p>\n<p><span class=\"crayon-cn\">1<\/span><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0 <\/span><span class=\"crayon-cn\">49<\/span><\/p>\n<p><span class=\"crayon-v\">Name<\/span><span class=\"crayon-o\">:<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">count<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">dtype<\/span><span class=\"crayon-o\">:<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">int64<\/span><\/p>\n<\/div>\n<\/td>\n<\/tr>\n<\/table><\/div>\n<\/p><\/div>\n<p>It looks like HDBSCAN detected two clusters associated with high-density regions in the data space. Would there also be noisy points that were not allocated to either of these two clusters? Let\u2019s check:<\/p>\n<div id=\"urvanov-syntax-highlighter-6a40101bbcfbb663585569\" class=\"urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate\" data-settings=\" minimize scroll-mouseover disable-anim\" style=\" margin-top: 12px; margin-bottom: 12px; font-size: 12px !important; line-height: 15px !important;\">\n<p><textarea wrap=\"soft\" class=\"urvanov-syntax-highlighter-plain print-no\" data-settings=\"dblclick\" readonly=\"readonly\" style=\"-moz-tab-size:4; -o-tab-size:4; -webkit-tab-size:4; tab-size:4; font-size: 12px !important; line-height: 15px !important;\"><br \/>\nfor cluster_id in sorted(df[&#8216;cluster&#8217;].unique()):&#13;<br \/>\n    if cluster_id == -1:&#13;<br \/>\n        print(&#8220;\\n=== CLUSTER: NOISE \/ UNCLASSIFIED ===&#8221;)&#13;<br \/>\n    else:&#13;<br \/>\n        print(f&#8221;\\n=== CLUSTER: Discovered Topic #{cluster_id} ===&#8221;)&#13;<br \/>\n        &#13;<br \/>\n    # Getting up to 3 sample texts from this cluster&#13;<br \/>\n    samples = df[df[&#8216;cluster&#8217;] == cluster_id][&#8216;text&#8217;].head(3).tolist()&#13;<br \/>\n    for i, sample in enumerate(samples, 1):&#13;<br \/>\n        clean_sample = &#8221; &#8220;.join(sample.split())[:120]&#13;<br \/>\n        print(f&#8221;  {i}. {clean_sample}&#8230;&#8221;)<\/textarea><\/p>\n<div class=\"urvanov-syntax-highlighter-main\" style=\"\">\n<table class=\"crayon-table\">\n<tr class=\"urvanov-syntax-highlighter-row\">\n<td class=\"crayon-nums \" data-settings=\"show\">\n<\/td>\n<td class=\"urvanov-syntax-highlighter-code\">\n<div class=\"crayon-pre\" style=\"font-size: 12px !important; line-height: 15px !important; -moz-tab-size:4; -o-tab-size:4; -webkit-tab-size:4; tab-size:4;\">\n<p><span class=\"crayon-st\">for<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">cluster_id <\/span><span class=\"crayon-st\">in<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">sorted<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">df<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-s\">&#8216;cluster&#8217;<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">unique<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-o\">:<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-st\">if<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">cluster_id<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">==<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-cn\">1<\/span><span class=\"crayon-o\">:<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-e\">print<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-s\">&#8220;\\n=== CLUSTER: NOISE \/ UNCLASSIFIED ===&#8221;<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-st\">else<\/span><span class=\"crayon-o\">:<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-e\">print<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-i\">f<\/span><span class=\"crayon-s\">&#8220;\\n=== CLUSTER: Discovered Topic #{cluster_id} ===&#8221;<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-p\"># Getting up to 3 sample texts from this cluster<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">samples<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">df<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-v\">df<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-s\">&#8216;cluster&#8217;<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">==<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">cluster_id<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-s\">&#8216;text&#8217;<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">head<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-cn\">3<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">tolist<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-st\">for<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">i<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">sample <\/span><span class=\"crayon-st\">in<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">enumerate<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">samples<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-cn\">1<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-o\">:<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">clean_sample<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-s\">&#8221; &#8220;<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">join<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">sample<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">split<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-o\">:<\/span><span class=\"crayon-cn\">120<\/span><span class=\"crayon-sy\">]<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-e\">print<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-i\">f<\/span><span class=\"crayon-s\">&#8221;\u00a0\u00a0{i}. {clean_sample}&#8230;&#8221;<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<\/div>\n<\/td>\n<\/tr>\n<\/table><\/div>\n<\/p><\/div>\n<p>Output:<\/p>\n<div id=\"urvanov-syntax-highlighter-6a40101bbcfc0416254317\" class=\"urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate\" data-settings=\" minimize scroll-mouseover disable-anim\" style=\" margin-top: 12px; margin-bottom: 12px; font-size: 12px !important; line-height: 15px !important;\">\n<p><textarea wrap=\"soft\" class=\"urvanov-syntax-highlighter-plain print-no\" data-settings=\"dblclick\" readonly=\"readonly\" style=\"-moz-tab-size:4; -o-tab-size:4; -webkit-tab-size:4; tab-size:4; font-size: 12px !important; line-height: 15px !important;\"><br \/>\n=== CLUSTER: Discovered Topic #0 ===&#13;<br \/>\n  1. Okay Mr. Dyer, we&#8217;re properly impressed with your philosophical skills and ability to insult people. You&#8217;re a wonderful &#8230;&#13;<br \/>\n  2. I was at an interesting seminar at work (UK&#8217;s R.A.L. Space Science Dept.) on this subject, specifically on a small-scale&#8230;&#13;<br \/>\n  3. This is the second post which seems to be blurring the distinction between real disease caused by Candida albicans and t&#8230;&#13;<br \/>\n&#13;<br \/>\n=== CLUSTER: Discovered Topic #1 ===&#13;<br \/>\n  1. It&#8217;s great that all these other cars can out-handle, out-corner, and out- accelerate an Integra. But, you&#8217;ve got to ask &#8230;&#13;<br \/>\n  2. l diamond star cars (Talon\/Eclipse\/Laser) put out 190 hp in the turbo models, and 195 hp in the AWD turbo models, These &#8230;&#13;<br \/>\n  3. Sorry for the mis-spelling, but I forgot how to spell it after my series of exams and NO-on hand reference here. Is it s&#8230;<\/textarea><\/p>\n<div class=\"urvanov-syntax-highlighter-main\" style=\"\">\n<table class=\"crayon-table\">\n<tr class=\"urvanov-syntax-highlighter-row\">\n<td class=\"crayon-nums \" data-settings=\"show\">\n<\/td>\n<td class=\"urvanov-syntax-highlighter-code\">\n<div class=\"crayon-pre\" style=\"font-size: 12px !important; line-height: 15px !important; -moz-tab-size:4; -o-tab-size:4; -webkit-tab-size:4; tab-size:4;\">\n<p><span class=\"crayon-o\">===<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">CLUSTER<\/span><span class=\"crayon-o\">:<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">Discovered <\/span><span class=\"crayon-i\">Topic<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-p\">#0 ===<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0<\/span><span class=\"crayon-cn\">1.<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">Okay <\/span><span class=\"crayon-v\">Mr<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">Dyer<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">we<\/span><span class=\"crayon-s\">&#8216;re properly impressed with your philosophical skills and ability to insult people. You&#8217;<\/span><span class=\"crayon-i\">re<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">a<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">wonderful<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-sy\">.<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0<\/span><span class=\"crayon-cn\">2.<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">I<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">was <\/span><span class=\"crayon-e\">at <\/span><span class=\"crayon-e\">an <\/span><span class=\"crayon-e\">interesting <\/span><span class=\"crayon-e\">seminar <\/span><span class=\"crayon-e\">at <\/span><span class=\"crayon-e\">work<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-i\">UK<\/span><span class=\"crayon-s\">&#8216;s R.A.L. Space Science Dept.) on this subject, specifically on a small-scale&#8230;<\/span><\/p>\n<p><span class=\"crayon-s\">\u00a0\u00a03. This is the second post which seems to be blurring the distinction between real disease caused by Candida albicans and t&#8230;<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-s\">=== CLUSTER: Discovered Topic #1 ===<\/span><\/p>\n<p><span class=\"crayon-s\">\u00a0\u00a01. It&#8217;<\/span><span class=\"crayon-i\">s<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">great <\/span><span class=\"crayon-e\">that <\/span><span class=\"crayon-e\">all <\/span><span class=\"crayon-e\">these <\/span><span class=\"crayon-e\">other <\/span><span class=\"crayon-e\">cars <\/span><span class=\"crayon-e\">can <\/span><span class=\"crayon-v\">out<\/span><span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-v\">handle<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">out<\/span><span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-v\">corner<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">and<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">out<\/span><span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">accelerate <\/span><span class=\"crayon-e\">an <\/span><span class=\"crayon-v\">Integra<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">But<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">you<\/span>&#8216;<span class=\"crayon-e\">ve <\/span><span class=\"crayon-e\">got <\/span><span class=\"crayon-st\">to<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">ask<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-sy\">.<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0<\/span><span class=\"crayon-cn\">2.<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">l<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">diamond <\/span><span class=\"crayon-e\">star <\/span><span class=\"crayon-e\">cars<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">Talon<\/span><span class=\"crayon-o\">\/<\/span><span class=\"crayon-v\">Eclipse<\/span><span class=\"crayon-o\">\/<\/span><span class=\"crayon-v\">Laser<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">put <\/span><span class=\"crayon-i\">out<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-cn\">190<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">hp <\/span><span class=\"crayon-st\">in<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">the <\/span><span class=\"crayon-e\">turbo <\/span><span class=\"crayon-v\">models<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">and<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-cn\">195<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">hp <\/span><span class=\"crayon-st\">in<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">the <\/span><span class=\"crayon-e\">AWD <\/span><span class=\"crayon-e\">turbo <\/span><span class=\"crayon-v\">models<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">These<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-sy\">.<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0<\/span><span class=\"crayon-cn\">3.<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">Sorry <\/span><span class=\"crayon-st\">for<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">the <\/span><span class=\"crayon-v\">mis<\/span><span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-v\">spelling<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">but<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">I<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">forgot <\/span><span class=\"crayon-e\">how <\/span><span class=\"crayon-st\">to<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">spell <\/span><span class=\"crayon-e\">it <\/span><span class=\"crayon-e\">after <\/span><span class=\"crayon-e\">my <\/span><span class=\"crayon-e\">series <\/span><span class=\"crayon-e\">of <\/span><span class=\"crayon-e\">exams <\/span><span class=\"crayon-st\">and<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">NO<\/span><span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-e\">on <\/span><span class=\"crayon-e\">hand <\/span><span class=\"crayon-e\">reference <\/span><span class=\"crayon-v\">here<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">Is<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">it<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">s<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-sy\">.<\/span><\/p>\n<\/div>\n<\/td>\n<\/tr>\n<\/table><\/div>\n<\/p><\/div>\n<p>Seems like all data points in the sample of 150 were allocated to either one of the two clusters identified, thus hinting at the clue that the news articles might easily separable according to topic.<\/p>\n<p>For extra insight, we can show some cluster visualizations with the aid of the supplementary code provided below, which shows a scatterplot for every pairwise combination of the five existing components that describe each data point:<\/p>\n<div id=\"urvanov-syntax-highlighter-6a40101bbcfc4557775450\" class=\"urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate\" data-settings=\" minimize scroll-mouseover disable-anim\" style=\" margin-top: 12px; margin-bottom: 12px; font-size: 12px !important; line-height: 15px !important;\">\n<p><textarea wrap=\"soft\" class=\"urvanov-syntax-highlighter-plain print-no\" data-settings=\"dblclick\" readonly=\"readonly\" style=\"-moz-tab-size:4; -o-tab-size:4; -webkit-tab-size:4; tab-size:4; font-size: 12px !important; line-height: 15px !important;\"><br \/>\nimport matplotlib.pyplot as plt&#13;<br \/>\nimport seaborn as sns&#13;<br \/>\nimport itertools&#13;<br \/>\n&#13;<br \/>\n# Creating a DataFrame for the 5 reduced embeddings and cluster labels&#13;<br \/>\nreduced_df = pd.DataFrame(reduced_embeddings, columns=[f&#8217;UMAP_D{i+1}&#8217; for i in range(reduced_embeddings.shape[1])])&#13;<br \/>\nreduced_df[&#8216;cluster&#8217;] = df[&#8216;cluster&#8217;]&#13;<br \/>\n&#13;<br \/>\n# Getting all unique pairwise combinations of the 5 dimensions&#13;<br \/>\ndim_pairs = list(itertools.combinations(reduced_df.columns[:-1], 2))&#13;<br \/>\n&#13;<br \/>\nnum_plots = len(dim_pairs)&#13;<br \/>\nnum_cols = 3&#13;<br \/>\nnum_rows = (num_plots + num_cols &#8211; 1) \/\/ num_cols&#13;<br \/>\n&#13;<br \/>\nplt.figure(figsize=(num_cols * 5, num_rows * 4))&#13;<br \/>\n&#13;<br \/>\nfor i, (dim1, dim2) in enumerate(dim_pairs):&#13;<br \/>\n    plt.subplot(num_rows, num_cols, i + 1)&#13;<br \/>\n    sns.scatterplot(&#13;<br \/>\n        x=dim1,&#13;<br \/>\n        y=dim2,&#13;<br \/>\n        hue=&#8221;cluster&#8221;,&#13;<br \/>\n        data=reduced_df,&#13;<br \/>\n        palette=&#8221;viridis&#8221;,&#13;<br \/>\n        s=70,&#13;<br \/>\n        alpha=0.7,&#13;<br \/>\n        legend=&#8217;full&#8217;&#13;<br \/>\n    )&#13;<br \/>\n    plt.title(f'{dim1} vs {dim2}&#8217;)&#13;<br \/>\n    plt.xlabel(dim1)&#13;<br \/>\n    plt.ylabel(dim2)&#13;<br \/>\n    plt.grid(True, linestyle=&#8221;&#8211;&#8220;, alpha=0.6)&#13;<br \/>\n&#13;<br \/>\nplt.tight_layout()&#13;<br \/>\nplt.show()<\/textarea><\/p>\n<div class=\"urvanov-syntax-highlighter-main\" style=\"\">\n<table class=\"crayon-table\">\n<tr class=\"urvanov-syntax-highlighter-row\">\n<td class=\"crayon-nums \" data-settings=\"show\">\n<div class=\"urvanov-syntax-highlighter-nums-content\" style=\"font-size: 12px !important; line-height: 15px !important;\">\n<p>1<\/p>\n<p>2<\/p>\n<p>3<\/p>\n<p>4<\/p>\n<p>5<\/p>\n<p>6<\/p>\n<p>7<\/p>\n<p>8<\/p>\n<p>9<\/p>\n<p>10<\/p>\n<p>11<\/p>\n<p>12<\/p>\n<p>13<\/p>\n<p>14<\/p>\n<p>15<\/p>\n<p>16<\/p>\n<p>17<\/p>\n<p>18<\/p>\n<p>19<\/p>\n<p>20<\/p>\n<p>21<\/p>\n<p>22<\/p>\n<p>23<\/p>\n<p>24<\/p>\n<p>25<\/p>\n<p>26<\/p>\n<p>27<\/p>\n<p>28<\/p>\n<p>29<\/p>\n<p>30<\/p>\n<p>31<\/p>\n<p>32<\/p>\n<p>33<\/p>\n<p>34<\/p>\n<p>35<\/p>\n<p>36<\/p>\n<\/div>\n<\/td>\n<td class=\"urvanov-syntax-highlighter-code\">\n<div class=\"crayon-pre\" style=\"font-size: 12px !important; line-height: 15px !important; -moz-tab-size:4; -o-tab-size:4; -webkit-tab-size:4; tab-size:4;\">\n<p><span class=\"crayon-e\">import <\/span><span class=\"crayon-v\">matplotlib<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">pyplot <\/span><span class=\"crayon-st\">as<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">plt<\/span><\/p>\n<p><span class=\"crayon-e\">import <\/span><span class=\"crayon-e\">seaborn <\/span><span class=\"crayon-st\">as<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">sns<\/span><\/p>\n<p><span class=\"crayon-e\">import <\/span><span class=\"crayon-i\">itertools<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-p\"># Creating a DataFrame for the 5 reduced embeddings and cluster labels<\/span><\/p>\n<p><span class=\"crayon-v\">reduced_df<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">pd<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">DataFrame<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">reduced_embeddings<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">columns<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-i\">f<\/span><span class=\"crayon-s\">&#8216;UMAP_D{i+1}&#8217;<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">for<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">i<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">in<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">range<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">reduced_embeddings<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">shape<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-cn\">1<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-v\">reduced_df<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-s\">&#8216;cluster&#8217;<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">df<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-s\">&#8216;cluster&#8217;<\/span><span class=\"crayon-sy\">]<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-p\"># Getting all unique pairwise combinations of the 5 dimensions<\/span><\/p>\n<p><span class=\"crayon-v\">dim_pairs<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">list<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">itertools<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">combinations<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">reduced_df<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">columns<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-o\">:<\/span><span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-cn\">1<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-cn\">2<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-v\">num_plots<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">len<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">dim_pairs<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-v\">num_cols<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-cn\">3<\/span><\/p>\n<p><span class=\"crayon-v\">num_rows<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">num_plots<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">+<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">num_cols<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-cn\">1<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-c\">\/\/ num_cols<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-v\">plt<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">figure<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">figsize<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-e \">num_cols *<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-cn\">5<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e \">num_rows *<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-cn\">4<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-st\">for<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">i<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">dim1<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">dim2<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">in<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">enumerate<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">dim_pairs<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-o\">:<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">plt<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">subplot<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">num_rows<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">num_cols<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">i<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">+<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-cn\">1<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">sns<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">scatterplot<\/span><span class=\"crayon-sy\">(<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">x<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-v\">dim1<\/span><span class=\"crayon-sy\">,<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">y<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-v\">dim2<\/span><span class=\"crayon-sy\">,<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">hue<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-s\">&#8216;cluster&#8217;<\/span><span class=\"crayon-sy\">,<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">data<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-v\">reduced_df<\/span><span class=\"crayon-sy\">,<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">palette<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-s\">&#8216;viridis&#8217;<\/span><span class=\"crayon-sy\">,<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">s<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">70<\/span><span class=\"crayon-sy\">,<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">alpha<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">0.7<\/span><span class=\"crayon-sy\">,<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">legend<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-s\">&#8216;full&#8217;<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">plt<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">title<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-i\">f<\/span><span class=\"crayon-s\">&#8216;{dim1} vs {dim2}&#8217;<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">plt<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">xlabel<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">dim1<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">plt<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">ylabel<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">dim2<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">plt<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">grid<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-t\">True<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">linestyle<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-s\">&#8216;&#8211;&#8216;<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">alpha<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">0.6<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-v\">plt<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">tight_layout<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-v\">plt<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">show<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<\/div>\n<\/td>\n<\/tr>\n<\/table><\/div>\n<\/p><\/div>\n<p>Result:<\/p>\n<div style=\"width: 810px\" class=\"wp-caption aligncenter\"><img fetchpriority=\"high\" decoding=\"async\" src=\"https:\/\/ktromedia.com\/wp-content\/uploads\/2026\/06\/1782757923_60_Clustering-Unstructured-Text-with-LLM-Embeddings-and-HDBSCAN.png\" alt=\"Clustering visualizations\" width=\"800\" height=\"706\"\/><\/div>\n<p>By trying different configurations for HDBSCAN, you may come across results in which the number of identified clusters could be different from two. Just give it a try!<\/p>\n<h2>Wrapping Up<\/h2>\n<p>Once we have gone through the process of building the text-based clustering pipeline, it is worth concluding by pointing out the key reasons why putting together LLM embeddings with HDBSCAN is worth it. These include the ability to retain and capture, to some extent, the true semantic meaning and linguistic nuances of the original text, thanks to the properties inherent to embeddings obtained through sentence-transformers. Moreover, HDBSCAN automatically determines an optimal number of clusters and is able to detect outlying points that might be noise or outliers that would distort group-level statistics.<\/p>\n<\/p><\/div>\n","protected":false},"excerpt":{"rendered":"<p>In this article, you will learn how to build a text clustering pipeline by combining large language model embeddings with HDBSCAN, a density-based clustering algorithm, to automatically discover topics in unlabeled text data. Topics we will cover include: How to generate text embeddings for raw documents using a pre-trained sentence-transformers model. How to reduce the<\/p>\n","protected":false},"author":1,"featured_media":180413,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[42],"tags":[],"class_list":{"0":"post-180412","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-ai"},"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v26.4 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Clustering Unstructured Text with LLM Embeddings and HDBSCAN - Ktromedia<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/ktromedia.com\/?p=180412\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Clustering Unstructured Text with LLM Embeddings and HDBSCAN - Ktromedia\" \/>\n<meta property=\"og:description\" content=\"In this article, you will learn how to build a text clustering pipeline by combining large language model embeddings with HDBSCAN, a density-based clustering algorithm, to automatically discover topics in unlabeled text data. Topics we will cover include: How to generate text embeddings for raw documents using a pre-trained sentence-transformers model. How to reduce the\" \/>\n<meta property=\"og:url\" content=\"https:\/\/ktromedia.com\/?p=180412\" \/>\n<meta property=\"og:site_name\" content=\"Ktromedia\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/KTROMedia\/\" \/>\n<meta property=\"article:published_time\" content=\"2026-06-29T18:32:06+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/ktromedia.com\/wp-content\/uploads\/2026\/06\/Clustering-Unstructured-Text-with-LLM-Embeddings-and-HDBSCAN.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1774\" \/>\n\t<meta property=\"og:image:height\" content=\"887\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"KTRO TEAM\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"KTRO TEAM\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"10 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/ktromedia.com\/?p=180412#article\",\"isPartOf\":{\"@id\":\"https:\/\/ktromedia.com\/?p=180412\"},\"author\":{\"name\":\"KTRO TEAM\",\"@id\":\"https:\/\/ktromedia.com\/#\/schema\/person\/612bf2fbac107722ea365932cdd35f5b\"},\"headline\":\"Clustering Unstructured Text with LLM Embeddings and HDBSCAN\",\"datePublished\":\"2026-06-29T18:32:06+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/ktromedia.com\/?p=180412\"},\"wordCount\":2006,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/ktromedia.com\/#organization\"},\"image\":{\"@id\":\"https:\/\/ktromedia.com\/?p=180412#primaryimage\"},\"thumbnailUrl\":\"https:\/\/ktromedia.com\/wp-content\/uploads\/2026\/06\/Clustering-Unstructured-Text-with-LLM-Embeddings-and-HDBSCAN.png\",\"articleSection\":[\"\u4eba\u5de5\u667a\u80fd\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/ktromedia.com\/?p=180412#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/ktromedia.com\/?p=180412\",\"url\":\"https:\/\/ktromedia.com\/?p=180412\",\"name\":\"Clustering Unstructured Text with LLM Embeddings and HDBSCAN - Ktromedia\",\"isPartOf\":{\"@id\":\"https:\/\/ktromedia.com\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/ktromedia.com\/?p=180412#primaryimage\"},\"image\":{\"@id\":\"https:\/\/ktromedia.com\/?p=180412#primaryimage\"},\"thumbnailUrl\":\"https:\/\/ktromedia.com\/wp-content\/uploads\/2026\/06\/Clustering-Unstructured-Text-with-LLM-Embeddings-and-HDBSCAN.png\",\"datePublished\":\"2026-06-29T18:32:06+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/ktromedia.com\/?p=180412#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/ktromedia.com\/?p=180412\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/ktromedia.com\/?p=180412#primaryimage\",\"url\":\"https:\/\/ktromedia.com\/wp-content\/uploads\/2026\/06\/Clustering-Unstructured-Text-with-LLM-Embeddings-and-HDBSCAN.png\",\"contentUrl\":\"https:\/\/ktromedia.com\/wp-content\/uploads\/2026\/06\/Clustering-Unstructured-Text-with-LLM-Embeddings-and-HDBSCAN.png\",\"width\":1774,\"height\":887},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/ktromedia.com\/?p=180412#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/ktromedia.com\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Clustering Unstructured Text with LLM Embeddings and HDBSCAN\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/ktromedia.com\/#website\",\"url\":\"https:\/\/ktromedia.com\/\",\"name\":\"Ktromedia\",\"description\":\"KTRO MEDIA Crypto News\",\"publisher\":{\"@id\":\"https:\/\/ktromedia.com\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/ktromedia.com\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/ktromedia.com\/#organization\",\"name\":\"Ktromedia\",\"url\":\"https:\/\/ktromedia.com\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/ktromedia.com\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/ktromedia.com\/wp-content\/uploads\/2025\/11\/ktroicon.png\",\"contentUrl\":\"https:\/\/ktromedia.com\/wp-content\/uploads\/2025\/11\/ktroicon.png\",\"width\":250,\"height\":250,\"caption\":\"Ktromedia\"},\"image\":{\"@id\":\"https:\/\/ktromedia.com\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/KTROMedia\/\",\"https:\/\/www.linkedin.com\/company\/ktro-media\/\",\"https:\/\/t.me\/ktrogroup\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/ktromedia.com\/#\/schema\/person\/612bf2fbac107722ea365932cdd35f5b\",\"name\":\"KTRO TEAM\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/ktromedia.com\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/ktromedia.com\/wp-content\/uploads\/2025\/10\/cropped-Untitled-design-7-1-150x150.png\",\"contentUrl\":\"https:\/\/ktromedia.com\/wp-content\/uploads\/2025\/10\/cropped-Untitled-design-7-1-150x150.png\",\"caption\":\"KTRO TEAM\"},\"description\":\"KTRO MEDIA \u662f\u4e00\u5bb6\u5168\u7403\u6027\u7684\u534e\u6587WEB3\u5a92\u4f53\u516c\u53f8\u3002\u6211\u4eec\u81f4\u529b\u4e8e\u4e3a\u533a\u5757\u94fe\u548c\u91d1\u878d\u79d1\u6280\u9886\u57df\u63d0\u4f9b\u6700\u65b0\u7684\u65b0\u95fb\u3001\u89c1\u89e3\u548c\u8d8b\u52bf\u5206\u6790\u3002\u6211\u4eec\u7684\u5b97\u65e8\u662f\u4e3a\u5168\u7403\u7528\u6237\u63d0\u4f9b\u9ad8\u8d28\u91cf\u3001\u5168\u9762\u7684\u8d44\u8baf\u670d\u52a1\uff0c\u8ba9\u4ed6\u4eec\u66f4\u597d\u5730\u4e86\u89e3\u533a\u5757\u94fe\u548c\u91d1\u878d\u79d1\u6280\u884c\u4e1a\u7684\u6700\u65b0\u52a8\u6001\u3002\u6211\u4eec\u4e5f\u5e0c\u671b\u80fd\u5e2e\u5230\u66f4\u591a\u4f18\u79c0\u7684WEB3\u4ea7\u54c1\u627e\u5230\u66f4\u591a\u66f4\u597d\u7684\u8d44\u6e90\u597d\u8ba9\u8fd9\u9886\u57df\u53d8\u5f97\u66f4\u6210\u719f\u3002 \u6211\u4eec\u7684\u62a5\u9053\u8303\u56f4\u6db5\u76d6\u4e86\u533a\u5757\u94fe\u3001\u52a0\u5bc6\u8d27\u5e01\u3001\u667a\u80fd\u5408\u7ea6\u3001DeFi\u3001NFT \u548c Web3 \u751f\u6001\u7cfb\u7edf\u7b49\u9886\u57df\u3002\u6211\u4eec\u7684\u62a5\u9053\u4e0d\u4ec5\u6765\u81ea\u884c\u4e1a\u5185\u7684\u4e13\u5bb6\uff0c\u5148\u950b\u8005\u4e5f\u5305\u62ec\u4e86\u6211\u4eec\u81ea\u5df1\u7684\u5206\u6790\u548c\u89c2\u70b9\u3002\u6211\u4eec\u5728\u5404\u4e2a\u56fd\u5bb6\u548c\u5730\u533a\u90fd\u8bbe\u6709\u56e2\u961f\uff0c\u4e3a\u8bfb\u8005\u63d0\u4f9b\u672c\u5730\u5316\u7684\u62a5\u9053\u548c\u5206\u6790\u3002 \u9664\u4e86\u65b0\u95fb\u62a5\u9053\uff0c\u6211\u4eec\u8fd8\u63d0\u4f9b\u5e02\u573a\u7814\u7a76\u548c\u54a8\u8be2\u670d\u52a1\u3002\u6211\u4eec\u7684\u4e13\u4e1a\u56e2\u961f\u53ef\u4ee5\u4e3a\u60a8\u63d0\u4f9b\u6709\u5173\u533a\u5757\u94fe\u548c\u91d1\u878d\u79d1\u6280\u884c\u4e1a\u7684\u6df1\u5165\u5206\u6790\u548c\u5e02\u573a\u8d8b\u52bf\uff0c\u5e2e\u52a9\u60a8\u505a\u51fa\u66f4\u660e\u667a\u7684\u6295\u8d44\u51b3\u7b56\u3002 \u6211\u4eec\u7684\u4f7f\u547d\u662f\u6210\u4e3a\u5168\u7403\u534e\u6587\u533a\u5757\u94fe\u548c\u91d1\u878d\u79d1\u6280\u884c\u4e1a\u6700\u53d7\u4fe1\u8d56\u7684\u4fe1\u606f\u6765\u6e90\u4e4b\u4e00\u3002\u6211\u4eec\u5c06\u7ee7\u7eed\u4e0d\u65ad\u52aa\u529b\uff0c\u4e3a\u8bfb\u8005\u63d0\u4f9b\u6700\u65b0\u3001\u6700\u5168\u9762\u3001\u6700\u53ef\u9760\u7684\u4fe1\u606f\u670d\u52a1\u3002\",\"sameAs\":[\"https:\/\/ktromedia.com\"],\"url\":\"https:\/\/ktromedia.com\/?author=1\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Clustering Unstructured Text with LLM Embeddings and HDBSCAN - Ktromedia","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/ktromedia.com\/?p=180412","og_locale":"en_US","og_type":"article","og_title":"Clustering Unstructured Text with LLM Embeddings and HDBSCAN - Ktromedia","og_description":"In this article, you will learn how to build a text clustering pipeline by combining large language model embeddings with HDBSCAN, a density-based clustering algorithm, to automatically discover topics in unlabeled text data. Topics we will cover include: How to generate text embeddings for raw documents using a pre-trained sentence-transformers model. How to reduce the","og_url":"https:\/\/ktromedia.com\/?p=180412","og_site_name":"Ktromedia","article_publisher":"https:\/\/www.facebook.com\/KTROMedia\/","article_published_time":"2026-06-29T18:32:06+00:00","og_image":[{"width":1774,"height":887,"url":"https:\/\/ktromedia.com\/wp-content\/uploads\/2026\/06\/Clustering-Unstructured-Text-with-LLM-Embeddings-and-HDBSCAN.png","type":"image\/png"}],"author":"KTRO TEAM","twitter_card":"summary_large_image","twitter_misc":{"Written by":"KTRO TEAM","Est. reading time":"10 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/ktromedia.com\/?p=180412#article","isPartOf":{"@id":"https:\/\/ktromedia.com\/?p=180412"},"author":{"name":"KTRO TEAM","@id":"https:\/\/ktromedia.com\/#\/schema\/person\/612bf2fbac107722ea365932cdd35f5b"},"headline":"Clustering Unstructured Text with LLM Embeddings and HDBSCAN","datePublished":"2026-06-29T18:32:06+00:00","mainEntityOfPage":{"@id":"https:\/\/ktromedia.com\/?p=180412"},"wordCount":2006,"commentCount":0,"publisher":{"@id":"https:\/\/ktromedia.com\/#organization"},"image":{"@id":"https:\/\/ktromedia.com\/?p=180412#primaryimage"},"thumbnailUrl":"https:\/\/ktromedia.com\/wp-content\/uploads\/2026\/06\/Clustering-Unstructured-Text-with-LLM-Embeddings-and-HDBSCAN.png","articleSection":["\u4eba\u5de5\u667a\u80fd"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/ktromedia.com\/?p=180412#respond"]}]},{"@type":"WebPage","@id":"https:\/\/ktromedia.com\/?p=180412","url":"https:\/\/ktromedia.com\/?p=180412","name":"Clustering Unstructured Text with LLM Embeddings and HDBSCAN - Ktromedia","isPartOf":{"@id":"https:\/\/ktromedia.com\/#website"},"primaryImageOfPage":{"@id":"https:\/\/ktromedia.com\/?p=180412#primaryimage"},"image":{"@id":"https:\/\/ktromedia.com\/?p=180412#primaryimage"},"thumbnailUrl":"https:\/\/ktromedia.com\/wp-content\/uploads\/2026\/06\/Clustering-Unstructured-Text-with-LLM-Embeddings-and-HDBSCAN.png","datePublished":"2026-06-29T18:32:06+00:00","breadcrumb":{"@id":"https:\/\/ktromedia.com\/?p=180412#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/ktromedia.com\/?p=180412"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/ktromedia.com\/?p=180412#primaryimage","url":"https:\/\/ktromedia.com\/wp-content\/uploads\/2026\/06\/Clustering-Unstructured-Text-with-LLM-Embeddings-and-HDBSCAN.png","contentUrl":"https:\/\/ktromedia.com\/wp-content\/uploads\/2026\/06\/Clustering-Unstructured-Text-with-LLM-Embeddings-and-HDBSCAN.png","width":1774,"height":887},{"@type":"BreadcrumbList","@id":"https:\/\/ktromedia.com\/?p=180412#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/ktromedia.com\/"},{"@type":"ListItem","position":2,"name":"Clustering Unstructured Text with LLM Embeddings and HDBSCAN"}]},{"@type":"WebSite","@id":"https:\/\/ktromedia.com\/#website","url":"https:\/\/ktromedia.com\/","name":"Ktromedia","description":"KTRO MEDIA Crypto News","publisher":{"@id":"https:\/\/ktromedia.com\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/ktromedia.com\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/ktromedia.com\/#organization","name":"Ktromedia","url":"https:\/\/ktromedia.com\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/ktromedia.com\/#\/schema\/logo\/image\/","url":"https:\/\/ktromedia.com\/wp-content\/uploads\/2025\/11\/ktroicon.png","contentUrl":"https:\/\/ktromedia.com\/wp-content\/uploads\/2025\/11\/ktroicon.png","width":250,"height":250,"caption":"Ktromedia"},"image":{"@id":"https:\/\/ktromedia.com\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/KTROMedia\/","https:\/\/www.linkedin.com\/company\/ktro-media\/","https:\/\/t.me\/ktrogroup"]},{"@type":"Person","@id":"https:\/\/ktromedia.com\/#\/schema\/person\/612bf2fbac107722ea365932cdd35f5b","name":"KTRO TEAM","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/ktromedia.com\/#\/schema\/person\/image\/","url":"https:\/\/ktromedia.com\/wp-content\/uploads\/2025\/10\/cropped-Untitled-design-7-1-150x150.png","contentUrl":"https:\/\/ktromedia.com\/wp-content\/uploads\/2025\/10\/cropped-Untitled-design-7-1-150x150.png","caption":"KTRO TEAM"},"description":"KTRO MEDIA \u662f\u4e00\u5bb6\u5168\u7403\u6027\u7684\u534e\u6587WEB3\u5a92\u4f53\u516c\u53f8\u3002\u6211\u4eec\u81f4\u529b\u4e8e\u4e3a\u533a\u5757\u94fe\u548c\u91d1\u878d\u79d1\u6280\u9886\u57df\u63d0\u4f9b\u6700\u65b0\u7684\u65b0\u95fb\u3001\u89c1\u89e3\u548c\u8d8b\u52bf\u5206\u6790\u3002\u6211\u4eec\u7684\u5b97\u65e8\u662f\u4e3a\u5168\u7403\u7528\u6237\u63d0\u4f9b\u9ad8\u8d28\u91cf\u3001\u5168\u9762\u7684\u8d44\u8baf\u670d\u52a1\uff0c\u8ba9\u4ed6\u4eec\u66f4\u597d\u5730\u4e86\u89e3\u533a\u5757\u94fe\u548c\u91d1\u878d\u79d1\u6280\u884c\u4e1a\u7684\u6700\u65b0\u52a8\u6001\u3002\u6211\u4eec\u4e5f\u5e0c\u671b\u80fd\u5e2e\u5230\u66f4\u591a\u4f18\u79c0\u7684WEB3\u4ea7\u54c1\u627e\u5230\u66f4\u591a\u66f4\u597d\u7684\u8d44\u6e90\u597d\u8ba9\u8fd9\u9886\u57df\u53d8\u5f97\u66f4\u6210\u719f\u3002 \u6211\u4eec\u7684\u62a5\u9053\u8303\u56f4\u6db5\u76d6\u4e86\u533a\u5757\u94fe\u3001\u52a0\u5bc6\u8d27\u5e01\u3001\u667a\u80fd\u5408\u7ea6\u3001DeFi\u3001NFT \u548c Web3 \u751f\u6001\u7cfb\u7edf\u7b49\u9886\u57df\u3002\u6211\u4eec\u7684\u62a5\u9053\u4e0d\u4ec5\u6765\u81ea\u884c\u4e1a\u5185\u7684\u4e13\u5bb6\uff0c\u5148\u950b\u8005\u4e5f\u5305\u62ec\u4e86\u6211\u4eec\u81ea\u5df1\u7684\u5206\u6790\u548c\u89c2\u70b9\u3002\u6211\u4eec\u5728\u5404\u4e2a\u56fd\u5bb6\u548c\u5730\u533a\u90fd\u8bbe\u6709\u56e2\u961f\uff0c\u4e3a\u8bfb\u8005\u63d0\u4f9b\u672c\u5730\u5316\u7684\u62a5\u9053\u548c\u5206\u6790\u3002 \u9664\u4e86\u65b0\u95fb\u62a5\u9053\uff0c\u6211\u4eec\u8fd8\u63d0\u4f9b\u5e02\u573a\u7814\u7a76\u548c\u54a8\u8be2\u670d\u52a1\u3002\u6211\u4eec\u7684\u4e13\u4e1a\u56e2\u961f\u53ef\u4ee5\u4e3a\u60a8\u63d0\u4f9b\u6709\u5173\u533a\u5757\u94fe\u548c\u91d1\u878d\u79d1\u6280\u884c\u4e1a\u7684\u6df1\u5165\u5206\u6790\u548c\u5e02\u573a\u8d8b\u52bf\uff0c\u5e2e\u52a9\u60a8\u505a\u51fa\u66f4\u660e\u667a\u7684\u6295\u8d44\u51b3\u7b56\u3002 \u6211\u4eec\u7684\u4f7f\u547d\u662f\u6210\u4e3a\u5168\u7403\u534e\u6587\u533a\u5757\u94fe\u548c\u91d1\u878d\u79d1\u6280\u884c\u4e1a\u6700\u53d7\u4fe1\u8d56\u7684\u4fe1\u606f\u6765\u6e90\u4e4b\u4e00\u3002\u6211\u4eec\u5c06\u7ee7\u7eed\u4e0d\u65ad\u52aa\u529b\uff0c\u4e3a\u8bfb\u8005\u63d0\u4f9b\u6700\u65b0\u3001\u6700\u5168\u9762\u3001\u6700\u53ef\u9760\u7684\u4fe1\u606f\u670d\u52a1\u3002","sameAs":["https:\/\/ktromedia.com"],"url":"https:\/\/ktromedia.com\/?author=1"}]}},"_links":{"self":[{"href":"https:\/\/ktromedia.com\/index.php?rest_route=\/wp\/v2\/posts\/180412","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/ktromedia.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/ktromedia.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/ktromedia.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/ktromedia.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=180412"}],"version-history":[{"count":1,"href":"https:\/\/ktromedia.com\/index.php?rest_route=\/wp\/v2\/posts\/180412\/revisions"}],"predecessor-version":[{"id":180414,"href":"https:\/\/ktromedia.com\/index.php?rest_route=\/wp\/v2\/posts\/180412\/revisions\/180414"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/ktromedia.com\/index.php?rest_route=\/wp\/v2\/media\/180413"}],"wp:attachment":[{"href":"https:\/\/ktromedia.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=180412"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/ktromedia.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=180412"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/ktromedia.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=180412"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}