通过向量数据库,可以搜索我们自己的数据,然后让大语言来生成答案。但仅仅依靠向量数据库来检索数据,生成的答案效果并不理想,仍然缺乏了很多上下文信息,我们需要更多的方法来增强上下文信息。
要想实现知识图谱的构建,我们就要能够"理解"文本内容,将其中的"关键"信息提取出来 ,然后构建知识图谱。
我们将借助小语言模型或大语言模型来实现这一目标。
Tip
你也可以使用自然语言处理的模型来训练自己的模型,然后使用模型来进行实体识别。 但是,这可能费时费力,效果也未必理想。
请根据自己的需求来定义识别的结构,如我需要识别技术名词、专有名词、摘要等信息。
public class NerResultDto { /// <summary> /// 技术名词 /// </summary> public string[]? TechNoun { get; set; } /// <summary> /// 专有名词 /// </summary> public string[]? ProperNoun { get; set; } /// <summary> /// 摘要 /// </summary> public string? Summary { get; set; } }
然后我们将使用语言模型能力提取实体信息,如:
string prompt = $$""" {{text}} 请分析该句子,识别其中的内容,以便进行分类 要识别的类别为: 技术名词,专有名词,摘要;识别的内容保持原语言表示 返回的json格式如下:{ "TechNoun":["",""], "ProperNoun":["",""], "Summary":"" } 仅返回json本身内容作为最终结果,不需要任何格式化内容,不要添加解释。 """;
我们需要定义关系抽取的结构,如我需要识别实体之间的关系。
public class RelationDto { public required string Subject { get; set; } public required string Relation { get; set; } public required string Target { get; set; } }
提示词如:
string prompt = $$""" 请分析以下文本内容,识别其中的技术名词、专有名词和概念等,并提取它们之间的关系。 如果文本中包含多个关系,请提取所有关系。如"属于","包含","依赖于"等关系 将结果以 JSON 数组的形式输出,每个元素包含 "subject" (实体1), "relation" (关系), 和 "target" (实体2) 字段。 返回的json格式如下:[ {"subject":"","relation":"","target":""} ] 仅返回json本身内容作为最终结果,不需要任何格式化内容,不要添加解释。 文本内容: {{text}} """;
完整的KnowledgeProcessing
类
/// <summary> /// 知识内容处理 /// </summary> public class KnowledgeProcessing { private readonly IChatCompletionService _chat; private readonly ILogger<KnowledgeProcessing> _logger; public KnowledgeProcessing(IChatCompletionService chat, ILogger<KnowledgeProcessing> logger) { _chat = chat; _logger = logger; } /// <summary> /// 实体识别 /// </summary> /// <param name="text"></param> /// <param name="chat"></param> /// <returns></returns> public async Task<NerResultDto?> NerAsync(string text) { string prompt = $$""" {{text}} 请分析该句子,识别其中的内容,以便进行分类 要识别的类别为: 技术名词,专有名词,摘要;识别的内容保持原语言表示 要求返回的json格式如下:{ "TechNoun":["",""], "ProperNoun":["",""], "Summary":"" } 返回内容严格遵循json格式,是合法的JSON,不能有其他内容。请注意,json的键名必须与上面一致,值可以是空数组或空字符串。 """; var response = await _chat.GetChatMessageContentsAsync(prompt); var result = response.FirstOrDefault()?.Content; if (result != null) { result = MarkdownProcessing.RemoveCodeBlock(result, "json"); try { return JsonSerializer.Deserialize<NerResultDto>(result); } catch (Exception ex) { _logger.LogError(ex, "NerAsync error: {result}", result); } } return null; } /// <summary> /// 关系抽取 /// </summary> /// <param name="text"></param> /// <returns></returns> public async Task<List<RelationDto>?> RelationExtractionAsync(string text) { string prompt = $$""" 请分析以下文本内容,识别其中的技术名词、专有名词和概念等,并提取它们之间的关系。 如果文本中包含多个关系,请提取所有关系。如"属于","包含","依赖于"等关系 将结果以 JSON 数组的形式输出,每个元素包含 "Subject" (实体1), "Relation" (关系), 和 "Target" (实体2) 字段。如果没有对应的Target,则不要返回该元素。 返回的json格式如下:[ {"Subject":"","Relation":"","Target":""} ] 返回内容严格遵循json格式,是合法的JSON,不能有其他内容。请注意,json的键名必须与上面一致,值可以是空数组或空字符串。 文本内容: {{text}} """; var response = await _chat.GetChatMessageContentsAsync(prompt); var result = response.FirstOrDefault()?.Content; if (result != null) { result = MarkdownProcessing.RemoveCodeBlock(result, "json"); try { return JsonSerializer.Deserialize<List<RelationDto>>(result); } catch (Exception ex) { _logger.LogError(ex, "NerAsync error: {result}", result); } } return null; } }
现在我们知道了如何提取实体关系,现在我们要将这些信息存储起来,以便后续使用。
实体关系我们使用RelationDto
来表示,它是一个列表,我们完全可以将它转变成Json,然后存储到本地。
为了能够查询实体关系,我们需要加载这些数据,然后转变成图的结构,然后进行查询。
我们可以将知识图谱存储到文件中,然后在使用时加载到内存中,我们可以使用QuikGraph
来实现图的加载和查询。
/// <summary> /// 图结构存储和处理关系数据 /// </summary> public static class GraphDataProcessing { private readonly static BidirectionalGraph<string, RelationEdge> Graph = new(); public static void AddRelation(RelationDto relation) { if (!Graph.ContainsVertex(relation.Subject)) { Graph.AddVertex(relation.Subject); } if (!Graph.ContainsVertex(relation.Target)) { Graph.AddVertex(relation.Target); } Graph.AddEdge(new RelationEdge(relation.Subject, relation.Target, relation.Relation)); } public static IEnumerable<RelationDto> QueryRelations(string subject, string target) { return Graph.Edges .Where(edge => edge.Source == subject || edge.Target == target) .Select(edge => new RelationDto { Subject = edge.Source, Relation = edge.Relation, Target = edge.Target }); } } public class RelationEdge : Edge<string> { public string Relation { get; } public RelationEdge(string source, string target, string relation) : base(source, target) { Relation = relation; } }
现在我们已经了解了关键的概念和实现方式,接下来我们需要准备数据。我们需要:
这里我以markdown
文档为例子。
这次,我们将文本拆分成段落,然后进行向量化处理。
using Markdig; namespace ApiService.Processing; /// <summary> /// markdown文档处理 /// </summary> public class MarkdownProcessing { /// <summary> /// 按二级标题拆分 /// </summary> /// <param name="content"></param> /// <returns></returns> public static List<string> SplitText(string content) { // 拆分 var paragraphs = content.Split("\n## ", StringSplitOptions.RemoveEmptyEntries).ToList(); return paragraphs.Select(s => Markdown.ToPlainText(s)).ToList(); } }
然后我们借助Ollama本地模型nomic-embed-text
进行向量化处理,存储到qdrant
中。
先定义向量存储模型
public class DocumentEmbedding { public static string DocName = "dusi"; [VectorStoreRecordKey] public Guid Id { get; set; } = Guid.NewGuid(); [VectorStoreRecordData] public string Hash { get; set; } = string.Empty; [VectorStoreRecordData] public string Content { get; set; } = string.Empty; [VectorStoreRecordVector(768)] public ReadOnlyMemory<float>? DescriptionEmbedding { get; set; } [VectorStoreRecordData(IsFilterable = true)] public List<string> Tags { get; set; } = []; }
然后编写向量化存储的代码:
/// <summary> /// 向量化存储 /// </summary> /// <param name="content"></param> /// <returns></returns> private async Task EmbedAndSaveAsync(IVectorStoreRecordCollection<Guid, DocumentEmbedding> collection, string content) { var paragraph = MarkdownProcessing.SplitText(content); var hash = MD5.HashData(Encoding.UTF8.GetBytes(content)); var md5 = BitConverter.ToString(hash).Replace("-", "").ToLowerInvariant(); ReadOnlyMemory<float> zeroVector = new float[768]; var searchResult = await collection.VectorizedSearchAsync(zeroVector, new VectorSearchOptions<DocumentEmbedding> { Filter = d => d.Hash == md5, Top = 1 }); if (await searchResult.Results.AnyAsync()) { _logger.LogInformation("➡️ skip exist embed...{name}", md5); return; } var embeddings = await _embed.GenerateEmbeddingsAsync(paragraph); List<DocumentEmbedding> sentencesEmbeddings = []; for (int i = 0; i < embeddings.Count; i++) { sentencesEmbeddings.Add(new DocumentEmbedding { Hash = md5, Content = paragraph[i], DescriptionEmbedding = embeddings[i] }); } try { if (sentencesEmbeddings.Count > 0) { var embedRes = collection.UpsertBatchAsync(sentencesEmbeddings); await foreach (var item in embedRes) { } } } catch (Exception ex) { _logger.LogError(ex, "Error during upsert"); } }
collection
的创建如下
var collection = _vectorStore.GetCollection<Guid, DocumentEmbedding>(DocumentEmbedding.DocName); await collection.CreateCollectionIfNotExistsAsync();
其中的_embed
是注入的ITextEmbeddingGenerationService
接口,_vectorStore
是注入的IVectorStore
接口。
先定义知识图谱数据存储结构
public class KnowledgeGraphDto { public HashSet<string> HashSet { get; set; } = []; public List<RelationDto> RelationDtos { get; set; } = []; }
其中HashSet用来避免重复处理相同的数据。
然后我们编写一个获取知识图谱数据的方法:
/// <summary> /// 知识图谱 /// </summary> /// <param name="content"></param> /// <returns></returns> private async Task GraphKnowledgeAsync(string content, KnowledgeGraphDto knowledgeGraph) { var paragraph = MarkdownProcessing.SplitText(content); using (var scope = serviceProvider.CreateScope()) { var processing = scope.ServiceProvider.GetRequiredService<KnowledgeProcessing>(); var relationList = new List<RelationDto>(); foreach (var text in paragraph) { var md5 = MD5.HashData(Encoding.UTF8.GetBytes(content)); var hash = BitConverter.ToString(md5).Replace("-", "").ToLowerInvariant(); if (knowledgeGraph.HashSet.TryGetValue(hash, out string? val)) { _logger.LogInformation("➡️ skip exist graph...{name}", hash); continue; } var relation = await processing.RelationExtractionAsync(text); if (relation != null) { relationList.AddRange(relation); knowledgeGraph.HashSet.Add(hash); } } knowledgeGraph.RelationDtos.AddRange(relationList); } }
我们使用后台服务(IHostedService
)来完成以上流程,
public class Worker( ILogger<Worker> _logger, ITextEmbeddingGenerationService _embed, IVectorStore _vectorStore, IConfiguration configuration, IServiceProvider serviceProvider ) : IHostedService { public async Task StartAsync(CancellationToken cancellationToken) { var searchPath = configuration["Resources:ContentPath"]; if (string.IsNullOrWhiteSpace(searchPath)) { _logger.LogError("Search path is not configured."); return; } _logger.LogInformation("✨ start embed content from {path}", searchPath); var collection = _vectorStore.GetCollection<Guid, DocumentEmbedding>(DocumentEmbedding.DocName); await collection.CreateCollectionIfNotExistsAsync(); var mdFiles = Directory.GetFiles(searchPath, "*.md", SearchOption.AllDirectories); var parallelOptions = new ParallelOptions { MaxDegreeOfParallelism = 4, // Specify the desired number of threads CancellationToken = cancellationToken }; await Parallel.ForEachAsync(mdFiles, parallelOptions, async (mdFile, ct) => { var mdContent = File.ReadAllText(mdFile); await EmbedAndSaveAsync(collection, mdContent); _logger.LogInformation("🆕 [{number}] Embedded!", mdFile); }); _logger.LogInformation("✅ All files embedded!"); // 知识图谱 _logger.LogInformation("✨ start graph knowledge from {path}", searchPath); var dataPath = configuration["Resources:DataPath"]; if (dataPath == null) { _logger.LogError("dataPath is null"); return; } var dataFilePath = Path.Combine(dataPath, "knowledge.json"); var knowledgeGraph = new KnowledgeGraphDto(); if (File.Exists(dataFilePath)) { var jsonContent = await File.ReadAllTextAsync(dataFilePath); knowledgeGraph = JsonSerializer.Deserialize<KnowledgeGraphDto>(jsonContent); } await Parallel.ForEachAsync(mdFiles, parallelOptions, async (mdFile, ct) => { var mdContent = File.ReadAllText(mdFile); await GraphKnowledgeAsync(mdContent, knowledgeGraph); _logger.LogInformation("🆕 Got Knowledge:[{number}]", mdFile); }); _logger.LogInformation("✅ All files graph knowledge!"); // save knowledge graph var json = JsonSerializer.Serialize(knowledgeGraph); await File.WriteAllTextAsync(dataFilePath, json); _logger.LogInformation("✅ Save knowledge graph to {path}", dataFilePath); } // ...以上具体处理的方法代码 }
Note
数据处理是一个耗时的过程,我们可以使用控制台程序或者后台服务来进行处理。
避免重复处理相同的数据。
现在我们有以下信息:
然后我们要先梳理下查询流程,用户登录是询问一个问题,我们要进行以下操作:
Important
这个流程不是固定的,是需要根据实际情况来调整的。
根据以上思路,我们编写一个SearchService
类来处理查询:
public class SearchService { private readonly ITextEmbeddingGenerationService _embed; private readonly IVectorStoreRecordCollection<Guid, DocumentEmbedding> _collection; private readonly KnowledgeProcessing _knowledgeProcessing; private readonly ILogger<SearchService> _logger; public SearchService( ITextEmbeddingGenerationService embed, IVectorStore _vectorStore, IConfiguration configuration, ILogger<SearchService> logger, KnowledgeProcessing knowledgeProcessing) { _embed = embed; _logger = logger; _collection = _vectorStore.GetCollection<Guid, DocumentEmbedding>(DocumentEmbedding.DocName); _knowledgeProcessing = knowledgeProcessing; var dataPath = configuration["Resources:DataPath"]; if (dataPath == null) { return; } if (!Directory.Exists(dataPath)) { Directory.CreateDirectory(dataPath); } var dataFilePath = Path.Combine(dataPath, "knowledge.json"); // 加载知识图谱 if (File.Exists(dataFilePath)) { var jsonContent = File.ReadAllText(dataFilePath); var knowledgeGraph = JsonSerializer.Deserialize<KnowledgeGraphDto>(jsonContent); var relationDtos = knowledgeGraph?.RelationDtos .Where(r => !string.IsNullOrWhiteSpace(r.Target) && !string.IsNullOrWhiteSpace(r.Subject)) .ToList(); relationDtos?.ForEach(GraphDataProcessing.AddRelation); } } /// <summary> /// 搜索 /// </summary> /// <param name="searchContent"></param> /// <param name="searchCount"></param> /// <returns></returns> public async Task<List<string>> SearchAsync(string searchContent, int searchCount = 10) { var searchResults = new List<string>(); // 向量搜索 var vectorResults = await SearchVectorAsync(searchContent); if (vectorResults?.Length > 0) { searchResults.AddRange(vectorResults ?? []); } // 搜索知识图谱,再进行向量搜索 var relationQueryList = await SearchGraphAsync(searchContent); if (relationQueryList?.Length > 0) { relationQueryList = [.. relationQueryList.Take(searchCount)]; foreach (var item in relationQueryList) { _logger.LogInformation("🔍 Graph knowledge search: {item}", item); var vectorResult = await SearchVectorAsync(item); if (vectorResult != null && vectorResult.Length > 0) { searchResults.AddRange(vectorResult); } } } else { _logger.LogWarning("⚠️ No Graph knowledge itmes"); } return searchResults; } /// <summary> /// 向量搜索 /// </summary> /// <param name="searchContent"></param> /// <returns></returns> public async Task<string[]?> SearchVectorAsync(string searchContent) { var results = Array.Empty<string>(); var vector = await _embed.GenerateEmbeddingAsync(searchContent); var result = await _collection.VectorizedSearchAsync(vector, new VectorSearchOptions<DocumentEmbedding> { Top = 2, }); if (await result.Results.AnyAsync()) { await foreach (var item in result.Results) { if (!string.IsNullOrEmpty(item.Record.Content)) { results = [.. results, item.Record.Content]; } } } return results; } /// <summary> /// 实体识别 /// </summary> /// <param name="searchContent"></param> /// <returns></returns> public async Task<string[]?> SearchNerAsync(string searchContent) { var res = Array.Empty<string>(); var data = await _knowledgeProcessing.NerAsync(searchContent); if (data != null) { if (data.TechNoun != null) { foreach (var item in data.TechNoun) { if (!string.IsNullOrWhiteSpace(item)) { res = [.. res, item]; } } } if (data.ProperNoun != null) { foreach (var item in data.ProperNoun) { if (!string.IsNullOrWhiteSpace(item)) { res = [.. res, item]; } } } if (!string.IsNullOrWhiteSpace(data.Summary)) { res = [.. res, data.Summary]; } } var logRes = string.Join(",", res); _logger.LogInformation("➡️ Ner result: {res}", logRes); return res; } /// <summary> /// 搜索知识图谱 /// </summary> /// <param name="searchContent"></param> /// <returns></returns> public async Task<string[]?> SearchGraphAsync(string searchContent) { var results = Array.Empty<string>(); var nerResults = await SearchNerAsync(searchContent); if (nerResults?.Length == 0) { return null; } var relationResult = new List<RelationDto>(); foreach (var ner in nerResults!) { var relations = GraphDataProcessing.QueryRelations(ner, ner); if (relations != null) { relationResult.AddRange(relations); } } relationResult = [.. relationResult.Distinct()]; if (relationResult.Count > 0) { foreach (var item in relationResult) { if (!string.IsNullOrWhiteSpace(item.Target)) { results = [.. results, $"{item.Subject} {item.Relation} {item.Target}"]; } } } return results; } }
然后在接口中使用SearchService
来处理查询:
public static async Task SearchAsync( HttpContext httpContext, QuestionModel question, IChatCompletionService chat, SearchService search ) { httpContext.Response.ContentType = "text/plain;charset=utf-8"; var searchResults = await search.SearchAsync(question.Content); string searchContent = string.Empty; if (searchResults?.Count > 0) { foreach (var item in searchResults) { searchContent += item + Environment.NewLine; } } string systemPrompt = $@" 以下是从本地文档中搜索到的相关内容: {searchContent} 仅根据上述搜索结果来回答用户的问题。如果没有足够的内容来回答,则提示没有找到相关信息。 "; ChatHistory history = []; history.AddUserMessage(question.Content); history.AddSystemMessage(systemPrompt); var response = await chat.GetChatMessageContentsAsync(history); foreach (var item in response) { await httpContext.Response.WriteAsync(item.Content ?? ""); } await httpContext.Response.CompleteAsync(); }
Tip
完整的项目源码示例.
为了得到更准确的答案,我们使用知识图谱来关联数据,增强上下文信息。
这里我们使用了本地小模型来处理实体识别和关系抽取
的功能。由于小模型的能力有限,会出现错误的识别和抽取,尤其是无法正确返回Json
格式的内容,会导致我们缺失一些关联信息。
可以采用成熟的大模型来处理这块内容。