BinQuery: A Novel Framework for Natural Language-Based Binary Code Retrieval
Binary Function Retrieval (BFR) is crucial in reverse engineering for identifying specific functions in binary code, especially those associated with malicious behavior or vulnerabilities. Traditional BFR methods rely on heuristics, often lacking the efficiency and adaptability needed for large-scale or diverse binary analysis tasks. To address these challenges, we present BinQuery, a Natural Language-based BFR (NL-based BFR) framework that uses natural language queries to retrieve relevant binary functions with improved flexibility and precision. BinQuery introduces innovative techniques to bridge information gaps between binary code and natural language, achieves fine-grained alignment for enhanced retrieval accuracy, and leverages Large Language Models (LLMs) to refine queries and generate diverse descriptions. Tested on the ViC and Magma datasets, BinQuery surpasses current state-of-the-art methods, achieving a 42.55% increase in recall@1 on ViC and a 4x improvement on Magma. Our framework marks a significant advancement for NL-based BFR, enhancing the efficacy of binary analysis for both general reverse engineering and vulnerability discovery.