📱
Mobile-Agent 技术详解

2024年9月11日修改

2024年8月22日创建

5607

6611

💬

恭喜你已经跑通了我们基础的 baseline 部分

接下来我们将逐步解析 mobile agent 的核心技术与内容

别忘了组队辅助表：https://docs.qq.com/sheet/DTW50em50a2FCSnlX?tab=BB08J2

Mobile-Agent简介

common.docs_name - LarkCCM_Docs_Menu_Image

Moblie-Agent-v2是阿里巴巴与北京交通大学共同提出的多代理协作有效导航的移动设备操作助手，该框架在ocr、开放式目标检测技术与多模态模型的基础上构建如上流程的多智能体系统来执行用户指令。​

初始信息获取

代码讲解

我们先从run.py 看起

代码块

iter = 0​
while True:​
    iter += 1​
    if iter == 1: # 如果为首轮执行则获取屏幕信息​
        screenshot_file = "./screenshot/screenshot.jpg" # 设置保存路径​
        perception_infos, width, height = get_perception_infos(​
            adb_path, screenshot_file # get_perception_infos获取abd路径以及保存路径​
        ) # 执行完毕后返回perception_infos描述信息、width屏幕宽度、height屏幕高并保存截图​
        shutil.rmtree(temp_file) # 删除temp文件夹及其文件​
        os.mkdir(temp_file)# 创建temp文件夹​
​
        keyboard = False # keyboard表示当前屏幕有无键盘​
        keyboard_height_limit = 0.9 * height # 设置键盘高度限制​
        for perception_info in perception_infos: # 遍历描述信息​
            # 如果目标高度小于keyboard_height_limit则直接遍历下个元素​
            if perception_info["coordinates"][1] < keyboard_height_limit: ​
                continue​
            # 如果ADB Keyboard在描述信息中则将keyboard置为True​
            if "ADB Keyboard" in perception_info["text"]:​
                keyboard = True​
                break​

在这段代码的逻辑中，首先我们将先获取当前屏幕截屏的信息，这个操作通过 get_perception_infos 函数完成​

代码块

def get_perception_infos(adb_path, screenshot_file):​
    # 截图屏幕​
    get_screenshot(adb_path)​
    # 获取屏幕宽、高​
    width, height = Image.open(screenshot_file).size​
    # 获取屏幕中文字及位置信息​
    text, coordinates = ocr(screenshot_file, ocr_detection, ocr_recognition)​
    # 若文本块位置相近则合并文本块​
    text, coordinates = merge_text_blocks(text, coordinates)​
    # 获取文本块位置中心​
    center_list = [​
        [(coordinate[0] + coordinate[2]) / 2, (coordinate[1] + coordinate[3]) / 2]​
        for coordinate in coordinates​
    ]​
    # 将几何中心画上红色圆点并保存到screenshot_file​
    draw_coordinates_on_image(screenshot_file, center_list)​
    # 将text及coordinates存放到perception_infos​
    perception_infos = []​
    for i in range(len(coordinates)):​
        perception_info = {"text": "text: " + text[i], "coordinates": coordinates[i]}​
        perception_infos.append(perception_info)​
    # 使用groundingdino_model检测屏幕中的图标，并返回图标位置​
    coordinates = det(screenshot_file, "icon", groundingdino_model)​
    # 将图标位置存入perception_infos​
    for i in range(len(coordinates)):​
        perception_info = {"text": "icon", "coordinates": coordinates[i]}​
        perception_infos.append(perception_info)​
    # 获取图标坐标框及id​
    image_box = []​
    image_id = []​
    for i in range(len(perception_infos)):​
        if perception_infos[i]["text"] == "icon":​
            image_box.append(perception_infos[i]["coordinates"])​
            image_id.append(i)​
    # 截取坐标框内容，并存为id.jpg​
    for i in range(len(image_box)):​
        crop(screenshot_file, image_box[i], image_id[i])​
    # 获取图标文件夹下所有文件，返回列表​
    images = get_all_files_in_folder(temp_file)​
    if len(images) > 0:​
        # 对图标截图按id进行排序​
        images = sorted(images, key=lambda x: int(x.split("/")[-1].split(".")[0]))​
        # 获取图标id​
        image_id = [int(image.split("/")[-1].split(".")[0]) for image in images]​
        # 使用多模态大模型对图标进行描述，并将结果存入icon_map​
        icon_map = {}​
        prompt = "This image is an icon from a phone screen. Please briefly describe the shape and color of this icon in one sentence."​
        if caption_call_method == "local":​
            for i in range(len(images)):​
                image_path = os.path.join(temp_file, images[i])​
                icon_width, icon_height = Image.open(image_path).size​
                if (​
                    icon_height > 0.8 * height​
                    or icon_width * icon_height > 0.2 * width * height​
                ):​
                    des = "None"​
                else:​
                    des = generate_local(tokenizer, model, image_path, prompt)​
                icon_map[i + 1] = des​
        else:​
            for i in range(len(images)):​
                images[i] = os.path.join(temp_file, images[i])​
            icon_map = generate_api(images, prompt)​
        # 将icon_map中信息整合到perception_infos​
        for i, j in zip(image_id, range(1, len(image_id) + 1)):​
            if icon_map.get(j):​
                perception_infos[i]["text"] = "icon: " + icon_map[j]​
    # 将锚框转为几何中心​
    for i in range(len(perception_infos)):​
        perception_infos[i]["coordinates"] = [​
            int(​
                (​
                    perception_infos[i]["coordinates"][0]​
                    + perception_infos[i]["coordinates"][2]​
                )​
                / 2​
            ),​
            int(​
                (​
                    perception_infos[i]["coordinates"][1]​
                    + perception_infos[i]["coordinates"][3]​
                )​
                / 2​
            ),​
        ]​
​
    return perception_infos, width, height​