Indoor scene generation has become a popular research topic in recent years. It not only provides computer vision tasks with naturally labeled indoor scene datasets, helping them better understand scenes, but also has applications in many real-world scenarios, such as robot navigation. The diversity of indoor scene layouts makes scene generation a highly challenging task. This paper reviews the research progress in indoor scene generation algorithms in recent years, summarizing and classifying generation algorithms based on scene input, scene context, scene representation, scene generation methods, and furniture placement order. It analyzes the development, advantages, and disadvantages of indoor scene generation algorithms in three branches: example-free object-relationship-based generation, example-free human-activity-based generation, and example-and-object-relationship-based generation. Furthermore, this paper summarizes the shortcomings of existing algorithms and points out potential directions for future indoor scene generation algorithms.
1 Introduction
In recent years, virtual indoor scenes have been widely used in virtual reality, augmented reality, open-world gaming, and robotics. However, designing indoor scenes is time-consuming, and modeling them requires complex scene design tools. Therefore, automated indoor scene generation has become a hot topic for researchers and has experienced rapid development.
The goal of interior scene generation is to place furniture in a three-dimensional space with a fixed size and structure, while satisfying the functional and physical constraints of a real-world interior scene. The attributes of furniture in three-dimensional space can be described by its position, orientation, and category. The essence of interior scene generation is determining the categories of furniture to be placed and how they are arranged in three-dimensional space. Similar to interior scene generation, 3D model generation determines the arrangement of model components, and floor plan generation arranges rooms. Therefore, solutions for floor plans, 3D models, and interior scene generation can be mutually referenced or combined. Due to their fixed functional characteristics, 3D models have relatively fixed layout rules between their components, offering limited variation. Compared to 3D model generation, interior scene generation presents the following three challenges:
(1) There is a great deal of freedom in the arrangement of furniture in an indoor scene. The same furniture may have multiple reasonable layouts, so the relationship between furniture is relatively more complex.
(2) 3D models have fixed computer-understandable representations, such as point clouds, voxels, and meshes. How to abstract indoor scenes into computer-understandable scene representations is a challenge.
(3) Indoor scene generation also requires consideration of more specific constraints, such as corridor connectivity and field of view. Mainstream indoor scene generation algorithms use the contextual relationships between objects to structure the scene layout, while a small number of algorithms based on human activities and object contextual relationships form a new branch. Generating indoor scenes without reference information is a very challenging task.
Figure 1 Scene generation algorithm framework
As a result, a series of example-based scene generation algorithms have emerged. This paper addresses the three challenges mentioned above, categorizing scene generation tasks into three branches: example-free object-relationship-based generation, example-free human-activity-based generation, and example-and-object-relationship-based generation. The paper then describes and analyzes these scene generation algorithms.
Figure 1 illustrates the algorithmic framework involved in the entire scene generation process, where the indoor scene renderings are from the 3D-FRONT dataset.
2 Indoor Scene Dataset
Computer vision tasks such as image detection, image segmentation, and intrinsic decomposition based on indoor scene images have been extensively studied, aiming to enable computers to better understand indoor scenes, much like humans. However, labeled data for different tasks in the vision domain is extremely difficult to obtain. Therefore, the field urgently needs naturally labeled indoor scene datasets to simulate real indoor scenes, thereby alleviating the burden of labeled data.
The earliest indoor scene dataset was the SceneNet dataset proposed by Handa et al., which only provided a small number of indoor scenes with 3D models. Song et al. proposed the widely used large-scale indoor scene dataset SUNCG; however, these indoor scenes were designed by amateur designers, so there is a certain gap between them and real-world scenes. Unlike synthetic indoor scene datasets composed of 3D models, Dai et al. proposed ScanNet, an RGB-D scanned image dataset based on real-world scene scans and containing rich annotations. The InteriorNet dataset proposed by Li et al. used more high-quality Computer-Aided Design (CAD) models, and professional designers designed nearly 20 million indoor scenes based on them, rendering images that are closer to photographic quality. However, it does not publicly disclose the corresponding 3D models, only providing images for research use. Unlike the aforementioned datasets (none of which contain realistic annotations of indoor scene structures), the StructureNet dataset proposed by Mo et al. provides indoor scenes designed by professional designers with annotations of scene structure information, offering more reliable annotation data for tasks such as room structure prediction. Subsequently, Fu et al. provided the 3D-FRONT interior scene dataset, which consists of real interior scene data used by users in the home decoration field. Nearly half of the room scenes were considered by designers to be high-quality scenes with certain design concepts. The Hypersim dataset proposed by Roberts et al. provides 3D models, as well as rendered images with instance and semantic segmentation annotations, and image decoupling representation images, making it the most comprehensive interior scene dataset in terms of annotation information to date.
3. Classification of Indoor Scene Generation Algorithms
The development of indoor scene generation algorithms has yielded many outstanding research results. This paper abstracts five classification criteria from existing algorithms and summarizes them from different perspectives, analyzing and comparing their advantages and disadvantages to help readers better understand the current state of development of indoor scene generation algorithms. The specific classifications are shown in Table 1.
3.1 Classification based on scene input
Based on whether there are reference examples in the scene input, indoor scene generation algorithms can be divided into example-free scene generation algorithms and example-based scene generation algorithms. Example-free scene generation algorithms often summarize rules and abstract energy functions from large-scale indoor scene datasets, or incorporate layout patterns into probability statistics and deep learning priors, thereby generating a reasonable indoor scene from scratch. Example-based scene generation algorithms accept input forms such as text, sketches, images, and 3D information, requiring the generated scene to match the input to a certain extent; this is a conditional scene generation task.
In practical applications of scene generation, it is often necessary to incorporate user preferences. Therefore, example-based generation algorithms can better interact with humans and have greater application prospects. However, when a large number of diverse virtual indoor scenes are required, example-free scene generation algorithms have a greater advantage.
3.2 Classification based on scene context
Based on the different methods of modeling scene context information, indoor scene generation algorithms can be divided into those based on relationships between objects and those based on relationships between humans and objects. Most indoor scene generation algorithms consider the relationships between objects, which can be used to determine the spatial relationships of furniture placement and the co-occurrence relationships of furniture categories. A smaller number use implicit methods to learn the contextual information of scene layout, such as using the attention mechanism of neural networks or automatically learning parameters from CNNs and DNNs. Considering that the placement of objects is closely related to human activities, algorithms based on the relationships between humans and objects have emerged in recent years. These mainly include three forms: human pose-object, human action-object, and human part-object.
When designers use computer software to design interior scenes, they often fail to input the areas or actions people might take into account, resulting in a lack of virtual interior scene datasets containing human behavior. Furthermore, disregarding complexity, modeling based on relationships between objects is the easiest to implement, and future algorithms will tend to favor this approach. However, existing algorithms still cannot avoid the need for manually defined relationships between objects, such as supports or surrounding elements. Employing attention mechanisms to learn these relationships can better address this issue.
3.3 Classification based on scene expression methods
Indoor scene generation algorithms are mainly categorized into graph structures, hierarchical structures, image structures, and matrix structures based on their representation methods. Graph structures consist of sets of nodes and edges, offering advantages such as flexibility and intuitiveness, allowing connections to be added between any two objects. Therefore, most algorithms employ graph structures for representation. (Layer)
Table 1 Classification of Indoor Scene Generation Algorithms
Note: In scene context, O represents object-object relationship, and P represents person-object relationship; in scene representation, G represents graph structure, H represents hierarchical structure, I represents image structure, and M represents matrix structure; in generation order, Seq represents sequential generation, and Syn represents synchronous generation; in scene input, N represents no reference, T represents text input, S represents sketch input, I represents image input, and D represents 3D information input; in generation algorithm, C represents traditional algorithm, and L represents deep learning algorithm.
The substructure consists of a set of nodes with parent-child relationships, where each child node has only one parent node. Generally, the entire scene is used as the root node, and furniture or furniture components as leaf nodes, with directional relationships between nodes. Generating interior scenes in a hierarchical manner aligns with the design thinking of designers. Considering that furniture is generally arranged in two-dimensional space, some researchers use a top-down view to represent the scene; this image-based representation can more intuitively show the positional relationships between furniture on a coordinate plane. A matrix-based approach first represents the attributes of each furniture node as a vector, then integrates all furniture vectors into a matrix that represents the overall scene layout information. While matrix representation is the simplest and doesn't require defining contextual relationships between objects, it also fails to reflect the relationships between furniture during generation, resulting in poor interpretability.
Currently, deep learning is the dominant algorithm for scene generation, and the emergence of graph convolutional neural networks (GCNNs) allows graph-represented scenes to leverage the advantages of deep learning and automatically learn scene layout patterns. Furthermore, the inherent flexibility of graph structures in adding and deleting nodes enables deep learning algorithms to better interact with humans. Therefore, using GCNNs for deep learning to represent scenes using graph structures offers significant advantages.
3.4 Classification based on scene generation method
Based on different scene generation methods, indoor scene generation algorithms can be mainly divided into traditional algorithms and deep learning algorithms. Before the advent of deep learning algorithms, traditional methods used rule-based, optimization, and probabilistic statistical approaches to model the layout patterns of scenes in order to determine layout parameters. After the emergence of large-scale indoor scene datasets, deep learning algorithms, due to their powerful feature learning and extraction capabilities, have become the main means of indoor scene layout algorithms. Traditional algorithms require a significant amount of manual effort to abstract the layout patterns of indoor scenes and are relatively time-consuming in the generation process. In contrast, deep learning algorithms, through end-to-end generative neural networks, can automatically learn layout patterns and quickly generate layouts, becoming the mainstream scene generation algorithm today.
3.5 Classification based on furniture placement order
Based on the order in which furniture is placed in a scene, indoor scene generation algorithms can be divided into two types: sequential iteration and synchronous generation. The disadvantage of sequential iteration is that later-placed objects cannot influence earlier-placed objects due to their sequential order, and earlier-placed objects cannot predict the categories of later-placed objects. However, its advantage is that objects that cannot fit can be discarded, so as long as the algorithm is correct, it can always generate a reasonable scene. The advantage of synchronous generation is that it considers the placement of all other furniture when arranging furniture. The disadvantage is that furniture cannot be discarded during the placement process, which may result in unreasonable scenes.
4. Indoor Scene Generation Algorithm
The goal of interior scene layout generation is to determine the size, type, and placement and orientation of furniture in three-dimensional space. Currently, mainstream scene generation algorithms generate scenes based on object relationships under no-example constraints. This section reviews no-example object relationship-based scene generation algorithms, categorizing them into traditional algorithms and deep learning algorithms, and outlines scene representation methods and generation sequences. Building upon these mainstream algorithms, other scene generation methods have emerged, including no-example scene modeling based on human activities, and scene generation methods based on both object relationships and examples.
This section provides a detailed overview and analysis of scene generation algorithms, categorized into three main types: example-free scene generation algorithms based on object relationships, example-free scene generation algorithms based on human activities, and scene generation algorithms based on examples and object relationships. This aims to help readers better understand the development trends of scene generation algorithms.
4.1 Example-free scene generation algorithm based on object relationships
4.1.1 Traditional generation methods
Early methods for studying automated indoor scene generation were mainly divided into three types: rule-based scene generation, energy-optimization-based scene generation, and probability-statistics-based scene generation. Traditional methods require researchers to fully utilize their generalization and abstraction abilities, using limited knowledge and capabilities to abstract possible scene layout patterns and represent them using algorithms.
Xu et al. were the first to propose a rule-based, sequentially iterative scene generation algorithm that places furniture one by one into an indoor scene. During the placement process, the algorithm considers the possible positions of the furniture based on the available planes, the planes' support capacity, and the distances between objects. It also incorporates the semantic relationships between furniture in the real-world scene into the rules, making the furniture placement more rational. Furthermore, the algorithm adds physical constraints such as the non-interpenetration of objects, stability, and friction between objects to avoid disharmonious scene placements.
Yu and Merrell et al. proposed using an energy function with layout information as a variable to represent rules, generating reasonable layouts by optimizing the energy function. The semantic relevance between furniture can be summarized as hierarchical relationships, spatial relationships, and pairwise relationships, which can be incorporated into the energy equation as scene context information. Functional and visual placement rules can also be considered to constrain furniture placement. The two algorithms propose different ways to optimize the energy function from different problem-solving perspectives. Due to the large search space, the aforementioned algorithm uses simulated annealing to gradually determine the furniture placement and obtain a reasonable indoor scene. The latter, considering that the diversity of layouts leads to multiple reasonable layouts and corresponding energy functions with multiple peaks, uses Markov Monte Carlo sampling to optimize the energy function.
Fisher et al. proposed a probabilistic model based on Gaussian mixture models and Bayesian models to learn the layout priors of a scene and generate new indoor scenes similar to user-given example scenes. This model can predict the types of furniture that can be placed and the most likely location for that furniture in the scene space based on the co-occurrence probability of paired objects in spatial locations. To increase the diversity of new scene layouts, the paper also proposes a clustering algorithm based on scene context information to provide interchangeable furniture categories. In addition to considering paired object relationships, Kermani et al. also used relationships involving more than two objects to represent the scene context. Unlike the previous methods that only considered the contextual relationships between local furniture, Liu et al. constructed a hierarchical syntactic probabilistic model using a given large-scale indoor scene dataset to summarize the contextual relationships of the entire indoor scene. This hierarchical relationship learned from the dataset, treating it as a whole, increases the rationality of the layout. Henderson et al. also arranged furniture in a hierarchical order: main objects, small objects, ceiling objects, and wall objects.
4.1.2 Deep Learning Generation Methods
Deep learning is a data-driven representation learning method. With the emergence of large-scale indoor scene datasets, scene generation using deep learning has become possible. Traditional generation methods require manual definition of specific rules, optimization functions, or density functions, while deep learning can directly utilize neural networks with special structures to implicitly learn and represent this information, avoiding the complexity of manual definition. Furthermore, scene priors learned from large-scale indoor scene datasets using deep learning can effectively supplement limited human experience in indoor scene design.
Wang and Ritchie et al. proposed a scene generation model based on convolutional neural networks (CNNs) that can generate scenes quickly and flexibly. Although interior scenes exist in three-dimensional space, gravity dictates that most objects are arranged on a two-dimensional plane. Therefore, this model takes a top-down view of the scene as input and uses different CNNs to predict the category, position, orientation, and size of furniture, adding furniture to the scene iteratively. Representing the scene graph as a top-down view enables pixel-level fine-grained reasoning and allows the use of CNNs specifically developed for image understanding to learn complex interior scene structures.
Unlike image-tiled representations of interior scenes, Li et al. argue that the structure of interior scenes is inherently hierarchical, proposing to abstract interior scenes into a hierarchical tree structure that includes support relationships, co-occurrence relationships, surrounding relationships, and wall-leaning relationships. First, a recurrent neural network encodes furniture attributes and their relative positions to other furniture from the leaf nodes upwards, based on the abstracted hierarchical tree structure. Then, it decodes the scene layout information of each furniture node, such as category, size, and orientation, from the root node downwards. Finally, it is trained using a variational autoencoder, generated from randomly sampled noise.
Similar to the hierarchical tree structure, Zhang et al. also employed a variational autoencoder generative network structure. This structure arranges all object features in the scene into a fixed-size matrix as a representation of the indoor scene. The parameterized matrix is input into a sparsely connected feedforward neural network to learn coarse-grained global layout information of the indoor scene. Simultaneously, a directed distance field is used to map the scene to a two-dimensional space to learn fine-grained local layout information. Yang et al. also adopted a matrix representation. In addition to the generative network containing information about individual objects, this study also used the generative network to encode the relative attributes of objects. Finally, a Bayesian method was used to optimize the final layout by combining individual object attributes, relative attributes, and parameter priors.
With the emergence and development of graph convolutional neural networks (GCNNs), indoor scenes represented by graph structures can be encoded with scene priors using deep learning, and graph structures are the most intuitive way to represent the contextual relationships between objects. Message passing networks are a commonly used framework for implementing GCNNs. Zhou et al. proposed a method for scene enhancement using graph representations of indoor scenes, and utilized support, surround, proximity, and co-occurrence relationships abstracted from a large dataset of indoor scenes, along with a message passing attention mechanism, to allow the model to focus more on the most relevant scene context for predicting new objects. Wang et al. decomposed scene generation into two steps: first, within the framework of a decision model, a graph convolutional neural network was used to generate a scene layout plan; then, a convolutional neural network was used to instantiate the furniture represented by each node in the graph structure into specific spatial locations. Luo et al. combined graph convolutional neural networks with conditional variational autoencoders to propose an end-to-end algorithm for generating indoor scenes. Dhamo et al., based on graph convolutional neural networks and conditional variational autoencoders, trained the system by adding an enhanced scene that differs from the original image before the decoder. This enabled them to generate indoor scenes by modifying the scene graph according to human preferences.
Previous scene generation methods made assumptions about the relationships between furniture. Wang et al. proposed to abstract the scene into a sequence of object attributes, transforming the scene generation task into a sequence generation task. They used the Transformer structure to generate the scene and implicitly learned the relationships between furniture objects through the attention mechanism in the Transformer.
4.2 Example-free scene generation algorithm based on human activities
The relationships between objects in real-world scenarios are complex and diverse, making it difficult to extract crucial contextual relationships for modeling. Real-world scene layouts are often divided according to functional areas and are closely related to human activities; therefore, scene layouts can be analyzed by modeling the relationships between human activities and objects.
Human posture can predict future action tendencies, and action is the medium through which interaction occurs between humans and objects; therefore, there is a certain contextual relationship between human posture and objects. Jiang et al. constructed a probability density function to learn the contextual relationship between the support, accessibility, and usability of objects and six standard human postures. During scene generation, the model first infers possible human postures and positions based on existing objects, and then samples from the probability density function, centered on the human posture, to find the possible placement position of the next object.
Fisher et al. directly used motion modeling to create relationships between people and objects, and based on this, generated new scenes similar to a given coarsely scanned scene in both function and geometric properties. This study extracted a geometric scene template with functional areas from the scanned scene and placed a virtual human agent in the corresponding functional areas. The virtual human agent determined actions to interact with objects based on scene functions, such as gazing, touching, back support, and hip support, arranging the objects in a motion-related configuration.
Ma et al. studied an action-driven scene generation framework that determines scene layout by simulating object placement altered by human actions. First, an action model is learned using labeled images, where each action type combines one or more human poses, one or more object categories, and information representing spatial relationships between people and objects, as well as between objects. Then, the scene is generated by sampling action sequences. Unlike other algorithms, since an action may involve multiple human poses and objects, this framework can simultaneously trigger the placement of a series of objects after the action is determined. Furthermore, all actions in the scene have a certain sequential relationship, making the generation of the entire scene more consistent.
Unlike other studies that focus on fixed human postures, Savva et al. controlled human postures with motion attributes, allowing for greater freedom in the interaction between human postures and the scene. This research established human posture attributes that reflect the relationship between body parts and nearby objects, constructed a probabilistic model from a large-scale dataset, and integrated human posture estimation into the scene generation task to generate more realistic scenes. Qi et al. proposed using an AND-OR graph related to spatial attributes to represent indoor scenes, encoding contextual relationships related to human activities into Markov random fields at terminal nodes, and then generating new scenes through sampling. Fu et al.'s research, given an empty scene and some furniture categories, expanded the categories based on the relationships between human activities and objects to construct complete scene functional areas.
4.3 Scene Generation Algorithm Based on Instances and Object Relationships
The goal of automated indoor scene generation is to reduce the time and effort spent on design and layout. However, some applications still require a certain degree of human interaction to generate indoor scene layouts that meet user needs. Furthermore, generating indoor scenes without any references requires learning all possible scene layouts, which is difficult to achieve. Example-based methods significantly reduce layout diversity, making scene generation tasks much simpler. This section will introduce several different example-based scene generation algorithms for text input, sketch input, image input, and 3D information input.
4.3.1 Text Input
Using natural language descriptions to obtain scene layouts is a relatively simple method. Natural language, as a way for people to express their thoughts in daily life, requires no training when describing scenes. Seversky and Coyne were among the first to propose language-driven scene generation methods. This method uses natural language to describe the relationships between objects and their spatial locations in detail, thus mapping natural language to scenes. This approach can generate indoor scenes that conform to linguistic expression, but it limits the freedom and diversity of users in expressing the scene.
The method can only generate fixed scene layouts. Chang et al. proposed parsing natural language into a scene template containing the objects to be placed and how they should be arranged, and then expanding it with implicit positional relationships based on prior spatial positional knowledge learned from the dataset. Ma et al.'s algorithm not only considers the spatial positional relationships between objects, but also models the probability of objects appearing in pairs. Therefore, this method supports not only implicit positional relationship expansion, but also implicit object category expansion. This method of enhancing scenes by using implicit, general layout rules extracted from the dataset allows users to avoid providing explicit layout information as in most previous methods. Chang et al. attempted to associate descriptions with objects to find 3D models more suitable for text descriptions, and also transformed the rules into text-based interactive scene editing operations, developing a user-friendly UI.
4.3.2 Sketch Input
Sketches are also a simple way for users to express scene layouts. Modelers create corresponding 3D interior scenes based on pre-drawn conceptual sketches by interior designers. Existing 3D scene design tools require modelers to repeatedly perform the two steps of model finding and model placement to place furniture one by one into the interior scene. In automated algorithms that generate scenes from given sketches, Shin et al. also adopted a similar process: first, identify individual objects from the sketch; then, find the corresponding 3D model in the model library based on the object's visual features; finally, place it in 3D space. Decomposing the 3D model into parts and completing the part-level model finding and placement can also achieve the generation of 3D models from model sketches. However, the retrieval and placement of individual objects often leads to ambiguity. To address this, Xu et al. proposed extracting furniture combinations with co-occurrence and spatial relationships from the dataset to achieve collaborative retrieval and placement of multiple objects. This method greatly reduces user intervention.
4.3.3 Image Input
Mobile phones are everyday portable devices with camera functions; obtaining a scene image simply requires pressing the shutter button. Therefore, generating scenes from RGB images is a viable option for users and has been extensively researched and developed. Huang et al. proposed using a holistic scene grammar that can represent the joint distribution of scene functions and geometric constraints to represent the structure of 3D scenes, and using the Monte Carlo method to find the scene most similar to the real scene rendered from the scene's 3D information. Nie et al. divided image reconstruction into three sub-tasks: scene layout structure estimation, object detection, and mesh reconstruction. Essentially, it still involves detecting the objects to be placed and projecting them into 3D space based on the camera pose. This study used an attention mechanism to weightedly sum the convolutional features of all objects, incorporating contextual information into the object 3D space prediction process. Xiao et al. used a more complex graph convolutional neural network, incorporating contextual information through message passing. Zhang et al. combined the above two approaches, utilizing the attention mechanism proposed by Nie et al.
After obtaining the initial position, the scene layout is further optimized using the graph convolutional neural network proposed by Xiao et al.
4.3.4 Three-dimensional information input
Generating 3D scenes from 2D RGB images is prone to errors, while RGB-D scenes or scanned scenes based on depth information provide clearer 3D scene information. However, RGB-D images are often noisy. Therefore, Chen et al. proposed using object context relationships learned from a database to constrain reconstruction, ensuring semantic similarity between the reconstructed furniture and the scanned furniture. Hampali et al. used the Monte Carlo method to search for possible furniture sets in RGB-D images to minimize the difference between the reconstructed scene and the real scene. Fisher et al. used human-object context relationships to generate new scenes similar to noisy scanned scenes. Avetisyan et al., after detecting objects in the scanned scene, optimized the placement of furniture using object-object context relationships.
5. Summary and Future Outlook
This paper provides a comprehensive analysis and description of scene generation algorithms. It covers a wide range of methods, from traditional rule-based, probabilistic, and optimization-function-based methods to deep learning-based methods such as graph convolutional neural networks, deep neural networks, and convolutional neural networks. It also analyzes the advantages and disadvantages of each scene generation algorithm, from object-to-object contextual relationships to human-to-object contextual relationships, from example-less generation models to example-based generation models, from matrix structures, hierarchical structures, and image structures to graph structures, and from sequential generation to synchronous generation. Finally, it examines the development of these algorithms in recent years.
Currently, indoor scene generation algorithms still face challenges. While mainstream deep learning methods can learn some prior scene knowledge, they still require predefined spatial relationships and co-occurrence relationships to aid scene understanding. Furthermore, the predefined relationships can only express a limited range of scene contexts. The attention mechanism in neural networks can effectively address this issue, but it only represents the closeness of connections between objects and does not contain any semantic information. Therefore, integrating semantic relationship prediction into scene generation is one direction for future research.
The most intuitive and currently most promising methods for scene representation are graph and image structures. Graph structures can ignore the position of furniture in three-dimensional space and construct connections between any nodes, but the nodes in this representation do not have a clear order. Image structures, on the other hand, are arranged in a two-dimensional coordinate system, so the scene representation naturally captures the positional relationships between furniture. Therefore, combining graph and image structures for scene prediction is a worthwhile research topic. Existing algorithms combine graphs and images, but they employ a two-step strategy. Future research could explore training an end-to-end network to combine the two.
Yang Miao 1 Chen Baoquan 2*
1. School of Computer Science and Technology, Shandong University
2. Frontier Computing Research Center, Peking University
Reprinted from "Integration Technology"