AnyHome: Open-Vocabulary Generation of
Structured and Textured 3D Homes

Rao Fu*†	Zehao Wen*	Zichen Liu*	Srinath Sridhar
	*Equal Contribution	†Corresponding Author

[Preprint]

[Code]

Abstract

We introduce AnyHome, a framework that translates open-vocabulary descriptions, ranging from simple labels to elaborate paragraphs, into well-structured and textured 3D indoor scenes at a house-scale. Inspired by cognition theories, AnyHome employs an amodal structured representation to capture 3D spatial cues from textual narratives and then uses egocentric inpainting to enrich these scenes. To this end, we begin by using specially designed template prompts for Large Language Models (LLMs), which enable precise control over the textual input. We then utilize intermediate representations to maintain the spatial structure's consistency, ensuring that the 3D scenes align closely with the textual description. Then, we apply a Score Distillation Sampling process to refine the placement of objects. Lastly, an egocentric inpainting process is incorporated to enhance the realism and appearance of the scenes. AnyHome stands out due to its hierarchical structured representation combined with the versatility of open-vocabulary text interpretation. This allows for extensive customization of indoor scenes at various levels of granularity. We demonstrate that AnyHome can reliably generate a range of diverse indoor scenes, characterized by their detailed spatial structures and textures, all corresponding to the free-form textual inputs.


Structure generation using amodal structured representation.	Egocentric exploration of the textured scene.

Results: Open-Vocabulary Generation

We show open-vocabulary generation results, including bird-eye view(left), egocentric views(middle), and egocentric tour(right). AnyHome comprehends and extends user's textual inputs, and produces structured scene with realistic texture. It can create a serene and culturally rich environment("Japanese tea house"), synthesize unique house types("cat cafe"), and render a more dramatic and stylized ambiance("haunted house").

Results: Open-Vocabulary Editing

Examples showcase the capability to modify room types, layouts, object appearances, and overall design through free-form user input. AnyHome also supports comprehensive style alterations and sequential edits, all made possible by its hierarchical structured geometric representation and robust text controllability.

Method

Taking a free-form textual input, our pipeline generates house-scale scenes by: i) comprehending and elaborating user textual input via querying a LLM with modulated prompts; ii) converting textual descriptions into structured geometry using intermediate representations; iii) employing a SDS process with a differentiable renderer to refine object layouts egocentrically; and iv) applying depth-conditioned texture in-painting for texture generation.

More Coming Soon...

For inquiries about creating unique scenes or to request the generated mesh files, kindly reach out to Rao at rao_fu@brown.edu.

Techinical Details

R. Fu, Z.H. Wen, Z.C. Liu, S. Sridhar.
AnyHome: Open-Vocabulary Generation of Structured and Textured 3D Homes.
(hosted on ArXiv)

[Bibtex]

This template was originally made by Phillip Isola and Richard Zhang for a colorful ECCV project; the code can be found here.